~ The amazing flying wizards ~
Version November 2002
The amazing flying wizards
(How a bunch of leet
seekers enters databases)
by VVAA (Various Authors)
(edited by fravia+)
Great stuff for seekers, truncation galore, read and take note of the many tricks...
Everything began with a simple and interesting question...
site specific searches/filenames ?
the-scientist.com publishes some articles that are "hot":
Data derived from the Science Watch/ Hot Papers database and the Web of Science (ISI, Philadelphia) [these are paid services] show that Hot Papers are cited 50 to 100 times more often than the average paper of the same type and age.
The URL for one of them is
How could we search for a list of "hot" articles in that site (because we / cannot /will not / should not / pay for the original service from Science Watch)?.
I tried in Google:
but it fails, I think because hot is not in the domain but in the filename
You will admit that this is an interesting question for a seeker. A typical case of 'closed database',
where information is hoarded instead of being freely spread. But the fundamental law
of the web is on
the side of the spreaders, and very strong web-winds blow against the commercial hoarders...
Re: site specific searches/filenames ?
I think your problem is truncation... I only know to main search
engines that perform this kind o search: altavista and hotbot
Here are the queries you can use:
(host:the-scientist.com OR host:www.the-scientist.com) AND url:hot_** on altavista
+"the-scientist.com" +domain:com +hot_* on hotbot
the second query brings the folowing url:
http://www.the-scientist.com/hotpapersarchive.htm, maybe that's what you want.
Truncation! How important! Moreover, those among you that have visited the first link given above by Nemo will
have noticed some URLs: www.the-scientist.com/yr2000/jan/hot_000110.html,
etc... mmmm :-)
Re: Re: Re: site specific searches/filenames ?
well... heres a couple my tricks (guess thats what you'd call them :)
archive org of course will give you lots of info ... however even with many url names (including nemos above) you still get delivered TOO the registration page
(I am unclear as to the problem though --- it seems to indicate it is a free reg (?) and then u can proceed to read free of charge ?) thats hat i read
I didn't register so am not clear or sure about this (butt you indicate that you have to pay)?
butt to a trick that works sometimes ... sometimes...
lets take Your posted url: http://www.the-scientist.com/yr2002/oct/hot_021028.html
now lets feed it to google ...
now lets click on googles cache of (your url... or any url at the site that you have brought up in the search engine that produces the doc that you want to read)
google s cache wisks you off to their Registration page ...
because somewhere it is reading and if and or else statement telling it you are a bad guy
could it be in googles cache?
so, back space now back to the google page
and click again on the cache link
now VERY QUICKLY before it can transfer you to the registration page HIT YOUR STOP BUTTON (if you are not quick enough u may have to try several times...
you will be staring at a blank googles cache page...
butt lets not stop there ... lets look at view source... [i tried
pasting the source here but it didn't want to work correctly] ... make
a copy of your 'source' and put it in an editor and view the html
page :) :) works fine
not only is the code for transfering you to the bad guy page there
(i guess ---im not a coder)... butt WhalaH --- also THERE you will see
is the article that you wanted to read in the source :) :)
what i find even more interesting in this little project is
the scientists robots txt
User-agent: * # applies to all robots
now why in the world would they block out a specific name???
google shows some rather nice returns for that name :) [although i don't
have time to figure out if he is listing and giving the articles essays away
for free at his sites --- the pdf files seem to work --- or why his name is
??? ohoooooooooo well geeeeeeeeeeesh -- you know the clocks were rolled BACK
yesterday and its only 11:20 here --- but really it should be 12:20 and by 12:20
I have had at least one beer so i guess my overactive mind isn't slowed down
enough because of the lack of beeeeeeeeeeeeeeeeeeeeeeeeer ... ok all it
means is this i guess --- http://www.the-scientist.com/eugene_garfield/
so lets re-evaluate
if you try nemo's page with your url on it ... and
click it it brings you to the registration or login page ... they
want an email address to let you in --- yes?
don't u supposse that someone who works
there has an email address that lets him in???
lets ask google:
well sonofabeehive ... google lists a number of members
lets snatch the very first one's email address and try it
lets disguise this emailaddress a little here, so that the harvesting bots and others nasties
wont index it:
now let's paste that above (corrected of course, with @ and everything) into
the email addy into the login page ... annnnnnnnnnnd
what was that quote?? oh yes 'That's funny ...'" Isaac Asimov
sunofabee there's your article with pictures ...
(DO NOT --- mess with the guys account info should you try this --- thanks)
He did it! Among many other useful information in the snippet above (like the checking
of the robots.txt), Jeff shows you an incredibly powerful access-shortcut whenever someone
dares to stop seekers asking for a registered email address: they want an email address? Let's
give them proper emailaddresses a-plenty! :-)
Re: Re: Re: Re: Re: on second thought
Haha, jeff, you rock! (again and as always)
My 2c: since the redirection is handled by
The voice of the rational! Google cache gives for certain that
that everyone uses
with lynx (see the
tools page) and be prepared for some surprises! :-)
Re: Re: Re: Re: Re: Re: on second thought and third thought
(I knew one of you js guys would know what to do!! :) :)
on third thought, as i was driving to the store and re-thinking,
i feel i did something wrong in this thread
digital was trying very hard to understand google
and proceed at abcdefg
I jumped him all the way to google - xyz
i should not have done that ... because his specific
efforts and google-questions were not really answered by my tricks
i apologize digital ... please proceed with your questions
i just finally figured out what you meant
and then click on googles cache
oh yeeees ... so much easier :) thanks!
you rock! :)
This true he went to google "xyz", even if we did learn a lot by his digression :-)
Re: Re: Re: Re: Re: Re: Re: on second thought and third thought
jeff you post was excellent, also Nemo�s that showed how to master
booleans on the engines capable of them (not google unfortunately).
strangely, your findings where due to a misunderstanding:
the URL is freely accesible directly, I wonder why google�s cache
redirects to a sign page! will have to check that. I use proxomitron
so there are no automatic redirects :-)
but sure I will apply your steps in the future, thanks
The registration policy has probably changed over time, and as Nemo pointed
out: "it's not so much a question of booleans, because since late 2000
you can do that
on google as well: Boolean Searching on Google. You can even use the operator OR inside phrases: "advertising OR advertisement statistics".
Its a problem of truncation.
You can use * on google, but it doesn't work in the same way...
on google it replaces an entire word, exemple: "ad augusta * angusta".
On altavista you can use * or **, here is an explanation how they work: truncation on altavista.
You can read some more information about *
on google here".
Re: Re: Re: site specific searches/filenames ?
>I wonder why Google does not allow wildcards, it is true that most often you
>get the results, but the * is very useful sometimes.
Yes, truncation is very useful sometimes... the best search engine for
truncation was northernlight where you could use queries like this one:
+"*lempicka*.jpg" hehehe... here
is the url:
>Lets see if I understand your logic:
>(host:the-scientist.com OR host:www.the-scientist.com) AND url:hot_** on >altavista
>is the OR here just to get all the domains if the URL changes?
>won�t host:the-scientist.com catch all?
You're right host:www.the-scientist.com is contained in host:the-scientist.com, but I joined the two just in case...
>+"the-scientist.com" +domain:com +hot_* on hotbot
>why the +domain:com here?
Because you said that the pages are in a comercial site... and you showed a working page...
the long way
looked at the example above and ran
at AV and shifted the months to get the following
seems to stop there so that may be the last time AV spidered it.
did a little induction (deduction? guessing?) and came up with
checked a few, seem to be good. except for the ones not out yet.
Once again, the power of guessing...hehe :-)
Let's just imagine that
there exist on the web sites similar to the-scientist, this one, but
that are not as open and freely accessible as
this one is, and require a fee in order to access information (yes, it happens, alas).
Let's imagine that they too have a subdirectory structure similar to the one
explained above. Well, you could do the same, and 'guess' them. See, for an example,
the bottom part of my flange page.
(c) III Millennium: [fravia+], all rights