Back to details.htm
~ Some Oddities @ Raging ~
(Reversing a new search engine)
By ~S~ Humphrey P. and others
Courtesy of searchlore.org, May
in fieri document for advanced searchers
Check [a look at the STRUCTURE of
Altavista] as well
This is an exceptional document (though, as usual with Humphrey P,
not very easy to read :-) which
represents an ongoing exercise in order to understand some quirks
A little history: The advent of Google
good algos, quick delivery of excellent results, not to mention the "cached pages"
most excellent bonus)
Altavista to try to re-gain its 'core' audience, which was migrating en masse
google (as funny as it may sound to you most users stick to ONE search engine for
whatever query they are performing :-(
Raging represents Altavista 'striking
it seems indeed a nice tool. But how does the "new" Raging engine really work and
interfaces with Altavista's databases?
That's what Humphrey and other seekers try to
understand (and explain) here...
|Presentation (by Gregor Samsa)|
I tried the new search interface altavista provides at
it ! It seems to have all features from altavista's simple search and there are NO
a good deal faster than the 'regular' altavista web interface.
to have a
on the way the customizing works. It is on a separate page and you customize your
or after you search. Apparently, the options are kept as long as you stay
I found something which might be a new feature.
was unknown to me.
Compare these two searches:
==> 1702 hits
The first search I performed using the default settings. On
altavista page this sets the KL-parameter to "XX", which stands for 'any
result I got by checking the boxes for English, French, German and Spanish on the
page. I do not know why, but the engine finds a larger number of hits when certain
chosen than by the default (which should be 'any language' here as well, shouldn't
Does this also work with the 'old' interface ? No. As you can see
following examples, only the LAST language parameter given is used.
==> 27 hits in Spanish
==> 2385 hits in English
Striking here, too: A search
for English pages turns up more hits than a
search for pages in any language. I'd really like to see the code they
|Trouble in Bilbao City (by ~S~ Humphrey
Trouble, in Bilbao City, (with a CapiTal T and that rhymes with P and that stands
Tabulation of presumptions, followed by
1702 pages found // word count: Adolf: 387514; Gustav: 461825; II:
38 pages found:
Searching in ONLY: • English • French • German • Spanish // word count: Adolf:
461825; II: 33305000
Cookie: AV_RAGESETTINGS=v1:30:3:1::enfrdees Tue Dec
27 hits in Spanish
found. // word
count: Adolf: 387514; Gustav: 461825; II:
2385 hits in English
pages found. //
word count: Adolf: 387514; Gustav: 461825; II:
number of word counts: Adolf: 387514; Gustav: 461825; II: 33305000
to be looking
at the same database, and the same search tree.
You haven't seen all
(a) 1702 (b)
5438 nor (d) 2,385 pages.
How are they alike? How are they
Ordinary ad filled search, where
1,702 pages found. // word count: Adolf: 387514; Gustav:
Comparing -e- with -a- I conclude: Statistics for results
where &kl=xx have
not changed between what the old altavista.com WAS doing and the new
Lets pick a really esoteric international subject.
brand new, because I want to see all the found items to figure out what
&stype=stext WAS doing.
Maybe that will give a clue as to what &Translate=on IS doing.
there WAS a
difference between .es and .en so, let's add .xx to those two in a reduced field
compare our three
2 pages found // word count:
Adolf: 387514; Gustav: 461825; II:
13 pages found // word count:
Adolf: 387514; Gustav: 461825; II:
6 pages found. // word count:
Adolf: 387514; Gustav: 461825; II: 33305000
The phenomenon persists. We
are getting more
pages with &kl=en (13) than with &kl=es (2) or &kl=xx (6).
And the pages
Not the same.
If you are drawing Venn diagrams, none of
circles overlap. There are no intersections.
Whatever [any language]
(&kl=xx) means, it
doesn't mean all the languages in the list OR (see ORs)
-the tardy student
there is a hidden time limit, and three different hash tables, and the &kl=xx is
the biggest hash
table, so the finder doesn't get through it in time to report all there is to
OR More is less.
+Gustav +II +Adolf (2385)
35 pages found //word count: Bilbao: 270771; Adolf:
461825; II: 33305000
There's a difference between &kl=xx and
XX does have
(by ~S~ Humphrey P.)|
"I" quit quite too soon in our analysis of AltaVista's new search
For, by itself, choosing between using kl=XX or kl=xx doesn't
answer your original question about why kl=en [English] should find more entries
than kl=XX [any language].
For, that is what you had proven with your
You had used kl=XX (big XX = inclusive OR?) and you got more
kl=en [English] than kl=XX [any language] with the +Gustav +"II" +Adolf
You had not used kl=xx (little xx = exclusive OR?). I stumbled upon
that. I made that mistake. And I got fewer kl=xx than kl=en, but more kl=XX than
kl=en with the +Gustav +"II" +Adolf +Bilbao query.
Let's see. I have the
tabulations. I tried it again, the next day, when I was awake, and could see I
hadn't proven anything.
A few days later so statistics are a little bit
+Gustav +"II" +Adolf
word count: Adolf: 374906; Gustav:
27 pages found.
836 pages found.
1,702 pages found.
+Gustav +"II" +Adolf +Bilbao
word count: Bilbao: 251970; Adolf:
374906; Gustav: 462989; II:
35 pages found.
Ok, now, our observations are these:
www.altavista.com didn't allow several languages at a time, but took the last one
in the list.
with the +Gustav +"II" +Adolf query,
[&kl=xx 836,] &kl=XX 1702, &kl=en 2384
It's odd that there are more English
(en) that AnyLanguage (XX).
+Gustav +"II" +Adolf +Bilbao
&kl=es 2, &kl=xx 6, &kl=en 13, &kl=XX 35
There are less
&kl=xx than &kl=en, but more &kl=XX than &kl=en. Apparently XX means intersection
(inclusive vs exclusive OR)
However, in our first set of
facts, we don't see how the XX (1702) included all the EN
Did I get the facts stated correctly, this
All we've managed to do, so far, is to watch the shadow of
"altavista.com/cgi-bin/query?" from a distance on different terrain at two
different times of day. We're not close to making a sundial, yet.
really want, is to have a look at the source.
Well, how big is it, and what
is it called?
And we are assuming there is a difference in terrain? That
"ragingsearch.altavista.com/cgi-bin/query?" is really different from
"www.altavista.com/cgi-bin/query?" -?- We just want to know how it's
I should learn more about CGI, instead of just
guessing, don't you think? Although CGI is not a language, but rather a way of
doing things, there should be constraints.
Lets see, what do I want to
There's a page of parameters somewhere for AltaVista...
means language, ...
Some time ago, we were looking for original
development and doctorial thesis and white papers and user's manuals, but
"everybody" got bored with that... I'd bet the original database design hasn't
changed much from then. Might shutup and go looking again. Berkely, wasn't it.
jeff had a section on .CGI didn't he? On his
ww.cgi-resources.com/Programs_and_Scripts/Perl/ The Cgi Resource Index
http://www.servers.nu/index.html Cgi scripts
Two resource sites, and a tut.
Ok, I can add a
simpler intro tut to
Stanislav's CGI Programming Tutorial
"Each pair is separated by an ampersand
(&). Please note I said separated, not terminated. There is no ampersand after the
but my find doesn't go very far. Just takes away a few of my
"The trick is in realizing that, despite appearances, the URL
has nothing to do with the command line of the program.
How, then, do you get
to the data from within your program? You read it from the environment variable
Hmmm. Has Mammon mastered CGI and explained it
(that was foolish! Cardinal Newman. what
did he know ...)
[host:eccentrica.org +mammon +cgi]
Hmmm. Java is
right down there with Voting for
hweb.cgi.general&utag= EarthWeb Discussions: earthweb.cgi.general
p=earthweb.cgi.general&item=541&utag= Subject: Want to learn CGI programming
http://www.wdvl.com Web developer's Virtual Library: Encyclopedia of
Web Design Tutorials, Articles and
http://www.wdvl.com/Authoring/CGI/ CGI: The Common Gateway
Interface for Server-side Processing by Alan Richmond
Yahoo! Home >
Computers and Internet > Internet > World Wide Web > CGI - Common Gateway
http://hoohoo.ncsa.uiuc.edu/cgi/ The Common Gateway
http://www.execpc.com/~keithp/bdlogcgi.htm How to use your
"With so many free CGI scripts available (including
http://www.execpc.com/~keithp/bdlogftr.htm Bestdam Logger Lite)..."
Do you suppose, someone has liberated "query?" and posted it somewhere? Can you
use an altavista search engine on your own pages? Who has done that and then went
fishing and is just sitting there, like a wrecked ship?
What do I want to know:
Is it always a script? Nahh. Could be
Perl. Could be C. Could be any language. Anything to take information from the
"QUERY_STRING" or from STDIN, and send it to STDOUT.
What am I going
to find in /cgi-bin/ -?- A program? Nothing? (Everybody will want their's to be
Even if I know it's name, will I be able to download
it from /cgi-bin/ -?- (Maybe the subdirectory is protected. Maybe the actual
program is named something different. Maybe sending the URL activates the
program, rather than selecting it for download. Is GET or POST always
How do you put it there? Or is it even there? Maybe you
just tell the sysadmin about it, and he tells the server that when you see an URL
of www.altavista.com/ccgi-bin/query? you lookup "query" in a table and locate
whatever program is associated whereever you told him it was.
- That when
you see an URL of ragingsearch.altavista.com/ccgi-bin/query? you find a different
entry in a table, and associate a different program with it...
sends the URL to a separate server...
OR, ragingsearch is read out of the URL
and the first simple change at the top of the program "query?" branches to the
"ragingsearch" variations, whereever.
- The last one is most likely.
They wouldn't penalize you for trying the new thing by confining you to one
server. Besides, backup and mirroring is complicated enough, without inventing
Hmmm. I don't know what it feels like. It will feel
differently, when I know something.
Or, can you give me a Zen dope
slap, and fix my attention upon the truth in a minute?
Some contributions (may not be very useful, but, at least, I've tried)....
What a difference ! But do you see where I want to go ? Why have I choosen
such an uninterressant query ? The huge number of results !
Scooter's manual, page 32 :
"A new option on the avs_search (timeout) and a
new api call (avs_timer) allows multi-threaded
applications to enforce maximum query processing times."
A few lines down in the manual,
it's explained that the administrator can
define the applications which have a timer, and which ones do not have one.
I think you've
come to the same conclusion as I do (which may be wrong, of
course, but which should be logic) : a multi-language querry (xx), which
would take necessarely more time to be executed than a specific language
query, will have a timer setted. On the contrary, a specific one won't
..... (errr.. I'm not sure about this one nevertheless)
Ok, we now know that a
multi-language query has a timer.... BUT, if there
are few results, we can assume that the timer won't be reached... Error !
Haha ! Do you imagine that when searching for special material you only see
2 results, althought there are at least 70 !!!!
So, provided that Altavista detects a multi-language querry, the timer is
setted off, and ALWAYS reached.. Too bad for us : we do not have access to
all the datases with a muti-language search.
Another example :
I know the results have nothing to do with the subject of the search, but I
don't care.... You see here that a (so-called) non results querry could
actually have somes. Terrifying !
As a conclusion : ALWAYS make single language queries
A thing is confuse in my mind, nevertheless : why should the timer be
reached, if there are few results ?????? Maybe people at Altavista don't
want their servers to be overloaded... But I doubt of this... I'm tired.
Rumsteack (from France, of course)
|Some more data
(by Gregor Samsa)|
I checked the number of hits for each language separately:
search A was :
[+Gustav +"II" +Adolf]
(word count: Adolf: 365276; Gustav: 446836; II:
search B was:
[+Gustav +"II" +Adolf +Bilbao]
Bilbao: 251970; Adolf: 365276; Gustav: 446836; II: 31069265)
language A B
&KL=xx 836 4
&kl=xx 836 4
&KL=XX 0 0
&kl=XX 1702 27
Czech (cs) 87 0
Danish (da) 34 0
(nl) 41 1
English (en) 2384 13
Estonian (et) 43
Finnish (fi) 28 0
Frensh (fr) 54 1
(de) 2946 8
Greek (el) 1 0
Hebrew (he) 0 0
(hu) 19 0
Icelandic (is) 1 0
Italian (it) 38 0
(ja) 11 0
Korean (ko) 4 0
Latvian (lv) 0 0
(lt) 0 0
Norwegian (no) 30 0
Polish (pl) 13 0
(pt) 12 0
Romanian (ro) 3 0
Russian (ru) 4 0
Swedish (sv) 1426 4
all checked 6559 27
I tried all those &kl=/&KL= combinations with yy and YY. No
I assumed there was a limit somewhere (timeout or
Did not find anything about that, but realized a pattern nevertheless:
Testing with [+Gustav +"II" +Adolf], comparing the expected number of hits with the
lang. => hits (diff.)
el + hu => 20 (0)
pl + pt =>
ro + ru => 7 (0)
ro + pt => 14 (-1)
ro + pl => 15 (-1)
da + fr => 76
fi + fr => 70 (-12)
fi + da => 58 (-4)
de + lv => 2551 (-395)
pl + lv =>
pt + lv => 11 (-1)
en + lv => 2099 (-285)
en + de => 4650 (-680)
There's a system. Do you see it ?
Every language seems to lose a certain number
of hits (compared to when it is used as only language parameter) when combined with another
language. I first suspected this, when there were two hits less for [pt + pl] than I had
expected, and in combination with [ro] each of these languages seemed to have one hit less.
I tried a language without any hits (lv) and combined it with both German and English.
[de + lv] showed 395 results less than I had expected, [en + lv] 285 less. If I was right, [en +
de] should show 285 + 395 = 680 results less than the sum of each language alone. Indeed
I try not to make conclusions too early. I really do not know what
this CGI works like or what their database is like (THAT would be an interesting point !)
don't know if you can use something like Oracle with 350.000.000 records and still expect a
reasonable response time for your SQL selects. Well, in the end I'm back at the beginning: Ich
weiß, dass ich nichts weiß
I stop here and go to bed. It seems to be a good
idea not to work long nights on such things - one forgets about cookies too easily...
|Oh, I get it (finally)
(by ~S~ Humphrey P.)|
Oh, I get it. (finally.)
If you were to design a search engine, you would let
the advanced search go longer, or try harder, but cut the simple searcher off
You are assuming that the simple searcher doesn't want to wait,
or doesn't care about the last few million 'Treffe' the 'pinball' search engine
The problem with stopping before searching the whole database, is
that the newest item might be the last one in the database, and you'd come away
thinking that AltaVista never updated their database...
or, the one which
best matches your string of keywords, might be in the part of the database the
search engine never got to.
It would be better to presort, to preindex, to
presearch, to run on the fastest available machinery, anything you could do to
optimize searching, in order to make the whole database available to both the simple
searcher and the one who customized his search.
But, the presentation is
very important, and very different... In the simple search, you are making
assumptions for the searcher. Call them defaults, call them design flaws,
call them public relations...
You've been to a wonderful site before,
haven't you? You know they have something, but the site has so many pages,
and no sitemap, so you decide to use their search engine. And you give it a
two or three word search, and the search engine says: 'nothing found.'
You know it is there. But you don't know the one word which would find
I've written some of those bafflers myself - design flaws.
jeff's lookime for fravia was like that... I couldn't quite get
it to give me the results I wanted. For instance, trying to find 'java,'
but not 'java script.'
Lets don't pick on jeff. He grabbed
what he could. There are lots of other conundrum site search engines
at Big companies.
You get way too many hits with one word.
And with two words, you either get no hits, as though it had to be
a phrase (tight AND)
Or else the simple engine assumes the two
words don't need to have anything in common, nor refer to
the same topic, but just appear on the same page (lazy OR.)
else the 'lucky' way of using the engine is there, and 'natural' to the
programmer, but not to anyone else. (That's the kind I write.)
after scanning it's 'billions and billions,' will put the entries in the
order of 'most found first,' ... 'least found last.'
since you brought it up, is running a popularity contest, and putting up
front those sites which others most often refer to. And besides that,
they consult a WWW directory.
(By the way, you should be able to
improve your site's 'popularity' by writing your section of that web
Here's a query on [search engine optimization] which
(yes, of course there's more here than I am letting on, but skip it for now. ;)
(Just notice that you can't tell where Netscape ends and Google begins... Nor
later where Netscape Open Directory ends and Open Directory Project
Web Site Categories
1 - 3 of 3
reviewed web sites related to your search term.
1 - 10 of 94
Web sites reviewed and categorized by a
team of editors.
Help build the largest
human-edited directory on the web.
Become an Editor (http://home.netscape.com/escapes/search/beditor.html?cp=srpstatic)
a site (http://home.netscape.com/escapes/search/addsite.html?cp=srpstatic)
The first on whisks you off to:
"If you would like to help build Netscape
Open Directory, it's easy to apply:"
and sends to back to the front
http://search.netscape.com/ Netcape Search.
Where we see a
very nice presentation of the Netscape Directory.
Computers > Internet >
WWW (or Web) > Searching the
the Web by the Open Directory Project."
(Amazing what Netscape has
started, isn't it?)
"Discussion, help and tutorials, comparisons,
integrated search pages, mailing lists and newsletters,
submitting and positioning."
Some where in the 'Submitting and
Positioning,' or in the 'Search Engine Optimization,' somebody has got
them all figured out already... every search engine's strengths and
weaknesses, each one's strategies, their peculiarities...
of submitting and positioning as Reverse Engineering of Search
there is Search Engine Watch, top of the list. Editors choice.)
I'm not saying that Open Directory Project has considered the whole web. Nor that ODP
is better than Yahoo, (it's not.) But just that Google uses this Open Directory to
form an 'opinion' about what is important, and what's popular. And it could be
your opinion which they are considering. After that, they have their own
Now, where in this thoughtful process would you
interrupt Google, and say, "I want your 5 second opinion, not your 20 second
How could you fool MetaCrawler into thinking that's what you
were giving them?
How to preprocess a query you had never seen before?
(Oh, there's one. Keep a history of queries you HAD seen before. Select one
which looks similar.)
How to stop your ranking algorithm in the middle?
That's kind of like stopping in the middle of a sort, isn't it?
there's another one. If you thought you were going to be interrupted, you might
be doing a little presentation processing right along with the sorting. "Well, I
didn't get it all sorted, but here's what I've got sorted so far.")
remember, the client is always right.
If he wants instant gratification, give
A rattle, a hug, the first card off the bottom of the deck...
something someone else has looked up before... what he asked for.
keep processing along his frame of mind, in anticipation that he will ask a further
question. You might keep processing, but with diminishing expectations that he will
ask a further question. If you can, keep his answered question in short term memory,
so you can pick up your search from there.
Gahhh! This is all
How do they do it? How do they do it?
(they put 'fravia' in every erotic keyword string... no. -
different topic ;-)
Tell me what you know about Google.
Still quite in fieri, I'm afraid... what about contributing with your own suggestions
(c) 2000: [fravia+], all rights