Petit image
Back to details.htm
~ Some Oddities @ Raging ~
(Reversing a new search engine)
By ~S~ Humphrey P. and others
Courtesy of, May 2000

in fieri document for advanced searchers

Check [a look at the STRUCTURE of Altavista] as well

[Trouble in Bilbao City]
[Delving deeper]
by Gregor Samsa
by ~S~ Humphrey P.
by ~S~ Humphrey P.
[From France]
[Some more data]
[Oh, I get it (finally)]
by Rumsteack
by Gregor Samsa
by ~S~ Humphrey P.

This is an exceptional document (though, as usual with Humphrey P, not very easy to read :-) which represents an ongoing exercise in order to understand some quirks at
A little history: The advent of Google (clean interface, good algos, quick delivery of excellent results, not to mention the "cached pages" most excellent bonus) forced Altavista to try to re-gain its 'core' audience, which was migrating en masse towards google (as funny as it may sound to you most users stick to ONE search engine for whatever query they are performing :-(
Raging represents Altavista 'striking back', and it seems indeed a nice tool. But how does the "new" Raging engine really work and interfaces with Altavista's databases? That's what Humphrey and other seekers try to understand (and explain) here...

Presentation (by Gregor Samsa)

I tried the new search interface altavista provides at
Man, you'll like it ! It seems to have all features from altavista's simple search and there are NO GRAFIX - it's a good deal faster than the 'regular' altavista web interface.

You'll have to have a look on the way the customizing works. It is on a separate page and you customize your results before or after you search. Apparently, the options are kept as long as you stay connected.


I found something which might be a new feature. At least it was unknown to me.
Compare these two searches:

==> 1702 hits lf

==> 5438 hits

The first search I performed using the default settings. On the 'old' altavista page this sets the KL-parameter to "XX", which stands for 'any language'.
The second result I got by checking the boxes for English, French, German and Spanish on the customizing page. I do not know why, but the engine finds a larger number of hits when certain languages are chosen than by the default (which should be 'any language' here as well, shouldn't it ?).

Does this also work with the 'old' interface ? No. As you can see by the following examples, only the LAST language parameter given is used.

==> 27 hits in Spanish 2+%2BAdolf

==> 2385 hits in English

Striking here, too: A search for English pages turns up more hits than a search for pages in any language. I'd really like to see the code they use...

Trouble in Bilbao City (by ~S~ Humphrey P.)

Trouble, in Bilbao City, (with a CapiTal T and that rhymes with P and that stands for POOL!)

Tabulation of presumptions, followed by results.

-a- BAdolf
1702 pages found // word count: Adolf: 387514; Gustav: 461825; II: 33305000

-b- +%2B%22II%22+%2 BAdolf&FFF=off&wfmt=d&nbq=30&KL=en&KL=fr&KL=de&KL=es&Translate=on&prf=Submit
54 38 pages found: Searching in ONLY: English French German Spanish // word count: Adolf: 387514; Gustav: 461825; II: 33305000

Cookie: AV_RAGESETTINGS=v1:30:3:1::enfrdees Tue Dec 31 05:58:26 2013

-c- stav+%2B%22II%2 2+%2BAdolf&kl=en&kl=fr&kl=de&kl=es&stype=stext
27 hits in Spanish
27 pages found. // word count: Adolf: 387514; Gustav: 461825; II: 33305000
-d- stav+%2B%22II%2 2+%2BAdolf&kl=es&kl=fr&kl=de&kl=en&stype=stext
2385 hits in English
2,385 pages found. // word count: Adolf: 387514; Gustav: 461825; II: 33305000

same number of word counts: Adolf: 387514; Gustav: 461825; II: 33305000

We seem to be looking at the same database, and the same search tree.

You haven't seen all (a) 1702 (b) 5438 nor (d) 2,385 pages.

How are they alike? How are they different?

Ordinary ad filled search, where &kl=xx 22II%22+%2BAdol f&kl=XX&stype=stext
1,702 pages found. // word count: Adolf: 387514; Gustav: 461825; II: 33305000

Comparing -e- with -a- I conclude: Statistics for results where &kl=xx have not changed between what the old WAS doing and the new IS doing.

Lets pick a really esoteric international subject. Perhaps something brand new, because I want to see all the found items to figure out what &stype=stext WAS doing. Maybe that will give a clue as to what &Translate=on IS doing.

You say there WAS a difference between .es and .en so, let's add .xx to those two in a reduced field search and compare our three results.

&kl=es q=%2BGustav+%2B %22II%22+%2BAdolf+%2BBilbao&kl=es&stype=stext
2 pages found // word count: Bilbao: 270771; Adolf: 387514; Gustav: 461825; II: 33305000

&kl=en q=%2BGustav+%2B %22II%22+%2BAdolf+%2BBilbao&kl=en&stype=stext
13 pages found // word count: Bilbao: 270771; Adolf: 387514; Gustav: 461825; II: 33305000

&kl=xx q=%2BGustav+%2B %22II%22+%2BAdolf+%2BBilbao&kl=xx&stype=stext
6 pages found. // word count: Bilbao: 270771; Adolf: 387514; Gustav: 461825; II: 33305000

The phenomenon persists. We are getting more pages with &kl=en (13) than with &kl=es (2) or &kl=xx (6).

And the pages are (trombone fanfare!):

Not the same.

If you are drawing Venn diagrams, none of the three circles overlap. There are no intersections.

Whatever [any language] (&kl=xx) means, it doesn't mean all the languages in the list OR (see ORs)

-the tardy student theory-
OR there is a hidden time limit, and three different hash tables, and the &kl=xx is the biggest hash table, so the finder doesn't get through it in time to report all there is to find...

-the superfluity theory-
OR More is less.
+Gustav +II +Adolf (2385)
+Gustav +II +Adolf +Bilbao (13)


&kl=XX v+%2B%22II%22+% 2BAdolf+%2BBilbao&kl=XX
35 pages found //word count: Bilbao: 270771; Adolf: 387514; Gustav: 461825; II: 33305000

There's a difference between &kl=xx and &kl=XX.

XX does have intersection.

Delving deeper (by ~S~ Humphrey P.)

"I" quit quite too soon in our analysis of AltaVista's new search interface.

For, by itself, choosing between using kl=XX or kl=xx doesn't answer your original question about why kl=en [English] should find more entries than kl=XX [any language].

For, that is what you had proven with your example.

You had used kl=XX (big XX = inclusive OR?) and you got more kl=en [English] than kl=XX [any language] with the +Gustav +"II" +Adolf query.

You had not used kl=xx (little xx = exclusive OR?). I stumbled upon that. I made that mistake. And I got fewer kl=xx than kl=en, but more kl=XX than kl=en with the +Gustav +"II" +Adolf +Bilbao query.

Let's see. I have the tabulations. I tried it again, the next day, when I was awake, and could see I hadn't proven anything.

A few days later so statistics are a little bit different, but:

+Gustav +"II" +Adolf
word count: Adolf: 374906; Gustav: 462989; II: 33338616

&kl=es q=%2BGustav+%2B%22II%22+%2BAdolf&kl=es&stype=stext
27 pages found.

&kl=en BGustav+%2B%22II%22+%2BAdolf&kl=es&stype=stext
2,384 pages found

&kl=xx &q=%2BGustav+%2B%22II%22+%2BAdolf&kl=xx&stype=stext
836 pages found.

&kl=XX BGustav+%2B%22II%22+%2BAdolf&kl=XX&stype=stext
1,702 pages found.

+Gustav +"II" +Adolf +Bilbao
word count: Bilbao: 251970; Adolf: 374906; Gustav: 462989; II: 33338616

&kl=es q=%2BGustav+%2B%22II%22+%2BAdolf+%2BBilbao&kl=es&stype=stext
2 pages found

&kl=en 2BGustav+%2B%22II%22+%2BAdolf+%2BBilbao&kl=en&stype=stext
13 pages found.

&kl=xx %2BGustav+%2B%22II%22+%2BAdolf+%2BBilbao&kl=xx&stype=stext
6 pages found

&kl=XX 2BGustav+%2B%22II%22+%2BAdolf+%2BBilbao&kl=XX&stype=stext
35 pages found.


Ok, now, our observations are these:

The old didn't allow several languages at a time, but took the last one in the list.

with the +Gustav +"II" +Adolf query,
&kl=es 27, [&kl=xx 836,] &kl=XX 1702, &kl=en 2384

It's odd that there are more English (en) that AnyLanguage (XX).

+Gustav +"II" +Adolf +Bilbao query.
&kl=es 2, &kl=xx 6, &kl=en 13, &kl=XX 35

There are less &kl=xx than &kl=en, but more &kl=XX than &kl=en. Apparently XX means intersection (inclusive vs exclusive OR)

-4- (Today)
However, in our first set of facts, we don't see how the XX (1702) included all the EN (2284)


Did I get the facts stated correctly, this time?


All we've managed to do, so far, is to watch the shadow of
"" from a distance on different terrain at two different times of day. We're not close to making a sundial, yet.

What you really want, is to have a look at the source.
Well, how big is it, and what is it called?

And we are assuming there is a difference in terrain? That "" is really different from "" -?- We just want to know how it's different.


I should learn more about CGI, instead of just guessing, don't you think? Although CGI is not a language, but rather a way of doing things, there should be constraints.

Lets see, what do I want to find.

There's a page of parameters somewhere for AltaVista...
kl means language, ...

Some time ago, we were looking for original development and doctorial thesis and white papers and user's manuals, but "everybody" got bored with that... I'd bet the original database design hasn't changed much from then. Might shutup and go looking again. Berkely, wasn't it. Then

jeff had a section on .CGI didn't he? On his links page:

http://w The Cgi Resource Index Cgi scripts Learn CGI today

Two resource sites, and a tut.

Ok, I can add a simpler intro tut to that. Adam Stanislav's CGI Programming Tutorial
"Each pair is separated by an ampersand (&). Please note I said separated, not terminated. There is no ampersand after the last pair."

but my find doesn't go very far. Just takes away a few of my illusions.

"The trick is in realizing that, despite appearances, the URL has nothing to do with the command line of the program.
How, then, do you get to the data from within your program? You read it from the environment variable QUERY_STRING."

Hmmm. Has Mammon mastered CGI and explained it somewhere?

[+mammon +cgi]

(that was foolish! Cardinal Newman. what did he know ...)

[ +mammon +cgi]

Hmmm. Java is right down there with Voting for Bush. hweb.cgi.general&utag= EarthWeb Discussions: earthweb.cgi.general newsgroup p=earthweb.cgi.general&item=541&utag= Subject: Want to learn CGI programming Webmonkey Web developer's Virtual Library: Encyclopedia of Web Design Tutorials, Articles and Discussions CGI: The Common Gateway Interface for Server-side Processing by Alan Richmond

Yahoo! Home > Computers and Internet > Internet > World Wide Web > CGI - Common Gateway Interface The Common Gateway Interface How to use your CGI-BIN
"With so many free CGI scripts available (including Bestdam Logger Lite)..."

Hmmm. Do you suppose, someone has liberated "query?" and posted it somewhere? Can you use an altavista search engine on your own pages? Who has done that and then went fishing and is just sitting there, like a wrecked ship?


Lets see. What do I want to know:

Is it always a script? Nahh. Could be Perl. Could be C. Could be any language. Anything to take information from the "QUERY_STRING" or from STDIN, and send it to STDOUT.

What am I going to find in /cgi-bin/ -?- A program? Nothing? (Everybody will want their's to be named "query.")

Even if I know it's name, will I be able to download it from /cgi-bin/ -?- (Maybe the subdirectory is protected. Maybe the actual program is named something different. Maybe sending the URL activates the program, rather than selecting it for download. Is GET or POST always implied?)

How do you put it there? Or is it even there? Maybe you just tell the sysadmin about it, and he tells the server that when you see an URL of you lookup "query" in a table and locate whatever program is associated whereever you told him it was.

- That when you see an URL of you find a different entry in a table, and associate a different program with it...
OR, ragingsearch sends the URL to a separate server...
OR, ragingsearch is read out of the URL and the first simple change at the top of the program "query?" branches to the "ragingsearch" variations, whereever.

- The last one is most likely. They wouldn't penalize you for trying the new thing by confining you to one server. Besides, backup and mirroring is complicated enough, without inventing exceptions.

Hmmm. I don't know what it feels like. It will feel differently, when I know something.

Or, can you give me a Zen dope slap, and fix my attention upon the truth in a minute?

Do it!
From France (by Rumsteack)


Some contributions (may not be very useful, but, at least, I've tried)....

http://ra (2,127) htt p:// (31,548)

What a difference ! But do you see where I want to go ? Why have I choosen such an uninterressant query ? The huge number of results !

Scooter's manual, page 32 : "A new option on the avs_search (timeout) and a new api call (avs_timer) allows multi-threaded applications to enforce maximum query processing times."

A few lines down in the manual, it's explained that the administrator can define the applications which have a timer, and which ones do not have one.

I think you've come to the same conclusion as I do (which may be wrong, of course, but which should be logic) : a multi-language querry (xx), which would take necessarely more time to be executed than a specific language query, will have a timer setted. On the contrary, a specific one won't ..... (errr.. I'm not sure about this one nevertheless)

Ok, we now know that a multi-language query has a timer.... BUT, if there are few results, we can assume that the timer won't be reached... Error ! Search (2) earch=Search&KL=en (70)

Haha ! Do you imagine that when searching for special material you only see 2 results, althought there are at least 70 !!!! So, provided that Altavista detects a multi-language querry, the timer is setted off, and ALWAYS reached.. Too bad for us : we do not have access to all the datases with a muti-language search.

Another example : (0 !!!!!!!) L=en (2)

I know the results have nothing to do with the subject of the search, but I don't care.... You see here that a (so-called) non results querry could actually have somes. Terrifying !

As a conclusion : ALWAYS make single language queries !

A thing is confuse in my mind, nevertheless : why should the timer be reached, if there are few results ?????? Maybe people at Altavista don't want their servers to be overloaded... But I doubt of this... I'm tired.

Rumsteack (from France, of course)

Some more data (by Gregor Samsa)

I checked the number of hits for each language separately:

search A was :
[+Gustav +"II" +Adolf]
(word count: Adolf: 365276; Gustav: 446836; II: 31069265)

search B was:
[+Gustav +"II" +Adolf +Bilbao]
(word count: Bilbao: 251970; Adolf: 365276; Gustav: 446836; II: 31069265)

language A B

&KL=xx 836 4
&kl=xx 836 4

&KL=XX 0 0
&kl=XX 1702 27

Chinese (zh) 2 0
Czech (cs) 87 0
Danish (da) 34 0
Dutch (nl) 41 1
English (en) 2384 13
Estonian (et) 43 0
Finnish (fi) 28 0
Frensh (fr) 54 1
German (de) 2946 8
Greek (el) 1 0
Hebrew (he) 0 0
Hungarian (hu) 19 0
Icelandic (is) 1 0
Italian (it) 38 0
Japanese (ja) 11 0
Korean (ko) 4 0
Latvian (lv) 0 0
Lithuanian (lt) 0 0
Norwegian (no) 30 0
Polish (pl) 13 0
Portugese (pt) 12 0
Romanian (ro) 3 0
Russian (ru) 4 0
Spanish (es) 27 2
Swedish (sv) 1426 4

all checked 6559 27

(BTW, I tried all those &kl=/&KL= combinations with yy and YY. No result)


I assumed there was a limit somewhere (timeout or similar).
Did not find anything about that, but realized a pattern nevertheless:

Testing with [+Gustav +"II" +Adolf], comparing the expected number of hits with the actual one:

lang. => hits (diff.)

el + hu => 20 (0)
pl + pt => 23 (-2)
ro + ru => 7 (0)
ro + pt => 14 (-1)
ro + pl => 15 (-1)
da + fr => 76 (-12)
fi + fr => 70 (-12)
fi + da => 58 (-4)
de + lv => 2551 (-395)
pl + lv => 12 (-1)
pt + lv => 11 (-1)
en + lv => 2099 (-285)
en + de => 4650 (-680)

There's a system. Do you see it ?

Every language seems to lose a certain number of hits (compared to when it is used as only language parameter) when combined with another language. I first suspected this, when there were two hits less for [pt + pl] than I had expected, and in combination with [ro] each of these languages seemed to have one hit less.

I tried a language without any hits (lv) and combined it with both German and English. [de + lv] showed 395 results less than I had expected, [en + lv] 285 less. If I was right, [en + de] should show 285 + 395 = 680 results less than the sum of each language alone. Indeed !


I try not to make conclusions too early. I really do not know what this CGI works like or what their database is like (THAT would be an interesting point !)
I don't know if you can use something like Oracle with 350.000.000 records and still expect a reasonable response time for your SQL selects. Well, in the end I'm back at the beginning: Ich weiß, dass ich nichts weiß

I stop here and go to bed. It seems to be a good idea not to work long nights on such things - one forgets about cookies too easily... ;-)


Oh, I get it (finally) (by ~S~ Humphrey P.)

Oh, I get it. (finally.)

If you were to design a search engine, you would let the advanced search go longer, or try harder, but cut the simple searcher off sooner.

You are assuming that the simple searcher doesn't want to wait, or doesn't care about the last few million 'Treffe' the 'pinball' search engine makes?

The problem with stopping before searching the whole database, is that the newest item might be the last one in the database, and you'd come away thinking that AltaVista never updated their database...

or, the one which best matches your string of keywords, might be in the part of the database the search engine never got to.

It would be better to presort, to preindex, to presearch, to run on the fastest available machinery, anything you could do to optimize searching, in order to make the whole database available to both the simple searcher and the one who customized his search.

But, the presentation is very important, and very different... In the simple search, you are making assumptions for the searcher. Call them defaults, call them design flaws, call them public relations...

You've been to a wonderful site before, haven't you? You know they have something, but the site has so many pages, and no sitemap, so you decide to use their search engine. And you give it a two or three word search, and the search engine says: 'nothing found.' You know it is there. But you don't know the one word which would find it.

I've written some of those bafflers myself - design flaws.
jeff's lookime for fravia was like that... I couldn't quite get it to give me the results I wanted. For instance, trying to find 'java,' but not 'java script.'
Lets don't pick on jeff. He grabbed what he could. There are lots of other conundrum site search engines at Big companies.

You get way too many hits with one word. And with two words, you either get no hits, as though it had to be a phrase (tight AND)

Or else the simple engine assumes the two words don't need to have anything in common, nor refer to the same topic, but just appear on the same page (lazy OR.)

Or else the 'lucky' way of using the engine is there, and 'natural' to the programmer, but not to anyone else. (That's the kind I write.)

AltaVista, after scanning it's 'billions and billions,' will put the entries in the order of 'most found first,' ... 'least found last.'


Google, since you brought it up, is running a popularity contest, and putting up front those sites which others most often refer to. And besides that, they consult a WWW directory.

(By the way, you should be able to improve your site's 'popularity' by writing your section of that web directory.)

Here's a query on [search engine optimization] which uses that directory: (yes, of course there's more here than I am letting on, but skip it for now. ;) (Just notice that you can't tell where Netscape ends and Google begins... Nor later where Netscape Open Directory ends and Open Directory Project begins.)

Web Site Categories
1 - 3 of 3
Groups of reviewed web sites related to your search term.

Reviewed Web Sites
1 - 10 of 94
Web sites reviewed and categorized by a team of editors.

Get Involved
Help build the largest human-edited directory on the web.
Become an Editor (
Suggest a site (
Give Feedback (

The first on whisks you off to:

"If you would like to help build Netscape Open Directory, it's easy to apply:"

and sends to back to the front door: Netcape Search.

Where we see a very nice presentation of the Netscape Directory.

Computers > Internet > WWW (or Web) > Searching the Web

"Searching the Web by the Open Directory Project."
(Amazing what Netscape has started, isn't it?)

"Discussion, help and tutorials, comparisons, integrated search pages, mailing lists and newsletters, submitting and positioning."

Some where in the 'Submitting and Positioning,' or in the 'Search Engine Optimization,' somebody has got them all figured out already... every search engine's strengths and weaknesses, each one's strategies, their peculiarities...

Think of submitting and positioning as Reverse Engineering of Search Engines. _Submitting_and_Positioning/Comparisons_and_Discussions

(And there is Search Engine Watch, top of the list. Editors choice.)


Well, I'm not saying that Open Directory Project has considered the whole web. Nor that ODP is better than Yahoo, (it's not.) But just that Google uses this Open Directory to form an 'opinion' about what is important, and what's popular. And it could be your opinion which they are considering. After that, they have their own database.


Now, where in this thoughtful process would you interrupt Google, and say, "I want your 5 second opinion, not your 20 second one" -?-

How could you fool MetaCrawler into thinking that's what you were giving them?

How to preprocess a query you had never seen before?

(Oh, there's one. Keep a history of queries you HAD seen before. Select one which looks similar.)

How to stop your ranking algorithm in the middle? That's kind of like stopping in the middle of a sort, isn't it?

(Oh. there's another one. If you thought you were going to be interrupted, you might be doing a little presentation processing right along with the sorting. "Well, I didn't get it all sorted, but here's what I've got sorted so far.")

Just remember, the client is always right.
If he wants instant gratification, give him something.
A rattle, a hug, the first card off the bottom of the deck... something someone else has looked up before... what he asked for.

You might keep processing along his frame of mind, in anticipation that he will ask a further question. You might keep processing, but with diminishing expectations that he will ask a further question. If you can, keep his answered question in short term memory, so you can pick up your search from there.


Gahhh! This is all theoretical...

How do they do it? How do they do it?

The maggots know.

(they put 'fravia' in every erotic keyword string... no. - different topic ;-)


Tell me what you know about Google.

Still quite in fieri, I'm afraid... what about contributing with your own suggestions and observations?

Petit image
(c) 2000: [fravia+], all rights reserved