- for exclusion (but NOT if using booleans, " " for exact match
( ) for nesting, AND NOT for boolean exclusion
no case sensitivity whatsoever
Now with the VERY important NEAR operator (november 2006)
[Once Yahoo had only 677 results viewable, now the SERPs stop at 1000]
For info on Yahoo's (Inktomi's) rich syntax, see Nemo's essay (September 2005)
Yahoo recognized the tragical mistake of going commercial and
went 'back to basic' in late 2002 (better late than never) it
seems to be gaining momentum as part of the inktomi factories :-)
Note that yahoo recently bought the wondrous fast/alltheweb search engine (and promptly killed it :-(
Yahoo is now one of the three "big players" (google, MSN and
Yahoo) andand claimed at the beginning of September 2005, to
have indexed 19 billion sites (against google's 8 billion). A few weeks later Google claimed 25 billion docs (against Yahoo 20 billion).
Since the Web runs around 500 billion docs (and growing) the 'race' is rather pointless :-)
(More on google's ad hoc section)
The educated seeker should however never forget
that yahoo, ms-search and
ask (to cite just the three most important search engines after google)
have own indexes (yahoo index is in fact bigger than google's) and own search algos that may be more suited than google's ones for
specific searches. For instance when searching files inside file sharing repositories you should
use yahoo, not google. Another typical case where yahoo beats google black and white :-) is when searching images
in black and white: Compare
yahoo's b&w option!
Suffice to say that casual searchers and assorted web-low life use only and/or exclusively google, while the educated seeker knows
when to use all the different search engines.
To find our targets, to dig those "scattered gems in the starry web-firmament" we crave and love,
we use many search engines
like "feathered multicolored arrows in our seekers' quivers", knowing that each one of them is
apt to hit better, or more prone to miss, some specific targets.
Even longer answer: it is, after all, "A Question of Relevancy
Search engine users are typically most interested in the items returned on the first page of the search results.
It's not often that users dig down into the ninth or tenth pages, because it takes too long and those results are simply not as relevant.
Given this fact, it seems appropriate for a ranking algorithm to spend most of its modeling efforts getting the topmost items right.
Though the current algorithms used by Yahoo! do a very good job at determining the relevance of a web page for a particular query,
there is always room for improvement. That's why Olivier Chapelle, a senior research scientist at Yahoo!,
has spent the last several months trying to boost the ranking quality.
His work is based on the machine learning framework of structured output learning, where the input corresponds to a set
of documents and the output is a ranking. This approach is different from the regression model commonly used in current search
In essence, the framework of structured output learning provides a new opportunity: instead of viewing the outputs of the documents
in an independent fashion, they are now coupled together in order to optimize the performance measure.
Why is this potentially a better approach? Well, by considering the documents independently, the regression model is
unaware of the global ranking. By contrast, structured output learning can find a rule that produces a better overall
ranking by taking into account all the documents associated with a query.
Indeed, in early tests, Chapelle's algorithm has yielded 3 to 4 percent improved accuracy rates on several public and
commercial ranking datasets. These results are featured in a paper entitled
Large margin optimization of ranking measures,
by Chapelle and his co-author.
Though excited by his initial results, Chapelle admits there is still a long way to go. "The next step is to do more systematic
experiments to validate the usefulness of this method" he says.
And if everything goes according to plan? "The long-term hope is that eventually this model
will be put in production and used for all searches on Yahoo!," says Chapelle".
Note that the problem of the (purportedly) "best top items" inside the SERP's first pages (wich in fact often are among the worst because of heavy spamming)
can still be solved "artisanally" using the old
"yo yo" searching trick.
Yahoo has its own syntax as well
Hooo, Yahoo too
Of course we are not limited to google. Each search engine has its own quirks, and
Yahoo has its own syntax as well:
"My index's bigger than yours, nah, nah, nah, nah"
I have presented these data - that you wont find on the web elsewhere- at my last Helsinki conference.
They demonstrate that -for the main search engines- index size is only loosely (and peraphs inversely :-)
related to the quality of results returned
In August 2005 Yahoo announced suddenly to have indexed 19 Billion (milliards) documents.
Clearly an attempt to dwarf Google's famous "8 Billion" (Milliards) sites.
Alas! No wonder that the results of (almost) any test search you may launch keep to be in Google's favor:
as the following data prove, the biggest increase in Yahoo's results seems to have been in "frills" domains.
For instance Yahoo now indexes 9.560.000.000
"com" domain documents, versus the 1.690.000.000
indexed by google. As you can see, the most striking differences, when regarding domains, are to be found
on crap & frill domains like "com", "info", "net" & "biz".
We can clearly see that the differences are less important for more content-rich domains like "edu", "org", "gov", "mil" and "int".
Here some graphs:
Note the sad preponderance of ".com" domains among those indexed:
Note the absolute preponderance of those very ".com" domains in Yahoo:
Would anyone in his right mind prefer a search engine that prefers "biz", "info", "net" and "com" domains?