|portal.htm → mines.htm → Luxembourg_2006.htm||Version 0.33, Updated 21/OCT/2006|
underneath the commercial web
"Powersearching without google"
Fravia's talk at the Hack.lu (Luxembourg, Luxembourg - 19-21 October 2006)
This file dwells @ http://www.searchlores.org/Luxembourg_2006.htm
Recall and Precision
Let's analyze some simple music queries
Today's targets (python & proximity galore)
Finding the three targets
Look ma, no google!
Let's search elsewhere
Spamming (& "popularity")
Even more important is the fact
that not only "tangible" and "digitized" targets are available to anyone,
but also all kind of
solutions are there, at your disposal. And I don't mean just messageboard solutions
on -say- how to port a proprietary driver to GNU/Linux.|
I mean real concrete solutions:
Also always consider that the main search engines DO NOT overlap too much and yet
that they cover together (at best) only 1/4 of the web,
this may be quite significant when deciding your search strategies.|
Many clueless zombies consider "searching the web" tantamount to digit one term inside google and then clicking enter. Such a simplistic approach is wrong, and not only for the "one-termness" of it. The real problem is that google covers only a small part of the web.
In order to access a bigger part of it you will need to use techniques that go from stalking to social engineering, through trolling and passwords breaking.
site: hostname: link: linkdomain: (links that points to one domain) url: intitle: inurl: (a specific keyword as part of indexed urls, example: inurl:searching)
intitle & inurl are VERY important parameters... nomen est omen: images giotto5.jpg...
site: allintitle: (all of the query words in the title) intitle: (that word in the title) allinURL: (all of the query words in the URL) inURL: (that word in the URL) cache: link: related: (pages that are "similar" to a specified web page) info: (google's info)
Altavista's most important operator:
NEAR (more on this later)
MSN Live's operators:
contains: Restricts results to sites that have links to the file type(s) you specify. For example, to search for websites that contain links to mp3 files, type music contains:mp3. filetype: Returns only web pages created in the file format you specify. Live Search recognizes html, txt, and pdf extensions. Live Search also recognizes the extensions for primary Office document types. For example, to find reports created in PDF format, type your subject, followed by filetype:pdf. For example, type information filetype:pdf. inanchor:, inbody:, intitle:, inurl: Returns pages that contain the specified term in the anchor, body, title, or web address of the site, respectively. Specify only one term per keyword. You can string multiple keyword entries as needed. For example, to find pages that contain google in the anchor, and the terms black and blue in the body, type inanchor:google inbody:black inbody:blue. ip: Finds sites that are hosted by a specific IP address. The IP address must be a dotted quad address. Type the IP: keyword, followed by the IP address of the website. For example, type IP:184.108.40.206. language: Returns web pages for a specific language. Specify the language code directly after the language: keyword. link: Finds sites that have links to the specified website or domain. This is useful for determining who links to whom. Do not add a space between link: and the web address. For example, to find pages that contain the word games and that link to searchlores.org, type games link:searchlores.org linkdomain: Finds sites that link to any page within the specified domain. Use this keyword to determine how many links are being made to a specific page, as well as how those links are made. For example, to see pages that link to searchlores, type linkdomain:searchlores.org. linkfromdomain: Finds sites that are linked from the specified domain. Use this keyword to determine how many links are being made from a specific page, as well as how those links are made. For example, to see pages that are linked from my site, type linkfromdomain:fravia.com loc:, location: Returns web pages from a specific country or region. Specify the country or region code directly after the loc: keyword. To focus on two or more languages, use a logical OR and group the languages. For example, "core python" (loc:RU OR loc:CN) prefer: Adds emphasis on either a word or another operator. For example, type searching prefer:internet site: Returns web pages that belong to the specified site. To focus on two or more domains, use a logical OR and group the domains. Do not add a space after the colon (:). You can use site search for web domains, top level domains, and directories that are not more than two levels deep. For example, to see web pages about media reporting from the BBC or CNN websites, type "media reporting" (site:bbc.co.uk OR site:cnn.com). You can also search for web pages that contain a specific search word on a site. For example, to find the library pages on searchlores, type site:www.searchlores.org/library feed: Finds RSS or Atom feeds on a website. For example, to find RSS or Atom feeds about web searching, type feed:"web searching" hasfeed: Finds web pages that contain an RSS or Atom feed on a website. You can add search words to narrow your search. For example, to find web pages on the Guardian website that contain RSS or Atom feeds about google, type site:www.guardian.co.uk hasfeed:google url: Checks whether the listed domain or web address is in the Live Search index. Do not add a space between url: and the domain or web address. For example, to verify that searchlores is in the index, type url:searchlores.org
Most important MSNLive operator:
linkfromdomain: (an outbound links operator)
Go for the format, go for the name, do it like the lamers or search elsewhere
You can search ftp, you can go local, or even better: regional. You can zap IRC channels and explore uncommon search engines
Going "regional" is ALWAYS a very good idea when searching. We have already seen how adding a simple .ru to
our queries can help. But why Russia? WHERE should we search? Which are the, how should I say? the "less copyright-obsessed" countries?
see a interesting
"piracy subdivision" published this summer by
We may as well use these 'scarecrow' data (produced by US-lobbyist Robert Holleyman's "Business Software alliance" in order to scrap some money) for our own purposes...
And look! As you can see, Vietnam, Zimbabwe, Indonesia, China, Pakistan, Kazakistan, Ukraine, Cameroon, Russia, Bolivia, Paraguay and Algeria seem to have a more relaxed attitude towards patent holders. Good to know :-)
Here the relevant country codes: .vn, .zw, .id, .cn, .pk, .kz, .ua, .cm, .ru, .bo, .py and .dz, codes, that we could use to restrict searches only and/or especially to such relaxed places.
Of course some of these countries are just tiny local niches, with next to no activity and extremely weak signals, and can be ignored: throwing our clever queries in -say- Zimbabwe or Cameroon we'll probably just wasting our (or our bots) precious searching time..
Let's say that -in general- .vn(Vietnam), .id (Indonesia), .cn (China), .pk (Pakistan), .ua (Ukraine) and .ru (Russia) look promising enough. We may add -out of our experience- Iran, Korea, Bulgaria and India (.ir, .kr, .bg and .in).
So let's go local: let's visit China, where we can find, among hundreds other, for instance this link, that requires just some guessing capacity (or some understanding of Chinese :-)
Of course we should also have a look in Vietnam, in Russia/Ukraine (where we will at once retrieve our Target and as many other programming books as you fancy), and here is how you would search in KOREA or in RUSSIA using MSN Search.
Caveat: this was all just academically speaking, duh. Once again: seekers don't need to download anything from the web, since they can always find their targets again and again if and when needed :-P
Searching through IRC channels and blogs can be -for specific targets- quite useful. However the ratio noise/signal is quite bad on these channels, and therefore IRC-searching and blog-searching is -in many cases- a waste of time if compared to more effective searching techniques.
After all, and behind the hype, blogs are just messageboards where only the Author can start a thread, and IRC channels need, in order to be useful, a lot of social engineering.
I'll just direct you to some blogs search engines and to some IRC search engines like this one. Nuff said.
At times simply switching to less known (but quite interesting) search engines can cut mustard.
Here's a related search with kartoo
and here's another search using gigablast.
Finally, since we are speaking of a programming language, we may also have a look at the recent google codesearch:
return lang:python gives 283000 scripts, enough for some serious studying. Samo with MSNsearch macros.
So we found our targets again and again using a palette of different searching colours. These are all paths that lead into the forest, and you'll be able to find many more on your own. Now let's go back to the theory.
Google alone and you're never done
Using a plethora of methods (cloaking, doorway pages, hidden text, blog-farms, you name it) the SEO beasts ("search engines optimizators" they have the cheek to call themselves) deny everybody the possibility to gather real knowledge pushing up their crap commercial sites into the first positions of the SERPs.
For search engines that do not allow any algo fine-tuning, a possible defensive approach is the "yo-yo" approach: jumping from the start onto lower SERPs and then going slowly back up.
Such methods can soon prove even more crucial for Internet searching purposes: while google may not be yet a sinking boat, anyone can see how much water is already leaking through its many spammed holes.
So we have to refine our seeking techniques.
Instead of just using google again and again, every time we begin a search, we should carefully consider how and where we start our searches, delve a little more inside our own specific requirements, and avoide wasting too much time on irrelevant side paths...
Quaeras ut possis, quando non quis ut velis
There are various strategies you can use when searching the web. Some are more relevant for LONG TERM searching, some on the contrary, for SHORT TERM searching.
But even the various simple techniques we used today (searching for mp3s and books) can and should be used together with the main search engines. On the ever moving web-quicksands it does not make much sense to give a list of links to places where you can "search alternatives". It is better, I believe, to (try to) show directly how to "search alternatively".
Of course there are various important non commercial databases, like Infomine (http://infomine.ucr.edu), Librarians Internet Index (http://lii.org), The Internet Public Library (http://www.ipl.org/) Resource Discovery Network (http://rdn.ac.uk), Academic Info (http://www.academicinfo.net/), The Front (for journals: http://www.arxiv.org/multi?group=math&%2Ffind=Search) and finally the best one of all: The Open Directory Project (http://dmoz.org).
These are all possible alternatives to a single approach limited to the main search engines.
Yet lists of links are and remain just that: lists of links. Bound to decay into obsolescence.
We have seen some alternative approaches. Practice them on your own subjects and interests. Once you learn how to seek, the world is yours. Cosmic power for free.
Nil perpetuum, pauca diuturna sunt
An easy assignment for this evening: (just in order to practice the various techniques explained today, lest you forget everything): find the new 1120 pages - September 2006 - second edition of our target Core Python Programming by Wesley J. Chun, (Prentice Hall, ISBN: 0132269937)
This search should take you at most 10 minutes if done now (and just a few seconds in a couple of months, when the book will have percolated the Web).
And now I'm finished.
Thank-you for your patience. Any questions?
SEARCHING THE PAST (DISAPPEARED SITES)
http://webdev.archive.org/ ~ The 'Wayback' machine at Alexa: explore the Net as it was!
Visit The 'Wayback' machine at Alexa, or try your luck with the form below.
Alternatively, learn how to navigate through [Google's cache]!
Alternatively a new "preservation" project from Webcapture: the International Internet Preservation Consortium is coming along.
A quick tour of the main search engines...
Uhhh.. almost forgot, a small book-searching present for those that solved the assignment (all others shouldn't look):
finding Ubuntu books
back to portal back to top