|by||aapje & ritz|
This query is used to find so called "open directories", by searching for certain standard-keywords that almost always appear in the server-generated pages of such directories. I believe that the first mention of this technique i have read, is in the solution to one of the labs[the one about "aqua" and "barbie"]. This trick will be explained for google, but can also apply for other searchengines
Okay, perhaps a small example for those who have never heard of it. Let's say you want an etext of the book Wizards First Rule by Terry Goodkind [why? let's say because you were reading the hardcopy, but left it somewhere else, and you are very anxious to start reading again]. This might be your searchquery:
"index +of" name size goodkind wizard rule txt
Exactly one result [per 20/7/2002]. We're taking a shortcut here, cause it took me about 15 minutes before i had this result, but this doc is not about how to refine your searches, but to hand you some tools. One further note: since the most [interesting] ebooks are illegal due to copyright they are not always directly linked at html-pages, so looking in open directories is a good place to start searching them. [also if you happen to stumble on one, and want to bookmark it for later reading, better just download and save it, because usually it's gone the next time you look ;-) ]
Ofcourse, there is index +of [note the '+', because sometimes a searchengine takes out this "commonly used word"]. We have found that this string almost always appears at the start of the title of an open directory. So a good sub-query would be intitle:"index +of" [too bad i don't know any searchengine capable of titlestartswith: queries :-) ].
Another very good keyword is "parent directory", because usually [sometimes automatically] there is a directory above the one you are now viewing. Sometimes, you want to hit a root-directory though. This is a bit strange, we have found root-directories that contain a parent directory link [that links right back to the same root], and we have found some without. Our guess is that it depends on the version and type of server that generated the page. Still, "parent directory" is a very good way to filter out all those pages with "index of" in the title that aren't real directories.
Other keywords that you might want to use are "server at" port [look at the bottom of the directory-page], name, size, last modified or description. You must be careful with these queries, as we have found that not all open directory pages contain these keywords. Especially description is often found missing. [btw notice the peculiar use of the words "found" and "missing" right next to each other :-P]
The last searchkeyword we have found is not about what is inside the page, but about what is the page: Open directories never have the extension htm or html in their URL, because else they would be files and not directories: http://www.someserver.com/something.html is a file and http://www.someserver.com/something/ is a directory. Of course http://www.someserver.com/somestupidthing.html/ is a directory too, but it would be pretty weird if people would name their directories like this. [i also don't know what apache or google thinks about these extensions] So a very nice subquery to filter out even more non-directory pages is: -filetype:htm -filetype:html.
intitle:"index +of" -filetype:htm -filetype:html parent-directory
A few notes about this query: First, parent-directory means exactly the same as "parent directory", just in case you are wondering.. Second, due to google's pageranking, you might still find non-directory pages appearing at the top of your results, especially for very broad searches. This is because google thinks that users rather see "real" html-pages instead of open directories [and indeed, most users do].
Now for a few examples of open-directories that are not so standard as the default apache directory, just to keep you a bit edgy :) [btw they are not particulary interesting, except for the fact they differ from the standard]
Missing "server at" port and description.
Generally weird, missing "server at" port, and half-french.
One more note about open directories in other languages. Not open directories in non-english speaking countries, but open directories where the page that is being generated contains non-english terms, things like ko instead of 'kb' [the French call their bytes octets so in french it is kilo-octet instead of 'kilobyte'], or 'nom' instead of 'name'.. We first thought that there weren't any, but after some discussion at ~S~ board, someone didn't understand what i was talking about, so i gathered some more examples. It turns out that there are numerous open directories with the term répertoire de parent, like for example this query. On the other hand, most of the open directories have their key-terms in plain english, searching for foreign generated open directories is a bit too much of an advanced topic for this essay.. If you want further information, we have only briefly touched the surface of this subject in this thread on ~S~ board. This is only a very short summary of french open directories, more research must be conducted for other languages [see also the thread].
Two more things, still.. You can target a site specifically using site:de, -site:msn.com or site:volkskrant.nl. Second, in AlltheWeb you can't specify the filetype, but omitting that and using only the rest of the query still gives very good results.
We would like to thank Jeff for his comments, and especially for not understanding things :)
Quite an addition
In a first time you need to generate your working context, a pool of pages that are directory listings. To achieve this you can use some of the methods listed in aapje&ritz's essay, i personally choose :
title:"index of /" AND "parent directory" NOT (filetype:htm OR filetype:html)
This gives 2,830,000 hits on google and 10,617,311 hits on altavista.
Now, we can think of searching inside the data pool by adding some filter to our query. Just try to imagine you are searching for files inside a huge chaotic computer, and you click on "Search files or directory" in your start menu. You have different fields to fill: File name or Directory name / Filetype / Date / Size.
It's the same for ODS. But you need to build your query yourself. And be creative :)
Remember how the wildcards work for your search engine (rtfm, and rtfe) You can search file and directory name juste by adding the keyword to the query.
File example :
intitle:"index of /" "parent directory" ecd502 -filetype:htm -filetype:html
One result, and in an open directory, you have ecd502.rar (157Mo), wich is the archive of Easy Cd Creator 5.02 Platinum Edition. Yes i now, the url is forbidden now, but it think you got the point :)
Directory example :
intitle:"index of /" "parent directory" intitle:"mp3" -filetype:htm -filetype:html
Again, i think you got the point :) To achieve directory searching, the best way is to search in title, or in url. Both aren't giving the same results, just experiment and see wich you prefer.
adding keywords such as "*.mp3" or ".pdf" or ".txt" can be very effective. Just be creative! Look at that one for example :
intitle:"index of /" "parent directory" +"*.nfo" +"*.rar" +"*.r05" +"*.r10" -filetype:htm -filetype:html
Check result 3 : Norton.SystemWorks.2002.Professional.Final.READ.NFO-ELUSiVE Note that this way of naming directory is somewhat a mark of quality : you know the archives hadn't passed through many dudes before you get it, because it's the cracking scene way of packaging : (Name.Of.Your.Release.Version.AddtionalInfo-CRACKINGGROUP) The keyword "*.nfo" will filter to fish only directory that has sone file.nfo, wich is again a mark of quality. Finally by using "*.rar" and "*.r10" we're sure to fish medium sized software because it's the way Rar compressor name multipart archive.
Other good queries, is to search for specific groups. If you know for example that Radium is a sound software cracking group, use "radium.nfo" as keyword : intitle:"index of /" "parent directory" "radium.nfo" -filetype:htm -filetype:html
It is also a synecdochical type query.
Pretty simple to do. but i have no enlightening example to give :) I just use this functionnality to keep things 'fresh' ( +2002)
Quoting erom in dark_rid.htm : 'add +"1M" to have the chance to find galleries with hires pictures.'
Again, be inspired.
There are some other tricks that can be used. On altavista you can filter your results searching for pages which title begins with index of. Compare this two query strings:
title:"index of" 11,441,981 hits
title:"index of" AND NOT title:"* index of" 11,310,633 hits
Unfortunatly this trick doesn't work well on Google:(
Given that Open Directories (OD) shows allways links and anchors you can use them to filter your results.
Only works on altavista.
Exemple: Search the Brandenbourg concerts of Johann Sebastian Bach in mp3.
Query: title:"index of" AND NOT title:"* index of" AND anchor:Bach AND anchor:mp3 AND anchor:brandenb**
Comments: given that altavista is case sensitive we can use capital letters on names to filter the results; I don't know in what language Brandenbourg was written, that's why I use truncation.
OD usually truncate the anchors if they are too long, so its better to use the link field.
Works on altavista, alltheweb, lycos
Example: search de mp3 of Pachelbel's Canon in D. Query strings:
altavista: title:"index of" AND NOT title:"* index of" AND link:(Pachelbel AND mp3 AND canon)
alltheweb: +normal.title:"index of" +link.all:pachelbel +link.all:canon +link.extension:mp3 -url.all:htm -url.all:html
lykos: +normal.title:"index of" +link.all:pachelbel +link.all:canon +link.extension:mp3 -url.all:htm -url.all:html
A litle off topic, once apon a time the field title: was supported by hotbot/MSN and you could do this query strings:
hotbot: +title:"index of" +Pachelbel +canon +linkextension:mp3
MSN: +title:"index of" +Pachelbel +canon +linkextension:mp3
Since hotbot and MSN search the inktomy database I tried the fields of hotbot/MSN on inktomy search engine. The fields domain:, link:, linkdomain:, linkextension: work, feature:[ acrobat, activex, audio, embed, flash, form, frame, image, script, shockwave, table, video, vrml] seams to do something/work, but title: seams to fail miserably.
I think you should also add this rtfm because they are not always up-to-date, (or at least up-to-guess:) and this way we have a good list of names to do our own research.
As the 'regular' index+of trick mostly finds Apache OD pages, here's a query to catch other fish: "parent directory" dir OR gif OR jpg.
It does look very similar, but not quite: "parent directory" dir OR gif OR jpg -"index+of" ... oops (*)... "parent directory" dir OR gif OR jpg -"index+of" -sex
Or, for better filtering, try something like: "parent directory" dir OR gif OR jpg -"index+of" -the (~300k)
Note, that most of these sites are running IIS. A better filter exlusively for IIS servers is "to parent directory" dir OR gif OR jpg -"index+of" -the (~300k)
The others are either stranger fish or common servers with custom index pages (hence we shouldn't use '-the' on them, as they are likely to contain human-writ text).
Btw tricks like '-the' are a nice way of filtering out specialized generated content - stats, sources, email lists, or indeed index+of pages.
(*) As it seems, not only porn research is of FUNDAMENTAL teaching for seekers, but it works the other way around too :)