|NOT MUCH TO DO WITH FSF||This workshop has -apparently- not much to do with
the free software movement, nor with the open source movement,
and in so much it differs from most of the other workshops at fosdem. |
Yet -together with my friend Richard Stallman- I think that people working for such worthy aims deserve to know the techniques I will describe to-day.
Another difference, as you will have noticed, is that I am the only one, here, that uses a pseudonym instead of his real name. The simple reason for this is that some of the searching techniques I will describe may be seen as "non kosher" in many of our euroamerican copyright-obsessed dictatorships.
|ANYTHING||Infact we will see together to-day,
how to find ANYTHING on the web. |
Indeed the web is so huge that -with almost no exceptions- searching in the correct way will enable you to find anything you may be looking for. Any image, any music, any book, any document, any data, any software (proprietary or not), any newspaper... that has been published in the history of mankind. Whole national libreries are going on line in this very moment in some god-forgotten country in Africa or Asia. Half a million people are putting in these very 5 seconds half a million scanned images on some god forgotten homepage. Another 200 thousand users are uploading, in THESE five seconds 200 thousand mp3, somewhere on the web.
Some of these books, of these images, of these musics, have never been on the web before. But they will now remain on Internet for the ETERNITY.
|NO WAY BACK||Indeed, everything that has been published once, will remain
on the web for ever and ever (and ever), in copycatted electrons, because the very moment something is
there it will be copied. A simple demonstration of this is that you can find things that
DO NOT EXIST ANYMORE on the web using one of the many repositories.
One of them is google cache, another one is the wayback machine, but there are many more, that will allow you to find data that have been 'pulled off' the web.
|SOMEWHERE... WHERE?||Yep, all this stuff is on the web, somewhere, but where?|
You will have to use a lot of different techniques and approaches to search effectively the web. As you will see google, though worthy, is by FAR not the solution for your searches. In order to understand WHY you have to use different tools, you should first of all understand what the web looks like, from a searcher point of view.
Explain diameter 19: do not dispair
There is ONE important thing in this image that i wish you will not forget: the difference between INDEXED web (coupla milliard/billions sites) and NOT INDEXED web (9 milliard/billions pages more). So when you are searching with the main search engines, with google, or with fast, or with wisenut, you are just limiting yourself to -in the best case- a FIFTH of the web.
|SO, HOW DO YOU SEARCH?||Ok, how do you search your needles in this
huge ocean of commercial hay?|
Let's begin with the beginning: usually you do not search a specific target: you search people that have searched that target. If the target has enough signal among the noise you may even search for people that have searched people that have searched for that specific target... :-)
This approach is called COMBING, and is rather effective. But before explaining it, we will have to understandhow the MAIN search engines really work, and WHY they are there. Simply stated, these "free" search engines exist in order to grep what you and million of other users are searching for.
Anonymity -proxies- free homepages - free email addrsses -free search engines
|EXAMPLE ~ S.E. DIFFERENCES||
Let's take an example:
you are interested ina specific camera, how to use it, if it is worth using... whatever.
Let's say a nikon F2|
Now, of course, you could search on google for nikon F2:
search?as_q=%22nikon+f2%22&num=100 :8180 results
This is a good, simple query and it is what most people would do. And they may even be happy with it.
Nevertheless a good idea would be to use ANOTHER main engine as well, let's say FAST:
&query=%22nikon+f2%22: 2561 results
Before discussing this, it would not be bad to use at least a THIRD main search engine on such a broad query:
query.dll?q=%22nikon+f2%22: 6807 results
A first thing to understand is that you should ALWAYS use at least two main search engines, when starting a broad query. As you may see if you follow the links above, wisenut, for instance, is more 'asian-centered' than google, which in our case, searching for a japanese camera, would probably be useful.
Now a normal user would be happy: Woah, 2000 - 6000 - 8000 results! I may browse for ever just here
In fact first of all you CANNOT really see all those results. There is a difference between the number stated by teh search engines and the results you may really check yourself.
If you really tryed to see ALL those links, you would quickly discover that
Here a table I made two months ago, based on another query, as you can see there is a huge difference between alleged results and results you can investigate:
(Based on the broad query: "advanced searching")
|WHILE WE ARE STILL AT THE MAIN SEARCH ENGINES||Some simple rules:|
1. always use more than one search engine! "Google alone and you will never be done!"
2. Always use lowercase
3. Always use MORE searchterms, not only one "one-two-three-four, and if possible even more!"
This is EXTREMELY important. Note that -lacking better ideas- even a simple REPETITION of the same term can give you more accurate results:
nikon: 1,410,000 (alleged) results
nikon nikon: 627,000 (alleged) results
If you are interested in this 'pleonastic' stuff, read The epanaleptical approach.
Since we did not do it before, it's time to use more searchterms now. Here is a "better" query for our target (I will use google, but remember -yourself- to use also OTHER main searchengines when broadsearching, you will be amazed by the non overlapping results).
Now look at this query:
You may not recognize the querycodes above... it is just "links.html#nikon". We are already slowly moving away from simple main search engines searching towards combing. In fact I was searching for pages of links that are of ineterest for my query. I can go further:
"nikon.htm" OR "nikon2.htm"
Another approach: +nikon nikkor +photo resources
As you see it's commercial infested... there is some need for our yo-yo here
You get the idea.
I could also use the netcraft trick: do we happen to have many "nikonsites"?
Woha... 800 sites NAMES vontain the word nikon! (But many of them will be dormient).
|BUT THE REAL DIFFERENCE||But the real difference between the simple
queryes we have made above and a good seeker approach, is that above we are still just
"skimming", or only slightly touching, the relatively
small INDEXED part of the web. We are still missing 4/5ths of it! That's the reason you will have to learn at least some
rudiments of combing.|
The first -simple- combing approach (remember: searching those that have already searched) is to use old glorious USENET!
Local searching (spanish search engines - buscadores hispanos)
getting at the target from behind: netcraft, synecdochial searching.
Passwords through google
Database accessing (politically correct)
brute forcing? Guessing / Searching
Bots searching, scrolls, wands
Software reversing: commercial bots capering
|ANY QUESTIONS?||Now, at the beginning of our workshop I told you
that you can find ANYTHING on the web. My experience has tought me that there is
almost always -unfortunately- ONE sad exception. It is almost always next to impossible to find quickly
the curious targets that people
ask for at the end of my workshops, so do not ask me to find something specific for you now...|
eliminating advertisement in Opera 6.1 Windoze Code is now packed, use an unpacker first, else you wont see this: .text:0044C2B3 ; ---------------------------------------------- .text:0044C2B3 .text:0044C2B3 loc_44C2B3: ; CODE XREF: .text:0044C26Fj .text:0044C2B3 ; .text:0044C281j ... .text:0044C2B3 0C FF or al, 0FFh .text:0044C2B5 C9 leave .text:0044C2B6 C3 retn you would probably prefer to have here -instead of that vicious OR FF- a well placed and less advertisement friendly AND 00 .text:0044C2B3 ; ----------------------------------------------- .text:0044C2B3 .text:0044C2B3 loc_44C2B3: ; CODE XREF: .text:0044C26Fj .text:0044C2B3 ; .text:0044C281j ... .text:0044C2B3 24 00 and al, 00h .text:0044C2B5 C9 leave .text:0044C2B6 C3 retn