One day I was searching for information on the deep web due to an article I had read
on another page, one of the results that came up was for a PDF on the subject.
The essay attempted to put numbers on the size and percentage of deep versus visible
web, the article sparked off many thoughts. I then started to look at the URL at which
the essay lived. Backstepping to the root path gave an interesting looking site,
which contained a pointer to a tool designed to perform searches for data in deep
or hidden databases.
I have been following the [oslse project] for quite some time, and have also been
assembling a list of URL's for queries on search engines. This was so I could
add more engines to my local search bot, and perform wider and narrower searches
without having to visit the individual search pages.
So, I followed the pointer and came to the home site of the tool. I read with much interest the claims for this tool, many of which I have seen with other tools of this type, but a couple jumped out at me:
1. You could perform query types on engines which did not support that particular query type themselves.
2. The amount and variety of engines that it supported.
The next step was to read more and see what it could do.
While reading their details I decided to view their updates page, there provided were all the data files, nicely labelled for you. This sounded too easy to be true.
My main thought was that if this tool could perform all these searches, it must contain the information needed to perform the searches. It was this that I was after not the product itself, it would save me the time of doing the work myself.
One of the main themes that came up in the OSLSE was how to parse the results, and get only the results and not the rest, this was another reason this tool held my interest as it seemed to be able to perform types of queries not actually supported by the search engine.
After downloading the files the first thing to be checked were the data file updates (quite old!). After looking at these, one of them seemed to be encrypted in some way, but still seemed to retain its original layout, as repetitve patterns and linefeeds were still present.
I have also seen a number of similar tools becoming available on the web, my thought is that these will become more popular and widespread as more people become fustrated with not being able to find things they are looking for using conventional search engines and are prepared to pay a small fee to cut down on their searching time. (I mean people who have not invested and are not interested in investing time and effort in [learning] these most valuable skills). Note, as a matter of interest, that several military and academic institutions recommend this tool and use it on training courses ;)
This does not mean we should take their recommendation without thinking about it and seeing how the program works and if it can be of help, it may prove worthless - how will we know unless we check first. Other people choose the approach of writing their own tools and can then add and customise them to their own requirements.
The fact that the process would probably be repeated with other similar tools (in my quest for more specialised search engine data) had a real impact on the approach taken, as reversing every single one of them to get the data out would take a lot of time.
I knew that to get the encrypted file to plaintext would be quite easy, as parts of it were visible in a hexview of the main file, So we knew what to look for, but decided on a different approach.
So I decided to put the task of looking inside the software to one side, and wondered if there was a really quick and easy (almost Zen) approach to get the data from this program, without altering or reversing the program.
Whilst sitting in my comfortable chair, sipping my favourite drink and listening to some appropriate music - a thought came to me. This was, all I needed to be able to do is see the queries it sends out, as that would give the query strings. So how to do this? I have a local proxy, which logs all web requests to a nice log file, normally when evaluating software I run the program so it goes through the proxy so it can be checked for spyware or adware components (Just in case), if this program was run and the requests went through the proxy then the result would be a log file with all the URLS requested by the program, which would be the query strings for the search engines. So all I needed to do was point it to the local proxy and then enter a query string that would be easy to pick out. ("AAAQUERYAAA")
So first step was to clean the log file of my local proxy ;)
Then to install the program, and run whilst not connected to the internet.
At this point I encountered a problem - the program would not run on my PC - crashed every time it was run before doing anything, So it was removed from my PC. Then I called a friend, said I had found this nice tool that might be of interest but that it did not like my PC or something on it - soon another PC with a clean install was sitting on the desk - he installed the software and agreed to the license on HIS PC!(it ran fine on his pc! umm).
The first time the program was run, it was done offline and set to point to the local proxy. The proxy was setup to return a dummy good page for any 404 errors, which contained a couple of valid but distant links. The proxy was configured to allow requests to these pages and return valid pages from the cache.
Now the program was configured to point to the local proxy a basic search was done.
When the search was complete, the proxy logs were checked and lo-behold, all the query strings were present and correct. Also to be noted was that the links present on the 404 pages (well - fake good results) had also been requested by the program.
This made me think, so this program had sent the queries to the search engines and when the results were returned it had requested the links on the results page. This I thought was very interesting. So it must be doing some processing on the pages pointed to by the results which suddenly made me think *OF COURSE* this is how it is able to perform queries of a type not supported by a particular search engine - it must do the query using a supported one and then do the extra work itself - a very nice idea. Suddenly the possibilities seemed endless and my mind started to wander to ideas of my own, but back to the project.
Another thought hit me like a brick on the head, HOLD ON !! I had not formatted my fake 404 page like the results page of a query, mainly because I had not been expecting it to grab those pages as well. But given that it had grabbed them - why had it? The links on my page in no way looked like a results page. I had assumed that they parsed the results pages and had rules for picking the links, so you did not end up grabbing loads of adverts. But this simple fact (that it had grabbed my links also) proved that they must not, and must handle that in another way - if they do at all!
Could it be that they have a set of rules, for throwing away links and grab all the ones that pass this - without bothering to check if they are really results, or just links on the page.
At this point I posted a message to the messageboard, which was simple and just
gave a URL and asked if anyone knew anything about it without shouting about
what it was or could do. I posted in this way, so it could easily be overlooked
and discarded by people just wandering and wondered if anyone would pick up on
the same points also if they followed similar lines of thought. A teaser was left
for those who wanted to see.
When noone sent a reply after some days the thought was it had been too obscure and that it should have had a red neon sign on it. Then RH posted a reply, which was certain to grab attention - so at least one person had seen the potential, this was the point at which Laurent saw the light and joined in, this resulted in me writing the essay you are reading.
The process described in this essay is very simple, but often overlooked
or not considered in projects of this type. It provided me with all the
required information without violating the license of the software.
Notably it is also an approach which can be used with other similar tools. If the user supplies a search string which can be used as a marker - to insert their own query later then the actual query URL string for each search engine can be retrieved with the minimum of effort. These can then be inserted into a local search bot, which replaces this marker with the user query.
The above observations on how this tool works have also given rise to a number of new thought lines for writers of search bots and I think some important lessons can be learnt from the methods employed by this tool.
The main change in my thoughts is the need for complex rules for parsing the results page for each engine. This need seems to be lessened by how the tool works - this is explained more in part two.
The approach taken in this essay also appeals because it does not utilize any specific files from the original software or any copyrighted material, because the search engines provide the query URL's and we do not use any of the extra data contained in the software - Just the Querys, which could be gained directly from each S.E.
I have repeated the same process with a number of other such tools, and have gained a large amount of information on some very specialised search engines. This process was completed without touching any code or disassemblers.
You are probably sitting there reading this thinking 'when do we get the meat' well it is in the second part.
This investigation changed direction during the process, as many do and managed to get the information without having to go that any deeper, also the aproach changed from being aimed at a specific target to a more general one which could be reused easily and quickly.
I must point out that during the writing of this essay, I DID NOT USE THIS
PROGRAM MYSELF OR AGREE TO THE LICENSE, or reverse engineer the code or data files within it.
All the information was gained through viewing log files on a local proxy server.
(information which would be present on any proxy server between you and the search engines)
I want to point out the importance of Log files especially on proxy servers as they are a very valuable resource and help understanding of similar processes and programs greatly. They are also a valuable searching resource as they often contain URL's which are internal links (not external). I have found many interesting databases using such log files.
At this point I thought about looking into the program itself, not to find the
URL's for the queries as that had already been revealed by the logs, so
my main interest was that nice big engine file.
Laurent has written an essay about the process of gaining the original data files from this program. This essay is part two of the series and delves much deeper under the skin of the software and the way it works. I followed the same path as Laurent for the second part of my investigation, but leave it to him to describe the process, as his essay is well written and concise and raises some interesting points and thoughts.
You should consider these two approaches as companions, and as approaches to the same problem which can be used to compliment each other - thus allowing a more thorough examination of the software under the microscope. Firstly by watching it work and then by looking into it.
Onto Part Two - Delving Deeper
Thanks to Laurent for getting me typing this essay.
Hope you enjoyed reading.
Copyright (c) 2001, WayOutThere