Manual searches of reliable news sources is a time consuming task, and not one suited for combing through thousands of news sources. The time constraints involved in such a search are unacceptable, to say the least. Computers and networks are sufficiently advanced that harnessing their power is inevitable, and is to be desired.
Relying on `weblogs' that cater to your specific interest is undesired because of the perceived, if not actual, lack of professionalism that is rampant in the weblog arena. These sites are not always on top of news events, and are often full of editorial mistakes. One of the largest weblog sources for `geek news', Slashdot(2), is often full of editorial mistakes and serious bias.
Being dependent on 3rd party commercial entities for appropriate search results can lead to skewed results, and extreme bias. In the past, it has not been uncommon for search engines to sell their page ranking results to the highest bidder(3). A cynical and skeptical nature is to be appreciated when dealing with commercial entities.
When doing large, professional searching, it is customary to use bots, or software search agents to comb the World Wide Web for valuable information. There are many tools available for writing bots(4), and much literature regarding said subject.
Regular Expression Web Templates
Large web sites are not easy to maintain, hence the rise of dynamic web sites and scripting tools, such as PHP, Perl, ASP, JSP/Servlets, Python, etc. Because of the use of such dynamic tools, most web sites fit a `template.' Logical tools create logical web sites, even if structures are masked by extra long and obscure URLs. A human can decipher and extricate such a structure, and hence map out the essence of a web site. This work is manifest by "sending out finely tuned software agents, or bots, that learn not only which pages to search, but also what information to grab from those pages."(1)
This structure that is extricated by a human is used once, in the form of a custom bot, then lost because of the nature of the ever-changing web, and because there does not exist a standard way of communicating the structure of a website to another person, or bot. RDF(5), and other meta tag standards are not useful, because of the voluntary nature of their use. RDF is a great idea for a perfect world. We need pragmatic solutions for an imperfect world.
A standard form for communicating the structure of a web site is needed, so that this structure can be fed to a bot, and information gathered efficiently, without the extreme duplication of work that is so rampant today in the creation of custom search agents. Regular expressions are the natural choice for modelling a single page, but an appropriate form for the structure of a website is needed. This structural form must be completely modelled in one file, be operating system agnostic, and must cater specifically to the HTTP protocol. There must also be a table of metadata at the head of the file, that indicates all the data that can be culled from the web site in question. For example, when modelling a news site, then the table of metadata must indicate that `Science Headlines', as well as `Economic Headlines' are available. Thus, a robot that is able to digest this standard form for communicating the structure of a web site must only be told what data is relevant, and not how to retrieve it and parse it.
Please note that this standard form for communicating the structure of a web site is the general case, and can be specifically used for the culling of news headlines. The reason for choosing to discuss news headlines is because of a perceived interest of the Readers by the Author.
By harnessing the many hands of the Internet, it will be possible to keep abreast of the changes of the structure to different web sites, by utilizing the good will of World Wide Volunteers. A central repository of Web Templates will be needed to house the structures of web sites, and this will necessarily need to be completely Free. Make no mistake, this technology must be available to one and all, and I don't care if we all have to live in a cardboard box to do it.
When News Content is available by these generic robots, text classification technology can be utilized to categorize content, and to assign user preferences to said content. It could be organized in such a manner to extricate certain patterns in content, in the goal of finding valuable information. Envision a Library, or if the immensity of that thought is too grand, then perhaps a News Library. Proven algorithms could be used, for example Bayesian Classification, or perhaps the implementation of cutting edge technology, such as Support Vector Machines (SVM). This is an extremely fruitful area of research, and I recommend it to all who are interested in understanding the nature of information, and hence, the nature of the 'net.
I would like to thank you, Gentle Reader, for staying the course and reaching this place of rest. I have written this text in the hopes of sparking useful, and enlightened discussion. I know that I will not be dissapointed. Your comments are welcome, and anxiously awaited.
(1) Mining the 'Deep Web' With Specialized Drills, Lisa Guernsey,
http://partners.nytimes.com/2001/01/25/technology/25SEAR.html or also @ +Forseti's: http://qu00l.net/seeking-nyt.html
(2) Slashdot: News for Nerds. Stuff that matters.,
(3) Pay For Placement?, Danny Sullivan,
(4) Bot Writing, Bot Trapping & Bot Wars: How to search the web, fravia+,
http://www.searchlore.or g/bots.htm or http://www.searchlores. org/bots.htm
(5) Resource Description Framework (RDF), W3C,
Information Retrieval on the Web (2000), Mei Kobayashi, Koichi Takeda,
(citeseer is an excellent source for computer science papers, that span text classification technology as well as the future of bots - you won't be dissapointed! also, perhaps do some searches here on `bayesian classifier' , 'bayesian networks' , 'support vector machines' for some text classification algorithms - please note that implementations are forthcoming - to be integrated with the generic bot)
2000 Search Engine Watch Awards: Best Specialty Search, Danny Sullivan, (http://www.searchenginewatch.com/awards/index.html#specialty
(mentions www.moreover.com, and is informative)
Moreover: Business Intelligence and Dynamic Content,
(commercial implementation of the culling of news sources, currently offering free searches of their database - could be much greater were it Free, our Aim)
Autonomy: Automating the Digital Economy,
(commercial implementation of basic text classification algorithms, aimed at diverse content types to automate the `understanding' of text - see their White Papers for an intro to their tech - a Free implementation will be completed soon)