three approaches to searching

In this essay I will present a comparison of three strategies of searching for information in a given set of Text-Documents, I will try to show the strenght and weaknesses of each approach, and how the three strategies can build upon each other and can also complement each other.

Status of this Document:

2003-04-25: This is currently in the draft phase, and it's not even complete in it's structure. I am still trying to find the best way to lay out my ideas.

2003-04-28: I just recently discovered another approach using 'Contectual Network Graphs' (see Apendix) that promises to give the benefits of LSI, but also avoids it's computational challenges. I will present this technique as the 'third' approach once I understand it better.

2003-04-30: Azeem Azhar has an excellent summary of the two schools of thought in search engine design, and how they pertain to Google's recently announced purchase of Applied Semantics.

Feedback is welcome via email at martin at traumwind dot de or as comment to this weblog post

A Disclaimer or two

I do not claim to have a very deep understanding of all the computational details involved in the approaches described here. I do not have a degree in computer science.

My goal for this essay and the work sourrounding it is to make the understanding of the basic principles possible to the intermediate Scripter. I wish to help others find the information that helped me shed light into something that otherwise seems to be a topic of qbscure wizardry.

Baylor Wetzel puts it like this in his exellent article 'Build a Smarter Search Engine': So here's what I ask. Think back on your computer science schooling or go pick up a book on algorithms (no, not a language-specific one but an honest-to-goodness theory book) and see if you can't find something useful in there. And if you do, talk some teammates into trying it out. Hey, while you're at it, why not think of a simple way to explain it and then write an article? Many of us who would be happy to read it.

A note on application of the described approaches: My interest here does not lie with large-scale search-engines for millions of users. I'm researching this topic for my very personal information-retrieval needs. I'm looking for solutions that will help me - the user - to make sense of the information at my disposal.

Keyword search

Searching by keywords is maybe the most well known and wide spread strategy of finding information in the World Wide Web. Almost anybody who uses the Web has used popular search engines like Google, Yahoo or Altavista or their competitor at least once. For many they are a constant tool in their research tool-box.

Let's take a closer look at what keyword search does and what it offers, before discussing it's weaknesses:

Goal: Find documents containing a user-supplied set of words

Basically a keyword search engine takes a user-input 'query' (a list of words) and tries to find documents containing those words. The trick here lies in the way this is done, and how efficient this is. Also important are the criteria after which the resulting list of documents are 'ranked'.

What is needed to supply keyword searching?

First of all, a set of documents. In case of a single website, this would be all documents found on the site. In case of Google, the 'set' is optimally every availeable document on the internet. You can easily see where the optimisation problems and the hight tech of search engines lie if you consider the sheer number of documents on the 'net today.
A naive approach to keyword searching (and one that works rather well in small scale) would be to actually search all availeable documents for the given search words each and every time. The Windows 'Find' availeable in Explorer or the Unix command 'grep' do just that, and are reasonable tools for local searching. I have such a 'brute-force searchengine' in use on my weblog. It works rather well and is sufficiently fast. Written in PHP it is about 80 lines of code, display included.
For larger collections of documents the sheer processing time and cost will make such a strategy unwiedly and inefficient. Various techiques exist to optimize this, mostly dealing with pre-generating search indices.
Querying this document set usually follows these steps:
1. split the user query into a list of words.
2. for each word in the query search document set for documents containing the word.
3. return list of documents, ranked by certain criteria.
A way of 'ranking' results according to user input. Which most large document sets 'swamping' with results is a common problem. Either the search does not return anything relevant (wrong keywords / keywords not found in set) or a far too large set of documents to be usefull (too un-specific query, or jsut too many documents alltogether). So there needs to be a way to 'reduce' the result set of documents to those that will most likely be usefull to the user.
Common approaches are:
- Rank by amount of words from the query contained in document ('any word' search). Also often refered to as 'relevance'
- Rank by date/site/author or any other criteria.
Advanced queries can also use 'boolean' modifiers. Searching for documents containing one word BUT NOT another or containing any one of a list of words (OR) etc.
Another advanced feature of many keyword search engines is searching for 'similar words' along with the actual query-words. Searching for word stems, allowing for spelling errors or synonyms of the query words are common.

Weaknesses of keyword search

Apart from the techical weaknesses of various implemetations - which mostly have to do with a needed compromize between performance/technical complexity - keyword search has one major drawback: it finds only documents containing the words the user queries for

So why is that a problem?
Well, to begin with, what if you don't know the topic specific 'keywords' you should be searching for? Keyword search shines when the query can be as specific as possible. You need to know what you are looking for. And most specifically: you need to know the keywords used in the documents you are looking for.

A good example of the counter-intuitive nature of choosing good keywords is this tip from a search-expert:
Imagine you are looking for documents on a given topic, and would like to find only documents that have some 'expertise'. Try to add the keyword 'bibliography' to your search terms.
Now that surely works, and thinking about it one soon realizes why: Documents that are serious enough to list a 'bibliography' will mostly be of more merit that such that don't. So even if we are not actually intersted in the bibliography, the word itself helps us find documents we are interested in.

An alternative: similarity search

What if you have found one document that talks about the topic you are interested in? Having a way to search for 'similar documents' might be a way to find information about a topic you are not (yet) very familliar with (and don't know the proper keywords for).

A naive approach: naive bayesian comparison

A technique currently in the rave for it's uses at detecting spam could also be usefull at detecting similarities between two documents.
(This technique is called naive bayesian not because it is a naive approach, but because it uses something that is not.really bayesian...)
This approach basically builds lists of word-occerences for each document, and then compares how many words are common between two documents. Some weighting is involved to make comparisons more valuable. A 'score' of intersections is then calculated to give a numerical 'similarity factor' for two texts.
This technique is very succesfully used to 'classify' a document as belonging to one of two groups (spam/non-spam). This done by building two 'refernce sets' of documents (one for is-spam and one for isnot-spam) and then comparing a new document to these both references

Some simple experiments with PHP [link nbayes.php] show that this works reasonably well on a one on one basis. For a 'search' which would mean to compare a 'query document' againt each and every document in the collection this quickly becomes unfeaseable, for very much the same reasons as for the 'brute-force' keyword search mentioned above. Some pre-generation of word-occurence lists etc. should speed things up, which would also be usefull for 'simple' keyword search. So maybe this approach could 'build upon' the indices that were built for a keyword search.

So where is the problem with this naive bayesian approach?
Well, to start with, it again only counts actual word-per-word matches. So to over-simplify it simply lists how many words appear in both documents to give a similarity rating. Agrred this is a crass simplification, carefull weighting of words (rating very common words less, and significant words higher, accounting for a word-weight to document lenght ratio etc.) pushes this very far along, and it has some very powerfull uses.

One very strong point of this approach on the other hand is the fact that it can be implemented with basic string and list handling routines availeable in most scripting languages alone, without the need to resort to C-based extensions (the same is true for the keyword/reverseindex search).

A better solution: similarity through VectorSpace

An advanced version of the naive bayesian comparison is based on a so called 'VectorSpace' model. This is an approach that lives in the obscurety of numerical computing as it involves some advanced vector/matrix computations usually reserved for Mathematical Sciences. But thanks to a recent perl.com article [link!] this has changed and has moved this interessting technique into the grasp of scripting languages (Also thanks to Perl's PDL library).
Although the mentioned article and the accompanied code make some claims that I don't fully agree with, the underlying techniques are very valuable, and I have implemented a very usefull 'similar entrries' feature for my weblog using it.

What is a VectorSpace?

Finding 'similar' text that share almost no common keywords

So how can these three approaches be combined?

Same prerequesites

cleaned documents: the first step for each approach is to 'clean' the documents of 'noise' (markup etc.) which does not contribute to the 'character' of a given document. (Many implementations even go so far as to reduce the amount of unique words by 'stemming' and omitting 'stopwords')
per document word-list: what unique words are contained in what document and in what weight (amount, position)?
from those document-word lists - which require some computing to generate - it is rather easy to generate the classical reverse-keyword index. This is an index listing all documents that contain any given keyword.

So alot of the preperation needed for the 'simarity' approaches can also be used for a keyword based search. Set up properly this can be used to accomodate each of the three search methods. In that way each can play it's strength to help where another approach falls short

Conclusion

Keyword search and esp. reverse-keywordindex is the best known and sound 'n' tried technique for website search. Naive bayesian or VectorSpace search are rather obscure and experimental at the moment (there are some very advanced applications in the commercial sector using LSI, though).

The main problems with the 'similarity search' approaches are questions of optimation mainly. The same is true for keyword search, but that has been used and implemented for some time longer, so there's more material/experience there.

One of the main problems with any optimized search approach is that of being able to add new documents 'on the fly'. Often it is neccesary to re-index the whole document set to add new material. For small, slow changing sets this is no big deal, but just take a small number of fast moving news-sites (weblogs) and even advanced search engines like ht://dig are hard pressed to cope. And if you'd like to have (almost-)realtime addition of content to the index, you are looking for some serious problem-solving.

The VectorSpace approach is esp. hard here, as the document/word matrix can only be computed once all contained documents have been added, and it will completly change with each addition. That makes 'incremental' addition look really hard.

On the Horizon

I would also like to mention another approach, that is fairly new to me, and hasn't been well documented yet: 'Semantic Search of Unstructured Data using Contextual Network Graphs'. A link to an introductory PDF can be found in the Appendix to this essay. I can't currently say very much about this approach, as I still need to read up on it. But from the introductory text on the website, it seems it tries to adress some of the shortcomings of VectorSpace / LSI (being flexible and easily updateable, better scalability etc.) with the additional bon that it is not covered by a patent (see LSI), but real OpenSource. That point at least should give it some reall power and influence for uses on the internet.

Appendix

Sources / further reading

An example of a simple keyword search engine - no database, just a deep faith in Perl's ability to parse files quickly.
A slightly dated (1999) but still very relevant primer on building a reverse index keyword search engine in Perl from Dr Dobbs Magazine: Full-Text Searching in Perl
by Tim Kientzle (the actual article is only availeable in for-pay nowadays)
building a VectorSpace search engine in Perl
Finding Out About (FOA): a book discussing very many of the topics involved here, which some sample code etc. (There's also a rundown HTMl version that can give a good oversight of what's in the book)
Semant-O-Matic - an experimental LSI search engine indexing about 1400+ Weblog posts from the time between March - June 2002
Semantic Search of Unstructured Data using Contextual
Network Graphs (PDF)

The Author

Martin Spernau (1970) is a webdeveloper and avid blogger from Germany with several years of experience in Perl/PHP programming. He has written several site-search tools for clients, and regularly experiments with alternative ways of making information accessible.

TGGoogleBrowser - A visualisation tool for the Google 'related pages' feature using a Java applet by TouchGraph
'similar entries feature for traumwind.de/blog/ is an experiment to see how the VectorSpace approach can be put to practical use.
A big source of shame for him is the fact that his personal website and blog up to date does not feature a working keyword search engine. Martin blames this on not yet having found or written one that work the way he wants it too. This essay - and the planned code to accompny it - is one big step into the right direction.
[update:] There is now a 'brute force keyword search' on 'Sender Traumwind', with viewable source

Three approaches to searching