Three approaches to searching

In this essay I will present a comparison of three strategies of searching for information in a given set of Text-Documents, I will try to show the strenght and weaknesses of each approach, and how the three strategies can build upon each other and can also complement each other.

Status of this Document:

2003-04-25: This is currently in the draft phase, and it's not even complete in it's structure. I am still trying to find the best way to lay out my ideas.

2003-04-28: I just recently discovered another approach using 'Contectual Network Graphs' (see Apendix) that promises to give the benefits of LSI, but also avoids it's computational challenges. I will present this technique as the 'third' approach once I understand it better.

2003-04-30: Azeem Azhar has an excellent summary of the two schools of thought in search engine design, and how they pertain to Google's recently announced purchase of Applied Semantics.

Feedback is welcome via email at martin at traumwind dot de or as comment to this weblog post

A Disclaimer or two

I do not claim to have a very deep understanding of all the computational details involved in the approaches described here. I do not have a degree in computer science.

My goal for this essay and the work sourrounding it is to make the understanding of the basic principles possible to the intermediate Scripter. I wish to help others find the information that helped me shed light into something that otherwise seems to be a topic of qbscure wizardry.

Baylor Wetzel puts it like this in his exellent article 'Build a Smarter Search Engine':     So here's what I ask. Think back on your computer science schooling or go pick up a book on algorithms (no, not a language-specific one but an honest-to-goodness theory book) and see if you can't find something useful in there. And if you do, talk some teammates into trying it out. Hey, while you're at it, why not think of a simple way to explain it and then write an article? Many of us who would be happy to read it. 

A note on application of the described approaches: My interest here does not lie with large-scale search-engines for millions of users. I'm researching this topic for my very personal information-retrieval needs. I'm looking for solutions that will help me - the user - to make sense of the information at my disposal.

Keyword search

Searching by keywords is maybe the most well known and wide spread strategy of finding information in the World Wide Web. Almost anybody who uses the Web has used popular search engines like Google, Yahoo or Altavista or their competitor at least once. For many they are a constant tool in their research tool-box.

Let's take a closer look at what keyword search does and what it offers, before discussing it's weaknesses:

Goal: Find documents containing a user-supplied set of words

Basically a keyword search engine takes a user-input 'query' (a list of words) and tries to find documents containing those words. The trick here lies in the way this is done, and how efficient this is. Also important are the criteria after which the resulting list of documents are 'ranked'.

What is needed to supply keyword searching?

Weaknesses of keyword search

Apart from the techical weaknesses of various implemetations - which mostly have to do with a needed compromize between performance/technical complexity - keyword search has one major drawback: it finds only documents containing the words the user queries for

So why is that a problem?
Well, to begin with, what if you don't know the topic specific 'keywords' you should be searching for? Keyword search shines when the query can be as specific as possible. You need to know what you are looking for. And most specifically: you need to know the keywords used in the documents you are looking for.

A good example of the counter-intuitive nature of choosing good keywords is this tip from a search-expert:
Imagine you are looking for documents on a given topic, and would like to find only documents that have some 'expertise'. Try to add the keyword 'bibliography' to your search terms.
Now that surely works, and thinking about it one soon realizes why: Documents that are serious enough to list a 'bibliography' will mostly be of more merit that such that don't. So even if we are not actually intersted in the bibliography, the word itself helps us find documents we are interested in.

An alternative: similarity search

What if you have found one document that talks about the topic you are interested in? Having a way to search for 'similar documents' might be a way to find information about a topic you are not (yet) very familliar with (and don't know the proper keywords for).

A naive approach: naive bayesian comparison

A technique currently in the rave for it's uses at detecting spam could also be usefull at detecting similarities between two documents.
(This technique is called naive bayesian not because it is a naive approach, but because it uses something that is not.really bayesian...)
This approach basically builds lists of word-occerences for each document, and then compares how many words are common between two documents. Some weighting is involved to make comparisons more valuable. A 'score' of intersections is then calculated to give a numerical 'similarity factor' for two texts.
This technique is very succesfully used to 'classify' a document as belonging to one of two groups (spam/non-spam). This done by building two 'refernce sets' of documents (one for is-spam and one for isnot-spam) and then comparing a new document to these both references

Some simple experiments with PHP [link nbayes.php] show that this works reasonably well on a one on one basis. For a 'search' which would mean to compare a 'query document' againt each and every document in the collection this quickly becomes unfeaseable, for very much the same reasons as for the 'brute-force' keyword search mentioned above. Some pre-generation of word-occurence lists etc. should speed things up, which would also be usefull for 'simple' keyword search. So maybe this approach could 'build upon' the indices that were built for a keyword search.

So where is the problem with this naive bayesian approach?
Well, to start with, it again only counts actual word-per-word matches. So to over-simplify it simply lists how many words appear in both documents to give a similarity rating. Agrred this is a crass simplification, carefull weighting of words (rating very common words less, and significant words higher, accounting for a word-weight to document lenght ratio etc.) pushes this very far along, and it has some very powerfull uses.

One very strong point of this approach on the other hand is the fact that it can be implemented with basic string and list handling routines availeable in most scripting languages alone, without the need to resort to C-based extensions (the same is true for the keyword/reverseindex search).

A better solution: similarity through VectorSpace

An advanced version of the naive bayesian comparison is based on a so called 'VectorSpace' model. This is an approach that lives in the obscurety of numerical computing as it involves some advanced vector/matrix computations usually reserved for Mathematical Sciences. But thanks to a recent perl.com article [link!] this has changed and has moved this interessting technique into the grasp of scripting languages (Also thanks to Perl's PDL library).
Although the mentioned article and the accompanied code make some claims that I don't fully agree with, the underlying techniques are very valuable, and I have implemented a very usefull 'similar entrries' feature for my weblog using it.

What is a VectorSpace?

Finding 'similar' text that share almost no common keywords

So how can these three approaches be combined?

Same prerequesites

So alot of the preperation needed for the 'simarity' approaches can also be used for a keyword based search. Set up properly this can be used to accomodate each of the three search methods. In that way each can play it's strength to help where another approach falls short

Conclusion

Keyword search and esp. reverse-keywordindex is the best known and sound 'n' tried technique for website search. Naive bayesian or VectorSpace search are rather obscure and experimental at the moment (there are some very advanced applications in the commercial sector using LSI, though).

The main problems with the 'similarity search' approaches are questions of optimation mainly. The same is true for keyword search, but that has been used and implemented for some time longer, so there's more material/experience there.

One of the main problems with any optimized search approach is that of being able to add new documents 'on the fly'. Often it is neccesary to re-index the whole document set to add new material. For small, slow changing sets this is no big deal, but just take a small number of fast moving news-sites (weblogs) and even advanced search engines like ht://dig are hard pressed to cope. And if you'd like to have (almost-)realtime addition of content to the index, you are looking for some serious problem-solving.

The VectorSpace approach is esp. hard here, as the document/word matrix can only be computed once all contained documents have been added, and it will completly change with each addition. That makes 'incremental' addition look really hard.

On the Horizon

I would also like to mention another approach, that is fairly new to me, and hasn't been well documented yet: 'Semantic Search of Unstructured Data using Contextual Network Graphs'. A link to an introductory PDF can be found in the Appendix to this essay. I can't currently say very much about this approach, as I still need to read up on it. But from the introductory text on the website, it seems it tries to adress some of the shortcomings of VectorSpace / LSI (being flexible and easily updateable, better scalability etc.) with the additional bon that it is not covered by a patent (see LSI), but real OpenSource. That point at least should give it some reall power and influence for uses on the internet.


Appendix

Sources / further reading

The Author

Martin Spernau Martin Spernau (1970) is a webdeveloper and avid blogger from Germany with several years of experience in Perl/PHP programming. He has written several site-search tools for clients, and regularly experiments with alternative ways of making information accessible.

Legal / Copyright

This text is copyright 2003 by Martin Spernau, all right reserved. Permission to quote is given as long as the author is credited