In this essay I will present a comparison of three strategies of searching for information in a given set of Text-Documents, I will try to show the strenght and weaknesses of each approach, and how the three strategies can build upon each other and can also complement each other.
Status of this Document:
2003-04-25: This is currently in the draft phase, and it's not even
complete in it's structure. I am still trying to find the best way to
lay out my ideas.
2003-04-28: I just recently discovered another approach using
'Contectual Network Graphs' (see Apendix) that promises to give the
benefits of LSI,
but also avoids it's computational challenges. I will present this
technique as the 'third' approach once I understand it better.
2003-04-30: Azeem Azhar has an excellent summary of the two schools of thought in search engine design, and how they pertain to Google's recently announced purchase of Applied Semantics.
Feedback is welcome via email at martin at traumwind dot de
or as comment to this weblog post
A Disclaimer or two
I do not claim to have a very deep understanding of all the
computational details involved in the approaches described here. I do
not have a degree in computer science.
My goal for this essay and the work sourrounding it is to make the understanding of the basic principles
possible to the intermediate Scripter. I wish to help others find the
information that helped me shed light into something that otherwise
seems to be a topic of qbscure wizardry.
Baylor Wetzel puts it like this in his exellent article 'Build a Smarter Search Engine': So here's what I ask. Think back on your computer science schooling or go pick up a book on algorithms (no, not a language-specific one but an honest-to-goodness theory book) and see if you can't find something useful in there. And if you do, talk some teammates into trying it out. Hey, while you're at it, why not think of a simple way to explain it and then write an article? Many of us who would be happy to read it.
A note on application of the described approaches: My interest here
does not lie with large-scale search-engines for millions of users. I'm
researching this topic for my very personal information-retrieval
needs. I'm looking for solutions that will help me - the user - to make
sense of the information at my disposal.
Keyword search
Searching by keywords is maybe the most well known and wide spread
strategy of finding information in the World Wide Web. Almost anybody
who uses the Web has used popular search engines like Google, Yahoo or
Altavista or their competitor at least once. For many they are a
constant tool in their research tool-box.
Let's take a closer look at what keyword search does and what it offers, before discussing it's weaknesses:
Goal: Find documents containing a user-supplied set of words
Basically a keyword search engine takes a user-input 'query' (a list
of words) and tries to find documents containing those words. The trick
here lies in the way this is done, and how efficient this is. Also
important are the criteria after which the resulting list of documents
are 'ranked'.
What is needed to supply keyword searching?
- First of all, a set of documents. In case of a single website, this would be all documents found on the site. In case of Google, the 'set' is optimally every availeable document on the internet. You can easily see where the optimisation problems and the hight tech of search engines lie if you consider the sheer number of documents on the 'net today.
- A naive approach to keyword searching (and one that works rather
well in small scale) would be to actually search all availeable
documents for the given search words each and every time. The Windows
'Find' availeable in Explorer or the Unix command 'grep' do just that,
and are reasonable tools for local searching. I have such a 'brute-force searchengine' in use on my weblog. It works rather well and is sufficiently fast. Written in PHP it is about 80 lines of code, display included.
For larger collections of documents the sheer processing time and cost will make such a strategy unwiedly and inefficient. Various techiques exist to optimize this, mostly dealing with pre-generating search indices. - Querying this document set usually follows these steps:
- split the user query into a list of words.
- for each word in the query search document set for documents containing the word.
- return list of documents, ranked by certain criteria.
- A way of 'ranking' results according to user input. Which most
large document sets 'swamping' with results is a common problem. Either
the search does not return anything relevant (wrong keywords / keywords
not found in set) or a far too large set of documents to be usefull
(too un-specific query, or jsut too many documents alltogether). So
there needs to be a way to 'reduce' the result set of documents to
those that will most likely be usefull to the user.
Common approaches are:
- Rank by amount of words from the query contained in document ('any word' search). Also often refered to as 'relevance'
- Rank by date/site/author or any other criteria.
- Rank by amount of words from the query contained in document ('any word' search). Also often refered to as 'relevance'
- Advanced queries can also use 'boolean' modifiers. Searching for documents containing one word BUT NOT another or containing any one of a list of words (OR) etc.
- Another advanced feature of many keyword search engines is searching for 'similar words' along with the actual query-words. Searching for word stems, allowing for spelling errors or synonyms of the query words are common.
Weaknesses of keyword search
Apart from the techical weaknesses of various implemetations - which
mostly have to do with a needed compromize between
performance/technical complexity - keyword search has one major
drawback: it finds only documents containing the words the user queries for
So why is that a problem?
Well, to begin with, what if you don't know the topic specific
'keywords' you should be searching for? Keyword search shines when the
query can be as specific as possible. You need to know what you are
looking for. And most specifically: you need to know the keywords used
in the documents you are looking for.
A good example of the counter-intuitive nature of choosing good keywords is this tip from a search-expert:
Imagine you are looking for documents on a given topic, and would
like to find only documents that have some 'expertise'. Try to add the
keyword 'bibliography' to your search terms.
Now that surely works, and thinking about it one soon realizes
why: Documents that are serious enough to list a 'bibliography' will
mostly be of more merit that such that don't. So even if we are not
actually intersted in the bibliography, the word itself helps us find
documents we are interested in.
An alternative: similarity search
What if you have found one document that talks about the topic you
are interested in? Having a way to search for 'similar documents' might
be a way to find information about a topic you are not (yet) very
familliar with (and don't know the proper keywords for).
A naive approach: naive bayesian comparison
A technique currently in the rave for it's uses at detecting spam
could also be usefull at detecting similarities between two documents.
(This technique is called naive bayesian not because it is a naive approach, but because it uses something that is not.really bayesian...)
This approach basically builds lists of word-occerences for each
document, and then compares how many words are common between two
documents. Some weighting is involved to make comparisons more
valuable. A 'score' of intersections is then calculated to give a
numerical 'similarity factor' for two texts.
This technique is very succesfully used to 'classify' a document as
belonging to one of two groups (spam/non-spam). This done by building
two 'refernce sets' of documents (one for is-spam and one for
isnot-spam) and then comparing a new document to these both references
Some simple experiments with PHP [link nbayes.php] show that this
works reasonably well on a one on one basis. For a 'search' which would
mean to compare a 'query document' againt each and every document in
the collection this quickly becomes unfeaseable, for very much the same
reasons as for the 'brute-force' keyword search mentioned above. Some
pre-generation of word-occurence lists etc. should speed things up,
which would also be usefull for 'simple' keyword search. So maybe this
approach could 'build upon' the indices that were built for a keyword
search.
So where is the problem with this naive bayesian approach?
Well, to start with, it again only counts actual word-per-word matches.
So to over-simplify it simply lists how many words appear in both
documents to give a similarity rating. Agrred this is a crass
simplification, carefull weighting of words (rating very common words
less, and significant words higher, accounting for a word-weight to
document lenght ratio etc.) pushes this very far along, and it has some
very powerfull uses.
One very strong point of this approach on the other hand is the fact
that it can be implemented with basic string and list handling routines
availeable in most scripting languages alone, without the need to
resort to C-based extensions (the same is true for the
keyword/reverseindex search).
A better solution: similarity through VectorSpace
An advanced version of the naive bayesian comparison is based on a
so called 'VectorSpace' model. This is an approach that lives in the
obscurety of numerical computing as it involves some advanced
vector/matrix computations usually reserved for Mathematical Sciences.
But thanks to a recent perl.com article [link!] this has changed and
has moved this interessting technique into the grasp of scripting
languages (Also thanks to Perl's PDL library).
Although the mentioned article and the accompanied code make some
claims that I don't fully agree with, the underlying techniques are
very valuable, and I have implemented a very usefull 'similar entrries'
feature for my weblog using it.
What is a VectorSpace?
Finding 'similar' text that share almost no common keywords
So how can these three approaches be combined?
Same prerequesites
- cleaned documents: the first step for each approach is to 'clean'
the documents of 'noise' (markup etc.) which does not contribute to the
'character' of a given document. (Many implementations even go so far
as to reduce the amount of unique words by 'stemming' and omitting
'stopwords')
- per document word-list: what unique words are contained in what document and in what weight (amount, position)?
- from those document-word lists - which require some computing to generate - it is rather easy to generate the classical reverse-keyword index. This is an index listing all documents that contain any given keyword.
So alot of the preperation needed for the 'simarity' approaches can
also be used for a keyword based search. Set up properly this can be
used to accomodate each of the three search methods. In that way each
can play it's strength to help where another approach falls short
Conclusion
Keyword search and esp. reverse-keywordindex is the best known and
sound 'n' tried technique for website search. Naive bayesian or
VectorSpace search are rather obscure and experimental at the moment
(there are some very advanced applications in the commercial sector
using LSI, though).
The main problems with the 'similarity search' approaches are
questions of optimation mainly. The same is true for keyword search,
but that has been used and implemented for some time longer, so there's
more material/experience there.
One of the main problems with any optimized search approach is that
of being able to add new documents 'on the fly'. Often it is neccesary
to re-index the whole document set to add new material. For small, slow
changing sets this is no big deal, but just take a small number of fast
moving news-sites (weblogs) and even advanced search engines like
ht://dig are hard pressed to cope. And if you'd like to have
(almost-)realtime addition of content to the index, you are looking for
some serious problem-solving.
The VectorSpace approach is esp. hard here, as the document/word
matrix can only be computed once all contained documents have been
added, and it will completly change with each addition. That makes
'incremental' addition look really hard.
On the Horizon
I would also like to mention another approach, that is fairly new to
me, and hasn't been well documented yet: 'Semantic Search of
Unstructured Data using Contextual Network Graphs'. A link to an
introductory PDF can be found in the Appendix to this essay. I can't
currently say very much about this approach, as I still need to read up
on it. But from the introductory text on the website, it seems it tries
to adress some of the shortcomings of VectorSpace / LSI (being flexible
and easily updateable, better scalability etc.) with the additional bon
that it is not covered by a patent (see LSI), but real OpenSource. That
point at least should give it some reall power and influence for uses
on the internet.
Appendix
Sources / further reading
- An example of a simple keyword search engine - no database, just a deep faith in Perl's ability to parse files quickly.
- A slightly dated (1999) but still very relevant primer on
building a reverse index keyword search engine in Perl from Dr Dobbs
Magazine: Full-Text Searching in Perl
by Tim Kientzle (the actual article is only availeable in for-pay nowadays)
- building a VectorSpace search engine in Perl
- Finding Out About (FOA): a book discussing very many of the topics involved here, which some sample code etc. (There's also a rundown HTMl version that can give a good oversight of what's in the book)
- Semant-O-Matic - an experimental LSI search engine indexing about 1400+ Weblog posts from the time between March - June 2002
- Semantic Search of Unstructured Data using Contextual
Network Graphs (PDF)
The Author
Martin Spernau (1970) is a webdeveloper and avid blogger from Germany with several years of experience in Perl/PHP programming. He has written several site-search tools for clients, and regularly experiments with alternative ways of making information accessible.
- TGGoogleBrowser - A visualisation tool for the Google 'related pages' feature using a Java applet by TouchGraph
- 'similar entries feature for traumwind.de/blog/ is an experiment to see how the VectorSpace approach can be put to practical use.
- A big source of shame for him is the fact that his personal
website and blog up to date does not feature a working keyword search
engine. Martin blames this on not yet having found or written one that
work the way he wants it too. This essay - and the planned code to
accompny it - is one big step into the right direction.
[update:] There is now a 'brute force keyword search' on 'Sender Traumwind', with viewable source
Legal / Copyright
This text is copyright 2003 by Martin Spernau, all right reserved.
Permission to quote is given as long as the author is credited