Collaborative Filtering - (almost) for free
One of the bigger projects I have been working on in the recent pat that came trough my beloved RentACoder (more on why I love it so later) - was a request to build a 'Collaborative Filtering' functionality for a series of news-oriented weblogs. I'm calling it 'almost for free' because the system needed very minimal changes to the weblogs per se to function. I will try to outline and describe the system and the concepts behind in brief here. My agreement with the buyer pervents me from posting actual code, but maybe that's better anyway (it's not much code after all). Please feel free to ask any questions you might have directly.
The basic requirement
The code was to provide an automatic URL recommodation for each single post in the weblogs, based on usage history (aka clicktracks). The idea was to have something akin to amazon's 'people who bought this book also bought' feature, only for vistited URLs. For each page the system was to display a list of further URLs that the user might be also interested in. To make things simpler and be able to pre-generate the recommodations, the usage pattern of the current user would not be (directly) taken into consideration for the recommodations.
Another requirement was that the system must not require any kind of special action on the side of the website-user as well as on the side of the website-maintainer. The recommodation system should be (almost) automatic. The website-user (visitor) should not even need to be aware of it, and the maintainer should have zero extra work to keep things going once set up.
The implementation idea
In layman's terms one could discribe the recomodation process like this:
- track visited URLs per user
- see these sets of visited URLs as a kind 'neigbourhood'
- for each URL compute 'similar' URLs based of the neigbourhoods
In implementation, this comparison is done via the Search::ContextGraph Perl module. Having some prior experience with this module for text-similarity comparisons. I found it ideally suited to the task once the basic conceptual hurdle was taken. I needed to transpond the document->term concept to session->url so that I could search the graph for 'similar URLs' instead of 'similar documents'.
The actual system consists of two distinct stages:
- the user-tracking / clicktrack logging
- and the 'similarity engine' that works from that data
Clicktrack logging is actually implemented in a very simple and convienient fashion: we simply use mod_usertrack which has been part of the default apache setup for ages. Enabling mod_usertrack is a no-brainer, only we have some special setup for logging the resulting data. But that is in no way neccessray as we know now. The module does a very simple job: it sets / tries to set a cookie if it sees a request that does not already have one. The cookie always gets logged, so clients not accepting (and returning) cookies are a slight concern, as they appear as a large number of 'sessions' of lenght one. But that can be taken care of while processing and in practice doesn't matter much.
The 'similarity engine' is again fairly simple. It collects all requested URLs from the logging data and records each 'session' or user-id that has visited that given URL, weighting by number of requests. This is similar to the inverse index approach used when searching for document similarities, where the key is the term and the values are the documents it occurs in. This index is then 'simply' fed to a Search::ContextGraph object, and then a 'find_similar' for each URL is preformed. The resulting URL sets are then stored for later retrieval in a SQL table.
And that was basically - it. All the complexety in the actual code comes from dealing with a set of MySQL tables, increamntal log-parsing, normalizing results and also some needed fallbacks for vaulty logfiles etc.
A little bell and a whistle
Results and further thoughts
The recommodations that are produced by this system can really only be evalued in a subjective way: Does the system recommend usefull 'see alsos'? But this depends very much on the underlying data which in turn depends on the actual usage patterns of website visitors. In evaluating the results one has to keep that in mind. Only if the majority of website visitors actually click on links to related and interesting content - and not simply on the first two or three links on the main page - does this system stand a chance of giving meaningfull output. Also one must not forget that there is a certain chance that the recommodation links might produce a certain shift towards just those URLs that were recommended...
In any case this system was fun to write and seems to be rather usefull to the client. It is definetly a very interesting use for all the logging data a normnal weblog produces
- CPAN: Search::ContextGraph
- Semantic Search of Unstructured Data using Contextual
Network Graphs (PDF) (this is the basis for the Perl module)
- creative-weblogging's list of weblogs (the system runs for most of them)
- Tools for Thought
- Assosiative Text... the Bernstein way
- DeBabelizing the profile space
- Tinderbox: going from cloud to text
alles Bild, Text und Tonmaterial ist © Martin Spernau, Verwendung und Reproduktion erfordert die Zustimmung des Authors