Home Geschichten Kunst Computer Tindertraum

[current]

more VectorSpace thoughts
(Thursday 24th April 2003)

here's some timing info for my current setup:
[Thu Apr 24 22:49:50 2003] vs.pl: reading archive from ../html/blog/archive
[Thu Apr 24 22:49:50 2003] vs.pl: found 1257 postings
[Thu Apr 24 22:49:50 2003] vs.pl: setting up VectorSpace
[Thu Apr 24 22:49:50 2003] vs.pl: building Index
[Thu Apr 24 22:49:50 2003] vs.pl: Making word list:
[Thu Apr 24 22:50:01 2003] vs.pl: Finished with word list
[Thu Apr 24 22:50:01 2003] vs.pl: doing queries
[Thu Apr 24 22:50:01 2003] vs.pl: making cosines for doc 0,
2003-04-24_22-33
[Thu Apr 24 22:50:02 2003] vs.pl: writing ../html/blog/archive/sims/2003-04-24_22-33.sim
[Thu Apr 24 22:50:02 2003] vs.pl:
[Thu Apr 24 22:50:02 2003] vs.pl: making cosines for doc 1, 2003-04-24_22-29
[Thu Apr 24 22:50:03 2003] vs.pl: writing ../html/blog/archive/sims/2003-04-24_22-29.sim
[Thu Apr 24 22:50:03 2003] vs.pl:
[Thu Apr 24 22:50:03 2003] vs.pl: making cosines for doc 2, 2003-04-24_22-26
[Thu Apr 24 22:50:05 2003] vs.pl: writing ../html/blog/archive/sims/2003-04-24_22-26.sim
[Thu Apr 24 22:50:05 2003] vs.pl:
[Thu Apr 24 22:50:05 2003] vs.pl: making cosines for doc 3, 2003-04-24_15-53
[Thu Apr 24 22:50:06 2003] vs.pl: writing ../html/blog/archive/sims/2003-04-24_15-53.sim
[Thu Apr 24 22:50:06 2003] vs.pl:
[Thu Apr 24 22:50:06 2003] vs.pl: making cosines for doc 4, 2003-04-24_10-41

so the first optimisation could be to go semi-incremental. No need to re-do all 1200+ entries in each run, as most of them don't really change.
next step surly would be to pre-generate the word-index phase, which could also be used for the keyword search.
each actual 'query' takes roughly a second to preform, that is rather fast if you consider you are comparing one doc to 1200+ others

And no, keeping the whole VectorSpace in-memory between user-queries is not really an option. The Perl VM size for this is about 125MB on my machine, which is ok for running it every so often, but not for running it contiously.

Another gripe is the fact that it's difficult to incrementally add new words (or docs containing new words) to a 'pre-generated' space.

[ by Martin>] [permalink] [similar entries]

similar entries (vs):

similar entries (cg):

relevant words



Martin Spernau
© 1994-2003

traumwind icon Big things to come (TM) 30th Dez 2002

Simple subtraction
Oblique Strategies, Ed.3 Brian Eno and Peter Schmidt



amazon.de Wunschliste





 

usefull links:
Google Graph browser
Traumwind 6-Colormatch
UAV News

powered by SBELT