indexing slow-down (i.e., speeding up programs)
John Vokey
vokey at uleth.ca
Sat Apr 12 17:16:00 EDT 2003
Here's my task: I've got two LARGE text files, one a spelling
dictionary (from Excalibur) and the other a dictionary of lexical
frequencies (i.w., item and frequency); there over 113,000 unique words
(as lines) in the first, and about a million different entries (as
separate lines) in the second, many not regular words (e.g., dates,
numbers, etc.); furthermore, in the second, the same word is listed
multiple times depending on use (e.g., ``rose'' as a noun and ``rose''
as a verb). I want to extract the lexical frequency (ignoring use
type) of each of the words in the first from the second, including
assigning a frequency of zero for those from the first not found in the
second.
Fortunately, both are already alphabetised, so as I move sequentially
through the first, I can simply start searching from where I last left
off in the second, summing over multiple use listings of the same word.
So far so good. I use the ``repeat for each ... in ... '' construct
for the first dictionary, and a variable pointer for the second that I
advance as I find each word from the first (I call a simple function
that skips over non-words and advances the pointer, returning the line
of the next word in the lexical dictionary). Both text files are read
into memory at the start of the routine (which takes less than a second
using the'' url file://...'' command (way to go, metacard!).
Here's my problem: initially, the routine takes much less than 1
second per hundred words (which, when multiplied by the number of words
remaining to index, results in an estimate of some small fraction of an
hour to do the whole task). However, it rapidly (as an exponential
function) slows down, so that by the time it reaches the middle of the
alphabet (M), it takes many minutes per 100 words, and an
ever-increasing time estimate for the remaining items (now over 60
hours!). Clearly, either the ``repeat for each'' command for the first
dictionary or the ``get line pointer...'' for the second (or both)
get(s) slower and slower as I progress through the dictionaries,
presumably because to do one or the other (or both), metacard counts
carriage returns from the beginning of the dictionary.
I've tried: deleting the items from the front of the lexical
dictionary as I've searched them, so that each new search starts at
line 1 of the edited list [i.e., put line pointer to (the number of
lines of dict2) of dict2 into dict2)], but that slows it down even more
(presumably because metacard needs to count crs to find the number of
the last line each time). I've tried various machinations of
offset(,,skip) using the skip variable as an absolute pointer, but it
also appeared to make things slower presumably because it counts chars
from the start of the dictionary. I guess I could divide dict2 (or
dict1) into many small segments, moving to each successive segment as
the previous one was exhausted, but I was hoping for something more
elegant.
What I need are *absolute* pointers (preferably a memory address, or
a pointer or a handle to such), rather than the relative (to the
beginning of the list) pointers given by the line (and possibly ``for
each'') construct. Arrays presumably would work (but doesn't metacard
then have to search the indices to resolve the array reference?), and
reading the million length file into the array just sets the problem
back one step.
Any suggestions would be appreciated. As I receive the list in
digest form, if you have a scathingly brilliant idea, please send a
copy of it directly to my email address. TIA
--
John R. Vokey, Ph.D. |\ _,,,---,,_
Professor /,`.-'`' -. ;-;;,_
Department of Psychology and Neuroscience |,4- ) )-,_. ,\ ( `'-'
University of Lethbridge '---''(_/--' `-'\_)
More information about the metacard
mailing list