indexing slow-down (i.e., speeding up programs)

Sat Apr 12 21:07:01 EDT 2003

on 4/12/03 4:14 PM, John Vokey at vokey at uleth.ca wrote:

> Here's my task:  I've got two LARGE text files, one a spelling
> dictionary (from Excalibur) and the other a dictionary of lexical
> frequencies (i.w., item and frequency); there over 113,000 unique words
> (as lines) in the first, and about a million different entries (as
> separate lines) in the second, many not regular words (e.g., dates,
> numbers, etc.); furthermore, in the second, the same word is listed
> multiple times depending on use (e.g., ``rose'' as a noun and ``rose''
> as a verb).  I want to extract the lexical frequency (ignoring use
> type) of each of the words in the first from the second, including
> assigning a frequency of zero for those from the first not found in the
> second.
> 
> Fortunately, both are already alphabetised, so as I move sequentially
> through the first, I can simply start searching from where I last left
> off in the second, summing over multiple use listings of the same word.
> So far so good.   I use the ``repeat for each ... in ... '' construct
> for the first dictionary, and a variable pointer for the second that I
> advance as I find each word from the first (I call a simple function
> that skips over non-words and advances the pointer, returning the line
> of the next word in the lexical dictionary).  Both text files are read
> into memory at the start of the routine (which takes less than a second
> using the'' url file://...'' command (way to go, metacard!).
> 
> Here's my problem: initially, the routine takes much less than 1
> second per hundred words (which, when multiplied by the number of words
> remaining to index, results in an estimate of some small fraction of an
> hour to do the whole task).  However, it rapidly (as an exponential
> function) slows down, so that by the time it reaches the middle of the
> alphabet (M), it takes many minutes per 100 words, and an
> ever-increasing time estimate for the remaining items (now over 60
> hours!).  Clearly, either the ``repeat for each'' command for the first
> dictionary  or the ``get line pointer...'' for the second (or both)
> get(s) slower and slower as I progress through the dictionaries,
> presumably because to do one or the other (or both), metacard counts
> carriage returns from the beginning of the dictionary.
> 
> I've tried: deleting the items from the front of the lexical
> dictionary as I've searched them, so that each new search starts at
> line 1 of the edited list [i.e., put line pointer to (the number of
> lines of dict2) of dict2 into dict2)], but that slows it down even more
> (presumably because metacard needs to count crs to find the number of
> the last line each time).  I've tried various machinations of
> offset(,,skip) using the skip variable as an absolute pointer, but it
> also appeared to make things slower presumably because it counts chars
> from the start of the dictionary.  I guess I could divide dict2 (or
> dict1) into many small segments, moving to each successive segment as
> the previous one was exhausted, but I was hoping for something more
> elegant.
> 
> What I need are *absolute* pointers (preferably a memory address, or
> a pointer or a handle to such), rather than the relative (to the
> beginning of the list) pointers given by the line (and possibly ``for
> each'') construct.  Arrays presumably would work (but doesn't metacard
> then have to search the indices to resolve the array reference?), and
> reading the million length file into the array just sets the problem
> back one step.
> 
> Any suggestions would be appreciated.  As I receive the list in
> digest form, if you have a scathingly brilliant idea, please send a
> copy of it directly to my email address.  TIA

 I don't know if it is "brilliant" or not -- but you may have to read the
initial files into several separate variables, or separate elements of an
array. Your pointer would then be two-dimensional (line,variable) or
(line,indexnumber).

You then should be able to do a double repeat -- a
    repeat for each element thiselement in thelist
and a
    repeat for each line thisline in thiselement

You would then reap the benefits of a smaller search area, as well as having
everything still at hand -- all for a little extra time loading the program
in the first place.

I don't know -- its just a guess. Try it, though, and let me know how it
works. 

Raymond