开发者

Where should I store a list of stop words?

开发者 https://www.devze.com 2023-02-06 20:06 出处:网络
My function parses texts and removes short words, such as \"a\", \"the\", \"in\", \"on\", \"at\", etc.

My function parses texts and removes short words, such as "a", "the", "in", "on", "at", etc.

The list of these words might be m开发者_Go百科odified in the future. Also, switching between different lists (i.e., for different languages) might also be an option.

So, where should I store such a list?

  • About 50-200 words
  • Many reads every minute
  • Almost no writes (modifications) - for example, once in a few months

I have these options in my mind:

  1. A list inside the code (fastest, but it doesn't sound like a good practise)
  2. A seperate file "stop_words.txt" (how fast is reading from a file? should I read the same data from the same file every few seconds I call the same function?)
  3. A database table. Would it be really efficient, when the list of words is supposed to be almost static?

I am using Ruby on Rails (if that makes any difference).


If it's only about 50-200 words, I'd store it in memory in a data structure that supports fast lookup, such as a hash map (I don't know what such a structure is called in Ruby).

You could use option 2 or 3 (persist the data in a file or database table, depending on what's easier for you), then read the data into memory at the start of your application. Store the time at which the data was read and re-read it from the persistent storage if a request comes in and the data hasn't been updated for X minutes.

That's basically a cache. It might be possible that Ruby on Rails already provides such a mechanism, but I know too little about it to answer that.


Since look-up of the stop-words needs to be fast, I'd store the stop-words in a hash table. That way, verifying if a word is a stop-word has amortized O(1) complexity.

Now, since the list of stop-words may change, it makes sense to persist the list in a text file, and read that file upon program start (or every few minutes / upon file modification if your program runs continuously).

0

精彩评论

暂无评论...
验证码 换一张
取 消