开发者

Word Base/Stem Dictionary

开发者 https://www.devze.com 2023-01-21 06:29 出处:网络
It seems my Google-fu is failing me. Does anyone know of a freely available word base dictionary that just contains bases of words? So, for something like strawberries, it would have strawberry. But

It seems my Google-fu is failing me.

Does anyone know of a freely available word base dictionary that just contains bases of words? So, for something like strawberries, it would have strawberry. But does NOT contain abbreviations or misspellings or alternate spellings (like UK versus US)? Anything quickly usable in Java 开发者_JAVA技巧would be good but just a text file of mappings or anything that could be read in would be helpful.


This is called lemmatization, and what you call the "base of a word" is called a lemma. morpha and its reimplementation in the Stanford POS tagger do this. Both, however, require POS tagged input to resolve the inherent ambiguity in natural language.

(POS tagging means determining the word categories, e.g. noun, verb. I've been assuming you want a tool that handles English.)

Edit: since you're going to use this for search, here's a few tips:

  • Simple stemming for English has a mixed reputation in the search engine world. Sometimes it works, often it doesn't.
  • Automatic spelling correction may work better. This is what Google does. It's expensive in terms of computing time, though, if you want to do it right.
  • Lemmatization may provide benefits, but probably only if you index and search for both the words and the lemmas. (Same advice goes for stemming.)
  • Here's a plugin for Lucene that does lemmatization.

(Preceding remarks are based on my own research; I wrote my master's thesis about lemmatization in search engines for very noisy data.)


This isn't exactly what you're asking for, but Wikipedia on stemming was enlightening and contains a number of links to free stemming programs. Which presumably should include lists of word stems


http://www.puzzlers.org/dokuwiki/doku.php?id=solving:wordlists:about:start

The Miriam Websters Collegiate 9th Edition link on this page contains a word file of only the root forms of words. Strawberry is in there, Strawberries is not. Likewise "add" is in there "adding" is not. Not sure if this is what you are after, but it was helpful for me.

0

精彩评论

暂无评论...
验证码 换一张
取 消