开发者

Extracting keywords from an article

开发者 https://www.devze.com 2023-03-11 10:51 出处:网络
I have articles and keywords stored inside MySQL. The site will preprocess the new articles to find how many matching keywords there are and then update a table which stores the relevant keywords rela

I have articles and keywords stored inside MySQL. The site will preprocess the new articles to find how many matching keywords there are and then update a table which stores the relevant keywords related to the article. This will then be used on the front-end by highlighting keywords within the article and will link users to articles with the same matching keywords.

My concern here is how to do this processing efficiently. M开发者_如何学JAVAy idea is: when processing new articles, it finds the ngrams of the text (up to 3- or 4-gram) and then search each against the keywords table in the MySQL database. This may end up being a slow mess, I haven't tried. But maybe I'm approaching this the wrong way?

Any resources on how to do this efficiently would be awesome. Language used here is primarily PHP.


I've never used PHP to do it, but in .NET, I'll usually do what was alluded to by samxli. I load all keywords into a hashtable. I've done it with up to 120,000 keywords and it works pretty fast.

The .NET hashtable object has a contains([key]) method. So for each word in the article I'll just call:

theHashTable.contains(theWord)

If it does contain the word, I'll index it. Has worked pretty well for me without having to use other frameworks. I don't know how hashtables work in PHP. You'd have to google that. I think their normal arrays work like hashtables?

The key to using a hashtable is that the keys are indexed for fast searching -- I think they use bTrees, but someone may correct me on that. If you're not familiar with the btree concept, you might want to look that up.


For fatener you approach you can index your keywords and search them with lucene i.e build query for your document. The most convinient way to extract keyword is using large corpra to build idf frequncy and then extract most tfidf words/phrasses. But in your case with restricted kwywords set first approach is the best.

Further look maui http://code.google.com/p/maui-indexer/ and KEA http://www.nzdl.org/Kea/

0

精彩评论

暂无评论...
验证码 换一张
取 消