开发者

Is there a well known classifier library?

开发者 https://www.devze.com 2022-12-09 23:22 出处:网络
I\'m crawling data from internet,without classifying. Is there such a library to recommend? EDIT I\'m crawling jobs from other we开发者_JS百科bsite,and I need to group them into different industrie

I'm crawling data from internet,without classifying.

Is there such a library to recommend?

EDIT

I'm crawling jobs from other we开发者_JS百科bsite,and I need to group them into different industries.


To sort unlabelled data into groups, you want clustering, not classification. The most complete machine learning library is the Java-based Weka. You'll probably want to start by extracting text from the web pages (remove script and style elements completely, strip other tags), and then running the text through the StringToWordVector filter before performing clustering.


My current employer developed a system to categorize web pages. There were not any useful libraries that we could find so we had to do our own. We do not license ours out.

I can give you some hints. Spam analyzers classify email into Junk or Not Junk. You can use the same tools such as Bayesian, CRM-114, etc to do your own classifications on any text, including web pages.

You will have to watch the results of these very carefully and give them a lot of human feedback. You can often find keyword sets that will score very well for you. Finding those keyword sets will take time and effort and it will change some over time.

You will have to write code to divide web pages into topic sections because most pages are not all one thing. There are ad frames, navigation and other things.

0

精彩评论

暂无评论...
验证码 换一张
取 消