开发者

Recognizing language of a short text? [closed]

开发者 https://www.devze.com 2022-12-24 12:24 出处:网络
Closed. This question needs to be more focused. It is not currently accepti开发者_如何学Cng answers.
Closed. This question needs to be more focused. It is not currently accepti开发者_如何学Cng answers.

Want to improve this question? Update the question so it focuses on one problem only by editing this post.

Closed 6 years ago.

Improve this question

I have a list of articles, and each article has its own title and description. Unfortunately, from the sources I am using, there is no way to know what language they are written in.

Furthermore, the text is not entirely written in 1 language; almost always English words are present.

I reckon I would need dictionary databases stored on my machine, but it feels a bit impractical. What would you suggest I do?


I'd use the guess-language project.

Edit: Now in Bitbucket


Have you looked into http://ling.unizd.hr/~dcavar/LID/ and http://en.wikipedia.org/wiki/Language_identification ?


You could try the Google AJAX Language API if you don't mind using a web service to do your work for you.


In general you're looking at doing nGram identification. Since this is a python question, you might take a look at http://github.com/koblas/ngramj-python which is a pure python port of the java ngram library (another open source project).

The documentation is lacking, but it has really good accuracy.


I know this is an old question, but in case people come across this while researching options for this task, it is worth mentioning that another tool is langid.


If neos recommendation is also unpractical, I would try something like this:

In many languages there are some keywords which are in many sentences and are often not found in other languages.

Example: "The" in English, "der", "die", "das" in German, ....

Find such words and try to find them in your texts. It can be a little fuzzy at last -- for example, when you find "the" and "der" -- it could be a German text containing some English sentences in it. At least with enough words from your target languages you could come to a high hit-rate.

0

精彩评论

暂无评论...
验证码 换一张
取 消