Detecting whether or not text is English (in bulk)_问答_开发者

I'm looking for a simple way to detect whether a short excerpt of text, a few sentences, is English or not. Seems to me that this problem is much easier than trying to detect an arbitrary language. Is there any software out there that can do this? I'm writing in python, and would prefer a python library, but something else wo开发者_StackOverflow社区uld be fine too. I've tried google, but then realized the TOS didn't allow automated queries.

I read a method to detect English language by using Trigrams

You can go over the text, and try to detect the most used trigrams in the words. If the most used ones match with the most used among english words, the text may be written in English

Try to look in this ruby project:

https://github.com/feedbackmine/language_detector

EDIT: This won't work in this case, since OP is processing text in bulk which is against Google's TOS.

Use the Google Translate language detect API. Python example from the docs:

url = ('https://ajax.googleapis.com/ajax/services/language/detect?' +
       'v=1.0&q=Hola,%20mi%20amigo!&key=INSERT-YOUR-KEY&userip=INSERT-USER-IP')
request = urllib2.Request(url, None, {'Referer': /* Enter the URL of your site here */})
response = urllib2.urlopen(request)
results = simplejson.load(response)
if results['responseData']['language'] == 'en':
    print 'English detected'

Altough not as good as Google's own, I have had good results using Apache Nutch LanguageIdentifier which comes with its own pretrained ngram models. I had quite good results on a large (50GB pdf, text-mostly) corpus of real-world data in several languages.

It is in Java, but I'm sure you can reread the ngram profiles from it if you want to reimplement it in Python.

Google Translate API v2 allows automated queries but it requires the use of an API key that you can freely get at Google APIs console.

To detect whether text is English you could use detect_language_v2() function (that uses that API) from my answer to the question Python - can I detect unicode string language code?:

 if all(lang == 'en' for lang in detect_language_v2(['some text', 'more text'])):
    # all text fragments are in English

I recently wrote a solution for this. My solution is not fool proof and I do not think it would be computationally viable for large amounts of text, but it seems to me to work well for smallish sentences.

Suppose you have two strings of text:

"LETMEBEGINBYSAYINGTHANKS"
"UNGHSYINDJFHAKJSNFNDKUAJUD"

The goal then is to determine that 1. is probably English while 2. is not. Intuitively, the way my mind determines this is by looking for the word boundaries of English words in the sentences (LET, ME, BEGIN, etc.). But this is not straightforward computationally because there are overlapping words (BE, GIN, BEGIN, SAY, SAYING, THANK, THANKS, etc.).

My method does the following:

Take the intersection of { known English words } and { all substrings of the text of all lengths }.
Construct a graph of vertices, the positions of which are the starting indices of the words in the sentence, with directed edges to the starting positions of the letter after the end of the word. E.g, (0) would be L, so "LET" could be represented by (0) -> (3), where (3) is M so that's "LET ME".
Find the largest integer n between 0 and len(text) for which a simple directed path exists from index 0 to index n.
Divide that number n by the length of the text to get a rough idea of what percent of the text appears to be consecutive English words.

Note that my code assumes no spaces between words. If you have spaces already, then my method is silly, since the core of my solution is about figuring out where the spaces should be. (If you are reading this and you have spaces then you probably are trying to solve a more sophisticated problem.). Also, for my code to work you need an English wordlist file. I got one from here, but you can use any such file, and I imagine in this way this technique could be extended to other languages too.

Here is the code:

from collections import defaultdict

# This function tests what percent of the string seems to me to be maybe
# English-language
# We use an English words list from here: 
# https://github.com/first20hours/google-10000-english
def englishness(maybeplaintext):
    maybeplaintext = maybeplaintext.lower()
    f = open('words.txt', 'r')
    words = f.read()
    f.close()
    words = words.lower().split("\n")
    letters = [c for c in maybeplaintext]
    # Now let's iterate over letters and look for some English!
    wordGraph = defaultdict(list)
    lt = len(maybeplaintext)
    for start in range(0, lt):
        st = lt - start
        if st > 1:
            for length in range(2, st):
                end = start + length
                possibleWord = maybeplaintext[start:end]
                if possibleWord in words:
                    if not start in wordGraph:
                        wordGraph[start] = []
                    wordGraph[start].append(end)
    # Ok, now we have a big graph of words.
    # What is the shortest path from the first letter to the last letter,
    # moving exclusively through the English language?
    # Does any such path exist?
    englishness = 0
    values = set([a for sublist in list(wordGraph.values()) for a in sublist])
    numberVertices = len(set(wordGraph.keys()).union(values))
    for i in range(2, lt):
        if isReachable(numberVertices, wordGraph, i):
            englishness = i
    return englishness/lt
    
# Here I use my modified version of the technique from:
# https://www.geeksforgeeks.org/
#   find-if-there-is-a-path-between-two-vertices-in-a-given-graph/
def isReachable(numberVertices, wordGraph, end):
    visited = [0]
    queue = [0]
    while queue:
        n = queue.pop(0)
        if n == end or n > end:
            return True
        for i in wordGraph[n]:
            if not i in visited:
                queue.append(i)
                visited.append(i)
    return False

And here is I/O for the initial examples I gave:

In [5]: englishness('LETMEBEGINBYSAYINGTHANKS')
Out[5]: 0.9583333333333334

In [6]: englishness('UNGHSYINDJFHAKJSNFNDKUAJUD')
Out[6]: 0.07692307692307693

So then approximately speaking, I am 96% certain that LETMEBEGINBYSAYINGTHANKS is English, and 8% certain that UNGHSYINDJFHAKJSNFNDKUAJUD is English. Which sounds about right!

To extend this to much larger pieces of text, my suggestion would be to subsample random short substrings and check their "englishness". Hope this helps!