开发者

how to guess the nationality of a person from the surname?

开发者 https://www.devze.com 2023-04-09 06:24 出处:网络
What approach can I use to predict the nationality of a person from the surname? I have a huge list of texts and surnames of authors. I would like to identify which texts have been written by latin-l

What approach can I use to predict the nationality of a person from the surname?

I have a huge list of texts and surnames of authors. I would like to identify which texts have been written by latin-language speakers and which texts have been written by native english speakers, in order to study if certain writing style patterns are different in one group compared to the other.

I have looked i开发者_开发知识库n google and in pubmed for a database of surnames, but I could not find any accessible for free. Another approach is to use some regexs, for example ".*ez" to identify some hispanic surnames such as 'rodriguez', but it doesn't get me very far.

Do you have any suggestion? Since I will manually revise all the associations after making the prediction, I don't need a great accuracy, but any help or idea will be welcome.


I don't think you can do this with any degree of reliability. A Rodriguez may well have a Spanish origin name, but could well have been born and brought up anywhere. They could be second generation British, and never have had Spanish spoken around them, and so come into the category of Native English speaker.


If Actual authors then maybe you can spider amazon and check their 'Author information' details?

I don't think you can guess. E.g. Irish last names - there are an estimated 80,000,000 people with Irish heritage however on 4.5 million of these live in Ireland/went through Irish education.


There is no meaningful way to do this. There is no reason why people with hispanic names cannot be native english speakers.

If you are going to revise it anyway, why not use the data you have?


Assuming you are intending on doing a programmatic comparison of the texts, you have to manually categorize the texts. Incorrect guesses would likely lead you to build a broken algorithm for textual analysis. This will be especially problematic with machine learning, such as artificial neural networks.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号