开发者

Calculating probability that a string has been randomized? - Python

开发者 https://www.devze.com 2022-12-30 20:23 出处:网络
this is correlated to a question I asked earlier (question) I have a list of manually created strings such as:

this is correlated to a question I asked earlier (question)

I have a list of manually created strings such as:

luc开发者_JAVA百科y87

gordan_king

fancy_unicorn77

joplucky_kanga90

base_belong_to_narwhals

and a list of randomized strings:

johnkdf

pancake90kgjd

fancy_jagookfk

manhattanljg


What gives away that the last set of strings are randomized is that sequences such as 'kjg', 'jgf', 'lkd', ... .

Any clever way I could separate strings that contain these apparently randomized strings from the crowd?

I guess that this plays a lot on the fact that certain characters are more likely to be placed next to others (e.g. 'co', 'ka', 'ja', ...).


Any ideas on this one? Kylotan mentioned Reverend, but I am not sure if it can be used fr such purpose.

Help would be much appreciated!


This is just a thought. I've never tried it myself...

Build a bloom filter from hashing every (overlapping) 4-letter sequence found in a dictionary. Test a string by counting how many 4-letter sequences in the string don't hit the filter. The more misses, the more likely it is that the word contains random junk.

Try tuning the size of the bloom filter and the number of letters per sequence.

Also note (thanks @MihaiD) that you should include a dictionary of names, preferably from multiple languages, in the bloom filter to minimise false positives.


What scores do you get if you run the strings through something like textcat? (I've seen a few different implementations of TextCat; maybe there's a Python one already out there; if not it's not a hard algorithm -- it's the data that's important.)

I'm thinking that if you strip the numbers out, the first set of strings will be closer to the "English" result in TextCat than the ones with random stuff in them.

How much closer and whether you might be able to use the TextCat data -- which is fundamentally based on which letters tend to be next to each other in particular languages -- to "pass" or "fail" a string is going to need some experimentation, though...


Try using a vanilla bayes classifier. Should be enough for the general case.


It seems to me like you are trying to write code to recognize a certian particular set of tiny stuff some spammer does to a string to get past your filters. What I don't see is what is stopping them from, after all your hard work, making a 10-second tweak to their algorithm and defeating your new filter.


Some time ago I read a short article about random name generation, where they did the following: They built up a table that contains the information you already pointed at: "I guess that this plays a lot on the fact that certain characters are more likely to be placed next to others".

So what they did was they read a whole dictionary and determined which letters were placed more likely to each others. I do not know, how much letters in a row they considered. Maybe you should try more than just two consecutive letters, let's say something between 3 and 6.

Now I suggest you bild up such a table (maybe in a better data structural representation), that contains all "valid" consecutive letter combinations (and maybe their likelihood) and look if your name to be checked contains (almost) only such "valid" consecutive letters.

0

精彩评论

暂无评论...
验证码 换一张
取 消