开发者

Extracting words from domain

开发者 https://www.devze.com 2023-04-07 12:16 出处:网络
I have a bunch of domains I would like to explode into words. I downloaded wordlist from wordlist开发者_如何学Python.sourceforge.net and started writing brute-force type of script to run each domain t

I have a bunch of domains I would like to explode into words. I downloaded wordlist from wordlist开发者_如何学Python.sourceforge.net and started writing brute-force type of script to run each domain through dictionary list.

The problem is that I can't get it to produce good enough results. The simple script I did looks like this:

foreach($domains as $dom) {
    $orig_dom = $dom;
    foreach($words as $w) {
        $pos = stristr($dom,$w);
        if($pos) {
            $wd[$orig_dom][] = $w;
        }
    }
}

$words is dictionary array and domains is just an array of domain names.

Results looks like this:

[aheadsoftware] => Array
    (
        [0] => ahead
        [1] => head
        [2] => heads
        [3] => soft
        [4] => software
        [5] => ware

Technically it works but the thing I don't know how to code is the trick to get the script to understand that if you match 'ahead', you don't have 'head' or 'heads' anymore. It should also understand to pick 'software' instead of 'soft' and 'ware'. Yes I know, world of linguistic computing is pure pain ;)


A naive solution could be every time you have a match and before you add the word in to the results do another stristr lookup and see if the word you are trying to put in to the results is contained in any of the words already in there. If it is, don't add it in.

This would not work for example if the domain contains 'heads' and your dictionary lists 'head' first. You may rather have 'heads' added in to the results instead of 'head'.

You can get around that limitation by checking to see which one is longer. If the word contained in your results is longer, do not add the new word in. If the new word is longer, remove the one already in the results and add the new one in.

0

精彩评论

暂无评论...
验证码 换一张
取 消