开发者

Counting common Bytes, Words and Double Words

开发者 https://www.devze.com 2022-12-31 14:37 出处:网络
I am scanning over a large amount of data and looking for 开发者_JS百科common trends in it.Every time I meet a recurrence of a unit, I want to increment the count of it. What is the best data structur

I am scanning over a large amount of data and looking for 开发者_JS百科common trends in it. Every time I meet a recurrence of a unit, I want to increment the count of it. What is the best data structure or way to hold this data. I need to be able to search it quickly, and also have a count with each unit of data.


You didn't specify a language, but a hash (associative array) is your best data structure.

It can sometimes be called a map/hashmap depending on a language (Java has HashMaps, Perl hash hashes, .

A hash/associative array/map data structure consists of a set of key-value pairs, with the values settable/gettable by the key. In you case, the key will be a string representing a word, a byte, or a double word (separate 3 hashmaps) and the value will be the count of the frequency.


Dictionary/Hash table would be the best if you need fast look up.


As has been mentioned, dictionaries/hash tables are your best bet. But your question is a little clear and I noticed that you mentioned compression in your tags; you may want to look at Huffman trees also.


As others have noted, a hash is an obvious candidate for your data structure.

For development and testing purposes, however, I would want that structure to be richer than a simple tally for each matched item. Rather, I would want store information that could be used to confirm the correctness of the code.

For starters, that information might contain the line number and some indication of the position where the match occurred. Here is an illustration in Perl:

use strict;
use warnings;

my %regexes= (
    rep_letter => qr/ ([a-z])         (\1   )+ /x,
    rep_word   => qr/ (\b \w+ \b) \W* (\1\W*)+ /x,
    doub_word  => qr/ (\b \w+   ) \W+  \1      /x,
);

my %ds;

while (my $line = <>){
    for my $r (keys %regexes){
        while ( $line =~ /$regexes{$r}/g ){
            # Data structure:
            #   $ds{REGEX_TYPE}{REPEATED_ITEM} = [
            #       [LINE_NO, pos_VALUE_OF_MATCH],
            #       etc. for each match
            #   ]
            #
            # For example:
            #   $ds{rep_word}{foo} = [
            #       [ 3, 11],
            #       [12, 88],
            #       ...
            #   ]
            push @{$ds{$r}{$1}}, [$., pos($line)];
        }
    }
}
0

精彩评论

暂无评论...
验证码 换一张
取 消