开发者

Is there a better way to calculate the frequency of all symbols in a file?

开发者 https://www.devze.com 2023-04-11 06:05 出处:网络
Okay, so, say I have a text file (not necessarily containing every possible symbol) and I\'d like to calculate the frequency of each symbol and, after calculating the frequency, I then need to access

Okay, so, say I have a text file (not necessarily containing every possible symbol) and I'd like to calculate the frequency of each symbol and, after calculating the frequency, I then need to access each symbol and its frequency from most frequent to least frequent. The symbols are not necessarily ASCII characters, they could be arbitrary byte sequences, albeit all of the same length.

I was considering doing something like this (in pseudocode):

function add_to_heap (symbol)
    freq = heap.fi开发者_运维百科nd(symbol).frequency
    if (freq.exists? == true)
        freq++
    else
        symbol.freq = 1
        heap.insert(symbol)

MaxBinaryHeap heap
while somefile != EOF
    symbol = read_byte(somefile)
    heap.add_to_heap(symbol)
heap.sort_by_frequency()

while heap.root != empty
    root = heap.extract_root()
    do_stuff(root)

I was wondering: is there a better, simpler way to calculate and store how many times each symbol occurs in a file?


You can always use a HashMap isntead of the Heap. Like this you'll be performing operations that are in O(1) for each symbol found instead of O(log n) wheres n is the number of items currently on the heap.

However, if te number of distinct symbols is bounded by a reasonable number (1 Byte is ideal, 2 Byte should be still fine), you can just use an array of that size and again have O(1) but with a significantly lower constant cost.


If you're looking for a "best" solution based on running times, here's what I'd suggest:

When you're reading the file, you should have your symbols sorted (or hashed) by the value of the symbols themselves, not their frequencies. This'll let you find the current symbol in your list of already seen symbols quickly, rather than having to search through your entire list. You should also have that initial structure be able to perform fast inserts - I'd recommend a binary tree of a hash.

Once you've read all your symbols, then you should switch your ordering to be based on the frequency counts. I'd read everything into an array and then perform an in-place sort, but there are a bunch of equivalent ways to do this.

Hope this helps!

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号