Space efficient collection for strings with common prefixes - Java implementation_问答_开发者

I need to store millions of string with common prefixes (they don't c开发者_开发百科orrespond to file system paths) in a Set like structure in memory, and query the Collection to see if a path exists.

e.g.

/path
/path/1
/path/2
/path/1/a
/path/1/b

I want to store these as efficiently as possible (they will be in memory) , given that there will be many common prefixes for all the strings involved would a Trie be a reasonable candidate?

I'm looking for a recommendation for an implementation of a suitable data structure in Java.

A Trie looks like the structure you need. Also similar structures are Radix Tries, that unlike tries, use sequence of characters to label edges. In plain tries edges are labeled with single characters, I am sure they behave better in your case where strings share quite a lot of prefixes.

see also ...

http://code.google.com/p/trie/

http://code.google.com/p/radixtree/

This looks like a good candidate implementation: https://github.com/rkapsi/patricia-trie

Let's consider the tradeoffs before any suggestions.

You say you need to store "millions" of paths. I'll assume one million, because it makes the calculations easier (and even on servers, I haven't seen more than a million directories).

How long are those paths? You've shown an example with very short paths, so we're looking at maybe a hundred megabytes to store those million paths. I don't have a reference for maximum path lengths, but 256 characters sticks in my mind. So your paths will take a maximum of 512 Mb of memory. Do you have that much memory?

How evenly distributed are the pathnames? In other words, do you follow an 80:20 rule where 80% of the paths are found in 20% of the directories? The reason that I ask is because a trie structure needs some form of index between levels. If you have a lot of directories with only a few paths under them, you're going to have a lot of overhead to maintain a trie.

The recommendations: if I had enough memory, I'd use a HashSet<String> and be done with it.

If I didn't have a lot of memory, and a directory structure that followed the 80:20 rule (or, more likely, 95:5), I'd think of a HashMap<String,Set<String>>. The key of this map would be the longest leading path string that had a "reasonable" amount of duplication, and the values would be the remaining strings. You'd probe this map with progressively shorter leading components until you found a match, then probe the set for the remainder.

That leaves open the question of "reasonable" duplication. It's the amount of duplication where the overhead of a two-piece data structure is overcome by the reduction in duplication. For example, /usr/bin/ might be valid (because it holds thousands of files and you save 9 characters or 18 bytes from each), but /usr/local/bin/ probably wouldn't be (at least on my system, it only holds a single file).

You could use a Tree structure, just like it would on disk. However, you need to remember that tree structures can use as much or more memory in overhead as they save. i.e. they are not really designed to save memory.

Perhaps you could use the cache of the disk subsystem if these files exist. It could be faster.

I would check you really need to do this as you can store a million entries in a JVM quite comfortably. ;)

If you want to minimise memory consumption, you could compress the data in memory. This could be much smaller than any other option, but more complicated to make as efficient.

What I would use: