Check for unique line data from file with 5 millions lines in Java_问答_开发者

Check for unique line data from file with 5 millions lines in Java

开发者 https://www.devze.com 2023-03-17 20:53 出处：网络

I have big file with row like ID|VALUE in one pass. In case of ID repeat, line must be ignored. How to effectively make this checking?

I have big file with row like ID|VALUE in one pass.

In case of ID repeat, line must be ignored.

How to effectively make this checking?

added: ID is long(8 bytes). I need a solut开发者_C百科ion that uses minimum of memory.

Thank's for help guys. I was able to increase heap space and use Set now.

You can store the data in a TLongObjectHashMap or use a TLongHashSet. These classes store primitive based information efficiently.

5 million long values will use < 60 MB in a TLongHashSet however a TLongObjectHashMap will also store your values efficiently.

To find out more about these classes

http://www.google.co.uk/search?q=TLongHashSet

http://www.google.co.uk/search?q=TLongObjectHashMap

You'll have to store ID's somewhere anyway in order to detect duplicates. Here I'd use a HashSet<String> and its contains method.

You have to read the entire file, one line at a time. You have to keep a Set of IDs and compare the incoming one to the values already in the Set. If a value appears, skip that line.

You wrote the use case yourself; there's no magic here.

This looks like a typical database task to me. If you have a database used in your app, you could utilize that to do your task. Create a table with a UNIQUE INTEGER field and start adding rows; you'll get an exception on the duplicated IDs. The database engine will take care of cursor windowing and caching so it fits in your memory budget. Then just drop that table when you're done.

There are two basic solutions;

First, as suggested by duffymo and Andreas_D above you can store all the values in a Set. This gives you O(n) time complexity and O(n) memory usage.

Second, if O(n) memory is too much, you can do it in O(1) memory by sacrificing speed. For each line in the file, read all other lines before it and discard if the ID appears before the current line.

What about probabilistic algorithms?

The Bloom filter ... is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positives are possible, but false negatives are not.