I have a file containing ~20 million lines (~1.5GB). Each line is of the form:
entry_1 entry_2 entry_3 ......... entry_5
The file contains duplicates but of the form:
entry_2 entry_1 entry_3 ......... entry_5
The content开发者_JAVA百科s of some lines are identical but the first two elements are often (possibly always) switched.
Does anyone have any suggestions on how to remove duplicates of this nature from a file of this size?
Thanks.
A suitable solution would depend on what constraints you have and how often you need to run this operation.
If this is a one-time (or infrequent) operation and if memory usage is not a big issue, something like this would be sufficient:
visited = set() # use set for faster lookups
with open(out_filename, "w") as outfile:
with open(in_filename, "r") as infile:
for line in infile:
x = line.split()
k = (tuple(sorted(x[:2])), tuple(x[2:]))
if k not in visited:
outfile.write(line)
visited.add(k)
The memory usage depend on the number of unique entries which we need to keep track of in visited
. If there aren't many duplicates, you'll end up with almost all the data in memory.
If memory usage becomes an issue, you can do this in multiple stages:
- First pre-process the file by sorting the first 2 elements in each line.
- Sort the whole file by lines
- Removing duplicates is now straight forward since duplicate entries will appear together.
Step 2 & 3 can be merged since you can simple discard duplicates when comparing entries while performing the sort.
If you don't mind using the shell, step 2 and 3 can be done using sort -u yourfile
.
Note that this changes the order of the lines in the file (which you mentioned is not an issue).
To drastically reduce memory usage at the expense of some performance, you can use a file-based db to store and look up visited entries (in place of set()
).
You can speed this up by storing a hash of the entries in memory, and only query the db when a hash matches to confirm if the entries are indeed the same. The hash can be as simple as taking the first char of each entry, or using the builtin hash()
function, or choose an existing hash algorithm. Each method would be a compromise between performance, hash size, and conflict frequency. A good choice would depend on your data and your constraints.
This will take some effort to find an optimum solution. Only worth embarking on if you need to perform this operation often.
separator = " "
aSet = set()
with open('myfile', 'rU') as infile:
for line in infile:
tempLine = line.split(separator)
tempLine = tempLine[1:2] + tempLine[0:1] + tempLine[2:]
tempLine = separator.join(tempLine)
if line not in aSet and tempLine not in aSet:
aSet.add(line)
Now, aSet
contains the list of lines that are unique regardless of whether entry_1
and entry_2
have been swapped.
EDIT: If all entries can be swapped and the line still considered unique, then:
separator = " "
aSet = set()
with open('myfile', 'rU') as infile:
for line in infile:
aSet.add(frozenset(line.split(separator)))
精彩评论