So lets say I have 5 files: f1, f2, f3, f4, f5. How can I remove the common strings (same text in all files) from all 5 files and put them into a 6th file, f6? Please let me know.
Format of the files:
property.a.p1=开发者_StackOverflow社区some string
property.b.p2=some string2
.
.
.
property.zzz.p4=123455
So if the above is an excerpt from file 1 and files 2 to 5 also have the string property.a.p1=some string
in them, then I'd like to remove that string from files 1 to 5 and put it in file 6. Each line of each file is on a new line. Thus, I would be comparing each string on a newline one by one. Each file is around 400 to 600 lines.
I found this on a forum for removing common strings from two files using ruby:
$ ruby -ne 'BEGIN {a=File.read("file1").split(/\n+/)}; print $_ if a.include?($_.chomp)' file2
See if this does what you want. It's a "2-pass" solution, the first pass uses a hash table to find the common lines, and the second uses that to filter out any lines that match the commons.
$files = gci "file1.txt","file2.txt","file3.txt","file4.txt","file5.txt"
$hash = @{}
$common = new-object system.collections.arraylist
foreach ($file in $files) {
get-content $file | foreach {
$hash[$_] ++
}
}
$hash.keys |% {
if ($hash[$_] -eq 5){[void]$common.add($_)}
}
$common | out-file common.txt
[regex]$common_regex = ‘^(‘ + (($common |foreach {[regex]::escape($_)}) –join “|”) + ‘)$’
foreach ($file in $files) {
$new_file = get-content $file |? {$_ -notmatch $common_regex}
$new_file | out-file "new_$($file.name)"
}
Create a table in an SQL database like this:
create table properties (
file_name varchar(100) not null, -- Or whatever sizes make sense
prop_name varchar(100) not null,
prop_value varchar(100) not null
)
Then parse your files with some simple regular expressions or even just split
:
prop_name, prop_value = line.strip.split('=')
dump the parsed data into your table, and do a bit of SQL to find the properties that are common to all files:
select prop_name, prop_value
from properties
group by prop_name, prop_value
having count(*) = $n
Where $n
is replaced by the number of input files. Now you have a list of all the common properties and their values so write those to your new file, remove them from your properties
table, and then spin through all the rows that are left in properties
and write them to the appropriate files (i.e. the file named by the file_name
column).
You say that the files are "huge" so you probably don't want to slurp all of them into memory at the same time. You could do multiple passes and use a hash-on-disk library for keeping track of what has been seen and where but that would be a waste of time if you have an SQL database around and everyone should have, at least, SQLite kicking around. Managing large amounts of structured data is what SQL and databases are for.
精彩评论