I have a file on a Linux system that is roughly 10GB. It contains 20,000,000 binary records, but each record is separated by an ASCII delimiter "$". I would like to use the split command or some combination thereof to chunk the file into smaller parts. Ideally I would be able to specify that the command should split every 1,0开发者_开发百科00 records (therefore every 1,000 delimiters) into separate files. Can anyone help with this?
The only unorthodox part of the problem seems to be the record separator. I'm sure this is fixable in awk pretty simply - but I happen to hate awk
.
I would transfer it in the realm of 'normal' problems first:
tr '$' '\n' < large_records.txt | split -l 1000
This will by default create xaa
, xab
, xac
... files; look at man split
for more options
I love awk :)
BEGIN { RS="$"; chunk=1; count=0; size=1000 }
{
print $0 > "/tmp/chunk" chunk;
if (++count>=size) {
chunk++;
count=0;
}
}
(note that the redirection operator in awk only truncates/creates the file on its first invocation - subsequent references are treated as append operations - unlike shell redirection)
Make sure by default the unix split will exhaust with suffixes once it reaches max threshold of default suffix limit of 2. More info on : https://www.gnu.org/software/coreutils/manual/html_node/split-invocation.html
精彩评论