How can I cat multiple files together into one without intermediary file? [closed]_问答_开发者

Closed. This question is opinion-based. It is not currently accepting answers. 开发者_StackOverflow社区

Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.

Closed 4 years ago.

Improve this question

Here is the problem I'm facing:

I am string processing a text file ~100G in size.
I'm trying to improve the runtime by splitting the file into many hundreds of smaller files and processing them in parallel.
In the end I cat the resulting files back together in order.

The file read/write time itself takes hours, so I would like to find a way to improve the following:

cat file1 file2 file3 ... fileN >> newBigFile

This requires double the diskspace as file1 ... fileN takes up 100G, and then newBigFile takes another 100Gb, and then file1... fileN gets removed.
The data is already in file1 ... fileN, doing the cat >> incurs read and write time when all I really need is for the hundreds of files to reappear as 1 file...

If you don't need random access into the final big file (i.e., you just read it through once from start to finish), you can make your hundreds of intermediate files appear as one. Where you would normally do

$ consume big-file.txt

instead do

$ consume <(cat file1 file2 ... fileN)

This uses Unix process substitution, sometimes also called "anonymous named pipes."

You may also be able to save time and space by splitting your input and doing the processing at the same time; GNU Parallel has a --pipe switch that will do precisely this. It can also reassemble the outputs back into one big file, potentially using less scratch space as it only needs to keep number-of-cores pieces on disk at once. If you are literally running your hundreds of processes at the same time, Parallel will greatly improve your efficiency by letting you tune the amount of parallelism to your machine. I highly recommend it.

When concatenating files back together, you could delete the small files as they get appended:

for file in file1 file2 file3 ... fileN; do
  cat "$file" >> bigFile && rm "$file"
done

This would avoid needing double the space.

There is no other way of magically making files magically concatenate. The filesystem API simply doesn't have a function that does that.

Maybe dd would be faster because you wouldn't have to pass stuff between cat and the shell. Something like:

mv file1 newBigFile
dd if=file2 of=newBigFile seek=$(stat -c %s newBigFile)

I believe this is the fastest way to cat all the files contained in the same folder:

$ ls [path to folder] | while read p; do cat $p; done

all I really need is for the hundreds of files to reappear as 1 file...

The reason it isn't practical to just join files that way at a filesystem level because text files don't usually fill a disk block exactly, so the data in subsequent files would have to be moved up to fill in the gaps, causing a bunch of reads/writes anyway.

Is it possible for you to simply not split the file? Instead process the file in chunks by setting the file pointer in each of your parallel workers. If the file needs to be processed in a line oriented way, that makes it trickier but it can still be done. Each worker needs to understand that rather than starting at the offset you give it, it must first seek byte by byte to the next newline +1. Each worker must also understand that it does not process the set amount of bytes you give it but must process up the the first newline after the set amount of bytes it is allocated to process.

The actual allocation and setting of the file pointer is pretty straightforward. If there are n workers, each one processes n/file size bytes and the file pointer starts at the worker number * n/file_size.

is there some reason that kind of plan is not sufficient?

Fast, but not free solution? Get an SSD drive or flash PCIe based storage. If this is something that has to be done on a regular basis, increasing disk IO speed is going to be the most cost effective and fastest speedup you can get.

There is a such thing as too much concurrency.

A better way of doing this would be to use random access reads into the file over the desired ranges and never actually split it up and process only the number of files as the number of physical CPU/Cores in the machine. That is unless that is swamping the disk with IOPS as well, then you should cut back until the disk isn't the bottleneck.

What you are doing either way with all the naive splitting/copying/deleting is generating tonnes of IOPS and there is no way around the physics of it.

A transparent solution that would be probably be more work than is worth it unless this is an ongoing daily issue/problem is to write a custom FUSE filesystem that represents a single file as multiple files. There are lots of examples on dealing with archive files contents as individual files that would show you the basics of how to do this.