Adding control symbols to a byte stream_问答_开发者

Problem: sometimes we have to interleave multiple streams into one, in which case its necessary to provide a way to identify block boundaries within a stream. What kind of format would be good for such a task? (All processing has to be purely sequential and i/o operations are blockwise and aligned.)

From decoding side, the best way is to have length prefixes for blocks. But at encoding side, it requires either random access to output file (seek to stream start and write the header), or being able to cache the whole streams, which is, in general, impossible.
Alternatively, we can add length headers (+ some flags) to blocks of cacheable size. Its surely a solution, but handling is much more complex than [1], especially at encoding (presuming that i/o operations are done with aligned fixed-size blocks). Well, one possible implementation is to write a 0 byte into the buffer, then stream data until its filled. So prefix byte = 0 would mean that there're bufsize-1 bytes of stream data next, and !=0 would mean that there's less... in which case we would be able to insert another prefix byte if end-of-stream is reached. This would only work with bufsize=32k or so, because otherwise the block length would require 3+ bytes to store, and there would be a problem with handling of the case with end-of-stream when there's only one byte of free space in the buffer. (One solution to that would be storing 2-byte prefixes to each buffer and adding 3rd byte when necessary; another is to provide a 2-byte encoding for some special block lengths like bufsize-2). Either way its no so good, because even 1 extra byte per 64k would accumulate to a noticeable number with large files (1526 bytes per 100M). Also hardcoding of the block size into format is bad too.
Escape prefix. Eg. EC 4B A7 00 = EC 4B A7, EC 4B A7 01 = end-of-stream. Now this is really easy to encode, but decoding is pretty painful - requires a messy state machine even to extract single bytes. But overall it adds least overhead, so it seem that we still need to find a good implementation for buffered decoding.
Escape prefix with all same bytes (Eg. FF FF FF). Much easier to check, but runs of the same byte in the stream would produce a huge overhead (like 25%), and its not unlikely with any byte value chosen for escape code.
Escape postfix. Store the payload byte before the marker - then decoder just has to skip 1 byte before masked marker, and 4 bytes for control code. So this basically introduces a fixed 4-byte delay for decoder, while [3] has a complex path where marker bytes have to be returned one by one. Still, with [3] encoder is much simpler (it just has to write an extra 0 when marker matches), and this doesn't really simplify the buffer processing.

Update: Actually I'm pretty sure that [3] or [5] would be the option I'd use, I only listed other options in hope to get more alternatives (for example, it would be ok if redundancy is 1 bit开发者_开发技巧 per block on average). So the main question atm is how to parse the stream for [3]... current state machine looks like this:

int saved_c;
int o_last, c_last;
int GetByte( FILE* f ) {
  int c;

  Start:
  if( o_last>=10 ) {
    if( c_last>=(o_last-10) ) { c=saved_c; o_last=0; }
    else c=byte("\xEC\x4B\xA7"[c_last++]);
  } else {
    c = getc(f); 
    if( o_last<3 ) {
      if( char(c)==("\xEC\x4B\xA7"[o_last]) ) { o_last++; goto Start; } 
      else if( o_last>0 ) { saved_c=c; c_last=0; o_last+=10; goto Start; } // 11,12
      // else just return c
    } else {
      if( c>0 ) { c=-1-c, o_last=0; printf( "c=%i\n", c ); }
      else { saved_c=0xA7; c_last=0; o_last+=10-1; goto Start; } // 12
    }
  }

  return c;
}

and its certainly ugly (and slow)

How about using blocks of fixed size, e.g. 1KB? Each block would contain a byte (or 4 bytes) indicating which stream it is, and then it follows with just data.

Benefits:

You don't have to pay attention regarding the data itself. Data cannot accidently trigger any behaviour from your system (e.g. accidently terminate the stream)
I does not require random file access when encoding. In particular, you don't store the length of the block as it is fixed.
If data goes corrupt, the system can recover in the next block.

Drawbacks:

If you have to switch from stream to stream a lot, with each having only few bytes of data, the block may be underutilised. Lots of bytes will be empty.
If the block size is too small (e.g. if you want to solve the above problem), you can get huge overhead from the header.

From a standpoint of simplicity for sequential read and write, I would take solution 1, and just use a short buffer, limited to 256 bytes. Then you have a one-byte length followed by data. If a single stream has more than 256 consecutive bytes, you simply write out another length header and the data.

If you have any further requirements though you might have to do something more elaborate. For instance random-access reads probably require a magic number that can't be duplicated in the real data (or that's always escaped when it appears in the real data).

So I ended up using the ugly parser from OP in my actual project
( http://encode.ru/threads/1231-lzma-recompressor )

But now it seems that the actual answer is to let data type handlers to terminate their streams however they want, which means that escape coding in case of uncompressed data, but when some kind of entropy coding is used in the stream, its usually possible to incorporate a more efficient EOF code.