Here is an interesting optimization problem that I think about for some days now:
In a system I read data from a slow IO device. I don't know beforehand how much data I need. The exact length is only known once I have read an entire package (think of it as it has some kind of end-symbol). Reading more data than required is not a problem except that it wastes time in IO.
Two constrains also come into play: Reads are very slow. Each byte I read costs. Also each read-request has a constant setup cost regardless of the number of bytes I read. This makes reading byte by byte costly. As a rule of thumb: the setup costs are roughly as expensive as a read of 5 bytes.
The packages I read are usually between 9 and 64 bytes, but there are rare occurrences larger or smaller packages. The entire range will be between 1 to 120 bytes.
Of course I know a little bit of my data: Packages come in sequences of identical sizes. I can classify three patterns here:
Sequences of reads with identical sizes:
A A A A A ...
Alternating sequences:
A B A B A B A B ...
And sequences of triples:
A B C A B C A B C ...
The special case of degenerated 开发者_如何转开发triples exist as well:
A A B A A B A A B ...
(A, B and C denote some package size between 1 and 120 here).
Question:
Based on the size of the previous packages, how do I predict the size of the next read request? I need something that adapts fast, uses little storage (lets say below 500 bytes) and is fast from a computational point of view as well.
Oh - and pre-generating some tables won't work because the statistic of read sizes can vary a lot with different devices I read from.
Any ideas?
You need to read at least 3 packages and at most 4 packages to identify the pattern.
- Read 3 packages. If they are all same size, then the pattern is AAAAAA...
- If they are all not the same size, read the 4th package. If 1=3 & 2=4, pattern is ABAB. Otherwise, pattern is ABCABC...
With that outline, it is probably a good idea to do a speculative read of 3 package sizes (something like 3*64 bytes at a single go).
I don't see a problem here.. But first, several questions:
1) Can you read the input asyncronously (e.g. separate thread, interrupt routine, etc)?
2) Do you have some free memory for a buffer?
3) If you've commanded a longer read, are you able to obtain first byte(s) before the whole packet is read?
If so (and I think in most cases it can be implemented), then you can just have a separate thread that reads them at highest possible speed and stores them in a buffer, with stalling when the buffer gets full, so that you normal process can use a synchronous getc()
on that buffer.
EDIT: I see.. it's because of CRC or encryption? Well, then you could use some ideas from data compression:
Consider a simple adaptive algorithm of order N for M possible symbols:
int freqs[M][M][M]; // [a][b][c] : occurences of outcome "c" when prev vals were "a" and "b"
int prev[2]; // some history
int predict(){
int prediction = 0;
for (i = 1; i < M; i++)
if (freqs[prev[0]][prev[1]][i] > freqs[prev[0]][prev[1]][prediction])
prediction = i;
return prediction;
};
void add_outcome(int val){
if (freqs[prev[0]][prev[1]][val]++ > DECAY_LIMIT){
for (i = 0; i < M; i++)
freqs[prev[0]][prev[1]][i] >>= 1;
};
pred[0] = pred[1];
pred[1] = val;
};
freqs
has to be an array of order N+1
, and you have to remember N
previsous values. N
and DECAY_LIMIT
have to be adjusted according to the statistics of the input. However, even they can be made adaptive (for example, if it producess too many misses, then the decay limit can be shortened).
The last problem would be the alphabet. Depending on the context, if there are several distinct sizes, you can create a one-to-one mapping to your symbols. If more, then you can use quantitization to limit the number of symbols. The whole algorithm can be written with pointer arithmetics, so that N
and M
won't be hardcoded.
Since reading is so slow, I suppose you can throw some CPU power at it so you can try to make an educated guess of how much to read.
That would be basically a predictor, that would have a model based on probabilities. It would generate a sample of predictions of the upcoming message size, and the cost of each. Then pick the message size that has the best expected cost.
Then when you find out the actual message size, use Bayes rule to update the model probabilities, and do it again.
Maybe this sounds complicated, but if the probabilities are stored as fixed-point fractions you won't have to deal with floating-point, so it may be not much code. I would use something like a Metropolis-Hastings algorithm as my basic simulator and bayesian update framework. (This is just an initial stab at thinking about it.)
精彩评论