c++ stl: a good way to parse a sensor response_问答_开发者

Attention please:

I already implemented this stuff, just not in any way generic or elegant. This question is motivated by my wanting to learn more tricks with the stl, not the problem itself.

This I think is clear in the way I stated that I already solved the problem, but many people have answered in their best intentions with solutions to the problem, not answers to the question "how to solve this the stl way". I am really sorry if I phrased this question in a confusing way. I hate to waste people's time.

Ok, here it comes:

I get a string full of encoded Data.

It comes in: N >> 64 bytes

every 3 byte get decoded into an int value
after at most 64 byte (yes,not divisible by 3!) comes a byte as checksum
followed by a line feed.
and so it goes on.

It ends when 2 successive linefeeds are found.

It looks like a nice or at least ok data format, but parsing it elegantly the stl way is a real bit**.

I have done the thing "manually".

But I would be interested if there is an elegant way with the stl- or maybe boost- magic that doesn't incorporate copying the thing.

Clarification: It gets really big开发者_如何学运维 sometimes. The N >> 64byte was more like a N >>> 64 byte ;-)

UPDATE Ok, the N>64 bytes seems to be confusing. It is not important.

The sensor takes M measurements as integers. Encodes each of them into 3 bytes. and sends them one after another
when the sensor has sent 64byte of data, it inserts a checksum over the 64 byte and an LF. It doesn't care if one of the encoded integers is "broken up" by that. It just continues in the next line.(That has only the effect to make the data nicely human readable but kindof nasty to parse elegantly.)
if it has finished sending data it inserts a checksum-byte and LFLF

So one data chunk can look like this, for N=129=43x3:

|<--64byte-data-->|1byte checksum|LF 
|<--64byte-data-->|1byte checksum|LF 
|<--1byte-data-->|1byte checksum|LF
LF

When I have M=22 measurements, this means I have N=66 bytes of data. After 64 byte it inserts the checksum and LF and continues. This way it breaks up my last measurement which is encoded in byte 64, 65 and 66. It now looks like this: 64, checksum, LF, 65, 66. Since a multiple of 3 divided by 64 carries a residue 2 out of 3 times, and everytime another one, it is nasty to parse. I had 2 solutions:

check checksum, concatenate data to one string that only has data bytes, decode.
run through with iterators and one nasty if construct to avoid copying.

I just thought there might be someting better. I mused about std::transform, but it wouldn't work because of the 3 byte is one int thing.

As much as I like STL, I don't think there's anything wrong with doing things manually, especially if the problem does not really fall into the cases the STL has been made for. Then again, I'm not sure why you ask. Maybe you need an STL input iterator that (checks and) discards the check sums and LF characters and emits the integers?

I assume the encoding is such that LF can only appear at those places, i.e., some kind of Base-64 or similar?

It seems to me that something as simple as the following should solve the problem:

string line;
while( getline( input, line ) && line != "" ) {    
  int val = atoi( line.substr(0, 3 ).c_str() );
  string data = line.substr( 3, line.size() - 4 );
  char csum = line[ line.size() - 1 ];
  // process val, data and csum
}

In a real implementation you would want to add error checking, but the basic logic should remain the same.

As others have said, there is no silver bullet in stl/boost to elegantly solve your problem. If you want to parse your chunk directly via pointer arithmetic, perhaps you can take inspiration from std::iostream and hide the messy pointer arithmetic in a custom stream class. Here's a half-arsed solution I came up with:

#include <cctype>
#include <iostream>
#include <vector>
#include <boost/lexical_cast.hpp>

class Stream
{
public:
    enum StateFlags
    {
        goodbit = 0,
        eofbit  = 1 << 0,   // End of input packet
        failbit = 1 << 1    // Corrupt packet
    };

    Stream() : state_(failbit), csum_(0), pos_(0), end_(0) {}
    Stream(char* begin, char* end) {open(begin, end);}
    void open(char* begin, char* end)
        {state_=goodbit; csum_=0; pos_=begin, end_=end;}
    StateFlags rdstate() const {return static_cast<StateFlags>(state_);}
    bool good() const {return state_ == goodbit;}
    bool fail() const {return (state_ & failbit) != 0;}
    bool eof() const {return (state_ & eofbit) != 0;}
    Stream& read(int& measurement)
    {
        measurement = readDigit() * 100;
        measurement += readDigit() * 10;
        measurement += readDigit();
        return *this;
    }

private:
    int readDigit()
    {
        int digit = 0;

        // Check if we are at end of packet
        if (pos_ == end_) {state_ |= eofbit; return 0;}

        /* We should be at least csum|lf|lf away from end, and we are
            not expecting csum or lf here. */
        if (pos_+3 >= end_ || pos_[0] == '\n' || pos_[1] == '\n')
        {
            state_ |= failbit;
            return 0;
        }

        if (!getDigit(digit)) {return 0;}
        csum_ = (csum_ + digit) % 10;
        ++pos_;

        // If we are at checksum, check and consume it, along with linefeed
        if (pos_[1] == '\n')
        {
            int checksum = 0;
            if (!getDigit(checksum) || (checksum != csum_)) {state_ |= failbit;}
            csum_ = 0;
            pos_ += 2;

            // If there is a second linefeed, we are at end of packet
            if (*pos_ == '\n') {pos_ = end_;}
        }
        return digit;
    }

    bool getDigit(int& digit)
    {
        bool success = std::isdigit(*pos_);
        if (success)
            digit = boost::lexical_cast<int>(*pos_);
        else
            state_ |= failbit;
        return success;
    }

    int csum_;
    unsigned int state_;
    char* pos_;
    char* end_;
};


int main()
{
    // Use (8-byte + csum + LF) fragments for this example
    char data[] = "\
001002003\n\
300400502\n\
060070081\n\n";

    std::vector<int> measurements;
    Stream s(data, data + sizeof(data));
    int meas = 0;

    while (s.read(meas).good())
    {
        measurements.push_back(meas);
        std::cout << meas << " ";
    }

    return 0;
}

Maybe you'll want to add extra StateFlags to determine if failure is due to checksum error or framing error. Hope this helps.

You should think of your communication protocol as being layered. Treat

|<--64byte-data-->|1byte checksum|LF

as fragments to be reassembled into larger packets of contiguous data. Once the larger packet is reconstituted, it is easier to parse its data contiguously (you don't have to deal with measurements being split up across fragments). Many existing network protocols (such as UDP/IP) does this sort of reassembly of fragments into packets.

It's possible to read the fragments directly into their proper "slot" in the packet buffer. Since your fragments have footers instead of headers, and there is no out-of-order arrival of your fragments, this should be fairly easy to code (compared to copyless IP reassembly algorithms). Once you receive an "empty" fragment (the duplicate LF), this marks the end of the packet.

Here is some sample code to illustrate the idea:

#include <vector>
#include <cassert>

class Reassembler
{
public:
    // Constructs reassembler with given packet buffer capacity
    Reassembler(int capacity) : buf_(capacity) {reset();}

    // Returns bytes remaining in packet buffer
    int remaining() const {return buf_.end() - pos_;}

    // Returns a pointer to where the next fragment should be read
    char* back() {return &*pos_;}

    // Advances the packet's position cursor for the next fragment
    void push(int size) {pos_ += size; if (size == 0) complete_ = true;}

    // Returns true if an empty fragment was pushed to indicate end of packet
    bool isComplete() const {return complete_;}

    // Resets the reassembler so it can process a new packet
    void reset() {pos_ = buf_.begin(); complete_ = false;}

    // Returns a pointer to the accumulated packet data
    char* data() {return &buf_[0];}

    // Returns the size in bytes of the accumulated packet data
    int size() const {return pos_ - buf_.begin();}

private:
    std::vector<char> buf_;
    std::vector<char>::iterator pos_;
    bool complete_;
};


int readFragment(char* dest, int maxBytes, char delimiter)
{
    // Read next fragment from source and save to dest pointer
    // Return number of bytes in fragment, except delimiter character
}

bool verifyChecksum(char* fragPtr, int size)
{
    // Returns true if fragment checksum is valid
}

void processPacket(char* data, int size)
{
    // Extract measurements which are now stored contiguously in packet
}

int main()
{
    const int kChecksumSize = 1;
    Reassembler reasm(1000); // Use realistic capacity here
    while (true)
    {
        while (!reasm.isComplete())
        {
            char* fragDest = reasm.back();
            int fragSize = readFragment(fragDest, reasm.remaining(), '\n');
            if (fragSize > 1)
                assert(verifyChecksum(fragDest, fragSize));
            reasm.push(fragSize - kChecksumSize);
        }
        processPacket(reasm.data(), reasm.size());
        reasm.reset();
    }
}

The trick will be making an efficient readFragment function that stops at every newline delimiter and stores the incoming data into the given destination buffer pointer. If you tell me how you acquire your sensor data, then I can perhaps give you more ideas.

An elegant solution this isn't. It would be more so by using a "transition matrix", and only reading one character at a time. Not my style. Yet this code has a minimum of redundant data movement, and it seems to do the job. Minimally C++, it really is just a C program. Adding iterators is left as an exercise for the reader. The data stream wasn't completely defined, and there was no defined destination for the converted data. Assumptions noted in comments. Lots of printing should show functionality.

// convert series of 3 ASCII decimal digits to binary
// there is a checksum byte at least once every 64 bytes - it can split a digit series
// if the interval is less than 64 bytes, it must be followd by LF (to identify it)
// if the interval is a full 64 bytes, the checksum may or may not be followed by LF
// checksum restricted to a simple sum modulo 10 to keep ASCII format
// checksum computations are only printed to allowed continuation of demo, and so results can be
// inserted back in data for testing
// there is no verification of the 3 byte sets of digits
// results are just printed, non-zero return indicates error


int readData(void) {
    int binValue =  0, digitNdx = 0,  sensorCnt = 0, lineCnt = 0;
    char oneDigit;
    string sensorTxt;

    while( getline( cin, sensorTxt ) ) {
        int i, restart = 0, checkSum = 0, size = sensorTxt.size()-1;
        if(size < 0)
            break;
        lineCnt++;
        if(sensorTxt[0] == '#')
            continue;
        printf("INPUT: %s\n", &sensorTxt[0]);           // gag

        while(restart<size) {
            for(i=0; i<min(64, size); i++) {
                oneDigit = sensorTxt[i+restart] & 0xF;
                checkSum += oneDigit;
                binValue = binValue*10 + oneDigit;
                //printf("%3d-%X ", binValue, sensorTxt[i+restart]);
                digitNdx++;
                if(digitNdx == 3) {
                    sensorCnt++;
                    printf("READING# %d (LINE %d) = %d  CKSUM %d\n",
                           sensorCnt, lineCnt, binValue, checkSum);
                    digitNdx = 0;
                    binValue = 0;
                }
            }
            oneDigit = sensorTxt[i+restart] & 0x0F;
            char compCheckDigit = (10-(checkSum%10)) % 10;
            printf("    CKSUM at sensorCnt %d  ", sensorCnt);
            if((checkSum+oneDigit) % 10)
                printf("ERR got %c exp %c\n", oneDigit|0x30, compCheckDigit|0x30);
            else
                printf("OK\n");
            i++;
            restart += i;
        }
    }
    if(digitNdx)
        return -2;
    else
        return 0;
}

The data definition was extended with comments, you you can use the following as is:

# normal 64 byte lines with 3 digit value split across lines
00100200300400500600700800901001101201301401501601701801902002105
22023024025026027028029030031032033034035036037038039040041042046
# short lines, partial values - remove checksum digit to combine short lines
30449
0451
0460479
0480490500510520530540550560570580590600610620630641
# long line with embedded checksums every 64 bytes
001002003004005006007008009010011012013014015016017018019020021052202302402502602702802903003103203303403503603703803904004104204630440450460470480490500510520530540550560570580590600610620630640
# dangling digit at end of file (with OK checksum)
37

Why are you concerned with copying? Is it the time overhead or the space overhead?

It sounds like you are reading all of the unparsed data into a big buffer, and now you want to write code that makes the big buffer of unparsed data look like a slightly smaller buffer of parsed data (the data minus the checksums and linefeeds), to avoid the space overhead involved in copying it into the slightly smaller buffer.

Adding a complicated layer of abstraction isn't going to help with the time overhead unless you only need a small portion of the data. And if that's the case, maybe you could just figure out which small portion you need, and copy that. (Then again, most of the abstraction layer may be already written for you, e.g. the Boost iterator library.)

An easier way to reduce the space overhead is to read the data in smaller chunks (or a line at a time), and parse it as you go. Then you only need to store the parsed version in a big buffer. (This assumes you're reading it from a file / socket / port, rather than being passed a large buffer that is not under your control.)

Another way to reduce the space overhead is to overwrite the data in place as you parse it. You will incur the cost of copying it, but you will only need one buffer. (This assumes that the 64-byte data doesn't grow when you parse it, i.e. it is not compressed.)