开发者

Reading integers from a memory mapped formatted file

开发者 https://www.devze.com 2023-01-25 04:23 出处:网络
I have memory mapped a large formatted (text) file containing one integer per line like so: 123 345 34324

I have memory mapped a large formatted (text) file containing one integer per line like so:

123
345
34324
3232
...

So, I have a pointer to the memory at the first byte and also a pointer to the memory at the last byte. I am trying to read all those integers into an array as fast as possible. Initially I created a specialized std::streambuf class to work with std::istream to read from that memory but it seem to be relatively slow.

Do you have any suggestion on how to efficiently parse a string like "1231232\r\n123123\r\开发者_StackOverflow中文版n123\r\n1231\r\n2387897..." into an array {1231232,123123,1231,231,2387897,...} ?

The number of integers in the file is not known beforehand.


This was a really interesting task for me to learn a bit more about C++.

Admitted, the code is quite large and has a lot of error checking, but that only shows how many different things can go wrong during parsing.

#include <ctype.h>
#include <limits.h>
#include <stdio.h>

#include <iterator>
#include <vector>
#include <string>

static void
die(const char *reason)
{
  fprintf(stderr, "aborted (%s)\n", reason);
  exit(EXIT_FAILURE);
}

template <class BytePtr>
static bool
read_uint(BytePtr *begin_ref, BytePtr end, unsigned int *out)
{
  const unsigned int MAX_DIV = UINT_MAX / 10;
  const unsigned int MAX_MOD = UINT_MAX % 10;

  BytePtr begin = *begin_ref;
  unsigned int n = 0;

  while (begin != end && '0' <= *begin && *begin <= '9') {
    unsigned digit = *begin - '0';
    if (n > MAX_DIV || (n == MAX_DIV && digit > MAX_MOD))
      die("unsigned overflow");
    n = 10 * n + digit;
    begin++;
  }

  if (begin == *begin_ref)
    return false;

  *begin_ref = begin;
  *out = n;
  return true;
}

template <class BytePtr, class IntConsumer>
void
parse_ints(BytePtr begin, BytePtr end, IntConsumer out)
{
  while (true) {
    while (begin != end && *begin == (unsigned char) *begin && isspace(*begin))
      begin++;
    if (begin == end)
      return;

    bool negative = *begin == '-';
    if (negative) {
      begin++;
      if (begin == end)
        die("minus at end of input");
    }

    unsigned int un;
    if (!read_uint(&begin, end, &un))
      die("no number found");

    if (!negative && un > INT_MAX)
      die("too large positive");
    if (negative && un > -((unsigned int)INT_MIN))
      die("too small negative");

    int n = negative ? -un : un;
    *out++ = n;
  }
}

static void
print(int x)
{
  printf("%d\n", x);
}

int
main()
{
  std::vector<int> result;
  std::string input("2147483647 -2147483648 0 00000 1 2 32767 4 -17 6");

  parse_ints(input.begin(), input.end(), back_inserter(result));

  std::for_each(result.begin(), result.end(), print);
  return 0;
}

I tried hard not to invoke any kind of undefined behavior, which can get quite tricky when converting unsigned numbers to signed numbers or invoking isspace on an unknown data type.


std::vector<int> array;
char * p = ...; // start of memory mapped block
while ( not end of memory block )
{
    array.push_back(static_cast<int>(strtol(p, &p, 10)));
    while (not end of memory block && !isdigit(*p))
        ++p;
}

This code is a little unsafe since there's no guarantee that strtol will stop at the end of the memory mapped block, but it's a start. Should go very fast even with additional checking added.


Since this is memory mapped a simple copy the chars to a stack array and atoi to the another integer array on top of a another memory mapped file would be the very efficient. This way the paging file is not used for these big buffers at all.

open memory mapped file to output int buffer

declare small stack buffer of 20 chars
while not end of char array
  while current char not  line feed
    copy chars to stack buffer
    null terminate the buffer two chars back
    copy results of int buffer output buffer
    increment the output buffer pointer
  end while  
end while

While this doesn't use the a library is has the advantage of minimising memory usage to memory mapped files, so temp buffers are limited to the stack one and the one used by atoi internally. The output buffer can be thrown away or left saved to the file as needed.


NOTE: This answer has been edited a few times.

Reads memory line by line (based on link and link).

class line 
{
   std::string data;
public:
   friend std::istream &operator>>(std::istream &is, line &l) 
   {
      std::getline(is, l.data);
      return is;
   }
   operator std::string() { return data; }    
};

std::streambuf osrb;
setg(ptr, ptr, ptrs + size-1);
std::istream istr(&osrb);

std::vector<int> ints;

std::istream_iterator<line> begin(istr);
std::istream_iterator<line> end;
std::transform(begin, end, std::back_inserter(ints), &boost::lexical_cast<int, std::string>);
0

精彩评论

暂无评论...
验证码 换一张
取 消