std::vector, std::map and memory issues_问答_开发者

I am writing code that inserts rows from a database into a vector. The vectors are then stored in a std::map. This architecture allows me to logically partition the datasets (vectors), based on the map key.

In my code, I will be retrieving a dataset (i.e. vector) from the std::map, adding/removing rows to it or performing some other logic on it, and then sticking the vector back into the map (all this goes on in a while() loop).

I have two questions, both of which are related to the (potentially) large number of elements stored in a vector. The vector may hold anything from a few tens to several tens of thousands of elements. There is no way of me to know beforehand, how many records will be retrieved from the database. Therefore, the memory management scheme of std::vector (i.e. alloc/dealloc) becomes of great importance in order to use memory effeciently, and to avoid unecessary (memory) fragmentation:

My questions are:

Given the potentially large number of elements that a row can store, ideally I would like to alloc/dealloc from a memory pool. But I dont know how to use std::vector with开发者_如何学Go a memory pool, and I dont know if that is going to be unnecessarily complicated. If that is overkill (or too complicated), then my other idea is to preallocate a fixed slab of memory when the vector is first created. But this is also likely to be too simplistic, as the number of elemnts are likely to vary widely from one vector instance to another - leading to (memory) fragmentation etc, not to mention inefficient use of memory.

What is the recommended approach here?
Given the fact that std::map (all STL containers IIRC), stores a COPY of the value element, the prospect of making copies of vectors containing several tens of thousands of elements, is just plain wrong. I am thinking therefore, of writing a SmartPointerMap wrapper arround stl::map and then store pointers to the vectors instead of the actual vectors.

Am I on the right track?. If No, what is a better solution.?. If yes, is there a boost library I can use (instead of writing my SmartPointerMap class template)?

[Edit]

I am building on Linux (Ubuntu 9.10) with gcc 4.4.1

Assuming that you're using vector for the map's data_type and not the key_type, you could modify the data in place without copying it. std::map::operator[]() returns a reference to a non-const data_type, and the iterator returned from the non-const version of std::map::find() allows you to modify the data.

What if you need to change the key when you change the data? You can use std::swap() to move the data from one vector to another without copying.

Don't forget that vector does not reduce its capacity() when you erase elements. Also, vector will usually allocate more capacity() than you need, so that push_back() takes amortized constant time. For very large vectors, these behaviors may significantly increase your program's memory usage, if you're not careful.

If you're using vector for the map's key_type and the map has extremely large keys, then pointers or smart pointers might help. However, if this is the case, you must make sure not to modify the contents of a key that is pointed to by one of the map values, because std::map is not designed to handle that.

As for the custom allocator idea, get your code working with the standard allocator first, and then see if it's fast enough. It might be fine using the standard allocator. If your code is not fast enough with the standard allocator, profile to see where the time is really being spent (it may be somewhere else, like the database code). If you write a custom allocator and you never compare it against the standard allocator, you will not know whether your custom allocator is actually an improvement; it could be much slower than the standard allocator.

Wrt #1, the default heap implementation in GCC/Linux (ptmalloc) will use a free list (aka memory pool) for small objects (<=64 bytes by default last time I checked). If you still want to use a custom allocator, the Boost.Pool library has allocators that satisfy the Standard Allocator requirements. As bk1e suggested, benchmark before and after to see if it's even worth it.

When populating your vectors from the database, if possible/practical, try to use vector<T>::reserve() to make vector allocate enough memory from the start and avoid reallocations.

Hope this helps.

To answer question 2:

using namespace std;

map<string, vector<int>* > foo;
vector<int>* pointer= foo["thekey"];

If using smart (reference-counted) pointers is a requirement:

#include<tr1/shared_ptr.h>
using namespace std::tr1;
using namespace std;

map<string, shared_ptr<vector<int> > > foo;
shared_ptr<vector<int> > pointer= foo["thekey"];

To answer question #1, you can write a new allocator template class and declare your vectors to use that allocator, but I don't really know anything about writing allocators:

map<string, vector<int, myallocator<int> >* > foo;

In particular I don't know how to design an allocator that will avoid fragmenting your memory pool. (But if you have that part answered, then writing a custom allocator would be the way to go.)

I'm going to suggest a more general approach to working with database query sets in C/C++.

Do the db query and put the rows in an array of structures. If you don't have enough physical memory, use a memory mapped file, which will return a pointer to the base of an array of structures which the MMU onboard the processor chip will worry about making resident in memory when needed.

You can now sort() this array of structs on a key/s, which will give you ISAM access, heresy in Relational Religion, but exactly what cursors give you. This is data access method #1

Now, if you still need to look at that data in another way/s, just create 1 or more indices/maps where the map's keys are the appropriate (column value/struct member) from the array of structs, and the data value of the map is a pointer to that struct's location in the memory mapped file. This provides exactly the same functionality as a db index. Subscripting the map by the struct.member value for any given row will get you right back to that row.

You can have multiple maps, for multiple indices, pointing to the same structs/rows in the array, using different struct.member data for the map's keys for different indices.

I'm ignoring your stated requirement to add new rows/structs, as it seems you are doing a single read of rows from the db, and don't actually need to add new rows/structs.

If, however, I'm wrong, and you do need to add new structs/rows, then instead of allocating one big array of structs, just malloc() and realloc() an appropriate sized array of structs and hang them off the pointers that are the map's "data". You lose true ISAM access this way, because the structs are not stored physically adjacent in memory, but you gain realloc() capability. It's a trade-off. ISAM access can be simulated by iterating over the map, but it will be slower than reading an array of structs sequentially from memory.