(How) Can I approximate a "dynamic" index (key extractor) for Boost MultiIndex?_问答_开发者

I have a MultiIndex container of boost::shared_ptrs to members of class Host. These members contain private arrays bool infections[NUM_SEROTYPES] revealing Hosts' infection statuses with respect to each of 1,...,NUM_SEROTYPES serotypes. I want to be able to determine at any time in the simulation the number of people infected with a given serotype, but I'm not sure how:

Ideally, Boost MultiIndex would allow me to sort, for example, by Host::isInfected( int s ), where s is the serotype of interest. From what I understand, MultiIndex key extractors aren't allowed to take arguments.
An alternative would be to define an index for each serotype, but I don't see how to write the MultiIndex container typedef ... in such an extensible way. I will be changing the number of serotypes between simulations. (Do experienced programmers think this should be possible? I'll attempt it if so.)
There are 2^(NUM_SEROTYPES) possible infection statuses. For small numbers of serotypes, I could use a single index based on this number (or a binary string) and come up with some mapping from this key to actual infection status. Counting is still darn slow.
I could maintain a separate structure counting the total numbers of infecteds with each serotype. The synchrony is a bit of a pain, but the memory is fine. I would prefer a slicker option, since I would like to do further sorts on other host attributes (e.g., after counting the number infected with serotype s, count the number of those infected who are also in a particular household and have a particular age).

Thanks in advance.

Update

I'm going to be traveling and unable to check up on any answers until Wednesday, May 5. This means that whoever's answer is most popular, according to others' votes, will win the bounty (assuming some min开发者_开发技巧imum threshold is met)--I will not be back in time to select what I think is the best answer. Please assume for the purpose of this bounty that NUM_SEROTYPES ~8-10, so I do not want to go with the third option or anything requiring some awful combinatoric enumeration. I'm really hoping to find a way to sort the MultiIndex on the host's infection status with respect to a given serotype.

Also, if this is a stupid question, I'd love to know why. Thanks.

Update 2, May 6

These replies are helpful and help me understand my options as I continue to optimize and customize the program. I'm unable to award the bounty or solution at this point, but if I could, it would go to outis for providing the most thorough and informative solution.

My current strategy will be to maintain int allInfecteds[ ALL_AGES ][ NUM_SEROTYPES ]. Maintenance of the array will be built into event definitions (e.g., for dying, recovering, aging, and becoming infected). Getting the total number of infected household members in a particular age will be done by individually querying hosts after sorting the MultiIndex by the household of interest. (Households are extremely small relative to the total number of hosts.) As my queries become more complex, I might replace the two-dimensional array with multimaps and use count_if. If we ultimately simulate with relatively few serotypes, I might try to use the metaprogramming example below or use an explicit index.

A sincere thanks again to all commentors for helping me learn about the various options and their costs. If I attempt reindexing or the metaprogramming solutions proposed below, I will post results here.

Solution 2 (maintaining a separate index for each serotype can be done with a little metaprogramming:

#include <boost/mpl/push_back.hpp>
#include <boost/mpl/range_c.hpp>
#include <boost/mpl/fold.hpp>
#include <boost/mpl/for_each.hpp>
#include <boost/mpl/vector.hpp>
#include <boost/multi_index_container.hpp>
#include <boost/multi_index/ordered_index.hpp>
#include <boost/shared_ptr.hpp>

using namespace boost::multi_index;

struct host
{
  static const int NUM_SEROTYPES=8;

  bool infections[NUM_SEROTYPES];
};

typedef boost::shared_ptr<host> host_ptr;

template<typename Num>
struct infection_extractor
{
  typedef bool result_type;

  result_type& operator()(const host_ptr& p)const
  {
    return p->infections[Num::value];
  }
};

typedef boost::mpl::fold<
  boost::mpl::range_c<int,0,host::NUM_SEROTYPES>,
  boost::mpl::vector<>,
  boost::mpl::push_back<
    boost::mpl::_1,
    ordered_non_unique<infection_extractor<boost::mpl::_2> >
  >
>::type index_list_type;

typedef multi_index_container<
  host_ptr,
  index_list_type
> multi_t;

template<int Num>
std::size_t infected_with(const multi_t& m)
{
  return m.get<Num>().count(true);
};

typedef std::size_t (*infected_with_type)(const multi_t&);

struct infected_with_table_assigner
{
  infected_with_table_assigner(infected_with_type* table):table(table)
  {
    boost::mpl::for_each<
      boost::mpl::range_c<int,0,host::NUM_SEROTYPES>
    >(*this);
  }

  template<typename Num>
  void operator()(Num)const{table[Num::value]=&infected_with<Num::value>;}

  infected_with_type* table;
};

std::size_t infected_with(const multi_t& m,int n)
{
  static infected_with_type           table[host::NUM_SEROTYPES];
  static infected_with_table_assigner assigner(table);

  return table[n](m);
}

#include <iostream>

int main()
{

  multi_t m;
  host h;
  for(int i=0;i<200;++i){
    for(int j=0;j<host::NUM_SEROTYPES;++j)h.infections[j]=(i&(1<<j))?true:false;
    m.insert(host_ptr(new host(h)));
  }

  for(int n=0;n<host::NUM_SEROTYPES;++n)
  {
    std::cout<<"infected with serotype "<<n<<": "
      <<infected_with(m,n)<<std::endl;
  }
}

But take into account that maintaining 8~10 indices has a memory and insertion time penalty. A simpler solution is to just maintain one random access index and sort this appropriately (with a custom comparer suited to the particular serotype of interest in each moment) before querying.

To implement option 2, define an enumerated type for the serotypes within another class, then write a templated key extractor. I'm not certain that boost::multi_index_container supports reindexing (I doubt it does), so this approach may be doomed from the start.

class ... {
    enum Serotype {
        ...
        sup // 1 greater than the highest serotype
    };
    ...
};

template <class H, typename H::Serotype s>
struct serotype_key_extractor {
    typedef bool result_type;

    result_type operator()(const boost::shared_ptr<Host>& h) const 
        { ! return h->isInfected(s); }
};

Of course, this requires an index for each serotype on your multi_index_container (Serotype::sup total), with a probable performance of O(I * lg(n)) for each index over the lifetime of the simulation (where I is the number of infection events and n is the number of hosts). If there are a one or two common queries, you could use composite keys, with the serotype key extractors one component.

Option 4, keeping track of infected hosts in a map of sets std::map<H::Serotype, std::set<boost::shared_ptr<Host> > >, can be made more performant over infection events at the cost of querying. Insertion to/removal from a set might be O(lg(n)), so it might be better to use a hash or list rather than a set, though this will increase the cost of queries. Since you're running a simulation, keeping this structure synchronized isn't too bad. Update it during an infection (or cure, if they exist) event. If you're using event based programming, it's quite easy.

For querying with option 4, you could build other sets based on the other attributes, then find the intersection of those sets. This is O(f(I)+f(n)+g(S,T)) overall, where f(x) is the cost of inserting x elements, g(X,Y) is the cost of intersecting sets and S and T are the size of the sets (the O(f(I)+f(N)) comes from building all the sets; I'm playing it a little loose with the timing, since I might be assuming f(x) is homomorphic, which it likely isn't). At best, f(x)=x and g(X,Y)=X+Y, depending on your set implementation. Alternatively, fold over one of the sets (e.g. find all the hosts of a particular age, then count infected; count_if is a particular fold you could use, where count_if(start, end, pred) is equivalent to foldl(start, end, [](int count, const boost::shared_ptr<Host>& h){return count + pred(h); })). In the latter case, make use of the most restrictive index to reduce the sub-population you need to scan as much as possible. This is O(f(I)+S) overall.

Part of the difficulty in designing a performant solution is that you can really only make use of one independent index at a time. Within the groups from an index, other attributes you might index on are arranged essentially randomly. If two indices aren't independent (those that are geographically close are more likely to have the same infections), you might be able to make use of both simultaneously, though this doesn't gain you much: you can get most of the sub-population early in the query, but you'd still have to examine the rest of the group from the primary index.

If you're not married to C++, you could switch to C# and handle the queries with LINQ.

As for switching to a database, as mathmike suggests, you'd probably only want to use an object-oriented, in-memory database. Christopher Brown lists some, including FastDB and OOFile, though I can't say how suitable they are for your project.

The simplest solution to your problem is to just count the number of people infected whenever you want to know the answer. This does mean you have to loop over every host at query time, but it's not combinatorial.

If that is too slow, then I would suggest maintaining a multimap from serotype to each infected host. This has the same synchrony problems as your fourth option, but it has the advantage that you easily have access to all the infected hosts to calculate more statistics.

If you're really doing a lot of queries based on lots of different attributes, and you don't want to manually build up the data structures to make each query efficient, you may want to consider using an SQL relational database. These are really designed for such queries.

I find your question very realistic.

As the number of serotypes seems to be big (~100) I don't think the option to indexing by serotype can be a valid option. If the number of serotype is reduced '~10) the answer of outis is the correct way to go on to make indexes on serotypes.

If I had to implement your problem, and if one of the main goals is to "be able to determine at any time in the simulation the number of people infected with a given serotype" I would maintain a counter for each possible serotype and increase/decrease each time I add an element in the container with this serotype. You could of course encapsulate this on a class that maintain the counters and the multi-index. Of course this is intrusive, but what it is important it to provide what you need.

This is simple and efficient.

If you need to iterate over all the Host satisfying a given query "after counting the number infected with serotype s, count the number of those infected who are also in a particular household and have a particular age" you could do the following

get the index related to the age and get the range of those with the given age (equal_range) or
get the index relative to the household and get the range of those with the given household (equal_range)

and on this index count all the host satisfying the other conditions using 'count_if'.

There is also a Project iterator feature that could be helpful.

Even if you had no serotype, the more simple query "count the number of those infected who are also in a particular household and have a particular age" for two information elements that have an index you will need to use count_if on one of the index. There is no such thing as a merge on two ranges.

This is simple, has a good compromise between efficiency and data.

If you need to make a complex query that need to be very efficient you will need to add an index that take care of all the conditions. For example for the preceding query you could add an index sorted by the household and age. The use of this index will give you directly with equal_range the concerned elements, but you will need yet to count them using count_if. This will be more efficient as the set of elements to iterate on will be shorter.

If all serotype states are fully independent, and you can't afford to create, say, a vector per serotype and push on / erase shared_ptrs appropriately, there's nothing you can do to go faster than just looping through them all to determine the count of an arbitrary serotype. Now that could be optimized in some ways, like a binary string, but fundamentally, it's the same.

I've been thinking about a sort of, minimum serotype value. If you're going to be querying many different serotypes in quick order, you could store in the object the maximum serotype and the minimum serotype it has, efficiently excluding it from all the remaining searches at once.