开发者

How to handle billions of objects without "Outofmemory" error

开发者 https://www.devze.com 2023-01-19 07:00 出处:网络
I have an application which may needs to process billions of objects.Each object ofis of TRange class type. These ranges are created at different parts of an algorithm which depends on certain conditi

I have an application which may needs to process billions of objects.Each object of is of TRange class type. These ranges are created at different parts of an algorithm which depends on certain conditions and other object properties. As a result, if you have 100 items, you can't directly create the 100th object without creating all the prior objects. If I create all the (billions of) objects and add to the collection, the system will throw Outofmemory error. Now I want to iterate through each object mainly for two purposes:

  1. To apply an operation for each TRange object(eg:Output certain properties)
  2. To get a cumulative sum of a certain property.(eg: Each range has a weight property and I want to retreive totalweight that is a sum of all the range weights).

How do I effectively create an Iterator for these object without raising Outofmemory?

I have handled the first case by passing a function pointer to the algorithm function. For eg:

procedure createRanges(aProc: TRangeProc);//aProc is a pointer to function that takes a    //TRange
var range: TRange;
  rangerec: TRangeRec;
begin
  range:=TRange.Create;
  try 
    while canCreateRange do begin//certain conditions needed to create a range
      rangerec := ReturnRangeRec;
      range.Update(rangerec);//don't create new, use the same object.
      if Assigned(aProc) then aProc(range);
    end;
  finally
    range.Free;
  end;
end;

But the problem with this approach is 开发者_开发技巧that to add a new functionality, say to retrieve the Total weight I have mentioned earlier, either I have to duplicate the algorithm function or pass an optional out parameter. Please suggest some ideas.

Thank you all in advance

Pradeep


For such large ammounts of data you need to only have a portion of the data in memory. The other data should be serialized to the hard drive. I tackled such a problem like this:

  1. I Created an extended storage that can store a custom record either in memory or on the hard drive. This storage has a maximum number of records that can live simultaniously in memory.
  2. Then I Derived the record classes out of the custom record class. These classes know how to store and load themselves from the hard drive (I use streams).
  3. Everytime you need a new or already existing record you ask the extended storage for such a record. If the maximum number of objects is exceeded, the storage streams some of the least used record back to the hard drive.

This way the records are transparent. You always access them as if they are in memory, but they may get loaded from hard drive first. It works really well. By the way RAM works in a very similar way so it only holds a certain subset of all you data on your hard drive. This is your working set then.

I did not post any code because it is beyond the scope of the question itself and would only confuse.


Look at TgsStream64. This class can handle a huge amounts of data through file mapping.

http://code.google.com/p/gedemin/source/browse/trunk/Gedemin/Common/gsMMFStream.pas


But the problem with this approach is that to add a new functionality, say to retrieve the Total weight I have mentioned earlier, either I have to duplicate the algorithm function or pass an optional out parameter.

It's usually done like this: you write a enumerator function (like you did) which receives a callback function pointer (you did that too) and an untyped pointer ("Data: pointer"). You define a callback function to have first parameter be the same untyped pointer:

TRangeProc = procedure(Data: pointer; range: TRange);

procedure enumRanges(aProc: TRangeProc; Data: pointer);
begin
  {for each range}
    aProc(range, Data);
end;

Then if you want to, say, sum all ranges, you do it like this:

TSumRecord = record
  Sum: int64;
end;
PSumRecord = ^TSumRecord;

procedure SumProc(SumRecord: PSumRecord; range: TRange);
begin
  SumRecord.Sum := SumRecord.Sum + range.Value;
end;

function SumRanges(): int64;
var SumRec: TSumRecord;
begin
  SumRec.Sum := 0;
  enumRanges(TRangeProc(SumProc), @SumRec);
  Result := SumRec.Sum;
end;

Anyway, if you need to create billions of ANYTHING you're probably doing it wrong (unless you're a scientist, modelling something extremely large scale and detailed). Even more so if you need to create billions of stuff every time you want one of those. This is never good. Try to think of alternative solutions.


"Runner" has a good answer how to handle this!

But I would like to known if you could do a quick fix: make smaller TRange objects. Maybe you have a big ancestor? Can you take a look at the instance size of TRange object? Maybe you better use packed records?


This part:

As a result, if you have 100 items, you can't directly create the 100th object without creating all the prior objects.

sounds a bit like calculating Fibonacci. May be you can reuse some of the TRange objects instead of creating redundant copies? Here is a C++ article describing this approach - it works by storing already calculated intermediate results in a hash map.


Handling billions of objects is possible but you should avoid it as much as possible. Do this only if you absolutely have to...
I did create a system once that needed to be able to handle a huge amount of data. To do so, I made my objects "streamable" so I could read/write them to disk. A larger class around it was used to decide when an object would be saved to disk and thus removed from memory. Basically, when I would call an object, this class would check if it's loaded or not. If not, it would re-create the object again from disk, put it on top of a stack and then move/write the bottom object from this stack to disk. As a result, my stack had a fixed (maximum) size. And it allowed me to use an unlimited amount of objects, with a reasonable good performance too.
Unfortunately, I don't have that code available anymore. I wrote it for a previous employer about 7 years ago. I do know that you would need to write a bit of code for the streaming support plus a bunch more for the stack controller which maintains all those objects. But it technically would allow you to create an unlimited number of objects, since you're trading RAM memory for disk space.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号