Fastest way to split overlapping date ranges_问答_开发者

I have date range data in SQL DB table that has these three (only relevant) columns:

ID (int identity)
RangeFrom (date only)
RangeTo (date only)

For any given date range, there may be an arbitrary number of records that may overlap (completely or partially).

Conditions

Every record with higher ID (newer record) takes precedence over older records that it may overlap (fully or partially)
Ranges are at least 1 day long (RangeFrom and RangeTo differ by one day)

So for a given date range (not longer than ie. 5 years) I have to

get all range records that fall into this range (either fully or partially)
split these overlaps into non-overlapping ranges
return these new non overlapping ranges

My take on it

Since there's a lot of complex data related to these ranges (lots of joins etc etc) and since processor + memory power is much more efficient than SQL DB engine I decided to rather load overlapping data from DB to my data layer and do the range chopping/splitting in memory. This give me much more flexibility as well as speed in terms of development and execution.

If you think this should be better handled in DB let me know.

Question

I would like to write the fastest and if at all possible also resource non-hungry conversion algorithm. Since I get lots of these records and they are related to various users I have to run this algorithm for each user and its set of overlapping ranges data.

开发者_开发问答

What would be the most efficient (fast and non resource hungry) way of splitting these overlapping ranges?

Example data

I have records ID=1 to ID=5 that visually overlap in this manner (dates are actually irrelevant, I can better show these overlaps this way):

       6666666666666
                44444444444444444444444444         5555555555
          2222222222222            333333333333333333333            7777777
11111111111111111111111111111111111111111111111111111111111111111111

Result should look like:

111111166666666666664444444444444444444444333333333555555555511111117777777

Result actually looks like as if we'd be looking at these overlaps from the top and then get IDs that we see from this top-down view.

Result will actually get transformed into new range records, so old IDs become irrelevant. But their RangeFrom and RangeTo values (along with all related data) will be used:

111111122222222222223333333333333333333333444444444555555555566666667777777

This is of course just an example of overlapping ranges. It can be anything from 0 records to X for any given date range. And as we can see range ID=2 got completely overwritten by 4 and 6 so it became completely obsolete.

How about an array of nullable integers

I've come up with an idea of my own:

for the given date range I would create an in memory array of integers with as many items as there are days in the range.
fill array with null values. All of them.
order records by ID in reverse order
flatten overlapped ranges by iterating over ordered records and do the following on each item:
1. get item
2. calculate start and end offset for array (days difference)
3. set all array values between these two offsets to item ID but only when value is null
4. continue to step 4.1
you end up with an array of flattened ranges and filled with record IDs
create new set of records and create each new record when ID in array changes. Each record should use data associated with the record ID as set in array
Repeat the whole thing for next person and its set of overlapped ranges (don't forget to reuse the same array). = go back to step 2.

And that's it basically.

A 10 years given date range requires an array of approx. 3650 nullable integers, which I think is rather small memory footprint (each integer taking 4 bytes, but I don't know how much space occupies a nullable integer that has an int and bool but lets assume 8 bytes which totals at 3650*8 = 28.52k) and can be easily and rather fast manipulate in memory. Since I'm not saving date ranges, splitting or anything similar these are barely just assignment operations with an if that checks whether value has already been set.

A 10 year date range is a rare exaggeratet extreme. 75% of date ranges will be within 3 months or quarter of a year (90 days * 8 bytes = 720 bytes) and 99% will fall in a range of a whole year (365*8 = 2920 bytes = 2,85k)

I find this algorithm more than appropriate for flattening overlapped date ranges.

To half memory footprint I could use int instead of int? and set to -1 instead of null.

A premature iteration loop break possibility

I could as well keep a count of days that aren't set and when it reaches 0 I can easily break the loop, because all remaining ranges are fully overlapped hence they wouldn't set any more values in array. So this would even speed things up a bit when I would have lots of range records (which will be rather rare).

The free Time Period Library for .NET includes the tool TimePeriodIntersector, which intersects various overlapping time ranges.

The algorithm uses a timeline and enumerates all moments within a time range (counting start/end points per moment):

// ----------------------------------------------------------------------
public void TimePeriodIntersectorSample()
{
  TimePeriodCollection periods = new TimePeriodCollection();

  periods.Add( new TimeRange( new DateTime( 2011, 3, 01 ), new DateTime( 2011, 3, 10 ) ) );
  periods.Add( new TimeRange( new DateTime( 2011, 3, 05 ), new DateTime( 2011, 3, 15 ) ) );
  periods.Add( new TimeRange( new DateTime( 2011, 3, 12 ), new DateTime( 2011, 3, 18 ) ) );

  periods.Add( new TimeRange( new DateTime( 2011, 3, 20 ), new DateTime( 2011, 3, 24 ) ) );
  periods.Add( new TimeRange( new DateTime( 2011, 3, 22 ), new DateTime( 2011, 3, 28 ) ) );
  periods.Add( new TimeRange( new DateTime( 2011, 3, 24 ), new DateTime( 2011, 3, 26 ) ) );

  TimePeriodIntersector<TimeRange> periodIntersector =
                    new TimePeriodIntersector<TimeRange>();
  // calculate intersection periods; do not combine the resulting time periods
  ITimePeriodCollection intersectedPeriods = periodIntersector.IntersectPeriods( periods, false );

  foreach ( ITimePeriod intersectedPeriod in intersectedPeriods )
  {
    Console.WriteLine( "Intersected Period: " + intersectedPeriod );
  }
  // > Intersected Period: 05.03.2011 - 10.03.2011 | 5.00:00
  // > Intersected Period: 12.03.2011 - 15.03.2011 | 3.00:00
  // > Intersected Period: 22.03.2011 - 24.03.2011 | 2.00:00
  // > Intersected Period: 24.03.2011 - 26.03.2011 | 2.00:00
} // TimePeriodIntersectorSample

The ID mapping should be an easy task.

I'm not quite sure how useful that would be but the way I would approach this... (first unoptimized for easy understanding...)

convert the table mapping from [ID->range] to [date->list of IDs].

(sort by date, each date - no matter, start or end, is a start of a time range reaching until next date.) So that your table would look like:

        |666|666666|6666|
        |   |      |4444|444|444444444444|4444444|         |55555|55555|
        |   |222222|2222|222|            |3333333|333333333|33333|     |       |7777777
 1111111|111|111111|1111|111|111111111111|1111111|111111111|11111|11111|1111111|

 1234567|890|123456|7890|123|4


 1 -> 1
 8 -> 1,6
 11 -> 6,2,1
 17 -> 6,4,2,1
 21 -> 4,2,1
 24 -> 4,1
 ...

select largest element in each list
concatenate following records with same largest value.

Since you will have duplicate IDs in your final database ("1" gets split over two segments in your example), keeping the database in date->ID format as opposed to ID->range seems preferred in the end.

Now for obvious optimizations - of course don't keep the list of IDs with each date record. Just fill in a date->ID table with null IDs, and while filling it in with final records, replace value largest record found so far:

create table of all date entries, [date -> ID]
for each record in your original table:
- select dates in range from-to,
- if any has ID value null or lower than currently checked record ID, fill in with current ID.
Then concatenate - if next record has same ID as previous, remove next.
in the end, you may want to denormalize a bit, replace extracting two consecutive records for a range with [date -> ID,length], or [date -> ID,end_date]

Adding new record is just like one iteration of the operation of creation. Removing a record, on the other hand, seems to be quite tricky.

Effectively you want to stack the data, and select the maximum from the stack. I have had to implement something similar to this before and the approach we used, which gave us a bit more flexibility than you require, so may not be appropriate was to do this:

Have an object for managing the records, and add each record to this object. when a record is added create a new date range and associate the value of the record with the range. Then check if the range overlaps with any other existing range. If it does overlap, then create a new range for each overlap and associate all of the values on the both/all (depending on if you do it as you add each range, or in a single pass) the overlapped ranges with the new range. This can either be done as you add the data, or in a single pass once all data has been added.

At the end you have a object which contains unique ranges, each of which has a collection of values associated with it, a bit like your picture above.

       |666|666666|6666|
       |   |      |4444|444|444444444444|4444444|         |55555|55555|
       |   |222222|2222|222|            |3333333|333333333|33333|     |       |7777777
1111111|111|111111|1111|111|111111111111|1111111|111111111|11111|11111|1111111|

You can then provide a class with a flattening function (probably using the strategy pattern) which will convert the unique ranges with collections of values into unique ranges with a single value, this will obviously concatenate ranges which end up with the same value.

You would want a class which selects the maximum value from each unique range, but you might also want to select the minimum value, sum the values, average them, count them etc etc. Each of these options can be done by passing a different implementation of the strategy.

As I said this approach might be less efficient than an approach which only selects the maximum value, as you wouldn't need to keep all the values in the stack in that case, but the implementation was fairly straight forward as I remember.