This is a resource-allocation problem. My goal is to run a query to fetch the top-priority shift for any time-slot.
The dataset is very large. For this example, let’s say there are 100 shifts each for 1000 companies (though the real dataset is even larger). They are all loaded into memory, and I need to run a single LINQ to Objects query against them:
var topShifts =
(from s in shifts
where (from s2 in shifts
where s2.CompanyId == s.CompanyId && s.TimeSlot == s2.TimeSlot
orderby s2.Priority
select s2).First().Equals(s)
select s).ToList();
The problem is that without optimization, LINQ to Objects will compare each and every object in both sets, doing a cross-join of all 1,000 x 100 against 1,000 x 100, which amounts to 10 billion (10,000,000,000) comparisons. What I want is to compare only objects within each company (as if Company were indexed in a SQL table). This should result in 1000 sets of 100 x 100 objects for a total of 10 million (10,000,000) comparisons. The later would scale linearly rather than exponentially as the number of companies grows.
Technologies like I4o would allow me to do something like this, but unfortunately, I don’t have the luxury of using a custom collection in the environment in which I’m executing this query. Also, I only expect to run this query once on any given dataset, so the value of a persistent index is limited. I’m expecting to use an extension method which would group the data by company, then run the expression on each group.
Full Sample code:
public struct Shift
{
public static long Iterations;
private int companyId;
public int CompanyId
{
get { Iterations++; return companyId; }
set { companyId = value; }
}
public int Id;
public int TimeSlot;
public int Priority;
}
class Program
{
static void Main(string[] args)
{
const int Companies = 1000;
const int Shifts = 100;
Console.WriteLine(string.Format("{0} Companies x {1} Shifts", Companies, Shifts));
var timer = Stopwatch.StartNew();
Console.WriteLine("Populating data");
var shifts = new List<Shift>();
for (int companyId = 0; companyId < Companies; companyId++)
{
for (int shiftId = 0; shiftId < Shifts; shiftId++)
{
shifts.Add(new Shift() { CompanyId = companyId, Id = shiftId, TimeSlot = shiftId / 3, Priority = shiftId % 5 });
}
}
Console.WriteLine(string.Format("Completed in {0:n}ms", timer.ElapsedMilliseconds));
timer.Restart();
Console.WriteLine("Computing Top Shifts");
var topShifts =
(from s in shifts
where (from s2 in shifts
where s2.CompanyId == s.CompanyId && s.TimeSlot == s2.TimeSlot
orderby s2.Priority
select s2).First().Equals(s)
select s).ToList();
Console.WriteLine(string.Format("Completed in {0:n}ms", timer.ElapsedMilliseconds));
timer.Restart();
Console.WriteLine("\nShifts:");
foreach (var shift in shifts.Take(20))
{
Console.WriteLine(string.Format("C {0} Id {1} T {2} P{3}", shift.CompanyId, shift.Id, shift.TimeSlot, shift.Priority));
}
Console.WriteLine("\nTop Shifts:");
foreach (var shift in topShifts.Take(10))
{
Console.WriteLine(string.Format("C {0} Id {1} T {2} P{3}", shift.C开发者_Python百科ompanyId, shift.Id, shift.TimeSlot, shift.Priority));
}
Console.WriteLine(string.Format("\nTotal Comparisons: {0:n}", Shift.Iterations/2));
Console.WriteLine("Any key to continue");
Console.ReadKey();
}
}
Sample output:
1000 Companies x 100 Shifts
Populating data Completed in 10.00ms Computing Top Shifts Completed in 520,721.00ms Shifts: C 0 Id 0 T 0 P0 C 0 Id 1 T 0 P1 C 0 Id 2 T 0 P2 C 0 Id 3 T 1 P3 C 0 Id 4 T 1 P4 C 0 Id 5 T 1 P0 C 0 Id 6 T 2 P1 C 0 Id 7 T 2 P2 C 0 Id 8 T 2 P3 C 0 Id 9 T 3 P4 C 0 Id 10 T 3 P0 C 0 Id 11 T 3 P1 C 0 Id 12 T 4 P2 C 0 Id 13 T 4 P3 C 0 Id 14 T 4 P4 C 0 Id 15 T 5 P0 C 0 Id 16 T 5 P1 C 0 Id 17 T 5 P2 C 0 Id 18 T 6 P3 C 0 Id 19 T 6 P4 Top Shifts: C 0 Id 0 T 0 P0 C 0 Id 5 T 1 P0 C 0 Id 6 T 2 P1 C 0 Id 10 T 3 P0 C 0 Id 12 T 4 P2 C 0 Id 15 T 5 P0 C 0 Id 20 T 6 P0 C 0 Id 21 T 7 P1 C 0 Id 25 T 8 P0 C 0 Id 27 T 9 P2 Total Comparisons: 10,000,000,015.00 Any key to continue
Questions:
- How can I partition the query (while still executing as a single LinQ query) in order to get the comparisons down from 10 billion to 10 million?
- Is there a more efficient way of solving the problem instead of a sub-query?
How about
var topShifts = from s in shifts.GroupBy(s => s.CompanyId)
from a in s.GroupBy(b => b.TimeSlot)
select a.OrderBy(p => p.Priority).First();
Seems to get the same output but 100015 comparisons
with @Geoff's edit he just halved my comparisons :-)
Have you tried using group by
:
var topShifts = from s in shifts
group s by new {
CompanyId = s.CompanyId,
TimeSlot = s.TimeSlot } into p
let temp = p.OrderBy(x => x.Priority).FirstOrDefault()
select new
{
CompanyId = temp.CompanyId,
TimeSlot = temp.TimeSlot,
Id = temp.Id,
Priority = temp.Priority
};
I'm a bit unsure about what you want to be honest but from Reading your code I'd say you could do something like
(from company in shifts.GroupBy(s=>s.CompanyID)
let lowPriority = (from slot in company.GroupBy(s=>s.TimeSlot)
select slot).OrderBy(s=>s.Priority).First()
select lowPriority).ToList();
精彩评论