I have a dataset of 2D points (~500k of them) on which I'd like to perform some kind of quadrat count analysis. The basics of quadrat count is to split your 2D space into a regular grid (each cell has size SxS
) and count the number of points in each cell.
For some reason, I'd like to do a slight variation of that : instead of using a regular grid, I want to build the grid such that each cell contains at most K
points.
What I did is the following: I start with the whole space, and divide it in 4 cells (by "cutting" each dimension in half). Then, I count the number of points in each cell.开发者_StackOverflow社区 For those that contain more than K
points, I divide them again, etc., until I'm done.
I tried both recursive and iterative implementations of this simple algorithm, but none performed well when applied to the whole dataset. The main bottleneck is the counting part, obviously, so I was wondering what kind of datastructure would allow me to do this efficiently ?
(For now, I'm just using "conditional indexing" in Python : points = points[points[,1] > x1 and points[,1] <= x2 and points[,2] > y1 and points[,2] <= y2,]
)
Also, do you have maybe another idea on how I could build my grid ?
EDIT: In other words, what kind of data structure could I use to quickly count (and retrieve) the points that fall within a given rectangle ((x1, y1), (x2, y2))
?
This isn't complete but it might point you in the right direction.
Instead of starting big and going small, start small and go big.
Divide your space into, say, 100x100 cells. Count the number in each cell (this is exactly O(n), you count each cell once.)
From there on out you don't need to count cells. You can create CellGroups to count what cells it has, and from there I would use an algorithm to combine cells into CellGroups.
You might consider an approach that takes two small cells to merge them and recalculates.
while(true) {
take the smallest cellgroup
compare it to each other cellgroup starting with the second smallest
go up the list until you find two adjecent cell groups
if you find a match
merge them
update the cellgroup size rankings
repeat the process (continue the while(true)
otherwise
break out, you're done merging cells
}
I'm not familiar enough with Python, but if you run through entire array for each quadrant, it can be improved:
After each splitting group points according to quadrant they correspond to. When splitting further a quadrant analyse only corresponding subarray. This may speed up counting.
Also since you are OK with irregular grid, you may consider selecting separation lines always diving points into equal groups (horizontal and vertical splitting should be done separately for this).
精彩评论