I have posted a bit here related to a project I have been trying to work on and I keep hitting design problems and have to design from scratch. So I'm wondering if I can post what I'm trying to do and someone can help me understand how I can get the result I want.
BackGround:
I'm new to programming and trying to learn. So I took a project that interested me which involves basically taking list and breaking down each number using only numbers from the list. I know I could easily brute force this(which I did) but I wanted to also learn Hbas开发者_C百科e, Hadoop, and parallel processing so I wanted do it in a way that I could break the process across various machines. I thought the way to do this was using dynamic programming and recursions to create a table of possibilities that could be broken down even further.
Example:
If I submit the list: [1,2, 4]
I should get {1: [[1]], 2: [[1, 1]], 4: [[2, 2]]}
. What its basically saying is 2+2=4, 1+1=2, and 1=1..so trying to see all the ways to make 4, I can just look up this list(which will be in a database) and see 2+2=4, then break down 2..and so on.. I have the lookup job working but the breakdown I am having problems with. Using brute force will not let me use large numbers(say a million, with a thousand numbers in the list) in a way that I can use hadoop or some other tool to scale it. Here are some more examples of possible results:
[1,2, 3, 4] = {1: [[1]], 2: [[1, 1]], 3: [[1, 2]], 4: [[1, 3], [2, 2]]}
[1,2, 3, 4, 6, 10] = {1: [[1]], 2: [[1, 1]], 3: [[1, 2]], 4: [[1, 3], [2, 2]], 6: [[2, 4], [3, 3]], 10: [[4, 6], [2, 2, 2, 2, 2]]}
[10, 30, 50] = 50: [[10, 10, 30]], 30: [[10, 10, 10]]}
The logic with this approach is that it will not take time to comput the next possible data in the list, so if I send a list with a million numbers it could do it quickly and even scale to a hadoop cluster.
The code I created to get this to work is here but that question was on how to correct the design problem. I got advice that this is a partition problem and looked around and found much simpler versions of what I was trying to do ( activestate ), but its not exactly what I am trying to do because it breaks down a number and does not use a specific data set to do it.
Question:
So hopefully I clearly described what I am trying to do. What do I need to read/study/learn to create a table of a partition of an list in python using dynamic programming so it can scale? Its just a hobby and not time sensitive but I feel I have been working on this for over 3 months and each time I hit design problems that cause me to have to start from scratch. How can I build this correctly(to see my incorrect way see my link above)? I have googled and found solutions to the knapsack problem and partition problems but they seem to be more for school work and not really built to scale with large datasets.
Hopefully someone can give me insight but regardless thank you for reading this.
You have to be aware, that DP problems are not per se optimal for independent and distributed computations.
When you consider the classical forms of DP algorithms, you have a matrix/table/array and you successively compute new values in a certain order. Each computation of a value requires other values that have to be created before. Hence, you lose data independence and can maximally compute a certain number of array fields at the same time, depending on the specific DP algorithms. For example, many DP algorithms can process a whole table column in parallel, as each field relies on fields from a previous column. But that's already the limit due to the data dependency of all remaining fields after that column.
That being said, calculating the sum possibilities of the various numbers available in your list is NOT a DP problem. You do not solve any sub-problems at all, but simply collect all possible sums (if they happen to match one of your list entries).
Hence, I suggest the following sightly different approach:
- Compute a new list with all possible summations. This is data independent and can be parallelized, but you need some upper bound for termination. Example:
[1,2,4]
becomes[ [1],[2],[4],[1,1],[1,2],[1,4],...]
. You don't have to explicitly construct this list, but just pass each such combination on to the next step. - Evaluate each computation, i.e. create the sum and check whether it matches a value from your original list. Again, you are data independent and can perform all these calculations independently.
- Combine the positive results into the final datastructure.
So to sum it up and answer your questions:
- Rethink, whether you want to regard this problem as DP at all.
- You may want to read up on data-parallelism. This is particularly relevant when solving such problems with GPUs, so the corresponding literature on CUDA/OpenCL may be useful, too.
精彩评论