Data Set Manipulation_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-02-11 13:27 出处：网络

I need to reverse engineer a data set to its original form. The original data set was derived from a process where multiple users who have multiple characteristics enter a room and some click on a button. The column variables are in indicator form so where a user click on the button or have a certain characteristic开发者_Python百科 this is recorded as one and where they don't it's indicated by a zero. This data set is then transformed in a form where the characteristic types are observations represented by two characteristic variables. This new data set shows the users who have two characteristics, the amount of them, and their button clicks. this also encompasses all users. my explanation might not be the clearest so here is an image that might help might explanation

Data Set Manipulation

I'm thinking of using some type of look up algorithm to solve this but that not might be too efficient.

Unfortunately, in general, you will not be able to unambiguously reverse engineer your data set. Ignoring for the moment the action column, consider the following two data sets:

Data set 1:

Data set 2:

Unless I'm mistaken, these two data sets would show the same number of users under each pair of characteristics:

A A 5
A B 2
A C 2
B B 5
B C 2
C C 5

Now, you might be tempted to think: "Hey, the first data set has 10 users but the second data set has only 9. If I'm able to get the total number of users, will this solve my problem?" The answer is mostly no. If you have three or fewer characteristics, then the answer is yes (see: Inclusion-exclusion Principle). However, if you have more than three characteristics, the answer is no. You can construct similarly ambiguous examples where the total number of users is the same.

As previous posters have mentioned, the data set is not going to be unique, but you might have another issue: What is the size of the data set? Intuitively, this looks like this problem is NP-hard. If we reduce the problem to simply finding any matrix n by k (first grid: n participants, k characteristics) that satisfies the constraints (second grid), brute forcing this is going to require you to try every possible combination. We can constrain this a little by only trying solutions that have the specified number of persons per characteristic, but in the worst case, this is still going to be (n choose n/2)^k combinations in the worst case.

I'm thinking that a possible brute force solution might be: Since I'm given the total number of users and actions and i can figure out the number of characteristics.

First I can create a data structure with the same dimensions as the original but with all observations equal to zero
Look up the given data set for users and actions with characteristic A
Look up the given data set for users and actions with B and A and B characteristics and adjust the data set accordingly
""""""" with C and A&C and B&C characteristics and adjust the data set accordingly.

I've only done it up to A, B, and C but by the looks of it it does get more complicated as I go on to more characteristics because I would have to look up the intersections of mostly all of them. Also the given data set can be reduced because a lot of the entries are duplicates, for instance C A is the same as A C.

sol image