开发者

how to predict quality of data?

开发者 https://www.devze.com 2023-03-10 16:43 出处:网络
I\'m very sorry if I\'m wording this wrong in advance but I have a large dataset and I am trying to analyze it, but most of the data is not correct and need some help figuring out how to select the co

I'm very sorry if I'm wording this wrong in advance but I have a large dataset and I am trying to analyze it, but most of the data is not correct and need some help figuring out how to select the correct data.

Here's some more information to clear it up more. For example I have the following:

color  value   quantity
red       20    2
blue    5   8
green   10  2

total       100

If only the value and the total is given, I will find there is 36 possible answers:

#1 Found : 20.0*0.0 red + 5.0*0.0 blue + 10.0*10.0 green = 100.0
#2 Found : 20.0*0.0 red + 5.0*2.0 blue + 10.0*9.0 green = 100.0
#3 Found : 20.0*0.0 red + 5.0*4.0 blue + 10.0*8.0 green = 100.0
#4 Found : 20.0*0.0 red + 5.0*6.0 blue + 10.0*7.0 green = 100.0
#5 Found : 20.0*0.0 red + 5.0*8.0 blue + 10.0*6.0 green = 100.0
#6 Found : 20.0*0.0 red + 5.0*10.0 blue + 10.0*5.0 green = 100.0
#7 Found : 20.0*0.0 red + 5.0*12.0 blue + 10.0*4.0 green = 100.0
#8 Found : 20.0*0.0 red + 5.0*14.0 blue + 10.0*3.0 green = 100.0
#9 Found : 20.0*0.0 red + 5.0*16.0 blue + 10.0*2.0 green = 100.0
#10 Found : 20.0*0.0 red + 5.0*18.0 blue + 10.0*1.0 green = 100.0
#11 Found : 20.0*0.0 red + 5.0*20.0开发者_C百科 blue + 10.0*0.0 green = 100.0
#12 Found : 20.0*1.0 red + 5.0*0.0 blue + 10.0*8.0 green = 100.0
#13 Found : 20.0*1.0 red + 5.0*2.0 blue + 10.0*7.0 green = 100.0
#14 Found : 20.0*1.0 red + 5.0*4.0 blue + 10.0*6.0 green = 100.0
#15 Found : 20.0*1.0 red + 5.0*6.0 blue + 10.0*5.0 green = 100.0
#16 Found : 20.0*1.0 red + 5.0*8.0 blue + 10.0*4.0 green = 100.0
#17 Found : 20.0*1.0 red + 5.0*10.0 blue + 10.0*3.0 green = 100.0
#18 Found : 20.0*1.0 red + 5.0*12.0 blue + 10.0*2.0 green = 100.0
#19 Found : 20.0*1.0 red + 5.0*14.0 blue + 10.0*1.0 green = 100.0
#20 Found : 20.0*1.0 red + 5.0*16.0 blue + 10.0*0.0 green = 100.0
#21 Found : 20.0*2.0 red + 5.0*0.0 blue + 10.0*6.0 green = 100.0
#22 Found : 20.0*2.0 red + 5.0*2.0 blue + 10.0*5.0 green = 100.0
#23 Found : 20.0*2.0 red + 5.0*4.0 blue + 10.0*4.0 green = 100.0
#24 Found : 20.0*2.0 red + 5.0*6.0 blue + 10.0*3.0 green = 100.0
#25 Found : 20.0*2.0 red + 5.0*8.0 blue + 10.0*2.0 green = 100.0
#26 Found : 20.0*2.0 red + 5.0*10.0 blue + 10.0*1.0 green = 100.0
#27 Found : 20.0*2.0 red + 5.0*12.0 blue + 10.0*0.0 green = 100.0
#28 Found : 20.0*3.0 red + 5.0*0.0 blue + 10.0*4.0 green = 100.0
#29 Found : 20.0*3.0 red + 5.0*2.0 blue + 10.0*3.0 green = 100.0
#30 Found : 20.0*3.0 red + 5.0*4.0 blue + 10.0*2.0 green = 100.0
#31 Found : 20.0*3.0 red + 5.0*6.0 blue + 10.0*1.0 green = 100.0
#32 Found : 20.0*3.0 red + 5.0*8.0 blue + 10.0*0.0 green = 100.0
#33 Found : 20.0*4.0 red + 5.0*0.0 blue + 10.0*2.0 green = 100.0
#34 Found : 20.0*4.0 red + 5.0*2.0 blue + 10.0*1.0 green = 100.0
#35 Found : 20.0*4.0 red + 5.0*4.0 blue + 10.0*0.0 green = 100.0
#36 Found : 20.0*5.0 red + 5.0*0.0 blue + 10.0*0.0 green = 100.0

As you can see, in the possibilities I get the correct answer but many other answers also. Now say I add one more red(so the total red is 3) then I now have 49 results, but some of the results in second set are not likely if you factor in the relationship with the first result set. I assume as I get more data results, I can more accurately remove the results that don't work.

I'm trying to figure if there's any research or standard approach to narrowing the results down to something more meaningful. I am not 100% sure but I thought maybe google might be an example of this as each query is not only ran against the data but your history also(I have a website that is ranked very low and when I clicked on it and then searched for it again it always comes up on top..but when I search on my friends computer the same site shows up at the bottom). I thought maybe the way google builds a relationship with our multiple search queries, I could use a similar approach to remove the results from my data above that weren't correct.

Sorry for the misunderstanding. I'm a bit new to algo's and I am having trouble explaining this. If it doesn't make sense please let me know.

Thanks in advance!


If I got this right you solve the equations like this one for

R*r + G*g + B*b = 100

For given integer values of R, G, B and with the constraint that r, g, b are also integer values.

Since you have only one equation and 3 variable, you get a solution space instead of a single solution and now want to apply some algorithm to pick the correct or best one

You also seem to have values of r0, g0, b0 which are likely values for r, g and b ?!

What you need to come up with is a fitness function which tells you how good or bad your candidate solution is.

One example could be (lower values meaning better solution)

(r-r0)^2 +(g-g0)^2 +(b-b0)^2 

Which basically says a solution is better when it is closer to the likely values.

A variant could be

(r-r0)^2 +(g-g0)^2 +(b-b0)^2 + c*C

Where C is a constant to be choosen by you and c is the number of values of that differ from your likely solution. This would give a higher fitness to a candidate which changes only one value compared to one changing two or three values.

Once you a have a fitness function, pick the solution with the lowest fitness.


The problem is called a linear Diophantine equation. You can find further information here.

0

精彩评论

暂无评论...
验证码 换一张
取 消