I have an algorithmic problem which can be reduced to this task:
Suppose we have a list of n
diseases and m
symptoms.
d
and symptom s
, we have one of three options:
- the symptom is positively correlated with the disease:
s => d
- the symptom is negatively correlated with the disease:
s => ~d
- the symptom is uncorrelated with the disease
The goal of the algorithm is to create a list of yes/no questions regarding symptoms (or even better - a binary tree of questions), which can deduce the exact disease according to the symptoms.
Any references to specific algorithms, relevant software tools and even dom开发者_StackOverflowain-specific jargon would be very appreciated.
You case use a decision tree : http://en.wikipedia.org/wiki/Decision_tree_learning
Basically finding the optimum tree (ie which will minimize the average number of questions asked before you can identify the desease) is NP-hard.
You can can use a greedy algorithm and then try to optimise it (if you need it).
At each step you want to reduce at much as possible the number of deceases that are still "possible".
You are at the top of your tree and so you have three possible options for a given s
, count the number of diseases in each option : pc
nc
uc
.
On one side you have pc
on the other nc
and the uc
you can't say anything (you can look at two level of your tree to have some info but for now we don't do that).
So worst case scenario, you have pc
/ nc + uc
or pc + uc
/ nc
, choose the s
that minimize worst case (ie : lot of one side, only a few on the other).
You need to minimize abs( pc - (nc + uc)) + abs ( (pc+uc) - nc)
.
You now have your s
for your first question and you can iteratively build your tree.
Is your domain really this 'binary' or in fact, would you more likely want to use the correlation coefficient for each symptom/disease pair as a numeric value? This would allow strong correlations to influence the result more than weak correlations.
If so then you might want to look at Support Vector Machines that analyze data and recognize patterns.
The problem is very similar to the bacteria / antibiotic question of Mycin, a forerunner to more generalized rule-based expert systems technology of the 1980s. There were other medical diagnostic programs developed with the resulting tools.
http://en.wikipedia.org/wiki/Mycin
If you just need a reference, take a look at the Russel & Norvig book. I don't have my copy at hand right now, but I can update this answer with some chapter suggestions tomorrow.
If each disease only has a few symptoms, then you can use graphical models to model the probabilities.
http://en.wikipedia.org/wiki/Graphical_model
http://www.cs.ubc.ca/~murphyk/Software/bnsoft.html
But I don't know if you can use a graphical model to create a tree of questions.
精彩评论