I have a database with many CVs, including structured data of the gender, age, address, number of years of education, and many other parameters of each person.
For about 10% of the sample, I also have additional data about a certain action they've made at some point in time. For instance, that Jane took a home loan in July 1998 or that John started pilot training in Jan. 2007 and got his license in Dec. 2007.
I need an algorithm that will give, for each of the actions, the probability that it will happen for each person in future time in开发者_如何学JAVAcrements. For instance, that the chance of Bill taking a home loan is 2% in 2011, 3.5% in 2012, etc.
How should I approach this? Regression analysis? SVM? Neural net? Something else?
Is there perhaps even some standard tool/library that I can use with just the obvious customizations?
The probability that X happens given that Y happened is right out of Bayesian inference, I think.
Lou is right, this is the case for 'Bayesian Inference'.
The best tool/library to solve this is the R statistic programming language (r-project.org).
Take a look at the Bayesian Inference Libraries in R: http://cran.r-project.org/web/views/Bayesian.html
How many people are in the "10% of the sample"? If it's below 100 people or so, I would fear that the results of the analysis could not be significant. If it's 1000 or more people, the results will be quite good (rule of thumb).
I would fist export the data to R (r-project) and do some data cleaning necessary. Then find a person familiar with R and advanced statistics, he will be able to solve this very quickly. Or try yourself, but R takes some time in the beginning.
Concerning the tool/library choice, I suggest you give Weka a try. It's an open source tool for experimenting with data mining and machine learning. Weka has several tools for reading, processing and filtering your data, as well as prediction and classification tools.
However, you must have a strong foundation in the above mentioned fields in order to strive for a useful result.
精彩评论