I'm currently working on a project concerning segmentation of geographical regions based on the plants that grow in each, over multiple significant layers (that is to say, each segmentation layer has a meaning that is unique wrt the other layers)
In doing so, we're using logistic regression to go from a list of regions, with the segment they belong to in each layer, and which plants they contain, to a probability of a plant growing in each combination of segments. At the moment, we are using SPSS, linked to a C# implementation of the segmentation.
So far, so good. The problem is, SPSS is slow as molasses on a cold day. For the full set (2500 plants and 565 regions), a single run would take about half a month. That's time we don't have, so for now we're using abbreviated data sets, but even that takes several hours.
We've looked into other libraries with logistic regression (specifically Accord.NET and Extreme Optimization), but neither has categorical logistic regression.
At this point I should probably specify what I mean by categorical logistic regression. Given that each row in the data set we feed the statistics engine has a variable for each layer, and one for the plant we're interested in at the moment, the value of the layer v开发者_如何学JAVAariables are considered categories. 0 is not better or worse than 1, it's simply different. What we want out of the statistics engine is a value for each category of each layer variable (as well as an intercept, of course), so in a setup with a layer with 3 segments and one with 2 segments, we'd get 5 values and the intercept.
I should note that we've experimented with dummy or indicator variables both in Accord.NET (where it had to be done outside of the library) and Extreme Optimization (which had some in-library support for it), but this did not produce the results necessary.
TL;DR
So, long story short, does anyone know of a good solution for categorical logistic regression in C#? This can be a class library, or simply an interface to plug into an external statistics engine, as long as it's stable and reasonably fast.
The standard approach to producing a logistic regression with categorical input variables is to transform the categorical variables into dummy variables. So, you should be able to use any of the logistic regression libraries that you've mentioned in your question, as long as you perform the appropriate transformation to the input data.
The mapping from one categorical variable with n categories to n-1 numeric dummy variables is called a contrast. This post has some further explanations of how contrasts are put together.
Note that the number of dummy variables is 1 less than the number of category values. If you try to use one dummy variable per category value, you'll find that the last dummy variable is not independent of the preceding dummy variables and if you try to fit the regression model to it you will get errors (or meaningless coefficients).
So, to take the example of a model with an intercept, a 3 level categorical input variable and a 2 level categorical input variable, the number of ceofficients will be 1 + (3 - 1) + (2 - 1) = 4.
This post is long gone by now, but in case it helps someone else: You might want to check the type of computation SPSS is using to build the model. I'm wondering if something which takes that long to run is bogged down using the exact computation, similar to Fisher's exact. The time these take rapidly grows as the category or record count grows. If 20% or more of your "cells" (unique combinations of categorical variables) has 5 or fewer records, however, you need to use something like the exact method. Unless you've got your regions grouped somehow it sounds like you may be down to that. SPSS may just see the need and automatically invoke that approach. Something to check anyway. Realistically though, if you have sufficient data, but they are broken into groups small enough to have 5 or less in a single variable combination that's a problem in itself. Should that be the case you should probably see if there are ways to consolidate and aggregate categories together whenever possible. If you're using SAS you'd mix and match variable combinations inside the LOGISTIC or GENMOD procs using the CONTRAST or EFFECT statements until you sifted it down to impactful combinations. If you were using R, a simple technique people will use is to build a nested model for each combination and compare their summary objects using ANOVA to see which additions add predictive power. If you MUST measure small quantities in many categories and you have access to SAS somewhere you can specify the firth option which does a good job of (quickly) mimicking the Bayesian offsets one might employ to offset the bias inherent to measuring tiny proportions. Good start though would be to simply see if you can consolidate categories and make sure you're not stuck doing exact computations.
Regarding dummy variables etc.: Other poster's are correct. Manytimes (particularly in an academic setting) one level of the category will be given no dummy variable and will serve as the reference (i.e. the info is built into the intercept). There is something called "effects" coding which mimics a separate estimate for every category but this is a little harder to wrap your head around. BTW if you have 2 layers one of which is populated in 3 cats and the other only has 2 cats with data, that sounds like 6 combinations to me. One is just empty. I'm probably just misinterpreting what you mean, though.
So.. bottom line, 1) see if you are stuck doing an exact computation. 2) Try to consolidate into the few essential categories that actually have impact. You need that anyway if you're going to make meaningful statements about various effects. It will make your model stronger and will likely get you to a point where you don't need an exact calculation anymore.
Those are my thoughts anyway, not having seen your data.
精彩评论