Replicating probit regression in SAS and R_问答_开发者

I'm trying to replicate my SAS work in R, but I get slightly different results -- differences that can't be explained by rounding error.

Here's my SAS code:

proc qlim data=mydata;
   model y = x1 x2 x3/ discrete(d=probit);
   output out=outdata marginal;
   title "just ran QLIM model";
run;
quit;

And here's my R code:

mymodel <- glm(y ~ x1 + x2 + x3, family=binomial(link="probit"), data=mydata)

I'm not really sure why I'd get the different results, and would greatly appreciate an explanation.

EDIT Here's my data:

2.66  20  0  0
2.89  22  0  0
3.28  24  0  0
2.92  12  0  0
4.00  21  0  1
2.86  17  0  0
2.76  17  0  0
2.87  21  0  0
3.03  25  0  0
3.92  29  0  1
2.63  20  0  0
3.32  23  0  0
3.57  23  0  0
3.26  25  0  1
3.53  26  0  0
2.74  19  0  0
2.75  25  0  0
2.83  19  0  0
3.12  23  1  0
3.16  25  1  1
2.06  22  1  0
3.62  28  1  1
2.89  14  1  0
3.51  26  1  0
3.54  24  1  1
2.83  27  1  1
3.39  17  1  1
2.67  24  1  0
3.65  21  1  1
4.00  23  1  1
3.1   21  1  0
2.39  19  1  1

And here's my estimated coefficients (std errors in parantheses):

SAS: -7.452320 (2.542536)
      1.625810 (0.693869)
      0.051729 (0.083891)
      1.426332 (0.595036)
R:   -7.25319  (2.50977)
      1.64888  (0.69427)
      0.03989  (0.079开发者_StackOverflow61)
      1.42490  (0.58347)

It is possibly in the contrast matrix used by default. R uses treatment contrasts while SAS uses it's own. Look up contrasts and contr.SAS in the help. If you're using SAS contrasts a lot you might want to just set the options to that.

options(contrasts=c("contr.SAS", "contr.poly"))

To get an idea how this affects things observe the difference in treatment and SAS contrast matrices

contr.treatment(4)
  2 3 4
1 0 0 0
2 1 0 0
3 0 1 0
4 0 0 1

contr.SAS(4)
  1 2 3
1 1 0 0
2 0 1 0
3 0 0 1
4 0 0 0

When I run it in R with your data and code I get answers (close to) what you show for the SAS results:

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept) -7.45231    2.57152  -2.898  0.00376 **
x1           1.62581    0.68973   2.357  0.01841 * 
x2           0.05173    0.08119   0.637  0.52406   
x3           1.42633    0.58695   2.430  0.01510 *

The standard errors are off by a few percent, but that's less surprising.

I also ran it in glmmADMB (available on R-forge), which is a completely different implementation, and got estimates slightly farther from, but standard errors closer to, SAS -- much smaller differences than you originally reported in any case.

library(glmmADMB)
> mm2 <- glmmadmb(y~x1+x2+x3,family="binomial",link="probit",data=mydata)
["estimated covariance may be non-positive-definite warnings"]
> summary(mm2)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)  -7.4519     2.5424   -2.93   0.0034 **
x1            1.6258     0.6939    2.34   0.0191 * 
x2            0.0517     0.0839    0.62   0.5375   
x3            1.4263     0.5950    2.40   0.0165 *

What version of R were you using? (It's possible that something changed between versions, although glm is very stable code ...) Are you sure you didn't mess something up?

> sessionInfo()
R Under development (unstable) (2011-10-06 r57181)
Platform: i686-pc-linux-gnu (32-bit)

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] glmmADMB_0.6.4

I'm an R newbie, but I have a suggestion.

Try running the probit using another R package...try Zelig.

mymodel <- zelig(y ~ x1 + x2 + x3, model="probit", data=mydata)
summary(mymodel)

Are the regression coefficients different in this model?

This is a great source http://sas-and-r.blogspot.com/

You should compare which software is reporting the highest log-likelihood. Those numbers may be different just because the termination criterion is different in both algorithms. For example, most algorithms use the norm of gradient as a stopping rule (ie: when less than 0.0005), but every software uses its own specification. Depending on where it is stopping, the variance of those estimates will be obviously different since they are obtained by inverting the Hessian ( evaluated at where it is stopping). Just to be 100% sure, you could check using R or SAS values which is reporting the highest log-likelihood. Or you could calculate by hand the log-likelihood using those values. If you are required by somebody to report the exact same values in R and SAS, just touch the convergence criteria of both algorithms. Set some very tight parameter <0.00000005, in both cases and both programs should report the same value.

( well unless your likelihood has multiple maxima, which doesnt seem to be the problem here; in that case the final estimates will depend on your initial values)