Wednesday, January 2, 2013

Sensible Binomials

Sensible? Binomials? Could this post be any less sexy?

The truth is, many data analysts — yes, I'm looking at you — are making your jobs more complicated by the way you're coding your binomials. Tell me, does extract from the codebook — you have a codebook or data dictionary, right? — look familiar?


Did the participant experience a concussion (answered “yes” to any of the symptoms questions, HEAD_MEMORY through HEAD_NOISE)?

  • 1 - Yes
  • 2 - No

Close your eyes, spin around in a circle, then tell me this: which value is “yes” and which value is “no”? If you work entirely alone and you're incredibly consistent, you may well remember. But if you're like me, and you work on multiple projects with multiple collaborators, you're probably looking the coding up. Several times. Per variable.

Then what happens when you're actually trying to analyze the data? Let's say you're trying to find variables that predict concussion, and you're testing gender (GENDER) and the field position being played at the time of the injury (POSITION_INJ). Most simply, your logistic regression model would look something like this:

xi: logistic concussion i.gender i.position_inj

(In Stata, anyway.) You probably see the issue. You're not modeling the odds of a concussion, because your outcome variable isn't coded for presence versus absence. You're modeling the odds of increasing the value of concussion from 1 to 2. So you'll have to recode CONCUSSION. Probably after consulting your codebook again, to make sure you're doing it right.

replace concussion = abs(concussion - 2)

Now the model is set up properly. Your working data no longer conforms to your codebook. You've wasted an extra couple minutes here and there looking up CONCUSSION's coding. And you're reasonably proud of that one-line replace statement. With a little unnecessary work, the model is set up properly.

What would have been easier? Always use 0/1 coding for binomials, where 0 denotes absence and 1 denotes presence. Your logistic (negative binomial, etc.) models will handle the variables properly. And you'll always remember that 1 means a concussion, 0 means no concussion.

I'll do you one better. GENDER only has two values. Why not simplify its coding, too?


Participant gender.

  • 0 - Male
  • 1 - Female

Now your logistic regression equation looks like:

xi: logistic concussion female i.position_inj

The gender recode also frees you up for some cool boolean coding. Let's say that the PREGNANCY field is missing sometimes. Well, we know the value of PREGNANCY for about half our population:

replace pregnancy = 0 if not(female)

Code that reads like English. Oooh, sexy.

No comments:

Post a Comment