Wednesday, February 6, 2013

Please talk to me

I don't mean that to sound pathetic. But I'm just wrapping up — I hope I'm wrapping up — two projects where I, the statistician and data guru, was not included in all the initial data needs meetings. Both projects are taking me, and everyone else, vastly more time than should be necessary.

Project #1. Subject data was initially extracted and heavily modified. Then in a separate request and with a separate contract, this data was matched to a secondary data source. No problem yet. Then additional subject data is requested. The number of subjects in this request differs from the original extract. Why? Could it be a change in the underlying data systems? Differing inclusion criteria? No one has the initial request, so it's not clear. Who has to dig through both files to discover the similarities and differences and come up with a hypothesis for the original request? Me.

Then additional data was requested of our secondary data source. The contract had expired. Sorry.

Project #2. I built an overall data tracking and management system for the project. I suggested a vendor we could work with to incorporate an interactive voice response (IVR) system to automate phone calls and collect information each week. I knew the vendor used an Asterisk-based IVR platform, and as a Ruby programmer, I knew I would be able to understand and potentially even extend the system.

Then the details of the data exchange were set without me. Our system would export data to the vendor, and receive the result data, in Excel spreadsheets. This system could have been automated, but hey, at least it's not my being wasted.

When asked to incorporate this new data, I discover that neither set of data includes the primary key that would allow us to link the IVR data to our original data. So someone has to spend stunningly unnecessary amounts of time after the fact to link data. The two minutes it would have taken for my input — “ Be sure to include the subject ID! ” — have become hours of work.

To summarize, include your data people early and often in your conversations. If you're talking about data — and if you're talking about your research plan, your metrics, your deliverables, you probably are — they won't see it as a waste of time. Surprising them later will waste their time.

<Sigh>, thanks for listening. Back to project #1.

Wednesday, January 2, 2013

Sensible Binomials

Sensible? Binomials? Could this post be any less sexy?

The truth is, many data analysts — yes, I'm looking at you — are making your jobs more complicated by the way you're coding your binomials. Tell me, does extract from the codebook — you have a codebook or data dictionary, right? — look familiar?


Did the participant experience a concussion (answered “yes” to any of the symptoms questions, HEAD_MEMORY through HEAD_NOISE)?

  • 1 - Yes
  • 2 - No

Close your eyes, spin around in a circle, then tell me this: which value is “yes” and which value is “no”? If you work entirely alone and you're incredibly consistent, you may well remember. But if you're like me, and you work on multiple projects with multiple collaborators, you're probably looking the coding up. Several times. Per variable.

Then what happens when you're actually trying to analyze the data? Let's say you're trying to find variables that predict concussion, and you're testing gender (GENDER) and the field position being played at the time of the injury (POSITION_INJ). Most simply, your logistic regression model would look something like this:

xi: logistic concussion i.gender i.position_inj

(In Stata, anyway.) You probably see the issue. You're not modeling the odds of a concussion, because your outcome variable isn't coded for presence versus absence. You're modeling the odds of increasing the value of concussion from 1 to 2. So you'll have to recode CONCUSSION. Probably after consulting your codebook again, to make sure you're doing it right.

replace concussion = abs(concussion - 2)

Now the model is set up properly. Your working data no longer conforms to your codebook. You've wasted an extra couple minutes here and there looking up CONCUSSION's coding. And you're reasonably proud of that one-line replace statement. With a little unnecessary work, the model is set up properly.

What would have been easier? Always use 0/1 coding for binomials, where 0 denotes absence and 1 denotes presence. Your logistic (negative binomial, etc.) models will handle the variables properly. And you'll always remember that 1 means a concussion, 0 means no concussion.

I'll do you one better. GENDER only has two values. Why not simplify its coding, too?


Participant gender.

  • 0 - Male
  • 1 - Female

Now your logistic regression equation looks like:

xi: logistic concussion female i.position_inj

The gender recode also frees you up for some cool boolean coding. Let's say that the PREGNANCY field is missing sometimes. Well, we know the value of PREGNANCY for about half our population:

replace pregnancy = 0 if not(female)

Code that reads like English. Oooh, sexy.