Making research comparisons that are theoretically important.
Why should we care if "baseline" is a good comparison?
This post is about research methods. Before that idea makes you gag and delete it, let me explain more. It is not about statistics. It's not about technical issues. It is about the logic of research.
It is about one way research could be done better, in my view. It's about the thinking about research. It's about how to understand why there are shortcomings in lots of studies. It is about how sensible people who read research can make better sense of what they are reading. So, please stick with me.
An introductory story
I remember a conversation with a smart, admirable teacher in Illinois1 with whom I was hoping to work on a study of a behavior management intervention. Before the school year began, we were conferring about the possibility of conducting a studyin his classroom.
The teacher and I discussed the plan for the study. I explained that I hoped to implement a variant of the Contingencies for Learning Academic and Social Skills program developed in the 1970s and later described by Hops et al. (1978). The teacher said that he could provide a great baseline for this intervention. He could simply let the kids run wild for baseline observations during the first weeks of the school year, then we could implement the “intervention,” and we would have a great study.
Whoa! Wait! Why would I want to compare CLASS to a classroom run wild? My goal wasn’t to prove that something like “CLASS” was better than “Run Wild.” I wanted to learn about how one version of classroom management compared to method of classroom management.
Intervention research, that type of research in which many special educators are interested, requires comparisons. There are many possible comparisons:
Does Drekur's Model cause different outcomes than some other discipline model on, say, improving students’ interpersonal relations?
Is Gillingham-Stillman more effective than Reading Recovery on reading accuracy and fluency?
Do teacher-lead discussion groups focused on effective practices lead participating teachers to use more effective procedures than professional development using video-based feedback
So, if you follow my drift, I didn't want to compare Run Wild to a structured intervention. That's sort of...duh. I wanted to compare a specific structured intervention to a different structured intervention. I wanted to compare two interventions, one that was predicated on Theory A and another on Theory B.
In educational research, the usual question is whether one condition has different effects than another condition. This is true whether the compared conditions are compared employing single-subject or group contrast methods.
That is, a study that contrasts (a) groups of students who get one curriculum to those who get another curriculum or teachers who get one sort of professional development to those who get another or (b) individual students, teachers, or classrooms get one condition (i.e., "baseline"") for a while and then get another condition ("treatment") for a period of time, they are both contrasting conditions. Do outcomes differ depending on conditions.
So, I'm arguing that it is important to focus on the comparison between or among conditions. It's less valuable to conduct a horse race (does the XYZ curriculum produce higher scores than the MNO curriculum?) than to conduct a study that examines theoretically important questions (e.g., does a curriculum that requires teachers to follow scripts produce better outcomes than one that presents comparable contact but allows teachers to free-lance lessons?)
Now, I must admit to the fact that I was a party to studies (e.g., Lloyd et al., 1980, 1981) where one essentially-favored level of independent variable was tested against another one that was not favored. Does DI produce better tomatos than the Weeds. The “control group” was hobbled. What I am saying is that appproach shouldn’t be the norm.
The big idea is whether X works better than M. What's the difference between outcomes for Method O versus Method R? Does Method F produce more beneficial outcomes on theoretically important measures than does Method P?
Often these days, one sees studies comparing some researchers’ newest, cool, nifty intervention to business as usual. That is, BAU may be one of those methods, X, M, O, R, or etc., but including BAU creates a modest comparison, at best. It would be more powerful if Method BAU differed from the researchers’ New-&-Groovy method on specific factors, that can be objectively assessed: opportunities to respond, reinforcement procedures, coaching practices, etc.
To be sure, it’s important to compare packages of practices to other packages. Those packages should be clear, however, about how one package differs from another.
What the heck is “baseline” or “BAU?” Is it just “RunWild?” Is that what researchers will provide as a comparison for their sparkly new method? Do some teachers or students get A and some get B? Is that a comparison between schools that are getting help to those doing without that help from researchers? That is, is the comparison simply between students and teachers who get special help vs. those who do not get any special help?
Comparisons of BAU and Super-sped methods are interesting, but those efforts don't advance knowledge nearly as much as a conceptually driven comparison. Also researchers can compare three conditions. One could be the BAU; a second can be a conceptually important comparison; and a third would be that super-cool new intervention that the researcher is championing. Researchers might feel compelled to design studies that show significant results (can my horse beat that horse on his way to the glue factory?). A better alternative is, again, to make sensible comparisons.
To be sure, a study can describe how an intervention goes (i.e., pre- vs. post-test with no control condition), but most of us are interested in whether an intervention produces better outcomes than another intervention. Describing how a study goes is really quite uninformative; it is a descriptive study, not a comparative study. How would the learners have performed if they didn't get the intervention? In an uncontrolled pre-test-post-test study, we do not have a comparison.
When a control is BAU or baseline, we have a little more of a comparison, but not much. We often don't know to what that New-&-Groovy method is being compared. Is the researchers’ method beating The Old Gray Mare or beating Seabiscuit?
I'm concerned about what researchers compare in studies. What is condition A, and what’s condition B? I don't think that A = RunWild vs. B = Some “Science-Based” Intervention is a very informative comparison. Why would we care about whether, in a two-horse race, the old-gray mare finishes second to Seabiscuit?
So, I’m concerned about what researchers compare in studies. What’s condition A, and what’s condition B? I would like to have those conditions differ in objectively assessed and theoretically important ways.
In my view, this is a variation on the major concern that we need to ask and exmamie inportant research questions. When we report studies, we should be clear about what we’re studying. When we read intervention studies, we should know what’s being compared to what.
Hops, H., Walker, H. M., Fleischman, D. H., Nagoshi, J. T., Omura, R. T., Skindrud, K., & Taylor, J. (1978). CLASS: A standardized in-class program for acting-out children: II. Field test evaluations. Journal of Educational Psychology, 70(4), 636.
Lloyd, J., Cullinan, D., Heins, E. D., & Epstein, M. H. (1980). Direct instruction: Effects on oral and written language comprehension. Learning Disability Quarterly, 3(4), 70-77. 1981
Lloyd, J., Epstein, M. H., & Cullinan, D. (1981). Direct teaching for learning disabilities. In J. Gottlieb & S. S. Strichart (Eds.), Developmental theory and research in learning disabilities (pp. 278-309). University Park Press.
Ssome obfuscation here (e.g., changes of irrelevant facts) to protect people’s identity.