A/B Testing

We randomly assigned conditions of "treatment" and "control" to users and administered pre and post surveys to see the effects of a teacher learning intervention on their vision and practices. I led the survey effort as part of a $3M grant with colleagues Bill Penuel (University of Colorado) and Abe Lo (BSCS Science Learning).

Analytic Questions



A/B Design

For this study, I led data gathering and analysis using Qualtrics, which I cleaned in Excel and imported into Power BI (web) to build interactive analyses to share with my 12-person team. Specifically, they wanted to know how TREATMENT (T) and CONTROL (C) teachers answered on items about culture, vision for instruction, and instructional practices on identical PRE and POST items. This battery also contained "backward" items that I recoded using nested IF formulas and XLOOKUP in Excel so that I could carefully check my work. These survey items underwent an (unpublished) IRT and CFA in 2018 that our team has not yet published.

Comparison 

Representations in PowerBI & Tableau

Factor Analysis & ANCOVA

This item-by-item interpretation was confusing and we lost too many degrees of freedom for the 45 question stems and only 50 teachers with pre and post data.

We moved on to fixed-factor ANCOVA to compare pre and post results by teacher group holding pre results as the covariate. Because we had a small sample, we used PCA to build scales and new variables from the sums of those items based on theory and loadings. We tested that scale for reliability and found alpha > 0.70.


We found a few items that theoretically aligned to our project that also held together and built a scale with these items:

-- Students are using science and engineering knowledge and practices to promote justice in their communities.

-- Students share their interests with me (through surveys, exit tickets, journal entries, etc) to use in planning instruction.

-- Students are learning that science is a cultural activity, influenced by the time, place, and who the scientists are.

-- Students’ ideas and questions guide and organize my instruction.

-- Students continually revise their models, explanations, and claims during a unit.


For this scale of questions, the treatment group increased almost two points on average (M = 1.947, SD = 2.20, n = 19) compared to a slight decrease for control teachers (M = -0.067, SD = 2.41, n = 30). The means and variances differed likely less than 5% of the time for this scale, F(1, 47) = 8.703, p = .005, partial eta-squared = .156, Cohen’s f = .429.

t-tests

One thing this analysis was lacking was interpretability. My team had a hard time understanding the methods of ANCOVA and how a covariate works with S-pooled (who doesn't!?!), and I consulted experts outside our teams to evaluate the validity of using a simple GAIN score (pre-test score sum on the validated scales minus the corresponding post-test score sum). In some respects, this is actually a better measure because it does not assume perfect reliability (read more here). For me, the more important reason to use gain scores was interpretability: it would be much easier to explain to others HOW one result was different than the other.

The t-tests did not show any of the scales to be significant when the ANCOVAs had not (as expected), but it was much more enjoyable to walk our 10-person team through the analysis.

Conclusions

Our pre- post- test model with A/B testing was a success - we determined that the experimental project had an impact on teacher self-reported practice - on the order of 2 points on a 5-point scale of teacher beliefs.  This is a moderate effect per Cohen's f


This difference is attributable to the course because of the random assignment of teachers to the course, satisfying the assumptions of time-dependence and independence (Robinson, 1970).