A recent study on UFLI is being celebrated for having a high effect size of over 1 on improving early literacy. “In fact, I’ve never seen effect sizes this high from a study of a literacy program.” (Twitter)
I want to look at this study through the lens of what research owes teachers. Part of the larger narrative around literacy is that teachers have been out of touch with research, if not outright misled by teacher education institutions, so clearly reading research articles should not be only for academics.
A few disclaimers here before I start in with my main argument. UFLI is a well structured program and when I listen to Holly Lane talk on podcasts, she has a lot of great things to say. However, when it comes to choosing a scripted program that we use with all students for 30 minutes a day, then I think we need to ask some careful questions that this paper leaves unaddressed.
We need a generous description of the control condition, here termed BAU (Business as Usual):
“UFLI Foundations was the foundational early literacy instructional program for the study and was taught for 30 minutes each day for the course of the school year. The district used a comprehensive reading program intended to support all areas of reading as the core instructional program, and students in the BAU control condition received the reading program without UFLI Foundations.“
That’s all we get for a description of the control condition. Information about the control condition is especially important since the control group is a cohort from 2020-2021 being compared to the following year. Moreover, the measure of literacy in this study, DIBELS, measures skills such as reading nonsense words that might be absent from many BAU programs even if they include systematic synthetic phonics instruction. Since UFLI includes work with nonsense words as part of its program, we can’t discount the possibility that much of the improvement on DIBELS comes from this particular skill that may or may not in itself predict future success in reading.
When we look at a whole class program, we need to know about how the program impacts all students. However, the study looks at children who were scoring below the median in both the treatment and control group. Would we still see these large effect sizes if the students who performed better on the pretest were included? In other words, did the kids who were already good readers need 30 minutes a day of UFLI? Suppose you have some students who have somehow learned to read with very little phonics instruction or who have already had a lot of phonics instruction outside of school (or in a previous grade). Will they benefit from 30 minutes a day of UFLI? Well, that effect size of over 1 doesn’t apply to them, only to kids who were scoring below the median on pretest.
This point is worth emphasizing because there is no way to exit the whole class program early, an important point since kids are exposed to limited kinds of text in the program. Students read ‘connected text’ passages that “were carefully developed to include only previously taught grapheme-phoneme correspondences.” In other words, kids aren’t going to be exposed to the kinds of complex text that will help them use statistical learning to understand the full scope of written language. Combined with a broader push to only use decodables during instruction outside of read-alouds, we could be headed down a path where we can’t see what readers look like several years into the future.
On the other end, we need to know about the kids that the program did not help in the way we might hope. If the results fit a normal distribution (we don’t know and the paper doesn’t tell us), then about 31.9% of students would fall below the 426 ‘at risk’ cut score using a basic z-score calculation. That’s really about 15.95% of students in the treatment group scoring ‘at risk’ since we are looking at students below the median in the first place. The same is true of the KG scores. We also need take into consideration that “The mean posttest score for first-grade students in the treatment group scored in the Some-Risk range by one point (M = 440 and 441 or higher was Minimal Risk)”. So even though we see this large effect size, the program is clearly leaving students close to ‘at risk’ according to DIBELS. Why is that happening? We can’t assume that a stronger dose of UFLI will help.

To be fair, lots more kids in the control groups are classified as ‘at risk’ according to DIBELS. UFLI moved the needle. However, if the mean of the Treatment group in Grade 1 hovers in the Some-Risk range, it’s probably worth trying to figure out why since the whole thrust of the Science of Reading movement is that kids have been left behind by bad Tier 1 instruction. If Balanced Literacy as a Treatment left 15.95% of students as ‘at risk’, you bet that Science of Reading folks would be wondering if that number could be brought down.
So far, I have not contested DIBELS as a measure of reading ability, and that’s not the track I am going to take, though there are lots of reasons to. However, using the DIBELS composite score as a measure of reading success obscures where students are improving, which makes responsive instruction more difficult.
The DIBELS composite score for KG is calculated by scaling the score on each subtest and then performing a four step calculation to get a composite score. In Grade 1, a new test of Oral Reading Fluency (ORF) is added. Unlike the other measures, ORF is not ‘decodable’. In fact, educators in the UFLI in Ontario group often comment on how they are surprised when they administer Acadience (a commercialized version of DIBELS) and find the ORF is hard for their students based on the UFLI program of instruction. Some have complained that it’s not fair to measure students’ progress on something they haven’t been taught.
In the case of DIBELS, each of these tests is 1 minute long. Before we take a look under the hood, take a minute and imagine what 1 minute looks like with a 6 year old.
Imagine a student in KG who scored 397, the mean in the Control group, and compare them to the mean in the Treatment. My point is that a 24 point improvement on DIBELS does not transparently show what we might see from a student in the classroom. The same argument could be made that reading levels are opaque and unreliable.
Here, I am using the DIBELS Benchmark Norms to produce reasonable numbers for students who are below the median.
KG | Control | Treatment |
Letter Naming Fluency x 8.86 | 35 | 41 |
Phonemic Segmentation Fluency x 4.13 | 36 | 44 |
Nonsense Word Fluency–Correct Letter Sounds x14.93 | 14 | 32 |
Nonsense Word Fluency- x3.56 | 4 | 6 |
Word Reading Fluency x5.62 | 6 | 6 |
((sum – 729)/630)x40 + 398 | ((sum – 729)/630)x40 + 398 | |
Total | =397 | =421 |
The point of this example is that students could make significant gains on a skill that UFLI focuses on, producing the sounds in a nonsense word, but otherwise appear pretty much the same as a child in the Control condition. If students did not ever have a teacher ask them to do this kind of task, you can imagine that they might not produce many sounds in one minute.
On the other hand, a child in KG could have the same scores as the control condition across all tests, but with 70 sight words instead of 6 and would still score a 421, the mean of the Treatment group. A DIBELS score of 421 does not tell us enough about what kind of profile of reading abilities we might see in our class.
Now, let’s look at a Grade 1 example. In this case, a student in the Control group could actually make gains over a student in the Treatment group on Oral Reading Fluency, but if the Treatment group made gains on the Phonemic Segmentation and Nonsense Words measures because that’s what they have been practicing, then that could account for the difference in means.
Grade 1 | Control | Treatment |
Letter Naming Fluency x10.7 | 55 | 55 |
Phonemic Segmentation Fluency x2.3 | 20 | 45 |
Nonsense Word Fluency–Correct Letter Sounds x23.13 | 20 | 50 |
Nonsense Word Fluency – WRC x7.79 | 15 | 35 |
WRF x13.51 | 28 | 25 |
Oral Reading Fluency – Correct x23.36 | 45 | 40 |
Oral Reading Fluency – Accuracy x0.25 | 90 | 90 |
((sum-3371)/2251)x40 + 440 | ((sum-3371)/2251)x40 + 440 | |
Total | 427 | 440 |
Of course, all of this is speculative, but it shouldn’t have to be. Researchers could make their data sets open access or do a deeper dive into what’s going on. Producing essentially 4 numbers, the means and standard deviations of each group, is not very generous.
This isn’t an argument about whether to use UFLI or not, but about whether the research paper gives enough to teachers. Assume that we are open-minded, curious, and knowledgeable.
Leave a Reply