The government’s proposed reforms of children’s services in
England assign a pivotal role to the inspectorate Ofsted. If a local
authority’s children’s services department is rated ‘inadequate’ by Ofsted, it
will now be given six months to improve or risk being taken over. That’s
drastic stuff, so there has never been a better time to think very hard about
how valid and reliable Ofsted inspections are.
To help do just that I have developed a thought experiment which is based on the red bead game that was
used by the quality guru, Dr. W. Edwards Deming, as a teaching aid in the seminars and lectures he gave
across the world until his death in 1993. Dr. Deming used the game to
demonstrate that even with identical methods and tools there will always be
variation in results and that this variation often has nothing to do with what
individuals and groups actually contribute to delivering a particular process.
My thought experiment adapts the red bead game as follows:
Imagine you have 150 pots, each one corresponding to a local
authority in England. In each pot you place 5000 beads, 4000 of which are white
and 1000 of which are red. The beads represent ‘cases’ or ‘service episodes’.
The white beads are examples of acceptable or good practice and the red ones
are examples of poor practice. So 1 in every 5 cases (20%) is substandard. [1]
Now simulate the activity of an inspector by randomly
extracting from each pot 50 beads and examining what you get [2]. You will be
very lucky indeed to find that each extract contains 40 white beads and 10 red
ones (corresponding to the overall proportion of 20% red beads in the pot). On
the contrary you are highly likely to have quite a lot of variation in the
white/red proportion of each extract. In some cases the number of red beads
will be well below 10, in some it may even be 0, and in some cases it will be
considerably higher than 10. In a few cases there may even be more red beads in
the extract than white.
Results for the first 10 pots might look like this:
Pot
|
No. (%) red
|
A
|
5 (10)
|
B
|
15 (30)
|
C
|
11(22)
|
D
|
19 (38)
|
E
|
2 (4)
|
F
|
17 (34)
|
G
|
8 (16)
|
H
|
23 (46)
|
I
|
5 (10)
|
J
|
18 (36)
|
This variation cannot be ascribed to anything that is going
on inside the pots (because we know that we put in 4000 white and 1000 red
beads into each one and that they have just stayed there until they were
extracted). So it would be very wrong indeed to ascribe to any particular pot a
description such as “too many reds” or “too much poor practice” or
“inadequate”. And it would be very wrong to conclude that pots D, F, H and J
should be made subject to special measures while those responsible for pots E,
A and I should be lauded for their outstanding performance!
But I hear you ask, perhaps Ofsted has taken steps in the
way it has designed its inspections, and the ways in which it selects its
samples, to minimise the natural variation which occurs in the red bead game?
Perhaps they use clever statistics to ensure that their results are valid?
Well, perhaps they do but there is no
evidence of it. I have scoured the Ofsted website for anything which suggests
that they have thought about the red bead problem. And I have written to them
and pursued them with a Freedom of Information Act request to find out if they
use statistical techniques to try to ensure inspections are valid. The reply I received
gives no indication that they do. [2]
But it is not really up to me to justify Ofsted’s methods.
It is up to them. In 2012 Professor Dylan Wiliam, of the University of London’s
Institute of Education, challenged Ofsted to evaluate the reliability of its
school inspections and publish the findings, asking: “If two inspectors inspect
the same school, a week apart, with no communication between them, would they
come to the same ratings?” (Times Educational Supplement 03/02/12 ).
I don’t know whether Prof. Wiliam got an answer but I can’t
find one that has been published. Maybe in 2016 Ofsted could answer a similar
question for me. “How can Ofsted be sure that the variation between different local
authorities, revealed in its inspections of children’s services in England, is
due to differences in performance rather than just due to chance?”
If Ofsted cannot answer that question in a convincing way it
should not be in the business of inspecting children’s social care and the
government should certainly not be assigning a pivotal role to Ofsted in its so-called
‘reforms’.
Notes
[1] I have no evidence that 1 in 5 cases is in fact substandard,
although it seems to me to be a reasonable 'guestimate', especially in view of
the fact that Ofsted finds such a large number of authorities ‘inadequate’ or
‘requiring improvement’. I have tried, without success, to discover if Ofsted
is able to estimate what the proportion of substandard cases is in the entire
‘population’ of the cases they have reviewed in (say) the last 10 years.
[2] Ofsted’s ‘Inspection Handbook’ speaks of ‘tracking’ no
more than 30 children during an inspection and ‘auditing’ a ‘sample’ of 20 case
files. I could find no detailed information in this document about how the cases
are chosen.