Dean Radin, is senior scientist at the Institute of Noetic Sciences and author of the book *The Conscious Universe* (published as *The Noetic Universe* in the UK). In chapter 8, “Mind-Matter Interaction,” he provides a meta-analysis of dice tossing experiments from 1935 to 1987. Dice tosses per study ranged from 60 to 240,000. A total of 2.6 million dice were thrown across the entire analysis. After compiling all data, the overall hit-rate was 51.2%. In a separate control study involving 150,000 throws, and no mental intention, the overall hit-rate was 50.02%.

Is this evidence that mental intention can directly influence the physical world, or that the mind can intuitively sense the outcome of events? The hit rate is sufficiently low that such an ability could hardly be used for any practical purpose, but if there is such an unusual link between mind and matter, this calls into question our presuppositions about the nature of reality itself. So, how do we account for the curiously unbalanced statistic?

Firstly, let’s carefully examine the outcome of Radin’s control study. I have no doubt that his result of 50.02% occurred, but let’s see what range of possibilities manifest if we repeat the control study 100 times. I will use Microsoft Excel, creating a grid of 1000×150 cells, each cell containing the function =ROUND(RAND(),0). This will generate either a 0 or a 1, with a 50/50 chance in each case. We will use 1 to represent a hit and 0 to represent a miss. When we add all the hits in the grid of 150,000 cells together, chance expectation determines that we should arrive very close to 75,000 (50%). Here are the results:

Examining 100 trials (each dot representing the outcome of one trial), we can see that Radin’s result of 50.02 can easily deviate by about 0.3 percent. That said, a result of 51.2%, if accurate, would be staggeringly significant, especially when 2.6 million individual guesses are taken into account, rather than just 150,000. So far, it looks as if something unusual is happening.

What about the quality of the real experiments? Radin collects data from experiments involving as few as 60 dice throws. Let’s take a look at what happens to the average hit-rate when we extend an experiment from 100 computer-generated coin tosses (left-hand side of chart) to 10,000 (right-hand side).

Although the above graph looks messy, it’s not necessary to pick out individual lines. We are simply attempting to identify a trend. Each line represents a single complete trial of 10,000 computer-generated coin tosses. Consider the red line that I have deliberately emboldened for effect. If we want to know the average hit-rate at 1,000 coin tosses, simply let your gaze travel up vertically from 10 on the X-axis. Likewise, if we want to know the hit-rate at 40,000 tosses, find 40 on the X-axis. This vertical slice also reveals the hit-rate at the same point on 99 additional trials. I have emboldened the navy and red lines, as they revealed themselves to be the highest and lowest hit-rates at the 10,000 tosses mark.

The first point of note is the statistical chaos that occurs at the beginning. In one trial, a lucky hit rate of 63% happens after just 100 tosses. If a human experimenter had achieved a score of 63% after 100 tries, he might be tempted to conclude that he is uncovering evidence in support of psi. But since the computer’s random number generator achieved this result, that is certainly not the case. The computer also generated a hit-rate as low as 39%. Clearly, at 100 tosses, it’s a case of anything goes – relatively speaking. Another interesting observation is that the navy line trial began at 39% and ended at 51.24%; the glaringly bad first score had little bearing on the final score at 10,000 tosses, which happened to be the top score of all 100 trials. This only reinforces the observation that nothing of value can be determined by a mere 100 coin tosses.

The first problem with Radin’s meta-analysis is that he incorporates trials of as little as 60 dice throws, when he ought to know that these are utterly valueless due to the massive spikes of randomness that can occur in such a small trial. It is also likely that such trials suffer massively from the “file drawer” problem. Since small experiments require little investment of time, an experimenter is much more likely to publish a small experiment that just happened to produce an outstanding result than he is to publish one where nothing unusual happened, or where he achieved an unusually low score. Excitement is a motive to publish, whereas lack of excitement a motive to forget. Larger experiments involving a significant time commitment and sufficient discipline to see it through are more likely to be published regardless of the outcome precisely because of the sheer investment of energy. Radin does not state how many small trials he relied upon. A better working principle for an accurate meta-analysis would be to exclude all small trials.

But how shall we define what constitutes a “small” trial? Essentially, we are attempting to realistically minimise the corruption caused by too much randomness. Consider what is happening at 800 coin tosses. Most of the lines are now settling into a noticeable bell curve shape. But note the little cyan peak at 55.625%. If I had achieved such a high score in 800 tosses, I would be tempted to think that psi was responsible. And my excitement would lead me to publicise the finding, whereas I would not have made any noise about an ordinary result. In reality, 55.625% after 800 coin tosses is a statistically ordinary result, at the upper end of chance expectation, although it tends to look extraordinary unless you refer to my chart. Bringing the 800 tosses up to 1,500, in our computer-generated example, sends the hit-rate in a nose-dive right down to 51.06667%, as it predictably meanders along the bell curve of chance expectation. This begs the question: of all the trials included in Radin’s meta-analysis, how many of them stopped at the “sweet spot.” In our computer trials, all the scores are not consistently within the bounds of 48% to 52% until a staggering 5,000 tosses. The principle at work is: the sooner you stop, the more amazing your score can appear. The value of a long-running experiment over a short one is clear. I find it staggeringly short-sighted that Radin would allow trials of 60 dice throws into his meta-analysis.

Let’s zoom in on our chart to get a clearer look at the detail:

I would suggest 1,500 tosses as a good minimum requirement for all experiments. We can see that our highlighted navy and red lines had established themselves in a prominent position statistically by 1,500 tosses or thereabouts. 1,500 is a suitable position where there is not much in the way of sudden spikes of random fluctuation, and we see the beginnings of an orderly trend.

Another good approach might be to take 150,000 throws from the 2.6 million, drawing from those experiments with the greatest quantity of throws per experiment. Then examine whether the hit-rate deviates significantly higher than 50.27933 % (the highest score in my original 100 computer trials of 150,000 throws each).

What I would like to see included in Radin’s research is a chart grouping together the number of throws with the number of tests. For instance, how many tests in the whole meta-analysis relied upon 60-100 throws, 101-500 throws, 501-1500 throws, 1501-3000 throws, and so on. Then we can get a feel for whether his analysis relies heavily on short tests or long ones.

Another extremely important issue is the manner in which the overall statistics are compiled. There are two approaches, one of which suffers from significant inaccuracy. To illustrate: let’s say I have data from two separate experiments and I want to combine them. The first experiment involved 100 tosses with an average hit-rate of 62% (62 hits). The second involved 3,000 tosses with an average of 52% (1,560 hits). Both of these outcomes are within the upper end of chance expectation. If I work out the overall average by adding 62% to 52% and dividing by two (because there are two experiments), I get a final hit-rate of 57%. On the surface, this looks fair. But consider the following alternative approach: instead of adding the results of two separate experiments together and working out an average, let’s treat the whole analysis as a single experiment involving 1622 hits (62 + 1,560) over 3,100 tosses (100 + 3,000). This gives us a vastly different and more accurate final hit-rate of 52.32258%.

The problem with the first approach is that it puts an experiment involving 100 tosses on an equal footing with one involving 3,000. This has the effect of treating the first experiment as if it involved 3,000 tosses with a hit-rate of 62 for every 100 of those tosses – literally inventing out of thin air 1860 hits out of 3,000. To double-check that what I’m claiming here is correct, look what happens when I work out the average of 1860/3,000 and 1560/3,000. As expected, 57%.

An overall hit-rate will not be accurate if it merely adds existing point estimates together from tests that involved wildly varying amounts of dice throws. A truly accurate final hit-rate would be the total number of hits of all experiments divided by the 2.6 million total throws.

In closing, this experiment would not be complete without attempting it first-hand. I approached this, not as a sceptic but as a believer in psychic phenomena (due to some limited personal experience). I really wanted it to be true, but alas I must report that my results were entirely in keeping with chance expectation. Rather than using a coin or a die, I used a new unblemished deck of cards. After a thorough shuffle, I would mentally choose whether I wanted the top card to be black (spades and clubs) or red (hearts and diamonds). I made this choice by intuitively feeling for it, trying not to rationalise. Then I would flip the top card over and record whether I had a hit or miss. I would give the deck another quick shuffle, then do the same again. This was certainly more tedious than coin flipping, but it provided exactly the same 50-50 mechanism. I simply liked the ability to look at the back of the card while deciding, and I liked being able to keep shuffling until I had the “hunch” to stop. Here are my results, showing 1,500 card guesses:

As chance would have it, I began with a staggeringly good score of 62 for the first 100 card flips. Amazingly, I immediately followed this with another great score of 59, giving me an average of 60.5% after 200 flips. But it was all downhill after that. At 700 I was still doing relatively well with a score of 53%, but by 1,500 guesses my hit-rate had levelled to 50% on the dot, where chance expectation still allowed for anything from roughly 47% to 53%. I had obtained 750 hits out of 1,500; it doesn’t get more typical than that.

Now, I could make up a nonsense theory about how my psychic mojo was working great at the start, but deteriorated as boredom set in. The truth is, I was really excited about this experiment when it all started going awry (i.e. back to normal). The simple truth is, my results are entirely typical of chance. If only I had called it quits at the sweet spot, eh? The more you extend your experiment, the less evidence for psi becomes apparent, until you realise that what you thought was psi was really just random chaos.

To reinforce my earlier assertion that statistics must be measured by totalling the actual guesses, not by grouping together the average outcomes of separate experiments, let’s examine what happens when I take my own data and split it into a series of separate experiments, each one of differing length. I originally compiled by results as a series of 15 scores, each one representing a hit-count out of 100 guesses. These were: 62, 59, 43, 47, 49, 56, 55, 50, 45, 53, 49, 48, 48, 37, 49. Let’s now imagine these were the outcomes of four separate experiments, the first one comprising of 100 guesses (62), the second one 500 guesses (59, 43, 47, 49, 56), the third one 200 guesses (55, 50), the fourth one 700 guesses (45, 53, 49, 48, 48, 37, 49). The average hit-rates of the four experiments work out as: 62%, 50.8%, 52.5%, and 47%. And when we combine these, we get a highly inaccurate 53.075%. If I wished to mislead people, or if I were incompetent at handling statistics, I could make a statement like the following: “After conducting four experiments involving a total of 1,500 coin flips, the resulting hit-rate lay marginally outside of chance expectation when compared with 100 trials using a random number generator.” I could even make a chart to reinforce how seemingly impressive this result is:

But the true hit-rate, as we have seen, was exactly 50%.

A similar way of accidentally manipulating the stats occurs when the experimenter gives himself permission to quit each round of guesses at any point he wishes, instead of measuring them in even rounds, as I did. If he happens to score 118 out of 200, he might say to himself, “I better finish now while my luck holds,” and he records an average hit-rate as 59% for round 1. Then, the following day, he commences round two, but scores a low 80 by the time he has reached 200 guesses. Naturally, he wants to keep going instead of recording a hit-rate of 40%. What he fails to realise is that a low starting score *always* rises when guessing is allowed to continue. At 3,000 guesses, it’s highly improbable to get a score lower than roughly 47.5% (refer to the chart). The winning strategy is simple: when you score high early, quit early; when you score low early, keep going. And so, the experimenter fools himself into thinking that he is beating odds against chance. The flaw in the experimental design is that he didn’t choose a precise value for how many guesses each round should contain. In Dean Radin’s meta-analysis, how much care was taken to ensure that none of the experiments suffer from this fault? We don’t know.

I used to think that meta-analysis was a really effective way of proving psi. Perhaps it is, but not when overly small experiments are included, not when wildly varying sizes of experiments are compared on an equal footing, and not when a quantity of the experiments may suffer from the design flaw I desribed in the paragraph above.

well stated Daryl!

Another thing that you might want to take into account is, that computer generated “random” numbers arent truly random. And that is because computers work with algorithms that are deterministic. We can however come really close with “pseudo-random” numbers, In simple words, we use the state that our computer is in at that very moment and use these circumstances to generate a pseudo-random number, Still a diceroll is even “more random”. Just a little to think about that might even have an impact on the experiment.

Dennis’ point about computer-generated random numbers was the first thing I thought of while reading your article, but the second thing was: how many of the experimenters were actually TRAINED to use psi? Unless the experiment was meant to measure the latent psychic power in the average human, I’d say using novices would be functionally useless.

Given your interests, I would recommend that you read the technical article describing this meta-analysis rather than rely on the simplified description in my popular book. You’ll find that in this meta-analysis I used sample-weighted effect sizes. You can find the article here: http://deanradin.com/evidence/Radin1991DiceMA.pdf. The description of the experiment you conducted is a nice replication of the well-known “decline effect,” which is often associated with boredom and with interference due to memory of the results of previous trials. Declines have been observed in many types of experiments, not just within psi research.

Darryl have you seen this?: http://www.amitgoswami.org

darryl, wouldnt you need to put your first card back in the deck each time….to maintain the same chances…when you remove that first black card….you have just increased the chances of drawing a red card.

I did return the card each time.

my apologies