Musings about being data driven. This week my daughter Darcy said, “It is my birthday soon, so you will have plenty of water to fill your new rainwater tank.” What?? As far as Darcy can recall, it ALWAYS rains on her birthday. I have the same type of mis-recollections of my small town (Toowoomba). We had two very wet years when I was at an impressionable age. I still think of Toowoomba as a wet place even though the mean annual rainfall is 944mm over 73 days compared to 1065mm over 84 days for Brisbane (where I currently live).
Our ability to predict the future is based on our own personal experiences and sometimes on the experiences of others if we care to ask. The two key problems with relying on human experience to predict the future are: firstly our sample size is too small for infrequent events and secondly our memories are selective rather than an unbiased collection of information.
Sample size – power analysis
So, if Darcy can only remember the weather from the last seven birthdays (due to childhood amnesia – see below), and that isn’t enough to predict what will happen this year, then how much data is enough? The question is one of statistical power and how big a sample size we need to be confident of understanding something. Most commonly, in statistics, we compare two datasets to see if they are statistically different (is it more likely to rain on my birthday or Darcy’s birthday?). For data that we collect ourselves, sample size is a critical issue – how do we know when we have enough to be able to be useful? For data that we collate from elsewhere, we cannot really do much about sample size so we have to make sure that the questions we ask of the data are appropriate for the sample size.
It would be great if there was a simple rule for determining sample size. Unfortunately it is one of those critical yet difficult statistical tasks that requires a fair bit of thought about:
- defining the hypothesis to test,
- knowing the distribution of the data, and
- knowing what levels of confidence you are willing to accept in the analysis.
We are looking into some power analysis tools in Truii to give some general guidance for non-statisticians, however for now Length (2001)  has provided a good summary about do’s and don’ts to help you determine an effective sample size.
Humans are fallible – memory biases
So how good is Darcy’s rainy birthday memory – It has rained on Darcy’s birthday twice in the last seven years (which is pretty close to the long-term average where it rained on Darcy’s birthday in 31% of the last 115 years). So, it seems that those two wet days must have been quite impressionable events for Darcy. As it turns out, we as humans are fallible and suffer from all typesof memory biases. Here are a few choice memory biases that relate to data analysis and prediction from wikipedia;
- Childhood amnesia: the retention of few memories from before the age of four.
- Choice-supportive bias: remembering chosen options as having been better than rejected options.
- Google effect: the tendency to forget information that can be easily found online.
- Egocentric bias: recalling the past in a self-serving manner, e.g., remembering one’s exam grades as being better than they were, or remembering a caught fish as bigger than it really was.
- Hindsight bias: the inclination to see past events as being predictable; also called the “I-knew-it-all-along” effect.
- Positivity effect: that older adults favor positive over negative information in their memories. I suffer from this one and I call it: ‘the older I get the better I was’.
- Rosy retrospection: the remembering of the past as having been better than it really was. you rarely hear:‘Remember the good old days when infant mortality was high, women couldn’t vote and a 50 hour work week was standard’.
- Von Restorff effect: that an item that sticks out is more likely to be remembered than other items – like Darcy’s rainy birthdays and my soggy childhood.
- Conservatism or Regressive Bias: tendency to remember high frequencies lower than they actually were and low ones higher than they actually were. That is – we remember the abnormal rather than the normal, effectively giving it a higher likelihood of occurring
These memory biases are what make us human and interesting; and also convinces us at Truii why we should use data to support our decisions.
Trust the data – when the hypothesis is plausible
Relying purely on our human recollection is clearly fraught, and pushing my own data-driven biases I would recommend decisions only based on data to remove human fallibility from the process. However, data is not always correct. Information that we collect and then analyse is necessarily a summary and not the whole story. The record length may not be long enough or our equipment may not be sensitive enough or there are important gaps in the data record. My basic approach is to think about data analysis as a hypothesis testing process – firstly use your human experience to generate hypotheses or theories based on some sensible mechanism – then use the data to test. The key point is to hypothesise first and then test and not the other way around where we go fishing in a dataset looking for relationships and then try to devise a hypothesis to suit the relationships that we find.
To really get a picture of if it will rain on your birthday consult your memory and then rely on the data (or move to the desert).
If you are ready to create some data visualisations give Truii a test (sign up -it’s free- at top of page)
 Length R. (2001) Some Practical guidelines for effective Sample Size Determination. The American Statistician, Vol. 55, No. 3. pp. 187-193