Tidy Data: 80% of your data analysis is data cleaning
According to a recent paper by Hadley Wickham 80% of your data analysis effort is spent on data cleaning. Data cleaning or creating ‘tidy data’ is a vital first step of your data analysis and is often repeated many times over as data is updated or the analysis is revised. Data tidying encompasses all those time consuming tasks like outlier checking, correcting dates, creating regular timesteps, changing timesteps, infilling missing values, aggregating and appending multiple data sets to create a single set.
Truii includes a large collection of data wrangling functions to help you get your data in order. But that is only part of the solution. Like all good housekeeping the secret is in having a good system in place so that you don’t have to reinvent the data handling process for each new dataset. The trick with having a data handling system is to keep it simple so that it can be as flexible as possible. We have embraced this philosophy at Truii. Columns are king. Our approach to standardizing the way we handle, store and analyse data is that each variable should have its own column and each row is an observation. This simple approach allows you to cast your eye across each row (observation) to get values that relate to each other. For example, a data entry in the first column can be related to several other variables (columns) that were collected on the same day (across the column).
This simple approach to storing your data in columns seems self-evident and obvious, but it is surprising how often we come across novel data structures. The main reason behind the creation of other novel data structures is that the column approach is inefficient because you end up with lots of white space in your table where there are no concurrent observations. This can be a problem with some analysis packages that don’t handling missing values well. Database size can be an issue for very large datasets so we try and make them dense. However I think that the main reason that we avoid tidy but sparse data structures is our sense of the look of the data – it is unappealing to look at a sparse table of data – it looks inefficient and untidy and can be hard to be human readable.
Two tips (from Wickham) to help keep your data tidy
1) Column headers are values not variable names.
For this example there are three variables – religion, income and frequency or count. A tidy version of this data set creates a long dataset with three columns.
2) One variable per column.
Multiple variables are stored in a single column. We often create codes to store pieces of related information as a single value. For example age and sex might look like; m18, f22 (male 18 years, female 22 years). These should simply be stored as two separate columns.
Don’t forget to sign up to Truii’s news and posts (form on the right).