Databases are like dentists; we know their value but still avoid them when we can. For small companies and project teams, building and maintaining databases is like a trip to the dentist. We all know we should, the long term benefits far outweigh the discomfort but we all put it off.
I recently watched the classic Tom Hanks’ film, ‘Castaway’. In the opening scenes Tom Hanks’ character ‘Chuck’ winces due to a sore tooth, but doesn’t get around to going to the dentist. Anyone who has seen the movie (even if it was in 1995), will still cringe at the memory of the scene later in the film where Chuck is stuck on a deserted island, his dental problems are giving him grief and he has to resort to conducting his own dental work with the blade of an Ice Skate, arghhhh!
I have several data project experiences which feel a bit like this. The general project flow goes something like this…
- It’s only a few data files, I will just store them locally and email the team with updates.
- The data file collection grows,
- Several project team members start modifying the data for different analysis,
- A new data update comes in and nobody wants to redo the 20 hours of data manipulation to the updated raw data (even if they could remember what they did).
- We start looking for volunteers to conduct ‘Ice Skate surgery’ on the data collection.
We reflect on how useful a pre-emptive data strategy would have been.
Why use a database?
Consider the reasons for using a database to store environmental data;
Data is secure: databases allow data versioning, so your important and expensive data is maintained in a traceable state; the data lineage is known. This is really valuable for data collected for environmental licensing where it is important to demonstrate the lineage of the data (who did what to the data when).
Single point of truth: this builds on the data security issue which is important. When you have a database, you have a point of truth for stored data, the database version is the source of truth when all the analysed versions get out of date or become corrupted you have somewhere to turn.
Data access is controlled: who can load data, who can modify it and who can download or view the data. Databases are designed to allow this type of user management as core functionality.
Storage efficiency: One of the basic features of a relational database is to minimise the repetition of the data stored. For example, you may have a site location which requires 10 fields to describe it (Latitude, longitude, street number, street name, etc ). If you store this along with every value that you collect at the site, your data collection can become huge very quickly.
For small project data-sets this is not really an issue with current computer speeds and storage systems. For large data collections a well-designed relational database can save hard-drive space and data retrieval time. However in my view if you have less than about 10 Million data points, then a simple but not optimal database (fewer tables) trumps an efficient (i.e. complicated and inflexible) database structure every time. For most environmental projects a two or three table database has worked for me (one or two for all the data, and the others for meta data that apply across multiple records such as site details and collector details).
Powerful data summaries: Your entire data is in one place and you can run queries to extract really useful subsets of the data (‘for example, the query is to get the sites when the salinity level is above 1000ppm and the water temperature is above 28 Degrees Celsius and we have an observation fish species X’).
Why databases are annoying
Databases are designed and built by database engineers. The result is often a very specific, inflexible and frustrating tool (albeit you have efficiently stored data). You know the result, you are forced to enter data in some non-intuitive format or alien structure (‘why does this have to be an integer and not a double, I don’t have any data for this field but I cannot move on until I enter one!!’).
Project databases are unique, which reflects a steep learning curve every time you use it. The project level database is uniquely designed to efficiently store and retrieve data which is specific to a project. The underlying table structure is unique to the project and you cannot really get a handle on how you should use the database without learning the underlying structure (this table is linked to that table via this key, and to that table by some other key). No sooner to you work out how to use the database interface than you temporarily move on to some other work and then have to relearn the structure when you return.
Database query syntax is hard to remember. Querying databases to extract useful data subsets is part of their power. But every time you want to construct a query you have to remind yourself about the specific syntax required.
Databases are inflexible: Data structures are rarely fixed at the start of the project so modifying the database structure on the fly is almost inevitable. ‘we need another data field, can someone remind that database guy when he gets back from holidays and get him to modify the table structure and make sure we can still load the old database file.’
The claytons database
So what I want is the benefits of a database (security, data lineage, data queries) without the discomfort (unique, inflexible, hard to use). I don’t want my project team to have to be database engineers I just want them to be able to interact with the data so they can get on with the project. I just want a file management system (like windows explorer) but it should just know what different data looks like and automatically allow me to do data stuff with it (data wrangling and data analysis) without forcing me to tick lots of different fields of reformat the data.
That is why we created Truii. Truii is your own project database, but looks and feels like a file management system. The only real trick that you need to know for Truii is to store your data in columns. Truii will do the rest.
Back to the ‘Castaway’ film: As a final irony, Tom Hanks eventually escapes from his deserted island only to return home and find that his partner has left him for a dentist! I’m sure there is a poignant database analogy here somewhere.
If you enjoyed this post, don’t forget to sign up to the blog – form on the right.