Spreadsheets: love ‘em or hate ‘em, they’re ubiquitous.
Think of an office job and the first thing that comes to mind is probably a spreadsheet. For most people, the words “data” and “spreadsheet” are synonymous.
Maybe you’ve moved on from spreadsheets and work with Parquet, Python, and Spark clusters, but you almost certainly work with others for whom the spreadsheet is the only way they know how to work with data.
And let’s face it, when we talk about spreadsheets we’re mostly talking about Microsoft Excel. Like most people I have a love/hate relationship with Excel: it democratises data work like nothing else and it’s an incredibly versatile tool, but it’s precisely because it’s so versatile that it’s misused and makes life harder for data practitioners.
So here are 5 common mistakes I’ve encountered with spreadsheet use, why these things are so problematic, and what to do instead.
Even if you know this stuff already, I’m sure you know at least one colleague who doesn’t - so why not share this article with them?
Use spreadsheets to store data
One of the best articles I’ve ever read about the appropriate way to think about spreadsheets is this piece by Andy Youell. It’s super short so I recommend you go have a read, but in essence Andy says that most problems with spreadsheets come from the fact that people use them as systems.
What does he mean by that? Well, in broad terms you can think of data work as divided across three layers: the data later, the business layer, and the presentation layer.
The data layer is just the storage of raw data, nothing more. It doesn’t look pretty and it’s not supposed to: it’s simply a collection of text and numbers.
The business layer is where we act on the data. We can re-shape the data, transform it, combine it with other data, extend the data based on the existing data, and so forth. Most people use Excel for this.
Finally, the presentation layer is where we visualise the data. This involves creating graphs, highlighting with colours, using different font styles, arranging the tables to look visually appealing and so on. This is also a common use of Excel, whether the spreadsheet is for one-time use or if it’s to be shared the temptation is always there to visually enhance the data.
Spreadsheets excel at the data layer, but are extremely problematic when it comes to the business layer and the presentation layer. The reason is that most people don’t recognise the separation of these layers in the first place, something Andy describes as a “collision of design paradigms.”
By recognising the separation of layers, we can design and manage each layer according to the needs of the consumer of each layer.
In my view, spreadsheets should be used purely in the data layer, and specialised tools should be used for the business layer and the presentation layer. In other words, I think you should use spreadsheets to store data, and better tools to work with that data.
The rest of this article shows you how to store the data in a spreadsheet correctly.
Keep your data tidy
Happy families are all alike; every unhappy family is unhappy in its own way
>
Leo Tolstoy
The tidy data framework originates with Hadley Wickham, Chief Data Scientist at Posit. “Like families”, he wrote, “tidy datasets are all alike but every messy dataset is messy in its own way.”
What does this mean in practice?
Well, imagine that three of your colleagues are responsible for preparing some data for you to work on each month. Without any guidance, it’s very likely that the spreadsheets won’t be laid out in the same way. This is especially true if they’re using software such as Excel, which provides the user with lots of options for formatting what they see on the screen.
The problem with this is you now have to take three different approaches to three different spreadsheets, and you won’t know in advance which approach to take. There’s no pattern and predictability to the data you’re receiving, and this increases your workload as a data practitioner.
Data should be organised in the simplest way possible. Hadley Wickham offers three rules:
- Each variable is a column; each column is a variable.
- Each observation is a row; each row is an observation.
- Each value is a cell; each cell is a single value.
Messy data is any other arrangement of the data.
It’s not always completely obvious if a dataset is tidy or messy, but if you’re ever unsure then just ask yourself: am I keeping things as simple as possible? If the answer is “no”, then your data is probably messy.
Don’t use colours on your data
We’ve all seen it: spreadsheets highlighted with all the colours of the rainbow. You might have even been tempted to do this yourself, and Excel makes it especially easy to change font colour or add a dash of yellow highlighting to some cells you want to draw attention to.
Stop. Step away from the colour palette.
Why is highlighting and using different font colours a bad idea? Well, think about it: how is anyone else going to understand what the colours mean?
When you use colour, you’re encoding meaning into the data. But without some form of legend, that meaning isn’t apparent to anyone else. But legends should be reserved for graphs, not for the raw data.
There is an exception to this: if you’re sharing the data as a visual, then by all means use some colour highlighting. What I mean here is that if the data is being presented as-is, then sometimes you can use colour to draw attention to something, but you still want to ensure that the reason you’re using colour is obvious. This usually means that you’re going to be presenting this data to someone else, so you can talk them through the meaning while you’re together.
But if we aren’t supposed to use colour in the raw data, then how do we encode important information? For example, what if I want to highlight a subset of the data because it’s of more importance to my purpose than the rest of the data? Well the answer here is simple: make a new column!
If you want to flag some of the data, then create a flag as data. For example, if you have personnel data and rows 50 to 200 are full-time employees whilst rows 2 to 49 and 201 to 250 are part-time employees, then add a new column called something like full_time and add TRUE or FALSE into the relevant rows. Now you’ve encoded the same information as you would have if you’d used highlighting, but the meaning is clear to anyone else who uses this data.
One last nerdy point on this: if you use colour in your spreadsheet and pass that to a data analyst, there’s a decent chance that they’ll use a programming language to inspect the data and they won’t even see your pretty colours in the first place.
Choose good names
There are only two hard things in Computer Science: cache invalidation and naming things.
>
Phil Karlton
I get it, naming things is hard and it’s easy to spend far too long on this. But if I could only give you one piece of advice about naming things it’s this: don’t use any spaces in your column names or in the name of the file itself.
Why does this matter? Well, if your data is going to be analysed by a programmer then they will need to wrap everything “in quotation marks” if you’ve included spaces, rather than just referencing things like in_quotation_marks. See the difference?
But this isn’t just about making things easy for the boffins. Good names are also useful to your other colleagues (and your future self) if you keep them short (but not too short) and clear: temp_celsius is better than temperature_in_degrees_celsius and anything is better than “Temperature (C)”.
One last tip about naming: if you find that it’s too hard to think of a name for the data in a given column, then maybe you need to reconsider the data you’ve stored in that column. Coming back to our earlier point on tidy data: can you make it simpler by splitting out the data across more than one column? So instead of one column for sex and age range (“m_20_to_30”), maybe use one column for sex and another for age.
Save it as a plain text file
Another way of putting this is: don’t use Excel at all, but if you must then save your file as a CSV as well.
I know I’m stating the obvious here, but Microsoft Excel isn’t freely available. You may get it via your work’s enterprise license, or maybe you get it at a discount because you work for a nonprofit or a university or something, but ultimately money is changing hands for you to use Excel. Microsoft is a for-profit company (at the time of writing they're one of three trillion-dollar companies in the world), and they own the product.
This fact has a few implications:
- What you can and cannot do with Excel is determined by Microsoft and Microsoft alone.
- If Microsoft decides to change something about the way Excel works and it breaks one of your spreadsheets, then tough luck.
- If your programmer colleague wants to read the data you’ve sent them as an Excel file, they’ll need special software to do that.
- Excel may be making unexpected changes to your data without your knowledge and this has caused all sorts of problems for people.
I’m a big fan of open source tools, which basically means source code that’s made freely available for possible modification and redistribution. The great thing about open source is that it’s maintained and developed by a community of volunteers (although sometimes organisations will get involved too, as many of them use open source tools and it’s in their interests to maintain them). This means that anyone can contribute, and as a user you’re free to modify the code however you want.
Open source data tools generally play well together, and CSV is an open file format because it’s just a plain text file. So, if you send a CSV file to a data analyst who uses something like R or Python, then it’s very easy for them to work with that data. And anyone who doesn’t have access to Excel can use the CSV file too by opening it in any number of open-source (i.e. free) alternatives to Excel.
So yes, I’m ending with a plea to you to just ditch Excel altogether. But if that’s too much of a leap then at the very least just use it to store data, keep your data tidy, avoid the colours, use good names for your columns and the file name, and save a copy as a plain-text file.
If you’d like some further guidance on spreadsheet best practice then check out the paper that inspired this post. And good luck on your journey from spreadsheet hell to data nirvana my friend - you can do this!