Start free trial Sign in

From the course: Data Science Foundations: Fundamentals

Data preparation

From the course: Data Science Foundations: Fundamentals

Start my 1-month free trial Buy for my team

Data preparation

“

- [Instructor] Anybody who's cooked knows how time consuming food prep can be. And that doesn't say anything about actually going to the market, finding the ingredients, putting things together in bowls and sorting them, let alone cooking the food. And it turns out there's a similar kind of thing that happens in data science, and that's the data preparation part. The rule of thumb is that 80% of the time on any data science project is typically spent just getting the data ready. So data preparation, 80%, and everything else falls into about 20%. And you know, that can seem massively inefficient, and you may wonder, what is your motivation to go through something that's so time consuming and really this drudgery? Well, if you want, in one phrase, it's GIGO, that is garbage in, garbage out. That's a truism from computer science. The information you're going to get from your analysis is only as good as the information that you put into it. And if you want to put in really starker terms, there's a wonderful phrase from Twitter. And it says most people who think that they want machine learning or AI really just need linear regression on cleaned-up data. Linear regression is a very basic, simple and useful procedure. And it lets you know, just as a rule of thumb, if your data is properly prepared, then the analysis can be something that is quick and clean and easy and easy to interpret. Now, when it comes to data preparation and data science, one of the most common phrases you'll hear is tidy data, which seems a little bit silly, but the concept comes from data scientist Hadley Wickham, and it refers to a way of getting your data set up so it can be easily imported into a program and easily organized and manipulated. And it revolves around some of these very basic principles. Number one, each column in your file is equivalent to a variable, and each row in your file is the same thing as a case or observation. Also, you should have one sheet per file. If you have an Excel sheet, you know you can have lots of different sheets in it, but a CSV file has only one sheet. And also that each file should have just one level of observation. So you might have a sheet on orders, another one on the SKUs, another one on individual clients, another one on companies, and so on and so forth. If you do this, then it makes it very easy to import the data and to get the program up and running. Now, this stuff may seem really obvious, and you say, why do we even have to explain that? It's because data in spreadsheets frequently is not tidy. You have things like titles and you have images and figures and graphs, and you have merged cells, and you have color to indicate some data value, or you have-sub tables within the sheet, or you have summary values, or you have comments and notes that may actually contain important data. All of that can be useful if you're never going beyond that particular spreadsheet. But if you're trying to take it into another program, all of that gets in the way. And then there are other problems that show up in any kind of data, things like for instance, do you actually know what the variable and value labels are? Do you know what the name of this variable is, 'cause sometimes they're cryptic. Or what does a three on employment status mean? Do you have missing values where you should have data? Do you have misspelled texts? If people are writing down the name of the town that they live in or the company they work for, they could write that really an infinite number of ways. Or in a spreadsheet, it's not uncommon for numbers to accidentally be represented in the spreadsheet as text, and then you can't do numerical manipulations with it. And then there's a question of what to do with outliers? And then there's metadata, things like where did the data come from, who's the sample, how was it processed? All of this is information you need to have in order to have a clean data set that you know the context and the circumstances around it, that you can analyze it. And that's to say nothing about trying to get data out of things like scanned PDFs or print tables or print graphs, all of which require either a lot of manual transcription or a lot of very fancy coding. I mean, even take something as simple as emojis, which are now a significant and meaningful piece of communication, especially in social media. This is the rolling on the floor laughing emoji. There are at least 17 different ways of coding this digitally. Here's a few of 'em. And if you're going to be using this as information, you need to prepare your data to code all of these in one single way so that you can then look at these summaries altogether and try to get some meaning out of it. I know it's a lot of work, but just like food prep is a necessary step to get something beautiful and delicious, data prep is a necessary, vital step to get something meaningful and actionable out of your data. So give it the time and the intention it deserves. You'll be richly rewarded.

Contents