From the course: Introduction to Career Skills in Data Analytics

Describing data best practices

From the course: Introduction to Career Skills in Data Analytics

Describing data best practices

- [Narrator] Do you have an approach to data? Have you ever really thought about it? I know after years of working with data for projects or ad hoc reporting, that I've built a pretty defined approach to every data set that I work with. There are just some things that I do with every data set. The process may be a little bit different based on the software that I'm working with, but in this example, I'm using Microsoft Excel. This transactions file has actually been exported from a software that we use to analyze our transactions. Normally, when we're working on an ad hoc report or a project, we have an expectation of what we're going to deliver. But to show you that this approach will work with any data set, I don't have an end goal in mind. I just want to learn about this data set. If I take a little time upfront to learn more about this data set, I'll be better off when I start trying to meet the end goal of the project. Excel will sort, filter and perform data commands on what it sees as the data set. And that's a key point, what Excel sees as the data set. So the very first thing I want to do is confirm that the data that I'm working with in the transactions list, is entirely recognized by Excel as a data set, meaning that there are no breaks in the data. I do this by using one of my most favorite shortcuts. It will select all the data that Excel sees in the range. To do this shortcut, I just simply do Ctrl+A. That's not enough though, because this is a lot of data. It looks like it picked it all up. But if I zoom out, I notice pretty quickly that I have a broken data set. You see all of column Z is empty. So that means Excel will only sort and filter everything to the left. In order to fix this data set, I can right-click column Z and delete it. Okay, let's do that shortcut again. I'll do Ctrl+A, and now I have a fully intact data set that Excel will recognize. This makes it easier for me to sort, filter, and do all sorts of data commands. Okay, let me do Ctrl+Home to go up to A1. Before I go any further, one of the very first things I'll do in working with the data set, is I'll make a copy of it. So I'm going to take my mouse and put it on the bottom of the transactions list here on the transaction sheet tab. I'll hold my Ctrl key, and then I'll drag and drop it one step to the right. Now, it's important, I'm going to let go of my mouse first, and then let go of Ctrl. That makes a copy. Okay, I'll rename it to working. Copy. That way, if I mess up, I can always go back to the original transactions list. Okay. Let's take a deeper look at this data. When I see fields named ID, like transaction ID, this is database language for key fields. Okay, let's see how many of those we have. So I'm going to hit the select all which selects the entire sheet, and double click in between the A and the B column headers, and this sizes all of the data. So I'm looking at transaction ID. I have product IDs. I have reference order ID. So these are key fields and it automatically makes me wonder, are there duplicates in this data set? So let me highlight the transaction ID because that's what I really need to be unique. So I highlight transaction ID, and I want to spot the duplicates before I deal with them if they exist or not. I'll go to conditional formatting. I'll choose highlight sales rules, and I'll choose duplicate values. I'll go ahead and make them light red fill, and click okay. I have to look at the data and immediately I see some duplicated data. That means, that I have duplicates in this data set. So if I were to total it up or count the records, I would get an inflated amount of information. Okay, so I need to address these duplicates. Let me do Ctrl+Home to go back up to A1. It's easy to deal with duplicates when you know what fields to choose. What makes this a duplicate transaction, is the fact that the transaction ID is duplicated. I see them all highlighted in red. It's a little bit more obvious now that we know that duplicates exist, but in a sea of data, it can be hard to find them. Okay, let's go remove the duplicates. Now this command will actually remove them, but that's okay, I have my copy here. I'll go to data. I'll choose, remove duplicates. I'll choose, unselect all for this example. And I'll choose transaction ID. I'll go ahead and click okay. It tells me that it found a ton of duplicates, and that it's only going to leave me 1,228 records that are unique. Perfect. I'll go ahead and click okay. Now I have a data set with integrity, no blank rows, no solid columns. I know that I don't have duplicates because I've removed them, and I have a working copy so that I can continue to explore this data. This is in no way, a comprehensive list of approaches. These are just techniques that when you start working with Excel data, you might want to do them on every data set.

Contents