From the course: Excel: Managing and Analyzing Data

The importance of checking data quality - Microsoft Excel Tutorial

From the course: Excel: Managing and Analyzing Data

The importance of checking data quality

- [Instructor] Let's talk about the importance of checking data quality. This conversation needs to be had more. So, let's say you finally got the data and you're ready to start doing some analysis and passing it off to colleagues and bosses. You so excited, and you just learned pivot tables and you just learned how to unpivot in power query, and I am going to say, stop, look. Data can fall anywhere between trustworthy and worthless. Too many people assume that the data that they get is trustworthy and they dive in, start doing analysis and inspiring decisions, but they haven't assessed where they are on this spectrum. So, the first question you should ask yourself is, can I trust this data? Why? Because there are people who are impacted by what we do with our data. They're impacted by our data management and data analysis. And yes, there are things like billing can be inaccurate. Inventory data can be inaccurate, but I believe bigger than that is the people who suffer those consequences, the person who ordered something from a website that had the wrong data, and now they're sitting back waiting for something that's not coming. Or how about paying a salesperson too much, and then, having to go back and tell 'em, you know, we made a mistake and so that extra, mmm, it's going to be coming out of your future commissions. No, we have to stop and check data quality, and yes, mistakes do happen, and we own our mistakes 100%, but prevent what you can by really thinking data quality. Now, let's look at a few things that corrupt data quality. Here, we have a list of cities. We have inconsistencies. There's NYC, there's New York NY, NY, and if you did analysis on that, this can cause problems for you because you'll have data tied to NYC, data tied to New York and data tied to NY, NY. You got to check this kind of thing out, the inconsistencies. Look at the guests. We have Rick McCray, Celia Silva and James Fielder's name is inconsistent with the others. Oh, and we have two Celia Silvas and a family of S. Archer, and then just plain Bob. This has got to be handled. And sometimes you can clean things up, like with James Fielder, you can clean that up, but then family of S. Archer and just plain Bob, you might have to set those aside while you deal with everything else, and then come back to those. Now, column F, we have name, birthday, and occupation all in the same cells. That would need to be split out. You might want to sort by the birthdays, or maybe you don't want the birthdays at all and need to dig 'em out and then delete 'em. And then, look, Dale Smith, no birthday, no occupation. And then, Vanessa Green has a birthday and no occupation. So, you have to decide what you're trying to do, what you need, what you don't need and split this apart. Let's go down, look at these start dates. We've got different types of formatting. There's 12. 17. 17. We can assume that's December 17th, but then we've got 30/06/15, and that raises questions about whether 9/6/22 is September 6th or June 9th. Again, you've got to be aware of this and clean up what you can. And let's look at one more example. We've got first name, last name, IDs, and addresses. Scroll down, okay? That's what we have. We're missing a lot of IDs. Also, the ID for Vicki Lane is one character short. So, you have to be really clear about what you're trying to do, what you need, if you have enough of what you need, and if you do, if you've decided that the data is clean enough and complete enough, then yes, go ahead. Go forward with your analysis, but always think data quality. Where is it on the spectrum? Is it trustworthy or is it worthless?

Contents