From the course: Introduction to Career Skills in Data Analytics

Assessing and adapting the data for transformation

From the course: Introduction to Career Skills in Data Analytics

Assessing and adapting the data for transformation

- [Instructor] Have you ever heard of data profiling? It's where we create a high-level profile of the characteristics of the data that we're working with. We should apply this approach to every data set. The greatest thing about profiling data is that when we use this approach, we get to learn about the data we're working with at a high level. Profiling helps to inform us on some pretty valuable items. It tells us how much data we have in this set. It can also tell us what the totals counts or averages of any number may be. This helps us validate our numbers later. It can also inform us about the data cleaning we will need to complete when we get ready to transform our data. I have some sales order data here, and I want to profile this data to help me get started working towards a report on sales orders. I'll first start by profiling the amount of data. I want to take a look at the record counts. How many records do I have in this data set? To do this, I can click on column A, and use the auto calculate feature on the bottom right-hand side of my screen. Now I have all of the auto calculate functions turned on. To do that, I just right-click the auto calculate area and then I can select each one of the options that I need. Okay, great. So when I look at this, I can see there's a count and a numerical count. So count will count everything I have highlighted and the numerical count will only count the numbers. So if I look at this record set, I actually have 3,500 records that represent the sales orders. We can also use some in average. Let's take a look at how much money is actually represented in this record set based on total due. I'll highlight column L. And this tells me I have approximately $33,700,000 worth of money represented in the total due column. It also tells me that my average is $9,633. Let's look at the average of the subtotal. This is the money before tax and freight. So the average sub total in this data set is $8,581. And the total is around 30 million. This tells me that if I see numbers like 60 million or 66 million, I have a problem in my data. So knowing how much it would total is important for validation later. Data profiling is so easy to do, but this is just the starting point of what you'll learn to profile your data. Remember, it will also help us inform our data cleaning. Take a look at columns, B, C, and D with me. These are order dates, but they look like zeros. If I click on B2, I can see that there is actually a date included. Just can't see it based on the formatting. Also for the purposes of my reporting, I don't need those timestamps. They're all set to midnight anyway. So this informs me that on my data cleaning process, I'll need to address the dates. There are additional profiling options that we will uncover as we explore deeper into our data and with other tools, but everyone with the data set and Excel, can use these options to profile their data.

Contents