From the course: Introduction to Career Skills in Data Analytics

Common cleaning and transformation

From the course: Introduction to Career Skills in Data Analytics

Common cleaning and transformation

- When building your cleaning and transformation toolbox, there's some common cleaning and transformation items you will use. Others will be more specific to the deeds of the data you work with. Let's start with general cleaning. Spaces are invisible to the eye, but in fact, they're characters. And when a field has extra spaces, you will want to clean those by removing them. There are leading spaces which are spaces that are at the front of the field. There are trailing spaces which are at the end of the field. When we want to remove either leading or trailing spaces, then we can use functions like trim or clean. The act of breaking out text is referred to as parsing text. And we can do this with any type of delimiter and every program handles this a little bit differently, but the outcome is the same. Spaces will also serve as a delimiter, like the spaces between words are valid spaces. Imagine first name and last name. In the case we want to have both last and first in their own individual columns for sorting, as an example, we will use the space to break those columns. This is not the only time we parse texts using delimiters. You might break apart text fields based on things like a dash or even a comma. We use things like text-to-columns, split by delimiter and functions like left, right and mid to work with parsing text. We don't only break apart text. There's also times when we need to combine text fields together. This is commonly known as concatenate or concat. We also replace text with valid text. For example, if someone enters an abbreviation of a state in the United States, but we want the full state spelled out, we might replace that text with the valid response. It could be a misspelling that we're correcting. There are several methods for replacing invalid data with valid data. We also change the case of text. Example would be maybe we need everything to be in uppercase or lowercase or even corrected to proper case. There are functions to do each of these commands, and again, they might differ between programs, but the outcome will be the same. These are very simple commands to perform in any data program. You may find that you'll also remove duplicates from a dataset and this can be done with commands like remove duplicates or using distinct keywords inquiry statements. We also transform data types to be appropriate for what we need to do with the data. You may have date fields that are stored as text, but to work with date-related functions, you need to convert it to an actual date data type. The same goes for numbers. If you need to work with a mathematical function, then the value of the field must be a number data type. These are just a few of the basic commands that we use for cleaning and transformation of data and some of the first ones to understand and master.

Contents