Från kursen: SQL Tips and Tricks for Data Science

Retrieve data using SQL

- [Instructor] So let's talk about retrieving data from a database now. First, I want to introduce you to the concept of SQL, and we're going to go through what SQL is. Then we'll try to understand why we would use SQL. And lastly, I'll show you how to use it in actually retrieving data from a database. A SQL stands for structured query language, and it's the most universal of all programming languages, and one of the few that has a standard syntax that all databases support. This language was designed as if it were to be read as English. And over the years, it has evolved into a complex, and expressive language that allows you to manipulate, transform, analyze, and even update or delete data in your database. This language is supported by virtually all databases, even newer big data systems known as NoSQL databases. NoSQL actually stands for not only SQL, which is a big misconception in the industry right now. The name really defines what it is. Now let's take a look at why we would use it? SQL at its most basic is a way to retrieve data from databases, which is what we're going to to do here. You enter some code, which we'll do in a second, and you get data back. Databases have been around since computers were created. In fact, you could argue that the main reason computers became popular was their ability to process data much faster than humans. So as nearly as long as computers have existed, so have databases. SQL also lets you manipulate data. Let's say you want to normalize values in a column to all be the full state name instead of an abbreviation. Or maybe you want to replace a blank or null value with a word like unknown, SQL allows you to actually change these values to better conform to the types of questions that you're seeking answers to. Since its creation in the 1970s, SQL has evolved to allow analysts and scientists to get answers to questions that are incredibly complex and sophisticated. One of the advantages of using SQL for these questions is that it runs the code on the database server itself making it more efficient in most cases, compared to running these analysis in a third-party tool, such as R or Tableau. The biggest reason by far however, is that SQL was adopted as a standard across all databases. This means that unlike any other technology platforms that have their own very specific syntax, if you know how do use SQL in one database, then you know how to use SQL on all databases. This is unique, and it makes SQL one of the most valuable languages to know. Now let's take a look at how to actually do this. We're going to start with the select statement. This is the most fundamental piece of SQL that you can write. And then we're going to get into using column aliases to retrieve only a subset of the columns, and even rename them in the process. And lastly, I'll show you how to limit your results. So you don't retrieve the entire database worth of values. You only retrieve those that you're looking for. I'm going to switch over to my code editor now, and we'll get started. So first let's just take a look at the most basic of select statements, select star from a table. A table is like a sheet in Excel that contains the data. If I execute this code here, you'll see that I get all of the results back. I get every column, and all 60,000 plus rows of data. If I wanted to limit that, I could simply add a top clause, select top 1,000 for example. From that same table, and I would get only 1,000 results. Other databases use a different syntax. Instead of saying top 1,000 at the beginning, you use limit 1,000 at the end, and it achieves the same result. It only retrieves the amount of data that you specify. Now if I wanted to instead of retrieving all of the columns, specify exactly which columns I want, I can use the column names themselves in my select statement. So instead of stating select star, I'll do select, and then the individual column names. When I run this query, specifying the exact column names, you'll see the results are the same. So it gives you the exact same effect. Now what if instead I wanted to actually change some of those column names, and not retrieve all of them at the same time? I can do so using column aliases. Here I have a query which specifies only the top 1,000 rows before, and some individual columns that I want, and even different names for them. So that way, instead of getting the full column name, I get the name I want to be retrieved as the column header. When I run this piece of code, you can see that I have a limited results set. The column names have been changed, and the results have been limited to a thousand rows that I specified.

Innehåll