From the course: Generative AI: Working with Large Language Models

GPT-3

- [Instructor] GPT-3 is probably the most well-known large language model. Let's take a look at what the letters GPT represent in turn. G is for generative, as we're predicting a future token, given past tokens. P is for pre-trained, as it's trained on a large corpus of data, including English Wikipedia, among several others. This involves significant compute time and costs. And finally, the T corresponds to a transformer, and we're using the decoder portion of the transformer architecture. GPT-3's objective is simple. Given the preceding tokens, it needs to predict the next token. So these are known as causal or autoregressive language models. This is very similar to how predictive text works on your phone. So if you type roses, the next suggested word is likely to be are followed by red and so on. GPT-3 was trained on the English Wikipedia, which is around 2 1/2 billion words, Common Crawl, WebText2, Books1, and Books2. Now, Common Crawl is raw webpage data from years of web crawling. WebText is a dataset created by scraping URLs from Reddit with a score of three, and this is a proxy for the quality of the data response. And the Books1 and Books2 corpus are likely to be the BookCorpus, which is a large collection of free novels written by unpublished authors. Earlier, we looked at how masked language modeling and next sentence prediction were the pre-training tasks for BERT. With GPT-3, the pre-training task was causal language modeling. This means its pre-training task is that it needs to predict the next word in a text. So what this means is that we can train the model in a self-supervised way, and we don't have to annotate our datasets. We can then take all these humongous datasets and use them to train our model. Additionally, we want to use some decoding algorithms such as beam search, to give us a balance of coherent language and diversity, so we don't get sentences repeated. For a couple of years, the focus of researchers was getting a large corpus of data and then training a language model. Now, if you wanted to use that language model for a specific task, say, sentiment analysis, then you'd need to give it hundreds of examples of sentences that were labeled as either having a positive or negative sentiment and train the model on these sentences and labels, and the model would produce excellent results. Now, let's take an example from the IMDB dataset, which are movie reviews that are either positive or negative. So here's the text, brilliant execution in displaying once and for all, this time in the venue of politics, and so on. And this is labeled with a 1, which means it's a positive review. The second text goes like this, this piece ain't really worth a comment. It's simply the worst horror movie I've ever seen, and so on. And this is labeled with a 0, which means it's a negative review. Now, if you had a totally different task, like providing some sentences and a couple of questions, and the model had to find the answer to the questions in the sentences, the same model that you use for sentiment analysis would fail miserably at this task. So you'd have to start again with your initial language model and give it hundreds examples of text and questions and where the answers are, and it would get good at this task. Now, what makes this different to how you and I operate is that if we have to learn a new task, we can do a reasonable job if we're given a clear set of instructions and given just a couple of examples. So the question is, what if we could create a language model that if we give it a new task and a couple of examples with the expected output, that it would be able to perform well on these tasks? And GPT-3 does just that. And what's really remarkable is that the way you interact with these models is with a prompt. This means that if you sometimes don't get the answer you expect, you can change the prompt, and you might get another and hopefully better answer. So this means you can provide a task and see how it performs without providing any examples or expected output. This is an example of zero-shot learning. One-shot learning is where you provide a task and one example with the expected output. And as the name suggests, few-shot learning is giving the model a couple of examples with the expected output. Let's summarize our understanding of GPT-3 in a table. One of GPT-3's primary objective was few-shot learning. This performed best on the largest of the GPT-3 models, the 175-parameter model. GPT-3 provides an easy way to interact with models. In the next video, we'll look at a couple of examples of working with the GPT-3.

Contents