From the course: Generative AI: Working with Large Language Models

Transfer learning

- [Instructor] One of the most important techniques used in deep learning is transfer learning. It's made up of two components, pretraining and fine-tuning. Your starting point is the model architecture, and all of the weights of the parameters are random. The model has no knowledge of language, and you then pretrain the model. This pretraining piece is very resource-heavy. You need loads of data, and this could include the entire Wikipedia corpus and a wide range of other corpuses. You also need a lot of compute. This is normally several hundreds to thousands of hardware accelerators, depending on how quickly you want to train your model. And these hardware accelerators are usually Nvidia's GPUs or Google's TPUs. At the end of this training, which can take days, weeks, or months, you have a model that has a very good understanding of the language you've trained it on. Now, fortunately, when the authors of BERT released their paper, they released the model architecture and the weights. This means we can use their pretrained model, which has a very good understanding of languages, as our starting point. So we can then go ahead and fine-tune this model for our specific task. In the case of BERT, this fine-tuning task could be text classification, named entity recognition, question answering, and so on. Now, this fine-tuning step involves training our model with labeled data. So for example, with sentiment analysis, we provide a whole lot of text examples and label each as either positive or negative. Now, what's interesting is that we can typically get better accuracy by starting with a pretrained model and fine-tuning it on a task such as sentiment analysis than if we trained a model from scratch on sentiment analysis. This is because the pretraining piece with large amounts of data allows you to have a model that has a very good understanding of language, which can then be used for other tasks. Let's take a look at some of the pretraining tasks for BERT. Now, BERT was fed Wikipedia and the BookCorpus data as input, and words were randomly masked out. BERT then had to predict what the most likely candidates were for these masked words. With next sentence prediction, it had to predict whether one sentence followed the other. So 50% of the time, one sentence did follow the other, and these were labeled as is next. And the other 50% of the time, a random other sentence from the corpus was used, and these were labeled as not next. Now, let's compare the pretraining for some of the larger models. So BERT was trained in 2018, the number of parameters was 109 million, and it took Google 12 days to train BERT. Now, I've put an asterisk by the 8 times V100s because BERT wasn't trained on GPUs, but rather Google's equivalent, TPUs, or Tensor Processing Units. The size of the dataset used for training was 16 gigabytes, and the training tokens were 250 billion. Now, think of one word as being approximately equal to 1 1/2 tokens. And finally, the data sources that was used to train BERT were Wikipedia and the BookCorpus, now, the BookCorpus being a large collection of free novels written by unpublished authors. Now, it's difficult to try and visualize how many words there are in Wikipedia and the BookCorpus. According to OpenAI's documentation, 1,500 words is approximately equivalent to 2,400 tokens. So this means a word is approximately 1.4 tokens. And so if we say the average 300-page novel is around 100,000 words, that's 140,000 tokens. So to put that in context, when BERT is trained on 250 billion tokens, that's approximately 1.8 million 300-page novels. Looking again at pretrained models, RoBERTa was developed by Facebook in 2019. The number of parameters was 125 million, and it has a very similar architecture to BERT. And quite amazingly, Facebook managed to train this in a single day, and that's because they used a whopping 1,024 V100 GPUs. Now, the size of the dataset was significantly larger than BERT's, so it was 160 gigabytes with 2 trillion tokens. The data sources were Wikipedia and BookCorpus, as used with BERT, but also the Common Crawl news dataset, OpenWebText, and the Common Crawl stories. Common Crawl is a raw webpage data from years of web crawling. And OpenWebText is a dataset created by scraping URLs from Reddit with a score of three. This is a proxy for the quality of the data response. And finally, GPT-3 was released in 2020 by OpenAI. The number of parameters for the largest model was 175 billion. The training time was probably around 34 days, and the infrastructure used was 10,000 V100 GPUs, and this was primarily an Azure infrastructure. The size of the dataset used for training was 4,500 gigabytes, which is 300 billion tokens. And the dataset source was Wikipedia, Common Crawl, WebText2, Books1, and Books2. So what are the benefits of transfer learning? Well, it takes much less time to train a fine-tuned model. For BERT, the author suggests two to four epochs of training for fine-tuning. This is in contrast to the thousands of hours of GPU time required for pretraining. We also don't need another massive dataset to fine-tune for a certain use case. This is in contrast to the large corpuses, such as Wikipedia and others, which are typically hundreds of gigabytes. And finally, we're able to achieve excellent results. We saw this phenomena when using transfer learning with computer vision several years ago when working with the ImageNet dataset, and this technique has worked in NLP too. We've looked at the two components of transfer learning, pretraining and fine-tuning, and why they're such powerful techniques for NLP. Next, we'll take a look at the transformer architecture.

Contents