From the course: Generative AI: Working with Large Language Models

Unlock the full course today

Join today to access over 23,100 courses taught by industry experts.

Multi-head attention and Feed Forward Network

Multi-head attention and Feed Forward Network

- Earlier, we looked at how self-attention can help us provide context for a word for the sentence, the monkey ate that banana because it was too hungry. But what if we could get multiple instances of the self-attention mechanism so that each can perform a different task? One could make a link between nouns and adjectives. Another could connect pronouns to their subjects. Now that's the idea behind multi-headed attention. And what's particularly impressive is we don't create these relations in the model. They're fully learned from the data. So BERT has 12 such heads and each multi-head attention block gets three inputs, the query, the key and the value. These are put through linear or dense layers before the multi-head attention function. The query, key and value are passed through separate fully connected linear layers for each attention head. And this model can jointly attend to information from different representations…

Contents