大規模言語モデルの進化:Transformerと事前学習の役割
大規模言語モデル(LLM)は、Transformerアーキテクチャと自己教師あり学習、特に「次の単語を予測」という事前学習によって飛躍的に成長しました。
Transformerは再帰的な処理をなくし、並列処理を可能にすることで、大規模なデータセットでの効率的な学習を可能にしました。
学習は自己教師あり学習が大部分を占め、その後に教師あり学習や強化学習が加わっています。
OpenAIのGPTシリーズをはじめとする大規模言語モデル(LLM)は、その巨大な規模と高度な性能から、近年急速に注目を集めています。しかし、これらのモデルがどのようにしてこれほど大規模になり、その性能を実現しているのでしょうか? 本記事では、LLMの進化におけるTransformerアーキテクチャと事前学習の役割に焦点を当て、その背景と重要性を解説します。Transformerアーキテクチャは、LLMの性能を飛躍的に向上させた画期的な技術であり、その理解は、今後のAI技術の発展を考える上で不可欠です。
自己教師あり学習:学習データの制約を打破
機械学習において、教師あり学習は、人間がラベル付けしたデータを用いてモデルを訓練する方法です。しかし、この方法は、ラベル付け作業に膨大なコストがかかるという課題がありました。そこで注目されたのが自己教師あり学習です。自己教師あり学習では、インターネット上に存在する大量のテキストデータを利用し、「次に何が来るかを予測する」というタスクをモデルに学習させます。このプロセスを通じて、モデルは単語間の関係性や文脈理解といった、汎用的な知識を獲得することができます。Yann LeCun氏が提唱した「ケーキのアナロジー」では、自己教師あり学習はケーキの大部分を占めるとされており、その重要性を示しています。
Transformerアーキテクチャ:大規模化を可能に
自己教師あり学習の登場により、大規模なデータを利用した学習が可能になりましたが、それでもモデルの性能を向上させるには限界がありました。その壁を打ち破ったのが、2017年にGoogleによって発表されたTransformerアーキテクチャです。従来のRNN(Recurrent Neural Network)のような逐次的な処理ではなく、Attentionメカニズムを用いることで、入力文全体の文脈を一度に捉えることを可能にしました。これにより、並列処理が可能となり、モデルの学習速度が大幅に向上しました。Transformerアーキテクチャは、GPTシリーズをはじめとするLLMの基盤技術として採用され、その性能を飛躍的に高めました。
ファインチューニング:性能向上と応用範囲の拡大
大規模なテキストデータを用いた事前学習(Pretraining)によって、LLMは汎用的な知識を獲得します。しかし、特定のタスクに特化させるためには、さらに少量のラベル付きデータを用いたファインチューニング(Supervised Learning)が必要です。例えば、GPTシリーズは、事前学習後に、質問応答や文章生成といった特定のタスクで調整されています。さらに、強化学習(Reinforcement Learning)を組み合わせることで、より人間らしい自然な応答を生成することも可能となっています。LLMの進化は、今後も継続的に進み、その応用範囲はますます広がっていくと見られています。
まとめ
大規模言語モデルの進化は、自己教師あり学習、Transformerアーキテクチャ、そしてファインチューニングという要素が組み合わさることで実現しました。これらの技術は、AI技術の発展に大きな影響を与えており、今後もその影響力は増していくでしょう。LLMの性能向上は、単なる技術的な進歩にとどまらず、社会や産業構造にも大きな変革をもたらす可能性があります。今後のLLMの進化と、その応用事例に注目が集まります。
原文の冒頭を表示(英語・3段落のみ)
Link postContentsSelf-supervised learning: most of the cakeOld-school supervising learningThe hands-off approachSelf-supervised learning in language modelsWhat are we waiting for? How transformers overcame a scaling problemSelf-supervised training is only the startChatbot as playscriptPost-trainingSupervised learning: the icingReinforcement learning: the cherry on topTimes changingLarge language models are really large. They’re among the largest machine learning projects ever, and set to be (perhaps already are by some measures) some of the largest computing and even largest infrastructure projects ever.But how did LMs actually get so large as to warrant the title ‘large language model (LLM)’? A large part of the answer is in the P (‘pretrained’) and the T (‘transformer’) of GPT.This is part 1 of a series about LLM architecture and some implications, past and future, for reasoning. Part 1 is ‘how we got here’ — what was so impactful about the transformer architecture for LLMs. Some readers may prefer to skip this. Part 2 will point at an unexpected benefit — the surprising explainability of contemporary AI reasoning — and why new trends might erode that. It’s a novel point as far as I know.Self-supervised learning: most of the cakeIn 2016, early deep learning practitioner Yann LeCun introduced a famous analogy for learning intelligent systems:LeCun’s cake analogy (‘unsupervised’, ‘predictive’, and ‘self-supervised’ are used somewhat interchangeably)If intelligence is a cake, the bulk of the cake is unsupervised learning, the icing on the cake is supervised learning, and the cherry on the cake is reinforcement learning (RL).This correctly described the sometime prevailing practice for training LLMs, before the first LLMs were created! (Though by 2025, certainly 2026, this has become outdated — more on that later.)What does this mean, and how did LeCun come to this position?Machine learning, especially deep learning, needs example data.Old-school supervising learningYou might think to start (and historically, researchers mostly did) by curating labelled examples: ‘this image is a cat, this one a dog; the French for “a bed” is “un lit”; …’. The machine learns these labels, and (hopefully) also learns how to generalise correctly, responding appropriately to similar-enough previously-unseen cases — which is what you actually want.This supervised approach can be excellent: you get (ideally) reasoned, expert judgement as the sole curriculum for your nascent learning machine. But this is also painfully expensive, because you have to pay or otherwise entice humans to look over all your examples (or worse yet, create gold standard examples from scratch)!The hands-off approachFortunately for developers, there is a stupendous amount of (language and other) data on the internet. That’s one very big reason, if you can, to pursue self-supervised training targets for your machine learning project: rather than ‘here is my carefully (and expensively!) curated dataset of labelled examples, learn them’, it’s ‘here is a gigantic pile of text, learn to predict whatever comes next’.[1]As in supervised learning, where the goal is not simply to memorise the provided examples, the eventual goal of self-supervised learning is rarely simply to teach the machine to carry out this specific defined prediction activity — rather, in the process of learning how to do that, the machine is forced to learn generalisable concepts and features which can then be turned to a wide range of tasks.More on RL (the cherry) later.Self-supervised learning in language modelsLanguage comes in long sequences[2]. So language learning systems need to be able to consume sequences. Consider: a large part of how we track what’s going on in a play, a textbook, an argument, a novel, and so on, is by reference and by recollection of context far back in the sequence: a previous dialogue, an earlier concept or experiment, a foundational point of information, a character’s exposition and background. The foreshadowing of a Chekhov’s gun can’t be understood if you’ve lost the plot in the meanwhile! So too with machine language models.For the ‘predict what comes next’ self-supervised target, that looks like consuming in principle everything that’s come so far, then using that context to output a distribution over possible continuations.Chapter 1:… The old revolver sat on the kitchen table……Chapter 20:… Alice had run out of patience. Without warning, she fired the [gun/chef/kiln/pigeon]Which one is right? Well, actually it depends a lot on what happens in the intervening scenes I’ve skipped! But assuming the narrative promise made in scene 1 is kept (and no other overriding promises are made in between), a good prediction is ‘gun’. On the other hand, if scene 1 instead introduced a sous-chef with a hygiene problem, a pottery studio with a design dispute, or an experimental pigeon-launching apparatus, the answers might be different.The especially useful property of ‘what comes next’ self-supervision is that you can run this test on every single position. So a reasonable-length text might give you thousands or more such training examples… some easier than others.‘Simply predicting what comes next’ clearly necessitates tracking a lot about what’s going on (across all kinds of scenarios) and how things work… at least if you want to predict well.What are we waiting for? How transformers overcame a scaling problemThe benefit of self-supervised learning — a huge, largely automatically-generated collection of training targets (‘predict what comes next’) — also raises its own challenges. If much of the capability of your system arises from the sheer quantity of these training targets, you run into computational challenges: you want to be pumping more and more of these examples through your machine, but each example takes some number crunching, which costs compute time. Crucially, if you can crunch through more examples at once in parallel, you’re in a far better spot.We saw how text prediction is a sequence problem. Naively, this means that, to make the prediction at the last position in the text, your process has to first read the first token, then the second, the third, and so on. To make the guess at the last position, you’ve had to ‘wait’ for absolutely all of the rest to be done processing. At least, that’s how humans read, and it’s how neural network machine language models did their processing until 2017[3].It’s implemented as a ‘message’ (neural activation) passed forward from position to position (which needs to capture what might be relevant to ‘remember’), a structure known as a recurrent network. (Messages are additionally passed ‘depthwise’ in deep neural networks.) The nth position needs n steps to proceed before it can compute anything. For long texts, that’s crippling.When later positions have to wait for earlier computations, it really adds up for long texts.The milestone transformer architecture, introduced in the 2017 paper Attention Is All You Need, totally upended this constraint, making far larger-scale self-supervised training practical. Neural attention mechanisms were introduced long before this paper — the real innovation is to drop the recurrent connections entirely. You Don’t Need Recurrent Connections is the logical corollary! — and is what enables the entirely more scalable training of these architectures on long sequences. Attention still passes messages forward, but never only forward, always ‘diagonally’: making depthwise steps through the structure at the same time. This means no position, however long the text, need wait for any previous positions to compute. It does demand a highly parallel computation, but modern compute resources are nothing if not highly parallel.By doing away with all fully recurrent message pathways, attention-only processing can proceed absurdly faster on long texts. (Both diagrams created with claude.ai)Self-supervised training is only the startSelf-supervised training targets are great for cheaply (low human effort!) learning all kinds of features and concepts from big data. The well self-supervised neural network now ‘gets what is going on’ in all kinds of texts, well enough to sensibly predict what might come next, even on new content.Sometimes that kind of prediction is already exactly what you want, but usually there are other things you’d like your neural network to do. Enter transfer learning, any number of approaches to tapping into these learnings to accomplish the actual task of interest.Chatbot as playscriptFor language models, a common basic approach is arranging for the model to ‘predict’ the responses of a helpful ASSISTANT character, in a dialogue with a USER character (played by you).The following is a dialogue between a computer USER and a helpful, knowledgeable ASSISTANT.USER: Which is better, a Chekhov’s gun or a Maxim gun?ASSISTANT:Because the assistant in this context is plausibly actually helpful and knowledgeable, there’s some reasonable chance that a well-trained language model produces a helpful answer here with no further tweaks. Of course this is flimsy — hallucination of plausible-but-incorrect responses or provision of unhelpful responses is common, or even entirely off-track diversions, like introducing new ‘characters’ or pivoting away from the established playscript structure.Post-trainingVarious approaches make more involved attempts to effectively make use of the knowledge acquired in self-supervised pretraining. Those which involve further training, not just prompting, are often called ‘post-training’.Supervised learning: the icingRecall LeCun’s cake: with self-supervised (‘unsupervised’) the cake, supervised learning is the icing. A common pattern for LLMs intended for chatbot use is to have humans generate examples of how the ASSISTANT should respond to various queries. Should it ask clarifying questions sometimes? How verbose should it be? Should it provide references? Should it respond in first person? Should it respond in the register and dialect of the USER, or have its own? With training examples provided, sometimes principles like these can be learned and generalised.Because almost all of the knowledge, language understanding, common sense, and character information comes from self-supervised pretraining, supervised learning as fine-tuning can get away with radically fewer examples than would be needed to train this way from scratch. Less human effort. Icing indeed!Reinforcement learning: the cherry on topLeCun’s cake analogy underestimates and misunderstands reinforcement learning — more on that in part 2. But for a few years, his relegation of RL to a final (but important) stylish flourish was a fairly good description of how LLM-based AI assistants were trained.Rather than getting human contributors to produce exemplary data (as in supervised learning), RL has human engineers specifying how to grade outputs as good or bad, and sets the machine up to explore possible behaviours. Those outputs get graded, and the training encourages more of whatever computations produced the good ones (and less of whatever produced the bad ones).This is very fraught, because accurately specifying what counts as good behaviour ahead of time is notoriously difficult, and if you’re not careful, exploratory AI will absolutely find the strange cases you didn’t think of. It can also fall completely flat if the AI in training has no idea how to achieve good outputs at all: it just gets stuck flailing.This makes post-training of reasonably competent pretrained base models a sweet spot for RL: the model has enough background competence to at least occasionally succeed, can learn from success, and keep getting better and better. RL (whether as post-training or from scratch) also has the most potential to scale past human expert level, because we only need to be able to say which outcomes are better, not how to achieve them.The earliest widespread application of RL to LLM post-training was in reinforcement learning from human feedback (RLHF). At its core, this asks humans to rate which answers are better or worse, then has the machine try out various ways of achieving better and better answers (according to the human raters). Typically this is repeated for a few rounds. In 2023-4, this is how companies got AI chatbots to mostly respond politely, usually refuse to describe how to make bombs, and often stick to character. It’s also a great way to train AI to be obsequious, disingenuous, flattering, and sycophantic. Oops!Times changingIn frontier AI systems, reinforcement learning is no longer a lightweight (albeit effective) post-training flourish on top of a primarily self-supervised pretrained LM. Since late 2024, reaching for expert-and-beyond capabilities has brought RL back into the spotlight. That has a few implications, one of which — reasoning and its transparency (or otherwise) — we’ll look at in the next post.^More generally, self-supervised targets take existing data from the target domain (images, language, code, …) and, in various ways, corrupt or distort it, tasking the machine with recovering the original as accurately as possible from the clues in context. For language, ‘here is a text fragment; predict what comes next’ is a natural version of this, as is ‘here is a fragment of censored text; fill the blanks’. In the image domain, images might be censored in patches, pixellated, or otherwise distorted.^Other rich and important formats also have at least one open-ended dimension to them: audio, video, computer code (just a kind of language), DNA, drone and vehicle control logs, robotic action sequences, … So machine learning approaches for this range of formats can benefit from similar architectures.^I’m fairly sure of this, at least I’m not aware of approaches which escaped this sequencing constraint. There’s a curious 2016 post by Google researchers which looks at various modified approaches — notably including attention mechanisms, which are what the later transformer architecture are centred on — but in all cases based on a fully recurrent backbone. Some hierarchical sequence processing approaches use convolutions instead of recurrence, but this too requires scanning the sequence in order to process later elements.
※ 著作権に配慮し、引用は冒頭3段落までです。続きは元記事をご覧ください。