AI活用における「学習の停滞」の構造
ある大学院生がMQTTプロトコルの検証エージェント開発プロジェクトに取り組んだ際、そのシステムがすべてLLM(Claude)によって自動生成されていたことが発覚しました。
学生は「ファインチューニング」といった専門用語を多用して進捗を報告しましたが、実験の妥当性や技術的な仕組みについて本質的な理解を持っていませんでした。
この事例は、「AIが正しく出力すれば問題ない」という楽観論への疑問を呈しています。
単に正しい形式のアウトプットが出たとしても、それが現実世界の正確さ(Correctness)と対応しているわけではないという重要な境界線を示唆しています。
AI開発の進展に伴い、「指示さえ出せばシステムが完璧に動く」という認識が広まっているものの、その裏側にある「学習の停滞(Learning Stall)」のリスクについて警鐘を鳴らす記事です。ある大学院生との共同研究経験を通して、LLM/エージェントによる自動検証の限界と、人間がシステムの内部構造を理解しておく必要性を訴えています。
AIへの過信が生む認識のズレ
近年、大規模言語モデル(LLM)やエージェント技術の進化により、「人間はシステム内部を知らなくても良い」という見方が強まっています。しかし、著者はこの考え方に異議を唱えています。単に「正しい出力」が得られたとしても、それが本当に「適切なこと」であるとは限らないからです。AIが作り出す不正確な情報(ハルシネーション)と現実的な正しさを区別することが非常に重要だと指摘しています。
理想から始まった検証プロジェクト
ある大学院生は、MQTTという通信プロトコルを対象に、AIエージェントを使ってプロトコルの自動検証を行う研究に取り組みました。当初の計画では、①規格書自体の矛盾がないか確認するか、②実際の実装が規格通りであるかを検証するかの二つのルートが提示されました。
このプロジェクトは短期間で実現可能と見られていましたが、学生は進捗報告をせずに、期限までに完成品を持参しました。
成果物と真実のギャップ
最終的に提出されたエージェントは、Qwenモデルをファインチューニングして開発されており、見た目は非常に洗練されていました。学生は「MQTTプロトコルを検証するAIエージェントを構築した」と報告しました。
しかし、具体的な質問(①か②のどちらに取り組んだのか)に対し、学生は核心的な内容を正確に伝えられず、話が途中で止まってしまったのです。これは、表面的な成果物だけでは真の理解度や達成度が測れないことを示しています。
まとめ
この事例は、AI技術がどれほど進歩しても、システムの内部ロジックを把握する人間の役割は不可欠であることを突きつけています。AIに依存しすぎることなく、仕組み全体を理解しようとする姿勢の重要性が改めて浮き彫りになっています。
原文の冒頭を表示(英語・3段落のみ)
I recently had an experience with an undergraduate student that really made me pause on how we use AI. I made a couple of posts on LinkedIn with very short snippets of what happened, but I want to write down this experience in detail, because the details shed a spotlight on the mechanics and the consequences of the often-heard idea that, in the near future, we, humans, will not need to know anything about the internals of the systems we build; that we just need to specify what we want and, once the specification is kosher, the AI agent will do it and it will all be correct.Maybe it will all be correct. And maybe it will also be wrong. Correctness is not the same as “the right thing.” And this distinction is crucial to separate hallucinations from reality.The Independent Project, Week 1This all started in early April when an undergraduate student, let’s call him Joe, approached me for an independent study — these are research project courses that students can take for credit, and are typically done by students who are considering graduate school or are simply curious about research. Joe, a junior, told me he wants to apply to PhD programs. So on week 1 of the quarter, we talked about his interests and my current interests, and we zoomed in on a possible project for him. Here’s a summary of that conversation.Lately, I am generally interested in methods for automatic verification of software artifacts using LLMs/agents/neural networks (we had a first paper on that back in 2023/2024). Joe told me he is interested in the models themselves, and wanted the experience of fine-tuning a model. So I suggested a project around the idea of automatic verification of protocol specifications. I selected a specific protocol — MQTT— for him to do an experiment involving the development of an AI agent that could, potentially, automatically verify the protocol. I also mentioned to him two possible routes for this general idea: (1) we could have the AI agent verify the consistency of the protocol specification itself, that there are no inconsistencies between the many clauses; or (2) we could grab concrete implementations of MQTT out there and verify whether they comply or not with the protocol specification.The meeting ended on a good note of mutual understanding, with Joe’s next goal being deciding which of the two problems to tackle, and then take it from there. In my mind, it was pretty clear that this project was feasible in a quarter, especially with the help of Claude/Gemini/etc.Week 8The way I work with students is to have one-on-one weekly meetings, possibly augmented with additional meetings when the students go through fast progress that needs my steering. I don’t make any of this mandatory, as I’ve worked with enough students to know that every one of them is different and needs different amounts of supervision; also because I treat these students as younger peers, not employees or children. This is just a nice way of saying that Joe never showed up for the weekly meetings.He showed up on Week 8 with the complete execution of the project. He was excited to show it to me. He told me he developed an agent that could, indeed, verify that MQTT was correct and that the agent was based on a fine tuning of Qwen. Glancing at his Visual Studio, the thing looked pretty impressive and well done: a proper folder structure, files that had meaningful names, a data folder that clearly contained training data for fine tuning, lots of references to MQTT , etc. Looked plausible and impressive! After this 2 minute oral abstract of the project, he asked if we could write a paper about this project and submit it to a conference. I was excited, too, so I asked him to explain what exactly did he do.*Joe’s slight pause* I build an agent that verified the MQTT protocol*Me* OK, but which of the two problems did you tackle? Is it the specification consistency problem or the implementation correctness problem?*Joe hesitating* uh… I think… the specification correctness?*Me, looking at his training data and seeing lots of Python code in it* Really? Are you sure?*Joe hesitating even more but trying to sound confident* uh, yes, the specification correctness*Me pointing at the training data* So what is this training data all about? What are all these Python functions that look like an MQTT implementation?*Joe now visibly nervous* Sorry, I meant I did the implementation correctness. I verified that a given Python implementation is correct*Me, now a bit worried* OK. And how did you do it?*Joe with a blank stare of confusion* I build an agent.*Me, even more worried* And how does the agent verify that the implementation is correct?*Joe really nervous* uhh…*Me jumping in to help him* OK let me take a closer look at your code.I looked at the code for one minute and I realized, with awe and horror, what exactly had happened. Two things:1) Joe had simply plugged into Claude Code some version of our conversation during our Week 1 meeting, and let Claude do the entire project. He had no idea what the bot had done to develop this protocol verification agent on top of Qwen, other than the vague notion that fine tuning was involved — something that seemed important to Joe since day 1.2) Claude did its best to comply with what Joe asked it to do, but the experiment was invalid: Claude had generated the training data and the test data, there was no external implementation of MQTT, and there was no baseline (e.g. Qwen without fine-tuning). This Qwen agent was saying that Claude’s functions were correct/incorrect based on Claude’s training data of correct/incorrect functions, and without any baseline that could control for how much Qwen already knew about MQTT.I leaned back and sighed. I hesitated: do I explain to Joe that the experiment is invalid, how and why it is invalid, or do I help him get to that conclusion by himself? There were still 2 weeks left in the quarter, so I decided to give him a chance. I told him that I see problems here, but that if I tell him the problems myself, he will get an F in the course. I gave him the option to go study what Claude did, enough for him to be able to tell me what the problems are.Week 9Joe came back one week later. This time he could answer, with confidence, that the agent was verifying an implementation of MQTT, not that the protocol itself was consistent. He could also tell me that the training and testing programs were being parsed into an AST (a term that seemed new and important to him, to the point that he repeated it, like, 10 times during this meeting), and that the agent would score them. He had written down notes, and was looking at the notes to explain things to me.I was confused. When are programs parsed, and for what purpose? What is the significance of this score? I asked him to draw something on the white board to explain the sequence of steps. He drew a box and wrote the words “relevant code” inside. He explained that the agent gets relevant code and then it scores. Me: scores what? How? *blanks*After 15 minutes of this, I stopped. It was clear that Joe had asked Claude to answer the questions that I had asked him on Week 8, plus, maybe a couple of more clarifying questions, and he wrote down Claude’s answers on his notebook. He still had no idea of what Claude had done, and, more importantly, he still had no idea of what experimental validity is — this wasn’t even on the scope of things he could understand at this point.Still one week to go.Week 10He came back again. Clearly he had spent a lot more time exploring and studying what Claude had done. He jumped to the whiteboard and drew a boxes-and-arrows diagram that at least attempted to explain what happens when a test Python function comes in for verification. It gets parsed. The identifiers and docstrings are used to retrieve the most similar protocol specification clause. Then a verification component kicks in (this is the fine-tuned Qwen) saying whether it complies or not, and a confidence score, a number between 0 and 1, is returned.*Me* What is this confidence score? How is it calculated?*Joe blanks**Me* OK, let’s backtrack. What is the prompt used for Qwen?*Joe nervous* The prompt? uhh… I send the function to Qwen.*Me* Just the function? No instruction? How does that work?*Joe nervous* uhh*Me* OK, let’s look at the code. Show me where the prompt is.He opens Visual Studio and proceeds to try to find something related to prompting. He is taking more than a minute, so I step out to go to the restroom and give him time to search the code without me looking over his shoulder. When I came back, his cursor was on top of a function call to make_prompt(…). Good.*Me* OK, so show me the prompt.*Joe clearly not knowing what to do**Me* Find the definition of that function!*Joe immediately searching for occurrences of make_prompt and finding it* Here it is.The function was constructing the prompt from a combination of constants and variables, as prompt functions do. One of the constants was called INSTRUCTION.*Me* OK, so show me the instruction.To my horror, he scrolled down on the file as if he didn’t know that constants are typically declared in the beginning of files — and INSTRUCTION was right there on top of def make_prompt(…), which happened to be the first function of that file. I stopped him. “Constants are usually on the top of files!”It dawn on me that this young man didn’t know how to read code. He is a junior in a CS program, and he doesn’t know how to read code! My awe and horror doubled. He has probably been using Claude/Gemini/ChatGPT this whole time to get through his homework assignments, and so he doesn’t know how to read code, how to navigate it. What a f*cking disaster!He scrolled back up and finally found the INSTRUCTION constant.*Joe* Here it is.*Me reading it out loud” OK, so it is asking to rate the function on a three-level scale of compliant, maybe, and not compliant.Joe was relieved that I was pleased, but he didn’t seem to understand the significance of the prompt or why I was so fixated on the prompt.*Me, take two* So what is this numerical confidence score? How is it calculated? Is it Qwen?*Blank stares*He couldn’t answer. He didn’t know. FYI, dear reader, the confidence score was based on the agent running test cases, also written by Claude. The numerical value was the percentage of test cases that passed. Joe had no idea. I’m not sure he knows what a test case is. I wanted to discuss the validity of using test cases for measuring the confidence of the agent’s answer, and how that composed with Qwen’s 3-level confidence scores, but there was no point in trying to have that discussion with Joe, since he didn’t even know how that score was computed, and he had just seen the Qwen prompt for the first time.The picture was clear: Claude had made up all the data — training and test — and all the test cases for quantifying confidence. This was a completely invalid experiment, a simulation of scientific research. In the beginning of the quarter, I set Joe up to answer the question: can you develop an agent that automatically verifies a given implementation of MQTT ?, and Claude gave him a pleasing answer: yes you can! — here it is. It is capable of detecting compliant and non-compliant implementations. Never mind that Claude made all that data up so that the answer would be yes.At this point, I steered the conversation to the experiment itself, to try to lead him to the judgement of validity — the most important issue at stake here, really. Joe was so far from understanding the concept that, at this rate, he would not get there by himself by the time I need to post his grade on the Registrar next week.*Me* OK, so who wrote all this training data for fine-tuning Qwen?*Joe, with hesitation, I think he was afraid I would think he was cheating* Claude*Me* OK. And who wrote the MQTT implementation that is being tested?*Joe* I did.*Me, surprised* You did? No way…*Joe, without hesitation* Yes, I wrote it*Me, coming closer to him, so to look at the perfectly written code* Joe, there is no way that you wrote this code.*Joe nervous again* OK, Claude wrote it but I asked Claude to do the project.*Me ignoring the disturbing cognitive boundary elimination* OK, so Claude wrote the training data, and the program that is being verified, and the test cases that determine the confidence of judgement…*pause to let him complete the thought**Joe* oh so maybe… this is invalid?Departing ThoughtsI am still wrapping my head around what happened, how we got here, and what it means. I don’t have clear thoughts. But I know this is very, very bad for young people still developing their independent thinking skills. This is what happens when you trust a black box, and you believe that you don’t need to verify anything, you don’t need to see what’s inside, because the black box sounds so authoritative — it certainly seems to know a lot more than you do.And then the black box is wrong. It optimizes to please you: you want a protocol verifier, and it gives you one. It hallucinated a perfect project for Joe, and Joe didn’t have the skills or knowledge to push back and ask questions, because he didn’t even know what questions to ask. He is still learning, he does not have the language to ask the right questions. Had I not pushed back, Joe would still be under the illusion that he did a fantastic, super cool research project that even involved fine-tuning Qwen! Wow! He was ready to put that on his resume, and he was ready to ask Claude to write the paper!So what will happen when everyone decides to ignore the insides, and just trusts the black box? What will happen when people don’t know what questions to ask because they never learned the insides of anything? It seems to me this is when the line between reality and hallucination completely disappea
※ 著作権に配慮し、引用は冒頭3段落までです。続きは元記事をご覧ください。