類似文書検索機能「More Like This」の進化

#Tech

類似文書検索機能「More Like This」の進化 MLT進化と意味理解する検索

MLT(More Like This)は、既存のドキュメントを基点として関連資料を探す技術です。

従来のMLTはTF-IDFやBM25による「語彙的」(Lexical)なマッチングが中心であり、エラーコードなどの厳密一致に優れていました。

しかし現代では、文書の意味的な類似性を捉えるために、ドキュメントを数値化する「埋め込み(Embedding)」ベクトルが利用されるようになりました。

このベクター検索は、表現が異なっていても意味的に近い文書を発見できます。

最先端システムでは、正確なマッチングが得意な語彙的検索と、文脈の類似性を捉えるベクトル検索を組み合わせたハイブリッド検索が標準となっています。

ウェブ検索やECサイトなどで「これに似たもの」を探す機能(MLT)は、従来のキーワードマッチングから大きく進化しています。本記事では、この類似文書検索技術がどのように変化し、現代のAIシステムでどのような役割を果たしているのかを解説します。

古典的なMLTとレキシカルな仕組み

かつての「More Like This (MLT)」は、選択されたドキュメントに含まれる重要な単語(キーワード)を抽出し、その単語群に類似した文書を探すという、言葉ベースの検索でした。これはTF-IDFやBM25といった従来の全文検索技術を利用しており、エラーコードや製品SKUなど、特定の文字列が一致することが重要となる場面で高い精度を発揮していました。この手法は既存のインデックスを活用できるため、比較的安価に実装できた点も特徴です。

埋め込み(Embedding)による意味理解への進化

現代のMLTでは、「埋め込み(Embedding)」という技術が主流となっています。これはテキストや画像などのデータを数値ベクトルとして表現する手法です。文書を単なるキーワードの集合体ではなく、意味的な塊として捉えるため、異なる言葉で書かれていても内容が似ていれば、そのベクトルの近さから類似性を検出できます。これにより、「メモリリーク」と「アンバウンドヒープグロース」のように表現が違っても同じ問題だと判断できるようになりました。

ハイブリッド検索による実用化の加速

現在の高度なシステムでは、レキシカル(キーワード)検索とベクトル(意味)検索を組み合わせた「ハイブリッド検索」が採用されています。これにより、エラーコードのような厳密な一致が必要な場面では従来の全文検索が機能しつつ、表現が異なる文書間の関連性も捉えることが可能になっています。この技術は、RAG(生成AIの文脈補完)など、最新のAIアプリケーションにおける情報探索の中核を担っています。

まとめ

MLTの進化は、単なる「似た言葉」を探す段階から、「同じ意味」を理解し関連付ける段階へと移行しています。キーワードと意味の両方を活用するハイブリッド検索が、今後の情報検索体験を大きく変えていくと見られています.

原文の冒頭を表示(英語・3段落のみ)

In many search scenarios, the user does not start from an empty query box, but from an existing result.A user opens an article and wants to find related material. A buyer views a product card and looks for close alternatives. A support engineer investigates an incident and wants to see earlier cases with the same symptoms. In all these situations, the user already has a relevant document to start from.This scenario is traditionally called More Like This (MLT): a function for finding documents similar to the selected one. In this article, MLT means search that starts from a known document, not from a newly typed query.The classic MLT approach, or similar-document search, was based on comparing textual matches. Modern implementations increasingly use embeddings: numerical representations of documents. A search index stores embeddings as vectors, and the search system can find documents with close vector representations.Short glossaryTo avoid repeating definitions throughout the article, here are the main terms:TermMeaning in this articleMore Like This (MLT)search for documents similar to an already selected documentembeddinga numerical representation of text, a product, an image, or another objectembedding vectora numerical representation of an object, such as text or a product, stored in the index to find similar objects by vector proximityKNN, nearest-neighbor searchsearch for nearest neighbors, meaning objects with close vectorsANN, approximate nearest neighborsapproximate nearest-neighbor search; it speeds up KNN on large datasets without scanning every vectorRAG, Retrieval-Augmented Generationan approach where the search system retrieves context for a generative modelhybrid searchcombining full-text search and vector search in one scenariorerankingan additional sorting step for already retrieved candidates using a more precise model or ruleWhat classic More Like This didClassic MLT was lexical. It answered a simple question: which documents use similar important words?The process usually looked like this:The search system took the source document.It analyzed its text.It selected informative terms.It built a query from those terms.It searched for documents with a similar set of words.It returned a list of similar documents.Internally, this used familiar full-text search mechanisms: TF-IDF or BM25, term frequency, stopwords, field boosts, and document-frequency limits. That is why older MLT implementations exposed parameters such as min_term_freq, min_doc_freq, max_doc_freq, and max_query_terms.This was not just an interface element, but a full search mechanism. MLT was used for related articles and products, duplicate detection, support-ticket matching, legal search, patent research, and internal knowledge bases.Where the lexical approach is still strongLexical MLT works well when specific words, identifiers, and stable formulations matter.Examples:error codes;product SKUs;part numbers;function names;stack traces;legal wording;nearly identical product or ticket descriptions.The reason is that exact matching is critical here. If two incident reports contain the same error code or the same stack trace, full-text search sees a direct match. For example, when searching tickets with the code ERR_404, lexical MLT quickly finds every mention of that code, while vector search may return tickets that describe similar but not identical problems.Lexical MLT had another advantage: it was cheap to run. The inverted index is already in the search engine. The analyzers are already configured. Ranking already works. There is no need to deploy separate search infrastructure just to support a “find similar” feature.The limitation is also clear. If two documents describe the same thing in different words, lexical MLT may fail to connect them. Synonyms work unevenly. Paraphrases are harder. Cross-lingual similarity is usually unavailable. For example, memory leak and unbounded heap growth may describe the same problem, but a standard analyzer sees different tokens.Lexical MLT efficiently finds documents with matching or similar wording. Semantic search helps when the meaning matches, not the words.What embeddings changeUsing

embeddings

— numerical representations of documents — changes the comparison principle: instead of words, the system compares vector representations.A document no longer has to be represented only as a set of weighted terms. It can be stored as a dense vector. Nearby vectors usually correspond to documents that are similar in meaning, even if they are written in different words.The lexical approach looks for matches by words and terms, while embedding search looks at the proximity of document vector representations. The first approach is optimal for exact matches such as error codes and SKUs. The second finds semantically close documents, even when they are phrased differently.This expands the scope of this kind of search. You can compare not only articles, but also products, images, code fragments, user events, or context fragments in a RAG system. In RAG, the search system first retrieves relevant context, and then the generative model uses that context to produce an answer.Lexical search does not disappear. Exact error codes, SKUs, names, statute references, and near duplicates are still better handled lexically. That is why production systems often use

※ 著作権に配慮し、引用は冒頭3段落までです。続きは元記事をご覧ください。

元記事を読む ↗