Atlas推論エンジン

2026年05月13日 #Tech

Atlas Inference Engineは、RustとCUDAのみでゼロから構築されたLLM推論エンジンです。

PythonやPyTorchといった大規模な依存関係を排除することで、約2.5GBという極小のバイナリで動作します。

この設計により、既存の競合エンジンと比較して劇的な速度向上を実現しています。

Atlasは、手動でチューニングされたカスタムCUDAカーネルやMTP（マルチトークン予測）といった高度な最適化技術を活用しています。

OpenAI互換APIに対応し、多種多様なモデルをサポートすることで、高性能かつシンプルで運用しやすいLLM推論環境を提供します。

原文の冒頭を表示（英語・3段落のみ）

Pure CUDA + Rust. Zero python dependencies, zero complex recipes. Inference at unimaginable speeds An LLM inference engine written from scratch in Rust and CUDA. No PyTorch. No Python. Just a

~2.5 GB image that runs 3x faster than the status quo. $ docker pull avarok/atlas-gb10:latest ~2.5 GB image. Run command below. 130 tok/s peak (Qwen3.5-35B)~2.5GB total image size<2min cold start time3.1x faster than vLLM Faster by Design Clean architecture beats bloat vLLM ships 20+ GB of Python, PyTorch, and 200+ dependencies. Atlas ships a single ~2.5 GB

binary. That simplicity is the speed. Atlas Image size ~2.5 GBCold start <2 minRuntime Rust + CUDADependencies None vLLM Image size 20+ GBCold start ~10 minRuntime Python + PyTorchDependencies 200+ packages ⚡ Pure Rust + CUDA Compiled from HTTP to kernel dispatch. No interpreter, no GIL, no JIT warm-up.🔧 Custom CUDA Kernels Hand-tuned attention, MoE, GDN, and Mamba-2 kernels for Blackwell SM120/121. NVFP4 and FP8 with native tensor cores.🔮 MTP Speculative Decoding Multi-Token Prediction generates multiple tokens per forward pass. Up to 3x throughput over single-token decoding. Qwen3.5-35B (NVFP4) on DGX Spark Single GPU, batch=1. Atlas with MTP K=2. Atlas vLLM Average (diverse workloads) Atlas111.4 tok/s 3.0x vLLM37.5 tok/s Peak (short context) Atlas130 tok/s 3.3x vLLM~38 tok/s Qwen3.5-122B (NVFP4) on a single DGX Spark 122B parameter model, single node. ~54 tok/s with EP=2. Atlas vLLM Decode throughput Atlas~50 tok/s 3.3x vLLM~15 tok/s Supported Models Model matrix Every model gets hand-tuned CUDA kernels. We expand based on what the community runs. All models

※ 著作権に配慮し、引用は冒頭3段落までです。続きは元記事をご覧ください。

— 元記事を読む ↗

元記事を読む ↗