在 RTX 5090 上部署本地 LLM：五小时的尝试与优化

2026-05-06 #Tech

本文讲述了作者在 RTX 5090 移动显卡上部署本地大型语言模型 (LLM) 的实践过程。

实验使用了 Qwen3-Coder (30B)、Llama3.3 (70B) 等模型，并在 Ollama 框架下进行了配置和优化。

最初尝试使用 Qwen-Code CLI 失败，随后切换到 OpenCode 并通过调整上下文长度和启用 Flash Attention + Q8 KV 缓存等技术，最终成功将 100K 上下文的 Qwen3-Coder 模型部署到显卡上，实现了 50-60 tok/s 的推理速度。

作者总结了 LLM 部署的经验教训，包括工具的成熟度、上下文大小的实际限制以及 Q8 KV 缓存的重要性，并强调了 30B 规模模型与 RTX 5090 移动显卡的良好适配性。

查看原文开头（英文 · 仅前 3 段）

may 5, 2026 | afternoon, at my desk | ~local-llm

One afternoon I got curious. How far could I push a local AI stack on my own machine? What can I actually do with one? Pure experiment, sat down, built it out, pushed it to the edge. Result: not Anthropic-grade, but genuinely useful.

>> The Hardware

※ 出于版权考虑，仅引用前 3 段。完整内容请阅读原文。

— 阅读原文 ↗

阅读原文 ↗