ActCam：ゼロショットによるカメラと3Dモーションの統合制御による動画生成

2026年05月09日 #Tech

本研究で提案されるActCamは、俳優の動きとカメラの軌道を同時に細かく制御する、ゼロショット型の動画生成手法です。

このモデルは、事前学習された画像から動画への拡散モデルを基盤とし、ソース動画からキャラクターのモーションを新しいシーンに転送し、フレーム単位でカメラの内外パラメータを制御します。

ActCamは、最初のデノイジング段階でポーズと疎な深度情報を用いてシーン構造を強制し、その後はポーズのみで詳細なディテールを洗練させる二段階の条件付けスケジュールを採用しています。

その結果、ActCamは特に大きな視点変化の状況下で、カメラの追従性やモーションの忠実性を大きく向上させることが実証されました。

原文の冒頭を表示（英語・3段落のみ）

View PDF

HTML (experimental)

Abstract:For artistic applications, video generation requires fine-grained control over both performance and cinematography, i.e., the actor's motion and the camera trajectory. We present ActCam, a zero-shot method for video generation that jointly transfers character motion from a driving video into a new scene and enables per-frame control of intrinsic and extrinsic camera parameters. ActCam builds on any pretrained image-to-video diffusion model that accepts conditioning in terms of scene depth and character pose. Given a source video with a moving character and a target camera motion, ActCam generates pose and depth conditions that remain geometrically consistent across frames. We then run a single sampling process with a two-phase conditioning schedule: early denoising steps condition on both pose and sparse depth to enforce scene structure, after which depth is dropped and pose-only guidance refines high-frequency details without over-constraining the generation. We evaluate ActCam on multiple benchmarks spanning diverse character motions and challenging viewpoint changes. We find that, compared to pose-only control and other pose and camera methods, ActCam improves camera adherence and motion fidelity, and is preferred in human evaluations, especially under large viewpoint changes. Our results highlight that careful camera-consistent conditioning and staged guidance can enable strong joint camera and motion control without training. Project page: this https URL.

※ 著作権に配慮し、引用は冒頭3段落までです。続きは元記事をご覧ください。

— 元記事を読む ↗

元記事を読む ↗