|
Recent progress in text-to-video generation has achieved remarkable realism, yet fine-grained control over
camera motion and orientation remains elusive. Existing approaches typically encode camera trajectories
through relative or ambiguous representations, limiting explicit geometric control. We introduce
GimbalDiffusion, a framework that enables camera control grounded in physical-world coordinates, using
gravity as a global reference. Instead of describing motion relative to previous frames, our method defines
camera trajectories in an absolute coordinate system, allowing precise and interpretable control over camera
parameters without requiring an initial reference frame. We leverage panoramic 360-degree videos to construct
a wide variety of camera trajectories, well beyond the predominantly straight, forward-facing trajectories
seen in conventional video data. To further enhance camera guidance, we introduce null-pitch conditioning,
an annotation strategy that reduces the model's reliance on text content when conflicting with camera
specifications (e.g., generating grass while the camera points towards the sky). Finally, we establish
a benchmark for camera-aware video generation by rebalancing SpatialVID-HQ for comprehensive evaluation
under wide camera pitch variation. Together, these contributions advance the controllability and robustness
of text-to-video models, enabling precise, gravity-aligned camera manipulation within generative frameworks.
|