This supplementary material complements the main paper by providing full videos, technical details and results with our method trained on another backbone, WAN.
Teaser Videos
Here we show the full videos used to make fig. 1.
Eiffel Tower
Harbor
Pitch from -80° to 80°
"The Eiffel Tower in Paris at early morning, bathed in soft golden light, with a clear blue sky and the Seine River gently flowing nearby."
Roll from -85° to 85°
"A medieval harbor at sunset with tall ships docking, sailors unloading barrels of spice while amber light glints off varnished wood and gentle waves lap against stone piers."
Training Set Examples
Here we show training data samples, complementing fig. 4. Each sample displays the video alongside camera trajectory visualizations from three perspectives: top view, side view, and a 3D view.
The cameras are color-coded to represent their position in time: the first camera is purple and the last one is red.
Video
Top View
Side View
3D View
Video
Top View
Side View
3D View
Video
Top View
Side View
3D View
Video
Top View
Side View
3D View
Video
Top View
Side View
3D View
Video
Top View
Side View
3D View
Video
Top View
Side View
3D View
Definition of φ
Here we show the full derivation of the $\varphi(\cdot)$ function used in eq. 2, which is used to enforce the yaw to be relative to the first frame, while preserving the absolute pitch and roll.
We take inspiration from the Look At matrix, frequently used in computer graphics to remove the "roll" component from a rotation matrix
when specifying a direction in which to "look at".
In our case, we slightly modify this procedure to take into account the gravity down vector as the direction and obtain a matrix $\mathbf{R}_{\text{no_yaw}}$ that has a null yaw component
for the first frame, while still encoding the pitch and roll.
We use the following right-handed coordinate convention: +x is right ($\mathbf{r}$), +y is down ($\mathbf{d}$), +z is forward ($\mathbf{f}$).
$$\begin{align*}
\mathbf{d} &\gets \mathbf{R}_{\text{pano},0}^{-1} [0~1~0]^\top && \text{
$\triangleright$ Compute the down direction vector $\mathbf{d}$ in camera space.
} \\
\mathbf{r} &\gets \mathbf{d} \times [0~0~1]^\top && \text{
$\triangleright$ Compute the right direction $\mathbf{r}$ vector using the down and temporary forward vector.
} \\
\mathbf{r} &\gets \frac{\mathbf{r}}{\|\mathbf{r}\|} && \text{
$\triangleright$ Normalize the right direction vector $\mathbf{r}$.
} \\
\mathbf{f} &\gets \mathbf{r} \times \mathbf{d} && \text{
$\triangleright$ Compute the forward direction vector $\mathbf{f}$.
} \\
\mathbf{R}_{\text{no_yaw}} &\gets \begin{bmatrix}
\mathbf{r}^\top \\
\mathbf{d}^\top \\
\mathbf{f}^\top
\end{bmatrix} && \text{
$\triangleright$ Compute the matrix containing the absolute pitch and roll for the first frame.
} \\
\varphi(\mathbf{R}_{\text{pano},0}) &\gets \mathbf{R}_{\text{no_yaw}} \mathbf{R}_{\text{pano},0}^{-1} && \text{
$\triangleright$ Compute the matrix $\varphi(\mathbf{R}_{\text{pano},0})$, which can be applied to an extrinsic matrix to remove its yaw component.
}
\end{align*}$$
For completeness, here is equation 2, which computes the absolute camera extrinsics. Note that we slightly abuse notation by assuming rotation matrices are in homogeneous coordinates
to match the \(4\times 4\) dimensions of extrinsics matrices.
Here we show results where we trained a different diffusion model, WAN 2.2 (5B), using the same data and procedure, adjusted to the conditioning mechanism of WAN.
Training details
We randomly initialize a new camera encoder block composed of two convolutional layers, taking as input the absolute Plücker rays and outputting a feature map that is added
to the patchified noisy latents, just before the first DiT layer. We train the camera encoder and finetune the DiT model for 70,400 iterations on 4 A100 (80 GB) GPUs.
In the following table, we show the quantitative results of this model. Training on this more powerful backbone yields better absolute and relative rotation error, but leads
to a slight decrease in CLIP, FID and FVD score. We show qualitative results for both backbones, in the "Additional Qualitative Results" section below.
Quantitative results
Method
PitchErr. (abs.) $\downarrow$
GravityErr. (abs.) $\downarrow$
RotErr (rel.) $\downarrow$
TransErr $\downarrow$
CLIP $\uparrow$
FID $\downarrow$
FVD $\downarrow$
Ours (WAN backbone)
11.79
14.23
9.33
0.75
19.27
119.78
1081.13
Ours (AC3D backbone)
23.79
27.06
14.25
0.75
21.35
110.71
896.84
Evaluation Dataset Details
Here, we supply additional details and statistics on our evaluation benchmark, SpatialVID-extreme, extending sec. 4.3 of the paper.
Random rotation trajectories
We randomly sample a start roll from \([-40°, 40°]\), and a start pitch from \([-90°, 90°]\).
The final roll and pitch are sampled the same way, and the end yaw is sampled from \([-180°, 180°]\).
The intermediate rotation are interpolated using spherical linear interpolation. Since the method used to compute absolute orientation metrics (Perspective Fields) is never trained
with rolls going beyond 45° in magnitude, we resample a random rotation trajectory if an intermediate frame goes outside the roll bounds of \([-40°, 40°]\).
Evaluation dataset statistics
Here we show that the original SpatialVID-HQ dataset provides limited diversity in absolute orientation and relative rotations. Our new evaluation benchmark, SpatialVID-extreme,
provides a broader coverage of Euler angles and total angular distance.
Evaluation dataset statistics
Details of Prompt Engineering
For the AC3D baselines, we provide the model the absolute camera orientation through text, as mentioned in sec. 4.4 of the paper. Here, we show the code used to generate
these camera descriptions.
We first take the absolute camera extrinsics \(E_\text{abs}\) for the video and convert them to Euler angles.
We then describe textually only the first frame's pitch and roll and the last frame's yaw, pitch and roll.
We omit describing the in-between frames, since the trajectories are linear and to avoid overwhelming the model.
def get_camera_description_from_absolute_c2w(c2w_absolute):
euler_angles = c2w_to_pitch_roll_yaw(c2w_absolute)
first_pitch = euler_angles['pitch'][0].item()
first_roll = euler_angles['roll'][0].item()
first_yaw = euler_angles['yaw'][0].item()
last_pitch = euler_angles['pitch'][-1].item()
last_roll = euler_angles['roll'][-1].item()
last_yaw = euler_angles['yaw'][-1].item()
def describe_angle_shot(pitch):
"""Describe pitch angle shot."""
pitch_rounded = round(pitch)
if -5 <= pitch_rounded <= 5:
return "near straight-on shot"
elif pitch_rounded > 0:
if 5 < pitch_rounded <= 20:
return f"small tilt-up of {pitch_rounded} degrees"
elif 20 < pitch_rounded <= 45:
return f"large tilt-up of {pitch_rounded} degrees"
else:
return f"extreme tilt-up of {pitch_rounded} degrees"
else:
abs_pitch = abs(pitch_rounded)
if 5 < abs_pitch <= 20:
return f"small tilt-down of {abs_pitch} degrees"
elif 20 < abs_pitch <= 45:
return f"large tilt-down of {abs_pitch} degrees"
else:
return f"extreme tilt-down of {abs_pitch} degrees"
def describe_dutch_angle(roll):
"""Describe roll (Dutch angle) with clockwise/counterclockwise."""
roll_rounded = round(roll)
if -5 <= roll_rounded <= 5:
return "near level shot"
abs_roll = abs(roll_rounded)
if abs_roll <= 20:
magnitude = "small"
elif abs_roll <= 45:
magnitude = "large"
else:
magnitude = "extreme"
# Positive roll is counterclockwise, negative is clockwise
direction = "counterclockwise" if roll_rounded > 0 else "clockwise"
return f"a {magnitude} Dutch angle tilted {direction} {abs_roll} degrees"
def describe_yaw(yaw):
"""Describe yaw (pan) direction."""
yaw_rounded = round(yaw)
abs_yaw = abs(yaw_rounded)
direction = "right" if yaw_rounded > 0 else "left"
return f"pan of {abs_yaw} degrees turned {direction}"
# Build start description
start_parts = []
start_pitch_desc = describe_angle_shot(first_pitch)
start_roll_desc = describe_dutch_angle(first_roll)
if start_pitch_desc != "near straight-on shot":
start_parts.append(start_pitch_desc)
if start_roll_desc != "near level shot":
start_parts.append(start_roll_desc)
# Build end description
end_parts = []
end_yaw_desc = describe_yaw(last_yaw)
end_pitch_desc = describe_angle_shot(last_pitch)
end_roll_desc = describe_dutch_angle(last_roll)
if end_yaw_desc:
end_parts.append(end_yaw_desc)
if end_pitch_desc != "near straight-on shot":
end_parts.append(end_pitch_desc)
if end_roll_desc != "near level shot":
end_parts.append(end_roll_desc)
# Construct final description
description_parts = []
start_text = "The camera starts at " if len(start_parts) == 1 else "The camera starts with "
description_parts.append(start_text + ", and ".join(start_parts))
end_text = "The camera ends with " + ", ".join(end_parts)
description_parts.append(end_text)
return ". ".join(description_parts)[:-1]
After obtaining the camera description, we concatenate it after the regular prompt. Refer to the following qualitative results for examples of camera descriptions.
Additional Qualitative Results
Here are additional qualitative results, complementing fig. 7.
We show 25 randomly-sampled videos from our SpatialVID-extreme dataset.
The black overlay represents the input camera orientation and the red overlay represents the estimated camera orientation (obtained using Perspective Fields and VGGT), as described in the paper.
For "AC3D + cam. text" baselines, we further provide the Camera prompt to the model to indicate the absolute camera orientation to the model.
Prompt: A serene lakeside town features a stone church with a red roof, surrounded by greenery and reflective waters, exuding a peaceful, picturesque charm.
Camera prompt: The camera starts with large tilt-up of 40 degrees, and a small Dutch angle tilted counterclockwise 16 degrees. The camera ends with pan of 171 degrees turned right, extreme tilt-up of 66 degree
Prompt: A poised wingsuit flyer stands on a grassy mountain peak, overlooking a forested valley and a distant town beneath an overcast sky, evoking tension and grandeur.
Camera prompt: The camera starts with small tilt-down of 7 degrees, and a small Dutch angle tilted clockwise 10 degrees. The camera ends with pan of 108 degrees turned right, extreme tilt-down of 82 degrees, a large Dutch angle tilted clockwise 24 degree
Prompt: A modern glass-and-steel skyscraper rises against a clear blue sky, surrounded by diverse urban architecture in a bustling cityscape.
Camera prompt: The camera starts at extreme tilt-down of 47 degrees. The camera ends with pan of 129 degrees turned right, extreme tilt-down of 60 degrees, a small Dutch angle tilted counterclockwise 6 degree
Prompt: A bright forest with towering green-leaved trees forming a dense canopy overhead.
Camera prompt: The camera starts with extreme tilt-up of 56 degrees, and a large Dutch angle tilted clockwise 39 degrees. The camera ends with pan of 60 degrees turned left, a small Dutch angle tilted counterclockwise 16 degree
Prompt: A dense, serene forest with towering trees and a winding path, bathed in soft, diffused light that enhances its mysterious, tranquil atmosphere.
Camera prompt: The camera starts with small tilt-down of 16 degrees, and a large Dutch angle tilted counterclockwise 31 degrees. The camera ends with pan of 6 degrees turned right, large tilt-down of 23 degrees, a small Dutch angle tilted counterclockwise 19 degree
Prompt: A wingsuited figure stands on a cliff, surrounded by surreal, inverted colors, capturing a tense, adventurous moment against a vast, dreamlike landscape.
Camera prompt: The camera starts with small tilt-up of 12 degrees, and a large Dutch angle tilted counterclockwise 28 degrees. The camera ends with pan of 155 degrees turned left, small tilt-down of 17 degrees, a large Dutch angle tilted counterclockwise 25 degree
Prompt: A man in a blue shirt works on a marina dock during the daytime, surrounded by yachts and a clear blue sky, evoking a vibrant, luxurious atmosphere.
Camera prompt: The camera starts with large tilt-down of 22 degrees, and a small Dutch angle tilted clockwise 15 degrees. The camera ends with pan of 171 degrees turned left, extreme tilt-down of 58 degrees, a small Dutch angle tilted counterclockwise 16 degree
Prompt: A low-angle urban scene features towering glass skyscrapers under a vibrant blue sky, with street lamps and traffic signals adding a grounded contrast to the imposing skyline.
Camera prompt: The camera starts with large tilt-down of 26 degrees, and a large Dutch angle tilted counterclockwise 37 degrees. The camera ends with pan of 113 degrees turned left, extreme tilt-down of 87 degrees, a small Dutch angle tilted clockwise 12 degree
Prompt: A serene forest under a clear blue sky, where tall trees stand alongside evergreens, creating a tranquil atmosphere.
Camera prompt: The camera starts with extreme tilt-up of 70 degrees, and a small Dutch angle tilted clockwise 16 degrees. The camera ends with pan of 147 degrees turned right, small tilt-down of 10 degrees, a large Dutch angle tilted clockwise 27 degree
Prompt: A cozy, sophisticated home office features a large wooden desk, an armchair, and bookshelves, bathed in warm, ambient lighting that enhances its intellectual and inviting atmosphere.
Camera prompt: The camera starts with small tilt-up of 10 degrees, and a small Dutch angle tilted clockwise 20 degrees. The camera ends with pan of 82 degrees turned right, extreme tilt-up of 65 degrees, a small Dutch angle tilted clockwise 6 degree
Prompt: A luxurious breakfast tray floats in a clear pool, surrounded by modern outdoor lounges, exuding calm and opulence under bright sunlight.
Camera prompt: The camera starts with large tilt-up of 28 degrees, and a small Dutch angle tilted counterclockwise 7 degrees. The camera ends with pan of 25 degrees turned left, large tilt-up of 27 degrees, a small Dutch angle tilted counterclockwise 6 degree
Prompt: A cozy, nostalgic living room features a polished wooden console with a record player, radio, and a vinyl record, bathed in warm, even light that highlights its retro design.
Camera prompt: The camera starts with large tilt-up of 36 degrees, and a large Dutch angle tilted counterclockwise 31 degrees. The camera ends with pan of 127 degrees turned left, large tilt-down of 30 degrees, a large Dutch angle tilted counterclockwise 21 degree
Prompt: A vibrant street scene features a large mural of a woman in a blue dress against a castle, set within a town under a blue sky.
Camera prompt: The camera starts with extreme tilt-down of 69 degrees, and a large Dutch angle tilted clockwise 35 degrees. The camera ends with pan of 173 degrees turned right, small tilt-down of 8 degree
Prompt: A serene forest under a clear blue sky, with yellow-tinted trees and undergrowth creating a peaceful atmosphere.
Camera prompt: The camera starts with large tilt-down of 27 degrees, and a small Dutch angle tilted counterclockwise 17 degrees. The camera ends with pan of 97 degrees turned right, extreme tilt-up of 72 degrees, a large Dutch angle tilted counterclockwise 36 degree
Prompt: A luxurious breakfast table set on a black speckled surface evokes an elegant dining atmosphere.
Camera prompt: The camera starts at small tilt-down of 18 degrees. The camera ends with pan of 68 degrees turned right, extreme tilt-up of 73 degrees, a small Dutch angle tilted counterclockwise 18 degree
Prompt: A vibrant forest canopy bathed in green and yellow foliage frames a blue sky, creating a peaceful, tranquil atmosphere under soft, diffused light.
Camera prompt: The camera starts with extreme tilt-up of 80 degrees, and a small Dutch angle tilted counterclockwise 11 degrees. The camera ends with pan of 31 degrees turned left, a large Dutch angle tilted clockwise 29 degree
Prompt: A dense forest bathed in warm, soft light, with towering trees reaching toward a partially hidden sky, creating a serene and awe-inspiring natural landscape.
Camera prompt: The camera starts with extreme tilt-down of 76 degrees, and a small Dutch angle tilted clockwise 10 degrees. The camera ends with pan of 100 degrees turned left, extreme tilt-down of 75 degrees, a large Dutch angle tilted clockwise 22 degree
Prompt: A winding road winds through a misty green valley, framed by towering mountains under a clear blue sky, evoking a tranquil yet surreal atmosphere.
Camera prompt: The camera starts at large tilt-up of 23 degrees. The camera ends with pan of 30 degrees turned left, small tilt-down of 18 degrees, a large Dutch angle tilted counterclockwise 29 degree
Prompt: A rain-soaked urban street at dusk, illuminated by bright storefronts and car headlights, with two men walking under umbrellas against a subdued, reflective backdrop.
Camera prompt: The camera starts with extreme tilt-down of 76 degrees, and a large Dutch angle tilted counterclockwise 22 degrees. The camera ends with pan of 62 degrees turned right, extreme tilt-down of 74 degrees, a large Dutch angle tilted clockwise 28 degree
Prompt: A modern yacht interior features a woman in a striped suit, a plush couch, a dining area, and elegant lighting through window blinds, evoking luxury and sophistication.
Camera prompt: The camera starts with extreme tilt-down of 80 degrees, and a large Dutch angle tilted clockwise 26 degrees. The camera ends with pan of 100 degrees turned right, large tilt-up of 35 degrees, a large Dutch angle tilted counterclockwise 23 degree
Prompt: A quiet forest path winds uphill through towering trees and ferns, bathed in soft, mysterious light that enhances the sense of an untouched, enchanted woodland.
Camera prompt: The camera starts at large tilt-up of 36 degrees. The camera ends with pan of 113 degrees turned right, large tilt-down of 36 degrees, a large Dutch angle tilted counterclockwise 28 degree
Prompt: A serene pond with swans and cygnets, surrounded by stone edges and greenery, under an overcast sky, capturing quiet natural life.
Camera prompt: The camera starts with large tilt-down of 23 degrees, and a large Dutch angle tilted counterclockwise 22 degrees. The camera ends with pan of 82 degrees turned left, small tilt-up of 16 degrees, a small Dutch angle tilted clockwise 19 degree
Prompt: A peaceful suburban neighborhood under bright sunlight, featuring well-kept homes, lush greenery, and a quiet, residential charm.
Camera prompt: The camera starts with small tilt-down of 13 degrees, and a small Dutch angle tilted counterclockwise 12 degrees. The camera ends with pan of 13 degrees turned right, a large Dutch angle tilted clockwise 37 degree
Prompt: A peaceful cherry blossom-lined riverside in Tokyo, where nature and urban architecture coexist under clear, blue skies, evoking calm and reflection.
Camera prompt: The camera starts with extreme tilt-up of 87 degrees, and a large Dutch angle tilted counterclockwise 28 degrees. The camera ends with pan of 43 degrees turned left, small tilt-down of 18 degrees, a large Dutch angle tilted counterclockwise 35 degree
Prompt: A quiet stone church with red pews, arched windows, and historical architecture exudes reverence and solemnity through its timeless design and natural lighting.
Camera prompt: The camera starts at large tilt-up of 35 degrees. The camera ends with pan of 43 degrees turned left, large tilt-down of 39 degrees, a small Dutch angle tilted counterclockwise 15 degree