Supplementary material

This supplementary material complements the main paper by providing full videos, technical details and results with our method trained on another backbone, WAN.

Teaser Videos

Here we show the full videos used to make fig. 1.

Eiffel Tower
Harbor

Pitch from -80° to 80°

"The Eiffel Tower in Paris at early morning, bathed in soft golden light, with a clear blue sky and the Seine River gently flowing nearby."

Roll from -85° to 85°

"A medieval harbor at sunset with tall ships docking, sailors unloading barrels of spice while amber light glints off varnished wood and gentle waves lap against stone piers."

Training Set Examples

Here we show training data samples, complementing fig. 4. Each sample displays the video alongside camera trajectory visualizations from three perspectives: top view, side view, and a 3D view. The cameras are color-coded to represent their position in time: the first camera is purple and the last one is red.

Video
Top View
Side View
3D View
Video
Top View
Side View
3D View
Video
Top View
Side View
3D View
Video
Top View
Side View
3D View
Video
Top View
Side View
3D View
Video
Top View
Side View
3D View
Video
Top View
Side View
3D View

Definition of φ

Here we show the full derivation of the $\varphi(\cdot)$ function used in eq. 2, which is used to enforce the yaw to be relative to the first frame, while preserving the absolute pitch and roll.
We take inspiration from the Look At matrix, frequently used in computer graphics to remove the "roll" component from a rotation matrix when specifying a direction in which to "look at". In our case, we slightly modify this procedure to take into account the gravity down vector as the direction and obtain a matrix $\mathbf{R}_{\text{no_yaw}}$ that has a null yaw component for the first frame, while still encoding the pitch and roll.
We use the following right-handed coordinate convention: +x is right ($\mathbf{r}$), +y is down ($\mathbf{d}$), +z is forward ($\mathbf{f}$).

$$\begin{align*} \mathbf{d} &\gets \mathbf{R}_{\text{pano},0}^{-1} [0~1~0]^\top && \text{ $\triangleright$ Compute the down direction vector $\mathbf{d}$ in camera space. } \\ \mathbf{r} &\gets \mathbf{d} \times [0~0~1]^\top && \text{ $\triangleright$ Compute the right direction $\mathbf{r}$ vector using the down and temporary forward vector. } \\ \mathbf{r} &\gets \frac{\mathbf{r}}{\|\mathbf{r}\|} && \text{ $\triangleright$ Normalize the right direction vector $\mathbf{r}$. } \\ \mathbf{f} &\gets \mathbf{r} \times \mathbf{d} && \text{ $\triangleright$ Compute the forward direction vector $\mathbf{f}$. } \\ \mathbf{R}_{\text{no_yaw}} &\gets \begin{bmatrix} \mathbf{r}^\top \\ \mathbf{d}^\top \\ \mathbf{f}^\top \end{bmatrix} && \text{ $\triangleright$ Compute the matrix containing the absolute pitch and roll for the first frame. } \\ \varphi(\mathbf{R}_{\text{pano},0}) &\gets \mathbf{R}_{\text{no_yaw}} \mathbf{R}_{\text{pano},0}^{-1} && \text{ $\triangleright$ Compute the matrix $\varphi(\mathbf{R}_{\text{pano},0})$, which can be applied to an extrinsic matrix to remove its yaw component. } \end{align*}$$

For completeness, here is equation 2, which computes the absolute camera extrinsics. Note that we slightly abuse notation by assuming rotation matrices are in homogeneous coordinates to match the \(4\times 4\) dimensions of extrinsics matrices.

$$ \mathbf{E}_{\text{abs},f} = \varphi(\mathbf{R}_{\text{pano},0})\mathbf{R}_{\text{pano},f} \mathbf{E}_{\text{rel},f} \; . $$

WAN Backbone

Here we show results where we trained a different diffusion model, WAN 2.2 (5B), using the same data and procedure, adjusted to the conditioning mechanism of WAN.

Training details

We randomly initialize a new camera encoder block composed of two convolutional layers, taking as input the absolute Plücker rays and outputting a feature map that is added to the patchified noisy latents, just before the first DiT layer. We train the camera encoder and finetune the DiT model for 70,400 iterations on 4 A100 (80 GB) GPUs.
In the following table, we show the quantitative results of this model. Training on this more powerful backbone yields better absolute and relative rotation error, but leads to a slight decrease in CLIP, FID and FVD score. We show qualitative results for both backbones, in the "Additional Qualitative Results" section below.

Quantitative results

Method PitchErr. (abs.) $\downarrow$ GravityErr. (abs.) $\downarrow$ RotErr (rel.) $\downarrow$ TransErr $\downarrow$ CLIP $\uparrow$ FID $\downarrow$ FVD $\downarrow$
Ours (WAN backbone) 11.79 14.23 9.33 0.75 19.27 119.78 1081.13
Ours (AC3D backbone) 23.79 27.06 14.25 0.75 21.35 110.71 896.84

Evaluation Dataset Details

Here, we supply additional details and statistics on our evaluation benchmark, SpatialVID-extreme, extending sec. 4.3 of the paper.

Random rotation trajectories

We randomly sample a start roll from \([-40°, 40°]\), and a start pitch from \([-90°, 90°]\). The final roll and pitch are sampled the same way, and the end yaw is sampled from \([-180°, 180°]\). The intermediate rotation are interpolated using spherical linear interpolation. Since the method used to compute absolute orientation metrics (Perspective Fields) is never trained with rolls going beyond 45° in magnitude, we resample a random rotation trajectory if an intermediate frame goes outside the roll bounds of \([-40°, 40°]\).

Evaluation dataset statistics

Here we show that the original SpatialVID-HQ dataset provides limited diversity in absolute orientation and relative rotations. Our new evaluation benchmark, SpatialVID-extreme, provides a broader coverage of Euler angles and total angular distance.

Evaluation dataset statistics

Details of Prompt Engineering

For the AC3D baselines, we provide the model the absolute camera orientation through text, as mentioned in sec. 4.4 of the paper. Here, we show the code used to generate these camera descriptions.
We first take the absolute camera extrinsics \(E_\text{abs}\) for the video and convert them to Euler angles. We then describe textually only the first frame's pitch and roll and the last frame's yaw, pitch and roll. We omit describing the in-between frames, since the trajectories are linear and to avoid overwhelming the model.

def get_camera_description_from_absolute_c2w(c2w_absolute):
    euler_angles = c2w_to_pitch_roll_yaw(c2w_absolute)
    first_pitch = euler_angles['pitch'][0].item()
    first_roll = euler_angles['roll'][0].item()
    first_yaw = euler_angles['yaw'][0].item()
    last_pitch = euler_angles['pitch'][-1].item()
    last_roll = euler_angles['roll'][-1].item()
    last_yaw = euler_angles['yaw'][-1].item()
    
    
    def describe_angle_shot(pitch):
        """Describe pitch angle shot."""
        pitch_rounded = round(pitch)
        if -5 <= pitch_rounded <= 5:
            return "near straight-on shot"
        elif pitch_rounded > 0:
            if 5 < pitch_rounded <= 20:
                return f"small tilt-up of {pitch_rounded} degrees"
            elif 20 < pitch_rounded <= 45:
                return f"large tilt-up of {pitch_rounded} degrees"
            else:
                return f"extreme tilt-up of {pitch_rounded} degrees"
        else:
            abs_pitch = abs(pitch_rounded)
            if 5 < abs_pitch <= 20:
                return f"small tilt-down of {abs_pitch} degrees"
            elif 20 < abs_pitch <= 45:
                return f"large tilt-down of {abs_pitch} degrees"
            else:
                return f"extreme tilt-down of {abs_pitch} degrees"
    
    def describe_dutch_angle(roll):
        """Describe roll (Dutch angle) with clockwise/counterclockwise."""
        roll_rounded = round(roll)
        if -5 <= roll_rounded <= 5:
            return "near level shot"
        
        abs_roll = abs(roll_rounded)
        if abs_roll <= 20:
            magnitude = "small"
        elif abs_roll <= 45:
            magnitude = "large"
        else:
            magnitude = "extreme"
        
        # Positive roll is counterclockwise, negative is clockwise
        direction = "counterclockwise" if roll_rounded > 0 else "clockwise"
        
        return f"a {magnitude} Dutch angle tilted {direction} {abs_roll} degrees"
    
    def describe_yaw(yaw):
        """Describe yaw (pan) direction."""
        yaw_rounded = round(yaw)
        
        abs_yaw = abs(yaw_rounded)
        direction = "right" if yaw_rounded > 0 else "left"
        return f"pan of {abs_yaw} degrees turned {direction}"
    
    # Build start description
    start_parts = []
    start_pitch_desc = describe_angle_shot(first_pitch)
    start_roll_desc = describe_dutch_angle(first_roll)
    
    if start_pitch_desc != "near straight-on shot":
        start_parts.append(start_pitch_desc)
    if start_roll_desc != "near level shot":
        start_parts.append(start_roll_desc)
    
    # Build end description
    end_parts = []
    end_yaw_desc = describe_yaw(last_yaw)
    end_pitch_desc = describe_angle_shot(last_pitch)
    end_roll_desc = describe_dutch_angle(last_roll)
    
    if end_yaw_desc:
        end_parts.append(end_yaw_desc)
    if end_pitch_desc != "near straight-on shot":
        end_parts.append(end_pitch_desc)
    if end_roll_desc != "near level shot":
        end_parts.append(end_roll_desc)
    
    # Construct final description
    description_parts = []
    

    start_text = "The camera starts at " if len(start_parts) == 1 else "The camera starts with "
    description_parts.append(start_text + ", and ".join(start_parts))
    
    end_text = "The camera ends with " + ", ".join(end_parts)
    description_parts.append(end_text)
    
    return ". ".join(description_parts)[:-1]
        

After obtaining the camera description, we concatenate it after the regular prompt. Refer to the following qualitative results for examples of camera descriptions.

Additional Qualitative Results

Here are additional qualitative results, complementing fig. 7. We show 25 randomly-sampled videos from our SpatialVID-extreme dataset. The black overlay represents the input camera orientation and the red overlay represents the estimated camera orientation (obtained using Perspective Fields and VGGT), as described in the paper. For "AC3D + cam. text" baselines, we further provide the Camera prompt to the model to indicate the absolute camera orientation to the model.

Sample: SpatialVID-extreme-00025-6f852a89-6120-588e-8944-71838d3ccc73
Prompt: A serene lakeside town features a stone church with a red roof, surrounded by greenery and reflective waters, exuding a peaceful, picturesque charm.
Camera prompt: The camera starts with large tilt-up of 40 degrees, and a small Dutch angle tilted counterclockwise 16 degrees. The camera ends with pan of 171 degrees turned right, extreme tilt-up of 66 degree
AC3D + cam. text
PitchErr. (abs.) ↓ 46.0274
GravityErr. (abs.) ↓ 50.6578
RotErr (rel.) ↓ 60.1353
TransErr ↓ 0.4708
CLIP ↑ 26.4013
AC3D + cam. text + abs. Plücker
PitchErr. (abs.) ↓ 51.0436
GravityErr. (abs.) ↓ 54.9797
RotErr (rel.) ↓ 68.3058
TransErr ↓ 0.5889
CLIP ↑ 27.4460
PreciseCam + WAN-I2V-CC
PitchErr. (abs.) ↓ 35.6582
GravityErr. (abs.) ↓ 38.7524
RotErr (rel.) ↓ 50.2021
TransErr ↓ 0.7401
CLIP ↑ 23.0262
Ours (w/o null pitch conditioning)
PitchErr. (abs.) ↓ 23.7425
GravityErr. (abs.) ↓ 26.6890
RotErr (rel.) ↓ 15.3803
TransErr ↓ 0.7295
CLIP ↑ 18.9948
Ours (AC3D backbone)
PitchErr. (abs.) ↓ 18.6595
GravityErr. (abs.) ↓ 19.8800
RotErr (rel.) ↓ 42.7691
TransErr ↓ 0.4510
CLIP ↑ 16.3254
Ours (WAN backbone)
PitchErr. (abs.) ↓ 7.1853
GravityErr. (abs.) ↓ 7.6025
RotErr (rel.) ↓ 13.2675
TransErr ↓ 0.4395
CLIP ↑ 16.4299
Sample: SpatialVID-extreme-00054-7fbd6052-afd4-5d39-b013-94b93c6605a2
Prompt: A poised wingsuit flyer stands on a grassy mountain peak, overlooking a forested valley and a distant town beneath an overcast sky, evoking tension and grandeur.
Camera prompt: The camera starts with small tilt-down of 7 degrees, and a small Dutch angle tilted clockwise 10 degrees. The camera ends with pan of 108 degrees turned right, extreme tilt-down of 82 degrees, a large Dutch angle tilted clockwise 24 degree
AC3D + cam. text
PitchErr. (abs.) ↓ 19.8415
GravityErr. (abs.) ↓ 22.4312
RotErr (rel.) ↓ 29.9119
TransErr ↓ 1.1958
CLIP ↑ 25.9891
AC3D + cam. text + abs. Plücker
PitchErr. (abs.) ↓ 11.0208
GravityErr. (abs.) ↓ 16.9371
RotErr (rel.) ↓ 35.9948
TransErr ↓ 1.1609
CLIP ↑ 27.0619
PreciseCam + WAN-I2V-CC
PitchErr. (abs.) ↓ 34.7894
GravityErr. (abs.) ↓ 36.0968
RotErr (rel.) ↓ 32.5520
TransErr ↓ 0.8935
CLIP ↑ 24.0029
Ours (w/o null pitch conditioning)
PitchErr. (abs.) ↓ 10.5160
GravityErr. (abs.) ↓ 11.5214
RotErr (rel.) ↓ 17.2173
TransErr ↓ 0.9006
CLIP ↑ 22.0709
Ours (AC3D backbone)
PitchErr. (abs.) ↓ 14.6655
GravityErr. (abs.) ↓ 14.9051
RotErr (rel.) ↓ 21.4328
TransErr ↓ 0.8671
CLIP ↑ 22.9628
Ours (WAN backbone)
PitchErr. (abs.) ↓ 19.1934
GravityErr. (abs.) ↓ 21.9546
RotErr (rel.) ↓ 6.3154
TransErr ↓ 0.9905
CLIP ↑ 23.1786
Sample: SpatialVID-extreme-00113-8df73a61-7eab-5f7e-910d-17d2095466da
Prompt: A modern glass-and-steel skyscraper rises against a clear blue sky, surrounded by diverse urban architecture in a bustling cityscape.
Camera prompt: The camera starts at extreme tilt-down of 47 degrees. The camera ends with pan of 129 degrees turned right, extreme tilt-down of 60 degrees, a small Dutch angle tilted counterclockwise 6 degree
AC3D + cam. text
PitchErr. (abs.) ↓ 53.8421
GravityErr. (abs.) ↓ 54.0759
RotErr (rel.) ↓ 30.4476
TransErr ↓ 0.8093
CLIP ↑ 20.9175
AC3D + cam. text + abs. Plücker
PitchErr. (abs.) ↓ 56.8496
GravityErr. (abs.) ↓ 57.3859
RotErr (rel.) ↓ 49.3157
TransErr ↓ 0.7080
CLIP ↑ 20.6264
PreciseCam + WAN-I2V-CC
PitchErr. (abs.) ↓ 80.6584
GravityErr. (abs.) ↓ 82.1548
RotErr (rel.) ↓ 29.1721
TransErr ↓ 0.6276
CLIP ↑ 16.4776
Ours (w/o null pitch conditioning)
PitchErr. (abs.) ↓ 9.7571
GravityErr. (abs.) ↓ 13.4318
RotErr (rel.) ↓ 10.1055
TransErr ↓ 1.0413
CLIP ↑ 18.0661
Ours (AC3D backbone)
PitchErr. (abs.) ↓ 6.2760
GravityErr. (abs.) ↓ 9.6366
RotErr (rel.) ↓ 10.0627
TransErr ↓ 1.0059
CLIP ↑ 17.5957
Ours (WAN backbone)
PitchErr. (abs.) ↓ 4.5623
GravityErr. (abs.) ↓ 5.4503
RotErr (rel.) ↓ 11.9829
TransErr ↓ 0.9592
CLIP ↑ 16.8313
Sample: SpatialVID-extreme-00094-7c736f48-8095-569a-b6be-2cfd7170c940
Prompt: A bright forest with towering green-leaved trees forming a dense canopy overhead.
Camera prompt: The camera starts with extreme tilt-up of 56 degrees, and a large Dutch angle tilted clockwise 39 degrees. The camera ends with pan of 60 degrees turned left, a small Dutch angle tilted counterclockwise 16 degree
AC3D + cam. text
PitchErr. (abs.) ↓ 15.4159
GravityErr. (abs.) ↓ 21.7923
RotErr (rel.) ↓ 9.2083
TransErr ↓ 0.8908
CLIP ↑ 24.0304
AC3D + cam. text + abs. Plücker
PitchErr. (abs.) ↓ 15.0149
GravityErr. (abs.) ↓ 21.9736
RotErr (rel.) ↓ 18.7598
TransErr ↓ 0.9982
CLIP ↑ 24.3263
PreciseCam + WAN-I2V-CC
PitchErr. (abs.) ↓ 26.9080
GravityErr. (abs.) ↓ 29.8925
RotErr (rel.) ↓ 24.3715
TransErr ↓ 0.8014
CLIP ↑ 25.4442
Ours (w/o null pitch conditioning)
PitchErr. (abs.) ↓ 15.1250
GravityErr. (abs.) ↓ 17.6764
RotErr (rel.) ↓ 12.8989
TransErr ↓ 0.7213
CLIP ↑ 24.9015
Ours (AC3D backbone)
PitchErr. (abs.) ↓ 13.5447
GravityErr. (abs.) ↓ 16.0264
RotErr (rel.) ↓ 8.1718
TransErr ↓ 1.0635
CLIP ↑ 25.0843
Ours (WAN backbone)
PitchErr. (abs.) ↓ 8.1029
GravityErr. (abs.) ↓ 13.3512
RotErr (rel.) ↓ 4.2873
TransErr ↓ 0.8547
CLIP ↑ 25.4300
Sample: SpatialVID-extreme-00149-0b0ff89f-91df-501b-84ca-4b29687b2f22
Prompt: A dense, serene forest with towering trees and a winding path, bathed in soft, diffused light that enhances its mysterious, tranquil atmosphere.
Camera prompt: The camera starts with small tilt-down of 16 degrees, and a large Dutch angle tilted counterclockwise 31 degrees. The camera ends with pan of 6 degrees turned right, large tilt-down of 23 degrees, a small Dutch angle tilted counterclockwise 19 degree
AC3D + cam. text
PitchErr. (abs.) ↓ 12.3667
GravityErr. (abs.) ↓ 26.1761
RotErr (rel.) ↓ 5.6906
TransErr ↓ 0.1334
CLIP ↑ 20.4088
AC3D + cam. text + abs. Plücker
PitchErr. (abs.) ↓ 19.1697
GravityErr. (abs.) ↓ 27.2809
RotErr (rel.) ↓ 6.5629
TransErr ↓ 0.1452
CLIP ↑ 19.8605
PreciseCam + WAN-I2V-CC
PitchErr. (abs.) ↓ 1.3823
GravityErr. (abs.) ↓ 14.1708
RotErr (rel.) ↓ 3.5397
TransErr ↓ 1.0539
CLIP ↑ 22.7485
Ours (w/o null pitch conditioning)
PitchErr. (abs.) ↓ 5.4246
GravityErr. (abs.) ↓ 9.7907
RotErr (rel.) ↓ 7.1947
TransErr ↓ 0.2232
CLIP ↑ 20.1277
Ours (AC3D backbone)
PitchErr. (abs.) ↓ 7.2750
GravityErr. (abs.) ↓ 17.9292
RotErr (rel.) ↓ 4.5358
TransErr ↓ 0.0600
CLIP ↑ 20.9196
Ours (WAN backbone)
PitchErr. (abs.) ↓ 5.2536
GravityErr. (abs.) ↓ 6.1095
RotErr (rel.) ↓ 2.9871
TransErr ↓ 0.5286
CLIP ↑ 17.2604
Sample: SpatialVID-extreme-00024-8c6e3ce1-6dda-5b9e-8c5c-caab8da79b96
Prompt: A wingsuited figure stands on a cliff, surrounded by surreal, inverted colors, capturing a tense, adventurous moment against a vast, dreamlike landscape.
Camera prompt: The camera starts with small tilt-up of 12 degrees, and a large Dutch angle tilted counterclockwise 28 degrees. The camera ends with pan of 155 degrees turned left, small tilt-down of 17 degrees, a large Dutch angle tilted counterclockwise 25 degree
AC3D + cam. text
PitchErr. (abs.) ↓ 43.5085
GravityErr. (abs.) ↓ 46.0342
RotErr (rel.) ↓ 70.5934
TransErr ↓ 0.8211
CLIP ↑ 12.8228
AC3D + cam. text + abs. Plücker
PitchErr. (abs.) ↓ 40.2365
GravityErr. (abs.) ↓ 45.3877
RotErr (rel.) ↓ 67.3071
TransErr ↓ 0.7163
CLIP ↑ 16.7430
PreciseCam + WAN-I2V-CC
PitchErr. (abs.) ↓ 42.1858
GravityErr. (abs.) ↓ 46.2768
RotErr (rel.) ↓ 32.7303
TransErr ↓ 0.2986
CLIP ↑ 23.1453
Ours (w/o null pitch conditioning)
PitchErr. (abs.) ↓ 9.7395
GravityErr. (abs.) ↓ 14.6548
RotErr (rel.) ↓ 24.4084
TransErr ↓ 0.5503
CLIP ↑ 22.8919
Ours (AC3D backbone)
PitchErr. (abs.) ↓ 7.1109
GravityErr. (abs.) ↓ 14.1656
RotErr (rel.) ↓ 22.4047
TransErr ↓ 0.4576
CLIP ↑ 21.8093
Ours (WAN backbone)
PitchErr. (abs.) ↓ 1.5712
GravityErr. (abs.) ↓ 9.6281
RotErr (rel.) ↓ 6.0374
TransErr ↓ 0.5593
CLIP ↑ 22.7422
Sample: SpatialVID-extreme-00032-011eca13-4549-5817-b2f7-46b88193204e
Prompt: A man in a blue shirt works on a marina dock during the daytime, surrounded by yachts and a clear blue sky, evoking a vibrant, luxurious atmosphere.
Camera prompt: The camera starts with large tilt-down of 22 degrees, and a small Dutch angle tilted clockwise 15 degrees. The camera ends with pan of 171 degrees turned left, extreme tilt-down of 58 degrees, a small Dutch angle tilted counterclockwise 16 degree
AC3D + cam. text
PitchErr. (abs.) ↓ 30.7817
GravityErr. (abs.) ↓ 35.2017
RotErr (rel.) ↓ 28.0205
TransErr ↓ 0.6576
CLIP ↑ 27.0847
AC3D + cam. text + abs. Plücker
PitchErr. (abs.) ↓ 27.5676
GravityErr. (abs.) ↓ 33.1861
RotErr (rel.) ↓ 32.9795
TransErr ↓ 0.7097
CLIP ↑ 23.8647
PreciseCam + WAN-I2V-CC
PitchErr. (abs.) ↓ 10.6864
GravityErr. (abs.) ↓ 14.7886
RotErr (rel.) ↓ 23.9565
TransErr ↓ 0.6871
CLIP ↑ 23.4003
Ours (w/o null pitch conditioning)
PitchErr. (abs.) ↓ 5.4172
GravityErr. (abs.) ↓ 10.3875
RotErr (rel.) ↓ 13.4265
TransErr ↓ 0.9614
CLIP ↑ 23.3385
Ours (AC3D backbone)
PitchErr. (abs.) ↓ 7.5925
GravityErr. (abs.) ↓ 11.8558
RotErr (rel.) ↓ 14.1721
TransErr ↓ 0.9131
CLIP ↑ 23.3216
Ours (WAN backbone)
PitchErr. (abs.) ↓ 4.9720
GravityErr. (abs.) ↓ 7.8187
RotErr (rel.) ↓ 6.8907
TransErr ↓ 1.0000
CLIP ↑ 22.0412
Sample: SpatialVID-extreme-00064-1dfe1cec-e3e4-5a8f-9f80-0937c05c3f54
Prompt: A low-angle urban scene features towering glass skyscrapers under a vibrant blue sky, with street lamps and traffic signals adding a grounded contrast to the imposing skyline.
Camera prompt: The camera starts with large tilt-down of 26 degrees, and a large Dutch angle tilted counterclockwise 37 degrees. The camera ends with pan of 113 degrees turned left, extreme tilt-down of 87 degrees, a small Dutch angle tilted clockwise 12 degree
AC3D + cam. text
PitchErr. (abs.) ↓ 110.4082
GravityErr. (abs.) ↓ 111.0205
RotErr (rel.) ↓ 55.7747
TransErr ↓ 1.3062
CLIP ↑ 23.2672
AC3D + cam. text + abs. Plücker
PitchErr. (abs.) ↓ 108.5395
GravityErr. (abs.) ↓ 108.7265
RotErr (rel.) ↓ 38.2278
TransErr ↓ 1.0574
CLIP ↑ 19.9949
PreciseCam + WAN-I2V-CC
PitchErr. (abs.) ↓ 20.6734
GravityErr. (abs.) ↓ 21.7919
RotErr (rel.) ↓ 54.7965
TransErr ↓ 0.5254
CLIP ↑ 9.5761
Ours (w/o null pitch conditioning)
PitchErr. (abs.) ↓ 91.6705
GravityErr. (abs.) ↓ 92.3970
RotErr (rel.) ↓ 12.5064
TransErr ↓ 0.9585
CLIP ↑ 19.1275
Ours (AC3D backbone)
PitchErr. (abs.) ↓ 73.4453
GravityErr. (abs.) ↓ 74.4047
RotErr (rel.) ↓ 17.1927
TransErr ↓ 1.0433
CLIP ↑ 16.8562
Ours (WAN backbone)
PitchErr. (abs.) ↓ 6.6322
GravityErr. (abs.) ↓ 11.4644
RotErr (rel.) ↓ 12.4027
TransErr ↓ 0.7851
CLIP ↑ 14.2942
Sample: SpatialVID-extreme-00099-4bac0e8c-f2bb-5f98-abee-2fd594c26266
Prompt: A serene forest under a clear blue sky, where tall trees stand alongside evergreens, creating a tranquil atmosphere.
Camera prompt: The camera starts with extreme tilt-up of 70 degrees, and a small Dutch angle tilted clockwise 16 degrees. The camera ends with pan of 147 degrees turned right, small tilt-down of 10 degrees, a large Dutch angle tilted clockwise 27 degree
AC3D + cam. text
PitchErr. (abs.) ↓ 24.3843
GravityErr. (abs.) ↓ 27.5568
RotErr (rel.) ↓ 29.0624
TransErr ↓ 0.7807
CLIP ↑ 23.0699
AC3D + cam. text + abs. Plücker
PitchErr. (abs.) ↓ 25.7627
GravityErr. (abs.) ↓ 28.1366
RotErr (rel.) ↓ 40.3407
TransErr ↓ 0.9822
CLIP ↑ 23.5943
PreciseCam + WAN-I2V-CC
PitchErr. (abs.) ↓ 13.6426
GravityErr. (abs.) ↓ 17.9353
RotErr (rel.) ↓ 58.3157
TransErr ↓ 0.8331
CLIP ↑ 18.3891
Ours (w/o null pitch conditioning)
PitchErr. (abs.) ↓ 8.4624
GravityErr. (abs.) ↓ 10.9273
RotErr (rel.) ↓ 15.5292
TransErr ↓ 0.9316
CLIP ↑ 23.2840
Ours (AC3D backbone)
PitchErr. (abs.) ↓ 12.5761
GravityErr. (abs.) ↓ 17.4512
RotErr (rel.) ↓ 11.3408
TransErr ↓ 0.9389
CLIP ↑ 21.3306
Ours (WAN backbone)
PitchErr. (abs.) ↓ 5.5623
GravityErr. (abs.) ↓ 7.0038
RotErr (rel.) ↓ 8.8094
TransErr ↓ 1.0426
CLIP ↑ 21.6473
Sample: SpatialVID-extreme-00049-4ab5baae-66dc-5c85-8496-ce0738f8b6ea
Prompt: A cozy, sophisticated home office features a large wooden desk, an armchair, and bookshelves, bathed in warm, ambient lighting that enhances its intellectual and inviting atmosphere.
Camera prompt: The camera starts with small tilt-up of 10 degrees, and a small Dutch angle tilted clockwise 20 degrees. The camera ends with pan of 82 degrees turned right, extreme tilt-up of 65 degrees, a small Dutch angle tilted clockwise 6 degree
AC3D + cam. text
PitchErr. (abs.) ↓ 34.8429
GravityErr. (abs.) ↓ 41.5793
RotErr (rel.) ↓ 21.6473
TransErr ↓ 0.8536
CLIP ↑ 22.7570
AC3D + cam. text + abs. Plücker
PitchErr. (abs.) ↓ 46.0506
GravityErr. (abs.) ↓ 50.8660
RotErr (rel.) ↓ 24.1164
TransErr ↓ 0.6946
CLIP ↑ 19.5785
PreciseCam + WAN-I2V-CC
PitchErr. (abs.) ↓ 11.8620
GravityErr. (abs.) ↓ 19.0803
RotErr (rel.) ↓ 16.2095
TransErr ↓ 0.6366
CLIP ↑ 22.8237
Ours (w/o null pitch conditioning)
PitchErr. (abs.) ↓ 14.2742
GravityErr. (abs.) ↓ 21.9224
RotErr (rel.) ↓ 5.2149
TransErr ↓ 0.4530
CLIP ↑ 18.8532
Ours (AC3D backbone)
PitchErr. (abs.) ↓ 2.7208
GravityErr. (abs.) ↓ 12.7898
RotErr (rel.) ↓ 7.1656
TransErr ↓ 0.6409
CLIP ↑ 14.4796
Ours (WAN backbone)
PitchErr. (abs.) ↓ 8.6621
GravityErr. (abs.) ↓ 11.6480
RotErr (rel.) ↓ 4.8464
TransErr ↓ 0.5123
CLIP ↑ 14.2219
Sample: SpatialVID-extreme-00021-3d3d0e82-2feb-5afa-ab61-4211c4868c98
Prompt: A luxurious breakfast tray floats in a clear pool, surrounded by modern outdoor lounges, exuding calm and opulence under bright sunlight.
Camera prompt: The camera starts with large tilt-up of 28 degrees, and a small Dutch angle tilted counterclockwise 7 degrees. The camera ends with pan of 25 degrees turned left, large tilt-up of 27 degrees, a small Dutch angle tilted counterclockwise 6 degree
AC3D + cam. text
PitchErr. (abs.) ↓ 60.7490
GravityErr. (abs.) ↓ 61.0202
RotErr (rel.) ↓ 3.9213
TransErr ↓ 0.3722
CLIP ↑ 25.5968
AC3D + cam. text + abs. Plücker
PitchErr. (abs.) ↓ 38.8333
GravityErr. (abs.) ↓ 39.4999
RotErr (rel.) ↓ 5.8911
TransErr ↓ 0.8308
CLIP ↑ 24.3916
PreciseCam + WAN-I2V-CC
PitchErr. (abs.) ↓ 5.9607
GravityErr. (abs.) ↓ 6.3004
RotErr (rel.) ↓ 1.7660
TransErr ↓ 0.8500
CLIP ↑ 19.3325
Ours (w/o null pitch conditioning)
PitchErr. (abs.) ↓ 22.0424
GravityErr. (abs.) ↓ 22.1279
RotErr (rel.) ↓ 5.3922
TransErr ↓ 1.1389
CLIP ↑ 20.7591
Ours (AC3D backbone)
PitchErr. (abs.) ↓ 5.2496
GravityErr. (abs.) ↓ 5.4403
RotErr (rel.) ↓ 2.2218
TransErr ↓ 0.9478
CLIP ↑ 18.0535
Ours (WAN backbone)
PitchErr. (abs.) ↓ 6.7869
GravityErr. (abs.) ↓ 7.2125
RotErr (rel.) ↓ 2.7276
TransErr ↓ 0.8024
CLIP ↑ 13.4368
Sample: SpatialVID-extreme-00107-1c9c9906-e7da-5a31-8933-5b6aeaf38da6
Prompt: A cozy, nostalgic living room features a polished wooden console with a record player, radio, and a vinyl record, bathed in warm, even light that highlights its retro design.
Camera prompt: The camera starts with large tilt-up of 36 degrees, and a large Dutch angle tilted counterclockwise 31 degrees. The camera ends with pan of 127 degrees turned left, large tilt-down of 30 degrees, a large Dutch angle tilted counterclockwise 21 degree
AC3D + cam. text
PitchErr. (abs.) ↓ 35.8782
GravityErr. (abs.) ↓ 38.6682
RotErr (rel.) ↓ 57.0301
TransErr ↓ 0.8719
CLIP ↑ 19.3937
AC3D + cam. text + abs. Plücker
PitchErr. (abs.) ↓ 33.3872
GravityErr. (abs.) ↓ 36.8525
RotErr (rel.) ↓ 46.3798
TransErr ↓ 0.7189
CLIP ↑ 19.7292
PreciseCam + WAN-I2V-CC
PitchErr. (abs.) ↓ 33.3381
GravityErr. (abs.) ↓ 34.9691
RotErr (rel.) ↓ 58.3994
TransErr ↓ 0.8954
CLIP ↑ 23.0701
Ours (w/o null pitch conditioning)
PitchErr. (abs.) ↓ 24.0375
GravityErr. (abs.) ↓ 28.0025
RotErr (rel.) ↓ 22.8317
TransErr ↓ 0.8803
CLIP ↑ 17.8000
Ours (AC3D backbone)
PitchErr. (abs.) ↓ 53.7958
GravityErr. (abs.) ↓ 58.0636
RotErr (rel.) ↓ 16.5109
TransErr ↓ 0.4437
CLIP ↑ 19.7723
Ours (WAN backbone)
PitchErr. (abs.) ↓ 3.1677
GravityErr. (abs.) ↓ 5.6045
RotErr (rel.) ↓ 4.8257
TransErr ↓ 0.8068
CLIP ↑ 11.6492
Sample: SpatialVID-extreme-00114-0dd53c4d-70f0-5273-aaeb-476b2d530103
Prompt: A vibrant street scene features a large mural of a woman in a blue dress against a castle, set within a town under a blue sky.
Camera prompt: The camera starts with extreme tilt-down of 69 degrees, and a large Dutch angle tilted clockwise 35 degrees. The camera ends with pan of 173 degrees turned right, small tilt-down of 8 degree
AC3D + cam. text
PitchErr. (abs.) ↓ 42.7767
GravityErr. (abs.) ↓ 46.9327
RotErr (rel.) ↓ 41.5490
TransErr ↓ 0.4805
CLIP ↑ 19.1048
AC3D + cam. text + abs. Plücker
PitchErr. (abs.) ↓ 40.1487
GravityErr. (abs.) ↓ 43.8205
RotErr (rel.) ↓ 61.1779
TransErr ↓ 0.8127
CLIP ↑ 19.7077
PreciseCam + WAN-I2V-CC
PitchErr. (abs.) ↓ 40.1486
GravityErr. (abs.) ↓ 52.3973
RotErr (rel.) ↓ 37.0648
TransErr ↓ 0.7903
CLIP ↑ 20.9855
Ours (w/o null pitch conditioning)
PitchErr. (abs.) ↓ 29.9336
GravityErr. (abs.) ↓ 32.9690
RotErr (rel.) ↓ 24.9350
TransErr ↓ 0.9501
CLIP ↑ 16.9815
Ours (AC3D backbone)
PitchErr. (abs.) ↓ 27.6411
GravityErr. (abs.) ↓ 32.7320
RotErr (rel.) ↓ 8.5795
TransErr ↓ 0.8528
CLIP ↑ 17.9792
Ours (WAN backbone)
PitchErr. (abs.) ↓ 12.2447
GravityErr. (abs.) ↓ 15.5637
RotErr (rel.) ↓ 16.6763
TransErr ↓ 0.3990
CLIP ↑ 18.6031
Sample: SpatialVID-extreme-00143-9f68e23e-8b85-53f0-afd5-f41a1e907967
Prompt: A serene forest under a clear blue sky, with yellow-tinted trees and undergrowth creating a peaceful atmosphere.
Camera prompt: The camera starts with large tilt-down of 27 degrees, and a small Dutch angle tilted counterclockwise 17 degrees. The camera ends with pan of 97 degrees turned right, extreme tilt-up of 72 degrees, a large Dutch angle tilted counterclockwise 36 degree
AC3D + cam. text
PitchErr. (abs.) ↓ 17.5634
GravityErr. (abs.) ↓ 18.5818
RotErr (rel.) ↓ 41.2934
TransErr ↓ 0.9521
CLIP ↑ 24.9497
AC3D + cam. text + abs. Plücker
PitchErr. (abs.) ↓ 17.5646
GravityErr. (abs.) ↓ 18.3237
RotErr (rel.) ↓ 31.5295
TransErr ↓ 0.9975
CLIP ↑ 24.7434
PreciseCam + WAN-I2V-CC
PitchErr. (abs.) ↓ 55.3966
GravityErr. (abs.) ↓ 56.0936
RotErr (rel.) ↓ 43.8370
TransErr ↓ 0.6545
CLIP ↑ 26.4944
Ours (w/o null pitch conditioning)
PitchErr. (abs.) ↓ 11.3157
GravityErr. (abs.) ↓ 12.4034
RotErr (rel.) ↓ 30.5031
TransErr ↓ 0.3289
CLIP ↑ 25.5602
Ours (AC3D backbone)
PitchErr. (abs.) ↓ 7.1139
GravityErr. (abs.) ↓ 9.0423
RotErr (rel.) ↓ 14.6091
TransErr ↓ 0.4175
CLIP ↑ 27.3670
Ours (WAN backbone)
PitchErr. (abs.) ↓ 4.6941
GravityErr. (abs.) ↓ 5.9101
RotErr (rel.) ↓ 10.9506
TransErr ↓ 0.7710
CLIP ↑ 22.5109
Sample: SpatialVID-extreme-00059-4e124bf9-40e5-5efb-92f1-d96c454ce56e
Prompt: A luxurious breakfast table set on a black speckled surface evokes an elegant dining atmosphere.
Camera prompt: The camera starts at small tilt-down of 18 degrees. The camera ends with pan of 68 degrees turned right, extreme tilt-up of 73 degrees, a small Dutch angle tilted counterclockwise 18 degree
AC3D + cam. text
PitchErr. (abs.) ↓ 78.2267
GravityErr. (abs.) ↓ 79.6091
RotErr (rel.) ↓ 32.2964
TransErr ↓ 0.8309
CLIP ↑ 14.7939
AC3D + cam. text + abs. Plücker
PitchErr. (abs.) ↓ 57.9806
GravityErr. (abs.) ↓ 58.4524
RotErr (rel.) ↓ 41.5077
TransErr ↓ 0.7863
CLIP ↑ 19.3070
PreciseCam + WAN-I2V-CC
PitchErr. (abs.) ↓ 87.9267
GravityErr. (abs.) ↓ 89.7094
RotErr (rel.) ↓ 37.1392
TransErr ↓ 0.6984
CLIP ↑ 26.3321
Ours (w/o null pitch conditioning)
PitchErr. (abs.) ↓ 6.0985
GravityErr. (abs.) ↓ 8.0882
RotErr (rel.) ↓ 24.8739
TransErr ↓ 0.8470
CLIP ↑ 12.3847
Ours (AC3D backbone)
PitchErr. (abs.) ↓ 7.8777
GravityErr. (abs.) ↓ 10.8806
RotErr (rel.) ↓ 19.6159
TransErr ↓ 0.8290
CLIP ↑ 12.8744
Ours (WAN backbone)
PitchErr. (abs.) ↓ 6.5513
GravityErr. (abs.) ↓ 8.1713
RotErr (rel.) ↓ 6.0184
TransErr ↓ 0.8078
CLIP ↑ 17.0321
Sample: SpatialVID-extreme-00005-4c45b60b-bb9d-55d0-bbd5-10c2087e91e8
Prompt: A vibrant forest canopy bathed in green and yellow foliage frames a blue sky, creating a peaceful, tranquil atmosphere under soft, diffused light.
Camera prompt: The camera starts with extreme tilt-up of 80 degrees, and a small Dutch angle tilted counterclockwise 11 degrees. The camera ends with pan of 31 degrees turned left, a large Dutch angle tilted clockwise 29 degree
AC3D + cam. text
PitchErr. (abs.) ↓ 23.2171
GravityErr. (abs.) ↓ 26.5473
RotErr (rel.) ↓ 16.1778
TransErr ↓ 0.3193
CLIP ↑ 24.4303
AC3D + cam. text + abs. Plücker
PitchErr. (abs.) ↓ 21.3226
GravityErr. (abs.) ↓ 26.0176
RotErr (rel.) ↓ 38.2427
TransErr ↓ 0.4511
CLIP ↑ 23.5468
PreciseCam + WAN-I2V-CC
PitchErr. (abs.) ↓ 21.0089
GravityErr. (abs.) ↓ 22.2772
RotErr (rel.) ↓ 20.5811
TransErr ↓ 0.7067
CLIP ↑ 24.6602
Ours (w/o null pitch conditioning)
PitchErr. (abs.) ↓ 25.5769
GravityErr. (abs.) ↓ 27.4363
RotErr (rel.) ↓ 13.6787
TransErr ↓ 0.2845
CLIP ↑ 22.7505
Ours (AC3D backbone)
PitchErr. (abs.) ↓ 26.3971
GravityErr. (abs.) ↓ 28.0014
RotErr (rel.) ↓ 19.9262
TransErr ↓ 0.6386
CLIP ↑ 23.6242
Ours (WAN backbone)
PitchErr. (abs.) ↓ 6.8455
GravityErr. (abs.) ↓ 11.0966
RotErr (rel.) ↓ 7.0996
TransErr ↓ 0.9739
CLIP ↑ 23.9834
Sample: SpatialVID-extreme-00121-72eaa75b-9cb9-5266-b2ce-1badf35297b0
Prompt: A dense forest bathed in warm, soft light, with towering trees reaching toward a partially hidden sky, creating a serene and awe-inspiring natural landscape.
Camera prompt: The camera starts with extreme tilt-down of 76 degrees, and a small Dutch angle tilted clockwise 10 degrees. The camera ends with pan of 100 degrees turned left, extreme tilt-down of 75 degrees, a large Dutch angle tilted clockwise 22 degree
AC3D + cam. text
PitchErr. (abs.) ↓ 116.4685
GravityErr. (abs.) ↓ 116.8717
RotErr (rel.) ↓ 20.7693
TransErr ↓ 0.8475
CLIP ↑ 17.2804
AC3D + cam. text + abs. Plücker
PitchErr. (abs.) ↓ 65.5913
GravityErr. (abs.) ↓ 65.8578
RotErr (rel.) ↓ 43.6912
TransErr ↓ 0.5457
CLIP ↑ 16.5664
PreciseCam + WAN-I2V-CC
PitchErr. (abs.) ↓ 57.1529
GravityErr. (abs.) ↓ 58.8117
RotErr (rel.) ↓ 25.4791
TransErr ↓ 0.8739
CLIP ↑ 20.3383
Ours (w/o null pitch conditioning)
PitchErr. (abs.) ↓ 159.0029
GravityErr. (abs.) ↓ 159.1611
RotErr (rel.) ↓ 15.9059
TransErr ↓ 0.3304
CLIP ↑ 9.1536
Ours (AC3D backbone)
PitchErr. (abs.) ↓ 154.0937
GravityErr. (abs.) ↓ 154.8382
RotErr (rel.) ↓ 12.2550
TransErr ↓ 0.5461
CLIP ↑ 16.7626
Ours (WAN backbone)
PitchErr. (abs.) ↓ 152.9943
GravityErr. (abs.) ↓ 153.3092
RotErr (rel.) ↓ 6.7771
TransErr ↓ 0.2830
CLIP ↑ 18.4231
Sample: SpatialVID-extreme-00158-6d717414-9a0d-5bb8-9cbe-96165e0bf3cf
Prompt: A winding road winds through a misty green valley, framed by towering mountains under a clear blue sky, evoking a tranquil yet surreal atmosphere.
Camera prompt: The camera starts at large tilt-up of 23 degrees. The camera ends with pan of 30 degrees turned left, small tilt-down of 18 degrees, a large Dutch angle tilted counterclockwise 29 degree
AC3D + cam. text
PitchErr. (abs.) ↓ 6.5641
GravityErr. (abs.) ↓ 21.2785
RotErr (rel.) ↓ 23.7720
TransErr ↓ 0.4005
CLIP ↑ 25.4395
AC3D + cam. text + abs. Plücker
PitchErr. (abs.) ↓ 41.6119
GravityErr. (abs.) ↓ 47.0758
RotErr (rel.) ↓ 26.7249
TransErr ↓ 0.6405
CLIP ↑ 23.1087
PreciseCam + WAN-I2V-CC
PitchErr. (abs.) ↓ 30.2743
GravityErr. (abs.) ↓ 35.7095
RotErr (rel.) ↓ 22.3214
TransErr ↓ 0.9057
CLIP ↑ 22.5986
Ours (w/o null pitch conditioning)
PitchErr. (abs.) ↓ 4.3865
GravityErr. (abs.) ↓ 7.6134
RotErr (rel.) ↓ 17.1938
TransErr ↓ 1.1040
CLIP ↑ 24.0481
Ours (AC3D backbone)
PitchErr. (abs.) ↓ 23.3339
GravityErr. (abs.) ↓ 26.4091
RotErr (rel.) ↓ 18.5041
TransErr ↓ 1.1162
CLIP ↑ 25.0407
Ours (WAN backbone)
PitchErr. (abs.) ↓ 2.4047
GravityErr. (abs.) ↓ 3.2921
RotErr (rel.) ↓ 5.4965
TransErr ↓ 1.0662
CLIP ↑ 19.5170
Sample: SpatialVID-extreme-00048-0e0fb727-11ad-5f64-b59d-3a63c27a1fc0
Prompt: A rain-soaked urban street at dusk, illuminated by bright storefronts and car headlights, with two men walking under umbrellas against a subdued, reflective backdrop.
Camera prompt: The camera starts with extreme tilt-down of 76 degrees, and a large Dutch angle tilted counterclockwise 22 degrees. The camera ends with pan of 62 degrees turned right, extreme tilt-down of 74 degrees, a large Dutch angle tilted clockwise 28 degree
AC3D + cam. text
PitchErr. (abs.) ↓ 70.9872
GravityErr. (abs.) ↓ 71.5628
RotErr (rel.) ↓ 7.7384
TransErr ↓ 0.7301
CLIP ↑ 24.1686
AC3D + cam. text + abs. Plücker
PitchErr. (abs.) ↓ 70.1419
GravityErr. (abs.) ↓ 70.6688
RotErr (rel.) ↓ 9.4571
TransErr ↓ 1.1819
CLIP ↑ 24.6172
PreciseCam + WAN-I2V-CC
PitchErr. (abs.) ↓ 79.7014
GravityErr. (abs.) ↓ 80.3465
RotErr (rel.) ↓ 2.6393
TransErr ↓ 1.1480
CLIP ↑ 25.4991
Ours (w/o null pitch conditioning)
PitchErr. (abs.) ↓ 32.3115
GravityErr. (abs.) ↓ 34.9931
RotErr (rel.) ↓ 7.8603
TransErr ↓ 0.3878
CLIP ↑ 18.8790
Ours (AC3D backbone)
PitchErr. (abs.) ↓ 55.2757
GravityErr. (abs.) ↓ 55.8499
RotErr (rel.) ↓ 9.5095
TransErr ↓ 0.9377
CLIP ↑ 24.0199
Ours (WAN backbone)
PitchErr. (abs.) ↓ 25.0271
GravityErr. (abs.) ↓ 25.9769
RotErr (rel.) ↓ 3.4485
TransErr ↓ 0.9906
CLIP ↑ 17.2169
Sample: SpatialVID-extreme-00052-6fe06b69-8421-502c-8544-f287c630b1f4
Prompt: A modern yacht interior features a woman in a striped suit, a plush couch, a dining area, and elegant lighting through window blinds, evoking luxury and sophistication.
Camera prompt: The camera starts with extreme tilt-down of 80 degrees, and a large Dutch angle tilted clockwise 26 degrees. The camera ends with pan of 100 degrees turned right, large tilt-up of 35 degrees, a large Dutch angle tilted counterclockwise 23 degree
AC3D + cam. text
PitchErr. (abs.) ↓ 24.7625
GravityErr. (abs.) ↓ 34.7832
RotErr (rel.) ↓ 32.6072
TransErr ↓ 0.7852
CLIP ↑ 23.8519
AC3D + cam. text + abs. Plücker
PitchErr. (abs.) ↓ 24.2002
GravityErr. (abs.) ↓ 32.8825
RotErr (rel.) ↓ 53.7767
TransErr ↓ 0.6714
CLIP ↑ 19.9530
PreciseCam + WAN-I2V-CC
PitchErr. (abs.) ↓ 43.2043
GravityErr. (abs.) ↓ 53.5281
RotErr (rel.) ↓ 47.0873
TransErr ↓ 0.5536
CLIP ↑ 24.3593
Ours (w/o null pitch conditioning)
PitchErr. (abs.) ↓ 7.5732
GravityErr. (abs.) ↓ 14.8372
RotErr (rel.) ↓ 17.8121
TransErr ↓ 0.5383
CLIP ↑ 21.5083
Ours (AC3D backbone)
PitchErr. (abs.) ↓ 13.4654
GravityErr. (abs.) ↓ 20.8586
RotErr (rel.) ↓ 12.1630
TransErr ↓ 0.8450
CLIP ↑ 21.8181
Ours (WAN backbone)
PitchErr. (abs.) ↓ 5.5546
GravityErr. (abs.) ↓ 8.0035
RotErr (rel.) ↓ 6.1997
TransErr ↓ 1.0452
CLIP ↑ 19.5790
Sample: SpatialVID-extreme-00016-4a43fdeb-bb7e-59e7-8096-4f9408126386
Prompt: A quiet forest path winds uphill through towering trees and ferns, bathed in soft, mysterious light that enhances the sense of an untouched, enchanted woodland.
Camera prompt: The camera starts at large tilt-up of 36 degrees. The camera ends with pan of 113 degrees turned right, large tilt-down of 36 degrees, a large Dutch angle tilted counterclockwise 28 degree
AC3D + cam. text
PitchErr. (abs.) ↓ 18.6687
GravityErr. (abs.) ↓ 36.6807
RotErr (rel.) ↓ 32.0958
TransErr ↓ 0.4534
CLIP ↑ 22.5970
AC3D + cam. text + abs. Plücker
PitchErr. (abs.) ↓ 14.1039
GravityErr. (abs.) ↓ 29.1681
RotErr (rel.) ↓ 30.0157
TransErr ↓ 0.8858
CLIP ↑ 21.7147
PreciseCam + WAN-I2V-CC
PitchErr. (abs.) ↓ 12.2044
GravityErr. (abs.) ↓ 25.7049
RotErr (rel.) ↓ 24.7168
TransErr ↓ 0.6136
CLIP ↑ 23.0996
Ours (w/o null pitch conditioning)
PitchErr. (abs.) ↓ 7.3703
GravityErr. (abs.) ↓ 14.5811
RotErr (rel.) ↓ 11.4314
TransErr ↓ 0.7148
CLIP ↑ 20.4090
Ours (AC3D backbone)
PitchErr. (abs.) ↓ 5.1096
GravityErr. (abs.) ↓ 14.4438
RotErr (rel.) ↓ 10.5019
TransErr ↓ 0.9847
CLIP ↑ 21.1846
Ours (WAN backbone)
PitchErr. (abs.) ↓ 4.7420
GravityErr. (abs.) ↓ 7.1774
RotErr (rel.) ↓ 7.6799
TransErr ↓ 1.0390
CLIP ↑ 18.6551
Sample: SpatialVID-extreme-00053-1fb6f4f1-aec2-5bd7-a398-7c4f307181f4
Prompt: A serene pond with swans and cygnets, surrounded by stone edges and greenery, under an overcast sky, capturing quiet natural life.
Camera prompt: The camera starts with large tilt-down of 23 degrees, and a large Dutch angle tilted counterclockwise 22 degrees. The camera ends with pan of 82 degrees turned left, small tilt-up of 16 degrees, a small Dutch angle tilted clockwise 19 degree
AC3D + cam. text
PitchErr. (abs.) ↓ 9.8650
GravityErr. (abs.) ↓ 16.5232
RotErr (rel.) ↓ 4.2297
TransErr ↓ 0.2184
CLIP ↑ 28.0454
AC3D + cam. text + abs. Plücker
PitchErr. (abs.) ↓ 16.3128
GravityErr. (abs.) ↓ 19.9868
RotErr (rel.) ↓ 16.6982
TransErr ↓ 0.3992
CLIP ↑ 23.8211
PreciseCam + WAN-I2V-CC
PitchErr. (abs.) ↓ 27.3438
GravityErr. (abs.) ↓ 28.1930
RotErr (rel.) ↓ 14.7622
TransErr ↓ 0.5218
CLIP ↑ 23.3104
Ours (w/o null pitch conditioning)
PitchErr. (abs.) ↓ 5.7983
GravityErr. (abs.) ↓ 9.1190
RotErr (rel.) ↓ 10.5545
TransErr ↓ 0.5161
CLIP ↑ 25.6933
Ours (AC3D backbone)
PitchErr. (abs.) ↓ 4.5393
GravityErr. (abs.) ↓ 5.8276
RotErr (rel.) ↓ 5.0180
TransErr ↓ 0.3522
CLIP ↑ 25.9724
Ours (WAN backbone)
PitchErr. (abs.) ↓ 3.6083
GravityErr. (abs.) ↓ 4.5798
RotErr (rel.) ↓ 4.5265
TransErr ↓ 0.5398
CLIP ↑ 18.0875
Sample: SpatialVID-extreme-00055-7dc91038-3cee-5db1-82bc-69271bb41b14
Prompt: A peaceful suburban neighborhood under bright sunlight, featuring well-kept homes, lush greenery, and a quiet, residential charm.
Camera prompt: The camera starts with small tilt-down of 13 degrees, and a small Dutch angle tilted counterclockwise 12 degrees. The camera ends with pan of 13 degrees turned right, a large Dutch angle tilted clockwise 37 degree
AC3D + cam. text
PitchErr. (abs.) ↓ 12.1484
GravityErr. (abs.) ↓ 18.1184
RotErr (rel.) ↓ 8.0310
TransErr ↓ 1.0347
CLIP ↑ 22.0883
AC3D + cam. text + abs. Plücker
PitchErr. (abs.) ↓ 3.9469
GravityErr. (abs.) ↓ 10.2930
RotErr (rel.) ↓ 5.3862
TransErr ↓ 1.0776
CLIP ↑ 23.7203
PreciseCam + WAN-I2V-CC
PitchErr. (abs.) ↓ 23.1958
GravityErr. (abs.) ↓ 23.8360
RotErr (rel.) ↓ 6.9296
TransErr ↓ 1.2169
CLIP ↑ 23.3984
Ours (w/o null pitch conditioning)
PitchErr. (abs.) ↓ 4.1281
GravityErr. (abs.) ↓ 5.7495
RotErr (rel.) ↓ 4.5434
TransErr ↓ 0.0958
CLIP ↑ 21.6032
Ours (AC3D backbone)
PitchErr. (abs.) ↓ 1.2906
GravityErr. (abs.) ↓ 4.1636
RotErr (rel.) ↓ 2.5608
TransErr ↓ 0.9939
CLIP ↑ 22.1513
Ours (WAN backbone)
PitchErr. (abs.) ↓ 2.1857
GravityErr. (abs.) ↓ 3.9967
RotErr (rel.) ↓ 3.4507
TransErr ↓ 0.7350
CLIP ↑ 21.4956
Sample: SpatialVID-extreme-00110-9ca63b49-fe40-5987-a03c-92dfe2c6d873
Prompt: A peaceful cherry blossom-lined riverside in Tokyo, where nature and urban architecture coexist under clear, blue skies, evoking calm and reflection.
Camera prompt: The camera starts with extreme tilt-up of 87 degrees, and a large Dutch angle tilted counterclockwise 28 degrees. The camera ends with pan of 43 degrees turned left, small tilt-down of 18 degrees, a large Dutch angle tilted counterclockwise 35 degree
AC3D + cam. text
PitchErr. (abs.) ↓ 11.4436
GravityErr. (abs.) ↓ 23.8078
RotErr (rel.) ↓ 31.4807
TransErr ↓ 0.6359
CLIP ↑ 26.6280
AC3D + cam. text + abs. Plücker
PitchErr. (abs.) ↓ 24.1143
GravityErr. (abs.) ↓ 35.5509
RotErr (rel.) ↓ 64.7968
TransErr ↓ 0.7457
CLIP ↑ 27.9051
PreciseCam + WAN-I2V-CC
PitchErr. (abs.) ↓ 19.5778
GravityErr. (abs.) ↓ 28.4830
RotErr (rel.) ↓ 47.4630
TransErr ↓ 1.0353
CLIP ↑ 24.2992
Ours (w/o null pitch conditioning)
PitchErr. (abs.) ↓ 20.7357
GravityErr. (abs.) ↓ 21.2864
RotErr (rel.) ↓ 30.6535
TransErr ↓ 0.7706
CLIP ↑ 21.1382
Ours (AC3D backbone)
PitchErr. (abs.) ↓ 7.4499
GravityErr. (abs.) ↓ 13.5967
RotErr (rel.) ↓ 23.4345
TransErr ↓ 1.1090
CLIP ↑ 21.8239
Ours (WAN backbone)
PitchErr. (abs.) ↓ 3.8689
GravityErr. (abs.) ↓ 5.6396
RotErr (rel.) ↓ 9.7143
TransErr ↓ 0.8848
CLIP ↑ 22.3384
Sample: SpatialVID-extreme-00051-1e7598dc-a6cd-557c-bc6d-7e4029a9ec70
Prompt: A quiet stone church with red pews, arched windows, and historical architecture exudes reverence and solemnity through its timeless design and natural lighting.
Camera prompt: The camera starts at large tilt-up of 35 degrees. The camera ends with pan of 43 degrees turned left, large tilt-down of 39 degrees, a small Dutch angle tilted counterclockwise 15 degree
AC3D + cam. text
PitchErr. (abs.) ↓ 14.8289
GravityErr. (abs.) ↓ 15.8419
RotErr (rel.) ↓ 12.5883
TransErr ↓ 0.6991
CLIP ↑ 23.5894
AC3D + cam. text + abs. Plücker
PitchErr. (abs.) ↓ 24.2614
GravityErr. (abs.) ↓ 24.6984
RotErr (rel.) ↓ 14.7795
TransErr ↓ 0.7585
CLIP ↑ 22.0869
PreciseCam + WAN-I2V-CC
PitchErr. (abs.) ↓ 37.6448
GravityErr. (abs.) ↓ 38.4759
RotErr (rel.) ↓ 16.3335
TransErr ↓ 0.7803
CLIP ↑ 22.5131
Ours (w/o null pitch conditioning)
PitchErr. (abs.) ↓ 4.5505
GravityErr. (abs.) ↓ 5.4869
RotErr (rel.) ↓ 17.6775
TransErr ↓ 0.6826
CLIP ↑ 20.9058
Ours (AC3D backbone)
PitchErr. (abs.) ↓ 3.1468
GravityErr. (abs.) ↓ 5.5341
RotErr (rel.) ↓ 7.6100
TransErr ↓ 0.4219
CLIP ↑ 23.3192
Ours (WAN backbone)
PitchErr. (abs.) ↓ 12.8990
GravityErr. (abs.) ↓ 14.5696
RotErr (rel.) ↓ 3.4324
TransErr ↓ 0.5449
CLIP ↑ 18.6097