Supplementary material

Supplementary material

This supplementary material complements the main paper by providing full videos, technical details and additional results.

Extreme rotation handling (extends fig. 2)

Here we show the full video used to make the "backflip" rotation (with no translation) from fig. 2, along with two additional examples of that same motion. On top is overlaid the input camera rotation in dark grey. Notice how only our method achieves a full 360° looping rotation.

Ours
UCPE
PreciseCam+Gen3C
Prompt: A spacious living room with a gray sofa facing a wooden coffee table. A large window behind sheer curtains lets soft afternoon light enter the room. A bookshelf filled with books lines one wall, and a floor lamp stands near the sofa.

Ours
UCPE
PreciseCam+Gen3C
Prompt: A wide alpine meadow filled with bright wildflowers in bloom beneath towering snow-capped mountains, clear blue sky stretching above the peaks, a narrow stream winding gently through the grassy field, soft spring sunlight illuminating colorful flowers in the foreground, distant pine forests along the slopes, peaceful high-altitude landscape, ultra-detailed, cinematic lighting, natural colors, photorealistic, 35mm photography, depth of field, high dynamic range.

Ours
UCPE
PreciseCam+Gen3C
Prompt: A dramatic canyon landscape with towering red rock cliffs carved by a winding river far below, warm sunset light illuminating layered rock formations, sparse desert vegetation clinging to the edges of the cliffs, vast open sky with glowing clouds above the horizon, rugged stone textures and dry shrubs in the foreground, ultra-detailed, cinematic lighting, natural colors, photorealistic, 35mm photography, depth of field, high dynamic range

Teaser videos (extends fig. 1)

Here we show the full videos used to make fig. 1.

Llamas in a village
Tropical beach

Camera motion: Pitch from +80° to -80°, no translation.

Prompt: A quiet mountain village during snowfall as smoke rises from chimneys and lights glow along the snowy street. Many llamas are walking around in the street.

Camera motion: Roll from -90° to 90°, move forward by 8 meters.

Prompt: A tropical beach at sunrise with palm trees swaying gently while small waves roll onto the golden sand.

Training set examples (extends fig. 4)

Here we show training data samples, complementing fig. 4. Each sample displays the video alongside camera trajectory visualizations from three perspectives: top view, side view, and a 3D view. The cameras are color-coded to represent their position in time: the first camera is purple and the last one is red.

Caption: The video begins with a view of a cozy, eclectic shop filled with various vintage and rustic items. A statue of a Native American figure, adorned with a colorful blanket and a beaded necklace, stands prominently on the right. The shop is well-lit, with warm lights casting a welcoming glow on the brick walls and wooden furniture. A Christmas tree decorated with lights is visible in the background, adding a festive touch to the scene. The camera pans slightly to the left, revealing more of the shop's interior, which includes shelves with framed pictures, lamps, and assorted decorative pieces. The statue remains the focal point as the camera moves, capturing the shop's eclectic charm and the cozy atmosphere. Looking down, we see a person holding a phone in their hand.
Video
Null Pitch Video
Top View
Side View
3D View
Caption: People walk under an archway in a historic city square. Looking down, we see a person walking on a cobblestone path.
Video
Null Pitch Video
Top View
Side View
3D View
Caption: The video begins with a view of a modern, spacious living room featuring a sleek glass coffee table, a black floor lamp, and a brown sofa. The room is well-lit with natural light streaming through large windows, and a balcony with a white railing is visible above. In the background, a dining area with red chairs and a round table is partially visible. The camera slowly pans across the room, highlighting the minimalist decor and architectural details, including a large framed picture on the wall. As the camera moves, it reveals more of the room's features, such as a black spherical light fixture on the coffee table and a chessboard set on a stand. The video maintains a static view, focusing on the room's interior design and layout. Looking down, we see a glass bowl with apples on a table.
Video
Null Pitch Video
Top View
Side View
3D View
Caption: People walk on an escalator in a busy indoor setting. Looking down, we see a person with a backpack on an escalator.
Video
Null Pitch Video
Top View
Side View
3D View
Caption: Yellow aircraft with "REINO DE ESPAÑA 43032" flies over a rocky landscape.
Video
Null Pitch Video
Top View
Side View
3D View
Caption: The video begins with a view of a wooden boardwalk alongside a river, where a tall ship with multiple masts is docked. The scene transitions to a bridge with cables, leading to a cityscape with modern buildings under a clear blue sky. The camera moves forward, capturing the river's surface and the bridge's structure. The view shifts to the riverbank, featuring a red brick building with a waterfront restaurant and a crane in the background. A person walks on the boardwalk, and the camera follows them, maintaining the focus on the river and the urban environment. Looking down, we see a person walking on a wooden deck while holding a phone.
Video
Null Pitch Video
Top View
Side View
3D View
Caption: A serene underwater scene unfolds, showcasing a vibrant coral reef bathed in the soft light of the ocean surface. The camera pans gently over the reef, highlighting its diverse textures and colors, with sunlight filtering through the water, creating a tranquil and ethereal atmosphere.
Video
Null Pitch Video
Top View
Side View
3D View
Caption: The video captures a vibrant outdoor scene at a flower garden on a sunny day. A wide path runs through the garden, bordered by rows of red and purple flowers. People are scattered throughout the area, some walking along the path, while others gather in groups to admire the flowers. In the background, a cityscape with tall buildings is visible under a clear blue sky. A kite is flying in the sky, adding to the leisurely atmosphere. The camera moves along the path, providing a panoramic view of the garden and its visitors. Looking down, we see a person walking on a path surrounded by red and purple flowers.
Video
Null Pitch Video
Top View
Side View
3D View
Caption: Paragliding over a vast desert landscape with mountains in the distance.
Video
Null Pitch Video
Top View
Side View
3D View

Full definition of $\varphi(\cdot)$ and $\psi(\cdot)$ (extends eq. 2)

Obtaining absolute camera poses

For completeness, here we repeat equation 2, which computes the absolute camera extrinsics: $$ \begin{gather} \mathbf{t}_{\text{abs},f} = \psi(\mathbf{\tilde{u}}_0)\mathbf{t}_{\text{rel},f} \nonumber \\ \mathbf{R}_{\text{abs},f} = \mathbf{R}_{\text{pano},f} \psi(\mathbf{\tilde{u}}_0)\mathbf{R}_{\text{rel},f} \\ \mathbf{E}_{\text{abs},f} = \varphi(\mathbf{R}_{\text{pano},0}) [\mathbf{R}_{\text{abs},f}|\mathbf{t}_{\text{abs},f} ] \;, \nonumber \end{gather} $$ where $\mathbf{\tilde{u}}_0$ corresponds to the average up vector of the first frame, $\mathbf{R}_{\text{pano},f}$ corresponds to the sampled camera rotation for frame $f$ and $\mathbf{t}_{\text{rel},f}$ and $\mathbf{R}_{\text{rel},f}$ are the translation and rotational components of the relative camera poses respectively, derived from SfM. We now define exactly the function $\psi(\cdot)$, which aligns camera poses to the gravity, and $\varphi(\cdot)$, which rotates the poses so that the yaw is null for the first frame.

Definition of the gravity alignment function $\psi(\cdot)$

We first need to have the average gravity up vector $\mathbf{\tilde{u}}_0$, normalized to have a unit length of 1. We then obtain a full gravity alignment rotation matrix $\psi(\mathbf{\tilde{u}}_0)$ by computing the forward vector and right vectors in a similar fashion as when computing a LookAt matrix (fixing the up vector instead of the forward vector). We show the full derivation below. $$\begin{align*} \mathbf{d} &\gets -\mathbf{\tilde{u}}_0 && \text{ $\triangleright$ Compute the down direction vector $\mathbf{d}$ in camera space. } \\ \mathbf{r} &\gets \mathbf{d} \times [0~0~1]^\top && \text{ $\triangleright$ Compute the right direction $\mathbf{r}$ vector using the down and temporary forward vector. } \\ \mathbf{r} &\gets \frac{\mathbf{r}}{\|\mathbf{r}\|} && \text{ $\triangleright$ Normalize the right direction vector $\mathbf{r}$. } \\ \mathbf{f} &\gets \mathbf{r} \times \mathbf{d} && \text{ $\triangleright$ Compute the forward direction vector $\mathbf{f}$. } \\ \psi(\mathbf{\tilde{u}}_0) &\gets \begin{bmatrix} \mathbf{r}~\mathbf{d}~\mathbf{f} \end{bmatrix} && \text{ $\triangleright$ Compute the matrix containing the absolute pitch and roll for the first frame by concatenating the basis vectors horizontally. } \end{align*}$$

Definition of the yaw removal function $\varphi(\cdot)$

We begin by extracting from $\mathbf{R}_{\text{pano},0}$ the rotations along the $y$, $x$ and $z$ axis, for the yaw, pitch and roll matrices. This results in the following decomposition: $$ \mathbf{R}_{\text{pano},0} = \mathbf{R}_{\text{pano},0,y} \mathbf{R}_{\text{pano},0,x} \mathbf{R}_{\text{pano},0,z}. $$ We then simply use the inverse yaw rotation $$ \varphi(\mathbf{R}_{\text{pano},0}) \gets \mathbf{R}_{\text{pano},0,y}^{-1}. $$

Dataset details and statistics (extends sec. 3.2 and 4.3)

Training dataset (extends sec. 3.2)

Below we show the distributions of rotations and translations, comparing our training set to RealEstate10K, a dataset typically used for training camera control methods (e.g. AC3D, GEN3C), and PanShot (used to train UCPE). The translation rose plots show the distribution of translation directions seen from the top (obtained from the last frame).

Rotation diversity

Translation diversity (top view)

Evaluation dataset (extends sec. 4.3)

Here, we supply additional details and statistics on our evaluation benchmark, SpatialVID-extreme, extending sec. 4.3 of the paper.

Camera path randomization

We randomly sample a start roll from \([-40°, 40°]\), and a start pitch from \([-90°, 90°]\). The final roll and pitch are sampled the same way, and the end yaw is sampled from \([-180°, 180°]\). The intermediate rotations are interpolated using spherical linear interpolation. Since the method used to compute absolute orientation metrics (Perspective Fields) is never trained with rolls going beyond 45° in magnitude, we resample a random rotation trajectory if an intermediate frame goes outside the roll bounds of \([-40°, 40°]\). Finally, we obtain the translations by estimating them from the original videos with ViPE. We then apply a global rotation to those translations around the yaw axis, with an angle sampled from \([0°, 360°]\), in order to reduce the bias towards translations moving forward.

Evaluation dataset statistics

Here we show that the original SpatialVID-HQ dataset provides limited diversity in absolute orientation and relative rotations. Our new evaluation benchmark, SpatialVID-extreme, provides a broader coverage of Euler angles and total angular distance.

Evaluation dataset statistics

Additional details on the "look down" prompt (extends sec. 3.3)

Training details

We start by showing an example sampled from the training data. In the training video (left), we can see part of the selfie stick used to hold the camera, but it is absent from the null pitch video (center), used for captioning. As explained in the paper, if undesirable elements are present in the training video (such as the selfie stick), but not in the caption, the model will learn to generate these as if they were normal scene content. To prevent this, we generate a third set of videos, this time looking straight down (right), and also caption them.

In practice, we give the following instructions to the VLM to obtain the "look down" prompt:

Describe the content of this video in under 15 words.
Start by mentionning that the video is taken looking down, e.g. "Looking down, we see ..." or "Below is a view of ...".

We then simply concatenate the two prompts 50% of the time in the following way:


if random.random() < 0.5:
    full_caption = caption_null_pitch + " " + caption_down
else:
    full_caption = caption_null_pitch
            

Training video
Null pitch video
Look down video
Caption: "The video begins with a view of a modern urban walkway lined with large buildings and greenery. The walkway is paved with stone tiles, and there are small patches of grass and plants along the sides. In the background, a set of stairs leads up to a higher level, where a few people are seen walking. The scene remains static for a moment before the camera pans slightly to the right, revealing more of the surrounding area. The video concludes with the camera still focused on the walkway, showing the same buildings and greenery in the background."
Caption: "Looking down, we see a person sweeping a patterned floor with a broom."

With the "Look down" caption included, the full caption is:

The video begins with a view of a modern urban walkway lined with large buildings and greenery. The walkway is paved with stone tiles, and there are small patches of grass and plants along the sides. In the background, a set of stairs leads up to a higher level, where a few people are seen walking. The scene remains static for a moment before the camera pans slightly to the right, revealing more of the surrounding area. The video concludes with the camera still focused on the walkway, showing the same buildings and greenery in the background. Looking down, we see a person sweeping a patterned floor with a broom.
Note that the VLM mistakenly identifies the selfie stick as a broom, but, in practice, we found this had no significant impact on the strategy.

Inference details

At inference time, we keep the input (positive) prompt intact. In order to ensure that undesirable elements (e.g. hands, distortions) are absent from the generated video, we use the following prompt for the negative direction:

Looking down, we see a person and a hand. Looking down, we see distortions. The video is low quality, worst quality, blurry, deformed, disfigured, distorted, extra limbs, cloned face.
Note that the choice of negative prompt can be adjusted at inference time based on user preferences without re-training the model.

Qualitative comparison

Here we show a few examples from our evaluation where we clearly see the artefacts removed by training on the "look down" prompts. The same prompt and negative prompt are used at inference. In the following three examples, we see that, without integrating the "look down" prompts in the training process, the cameraman appears when the camera is pointed directly down.

Ours
w/o look-down training
Prompt: A bustling city sidewalk under scaffolding, with pedestrians in casual summer clothes, cars, and buildings visible in the background, all bathed in soft daylight under an overcast sky.
Ours
w/o look-down training
Prompt: A lush forest bathed in green and golden light, with towering trees, lush moss, and a tranquil, peaceful atmosphere.
Ours
w/o look-down training
Prompt: A serene, historic Chinese street lined with red and yellow lanterns, tiled roofs, and bustling vendors under an overcast sky, evoking a tranquil, cultural atmosphere.

Quantitative comparison

We compare our full model (trained for 30k iterations) against an ablated version trained without look-down prompts (trained for 28k iterations). Despite the slightly shorter training schedule of the ablated model, the improvements from integrating look-down prompts are clearly visible in the higher CLIP score, confirming the effectiveness of this training strategy.

Method PitchErr (abs.) ↓ GravityErr (abs.) ↓ RotErr (rel.) ↓ TransErr (rel.) ↓ CLIP ↑ FID ↓ FVD ↓
Ours (w/o look-down training) 11.87 14.28 18.54 0.73 19.99 114.40 968.54
Ours 9.35 11.67 15.96 0.72 20.63 115.58 981.01

Camera control architecture (extends sec. 4.2)

Our lightweight camera control encoder module is composed of the following modules:

Layer Parameters
PixelUnshuffle downscale_factor=16
Conv2d in_channels=24×256, out_channels=128, kernel_size=2, stride=2, padding=0
GroupNorm num_groups=8, num_channels=128
ReLU
Conv2d in_channels=128, out_channels=out_dim, kernel_size=1, stride=1, padding=0 (zero initialized weights and biases)
All layers except the last one are randomly initialized. The last layer's features are added as a residual link after the DiT's patchify layer.

Comparison against panorama generation methods (extends sec. 2)

Panorama generation methods, such as PanoWAN [43] can generate realistic 360° equirectangular videos. As mentioned in the paper, gravity-aligned camera control can be achieved by cropping the desired field of view out of the generated panorama, but this comes with several disadvantages: 1) most of the panorama pixels are discarded, resulting in severe resolution degradation; 2) the desired concepts described in the prompt can be cropped out and thus not appear in the video; and 3) current text-conditioned methods do not provide control over camera translation.

Qualitative comparison

Here, we show an example where two concepts should be present in the generated image: a house and a lake.

We run both our method and PanoWAN on 4 different seeds and display the first generated frame. When cropping a 90° FOV into the generated panorama, notice that for 3 out of the 4 seeds (0, 1, 2), PanoWAN only includes either the lake or the house but not both. When cropping a 45° FOV, all seeds lack either the house or the lake. Our method includes both elements, while also generating sharper details. Please zoom in to compare.

Prompt: A charming wooden lakeside house with warm lights glowing from the windows, sitting beside a crystal-clear blue lake surrounded by lush green trees and distant mountains, soft golden sunset light reflecting on the calm water, wildflowers in the foreground, ultra-detailed, cinematic lighting, natural colors, photorealistic, 35mm photography, depth of field, high dynamic range
Ours PanoWAN
90° FOV 45° FOV Generated panorama 90° FOV (cropped) 45° FOV (cropped)
Seed 0
Seed 1
Seed 2
Seed 3

We show a second example where a temple should be clearly visible and surrounded by valleys. Again, PanoWAN omits the temple in most generations, whereas our method consistently includes it.

Prompt: An ancient stone temple perched high on a rugged mountain ridge surrounded by dramatic cliffs and mist-filled valleys, intricate carvings on weathered stone pillars, snow-capped peaks visible in the distance, warm sunrise light illuminating the temple facade, cinematic lighting, natural textures, photorealistic, 35mm photography, depth of field, high dynamic range
Ours PanoWAN
90° FOV 45° FOV Generated panorama 90° FOV (cropped) 45° FOV (cropped)
Seed 0
Seed 1
Seed 2
Seed 3

Quantitative comparison

Since PanoWAN produces full 360° panoramas at $896 \times 448$ resolution, we must crop its outputs to create standard perspective images. This cropping results in significant detail loss, discarding 86.1% and 95.8% of pixels for 90° and 45° fields of view, respectively. The reduction in high-frequency details affects CLIP alignment and FID scores, even though both models are trained on similar data. We train ours on a subset of PanoVid, while PanoWAN uses the entire dataset. At 90° FOV, we observe a decrease in prompt alignment (CLIP score of 20.63 vs. 19.46) and image quality (FID score of 115.58 vs. 123.10). When using a narrower FOV of 45°, these differences are more pronounced, with CLIP scores of 18.54 versus 16.82 and FID scores of 122.43 versus 142.81 for PanoWAN.

Details of the AC3D baseline prompt engineering (extends sec. 4.4)

For the AC3D baselines, we provide the model with the absolute camera orientation through text, as mentioned in sec. 4.4 of the paper. Here, we show the code used to generate these camera descriptions. We first take the absolute camera extrinsics \(E_\text{abs}\) for the video and convert them to Euler angles. We then describe textually only the first frame's pitch and roll and the last frame's yaw, pitch, and roll. Please click below for the full implementation.

Click here to expand the full code
def get_camera_prompt_from_absolute_c2w(c2w_absolute):
    euler_angles = c2w_to_pitch_roll_yaw(c2w_absolute)
    first_pitch = euler_angles['pitch'][0].item()
    first_roll = euler_angles['roll'][0].item()
    first_yaw = euler_angles['yaw'][0].item()
    last_pitch = euler_angles['pitch'][-1].item()
    last_roll = euler_angles['roll'][-1].item()
    last_yaw = euler_angles['yaw'][-1].item()
    
    
    def describe_angle_shot(pitch):
        """Describe pitch angle shot."""
        pitch_rounded = round(pitch)
        if -5 <= pitch_rounded <= 5:
            return "near straight-on shot"
        elif pitch_rounded > 0:
            if 5 < pitch_rounded <= 20:
                return f"small tilt-up of {pitch_rounded} degrees"
            elif 20 < pitch_rounded <= 45:
                return f"large tilt-up of {pitch_rounded} degrees"
            else:
                return f"extreme tilt-up of {pitch_rounded} degrees"
        else:
            abs_pitch = abs(pitch_rounded)
            if 5 < abs_pitch <= 20:
                return f"small tilt-down of {abs_pitch} degrees"
            elif 20 < abs_pitch <= 45:
                return f"large tilt-down of {abs_pitch} degrees"
            else:
                return f"extreme tilt-down of {abs_pitch} degrees"
    
    def describe_dutch_angle(roll):
        """Describe roll (Dutch angle) with clockwise/counterclockwise."""
        roll_rounded = round(roll)
        if -5 <= roll_rounded <= 5:
            return "near level shot"
        
        abs_roll = abs(roll_rounded)
        if abs_roll <= 20:
            magnitude = "small"
        elif abs_roll <= 45:
            magnitude = "large"
        else:
            magnitude = "extreme"
        
        # Positive roll is counterclockwise, negative is clockwise
        direction = "counterclockwise" if roll_rounded > 0 else "clockwise"
        
        return f"a {magnitude} Dutch angle tilted {direction} {abs_roll} degrees"
    
    def describe_yaw(yaw):
        """Describe yaw (pan) direction."""
        yaw_rounded = round(yaw)
        
        abs_yaw = abs(yaw_rounded)
        direction = "right" if yaw_rounded > 0 else "left"
        return f"pan of {abs_yaw} degrees turned {direction}"
    
    # Build start description
    start_parts = []
    start_pitch_desc = describe_angle_shot(first_pitch)
    start_roll_desc = describe_dutch_angle(first_roll)
    
    if start_pitch_desc != "near straight-on shot":
        start_parts.append(start_pitch_desc)
    if start_roll_desc != "near level shot":
        start_parts.append(start_roll_desc)
    
    # Build end description
    end_parts = []
    end_yaw_desc = describe_yaw(last_yaw)
    end_pitch_desc = describe_angle_shot(last_pitch)
    end_roll_desc = describe_dutch_angle(last_roll)
    
    if end_yaw_desc:
        end_parts.append(end_yaw_desc)
    if end_pitch_desc != "near straight-on shot":
        end_parts.append(end_pitch_desc)
    if end_roll_desc != "near level shot":
        end_parts.append(end_roll_desc)
    
    # Construct final description
    description_parts = []
    

    start_text = "The camera starts at " if len(start_parts) == 1 else "The camera starts with "
    description_parts.append(start_text + ", and ".join(start_parts))
    
    end_text = "The camera ends with " + ", ".join(end_parts)
    description_parts.append(end_text)
    
    return ". ".join(description_parts)[:-1]
        

After obtaining the camera description, we concatenate it after the regular prompt. Refer to the qualitative results at the end of the supplementary material for examples of camera descriptions.

Additional results on the prompt-camera entanglement benchmark (extends sec. 4.6 and fig. 6)

Additional quantitative results (extends fig. 6.b)

First, we report additional graphs showcasing the relationship between the prompt alignment and the input pitch angle, via the CLIP score, where a higher score indicates a high correlation between the prompt and the image.

We report the CLIP score computed between the generated image and the forward-looking input prompt (left) and the caption from the ground truth at the input pitch angle (right). We further display both the relative metrics shifted to each method's average CLIP score (top) and absolute (bottom) graphs.

We observe that our method (in orange) comes closer to the ground truth values (in black) than by training without null-pitch conditioning (in red) and other methods, across the range of pitch angles.

CLIP computed between the caption from the ground truth (at the input pitch angle) and the generated image (relative variations)

CLIP computed between the input caption (forward looking) and the generated image (relative variations)

CLIP computed between the caption from the ground truth (at the input pitch angle) and the generated image (absolute)

CLIP computed between the input caption (forward looking) and the generated image (absolute)

Additional qualitative results (extends fig. 6.a)

Here, we randomly sampled 6 scenes out of the 20 scenes used in our benchmark. We show the prompt and each method's output along with the ground truth, for 5 different pitch angles. Notice how our method more closely follows the input pitch angle, especially at extreme pitch angles (-90° and 90°).

Note that PreciseCam was trained on PolyHaven panoramas used in our benchmark.

Prompt: Two people are working at a desk with computers. The room has wooden beams on the ceiling and posters on the wall. One person is typing on a keyboard while the other looks at the screen. The workspace is cluttered with various items.
PitchGround truthOursw/o null pitchUCPEPreciseCam
90°
50°
-50°
-90°
Prompt: The scene shows an abandoned concrete structure with graffiti on the walls and ceiling. Sunlight streams through large openings, illuminating the overgrown vegetation outside. The interior is dusty and empty, with visible cracks and decay. The graffiti includes the word "RAVE" in red.
PitchGround truthOursw/o null pitchUCPEPreciseCam
90°
50°
-50°
-90°
Prompt: A dirt path winds through a lush, green forest with tall trees and dense shrubs. The sky is clear and blue, with a few scattered clouds. The path leads into the distance, surrounded by vibrant vegetation.
PitchGround truthOursw/o null pitchUCPEPreciseCam
90°
50°
-50°
-90°
Prompt: A cozy room features a beige tufted sofa in the center, flanked by a Christmas tree and a white fireplace adorned with a wreath. To the right, a gray canopy tent and a small cloud-shaped decoration are visible. The room is illuminated by a chandelier and overhead lights, with sheer curtains on the left and decorative screens nearby.
PitchGround truthOursw/o null pitchUCPEPreciseCam
90°
50°
-50°
-90°
Prompt: A large circular softbox light is positioned on the right side of a plain, white room. The softbox emits a bright, diffused light, illuminating the area around it. The ceiling and walls are smooth and unadorned, with visible wiring and a corner of the room slightly curved. The scene remains static, focusing solely on the lighting setup.
PitchGround truthOursw/o null pitchUCPEPreciseCam
90°
50°
-50°
-90°
Prompt: A spacious workshop with large windows and industrial equipment is shown. A blue metal structure dominates the foreground, surrounded by various tools and workbenches. In the background, a staircase leads to an upper level with a glass enclosure. The room is filled with scattered tools, machinery, and workstations, creating an organized yet busy atmosphere.
PitchGround truthOursw/o null pitchUCPEPreciseCam
90°
50°
-50°
-90°

Additional qualitative results (extends fig. 5)

Here are additional qualitative results, complementing fig. 5. We show 18 samples from our SpatialVID-extreme dataset. The black overlay represents the input camera orientation. For "AC3D + cam. text" baselines, we further provide the camera prompt to the model to indicate the absolute camera orientation to the model.

To avoid overly large file sizes, each video's resolution is reduced from $720 \times 480$ to $480\times 320$.

Prompt: A bustling New York City street is framed by towering buildings, with a tall skyscraper dominating the skyline under bright, slightly desaturated sunlight.
Camera description (for AC3D): The camera starts with small tilt-down of 15 degrees, and a large Dutch angle tilted counterclockwise 36 degrees. The camera ends with pan of 99 degrees turned right, large tilt-down of 42 degree
GT Trajectory:
Top
GT top view
Side
GT side view
Ours
Ours (w/o null-pitch cond.)
UCPE
AC3D + cam. text.
PreciseCam + WAN-I2V-CC
PreciseCam + GEN3C
PitchErr (abs.) ↓14.2980
GravityErr (abs.) ↓15.3593
RotErr (rel.) ↓18.9273
TransErr (rel.) ↓0.4353
CLIP ↑15.3568
PitchErr (abs.) ↓18.4883
GravityErr (abs.) ↓19.5434
RotErr (rel.) ↓10.6204
TransErr (rel.) ↓0.5627
CLIP ↑18.4471
PitchErr (abs.) ↓7.7234
GravityErr (abs.) ↓9.3229
RotErr (rel.) ↓8.8529
TransErr (rel.) ↓0.3366
CLIP ↑16.2734
PitchErr (abs.) ↓65.1736
GravityErr (abs.) ↓70.8080
RotErr (rel.) ↓27.5934
TransErr (rel.) ↓0.6098
CLIP ↑21.3544
PitchErr (abs.) ↓18.6828
GravityErr (abs.) ↓28.1785
RotErr (rel.) ↓12.2269
TransErr (rel.) ↓0.5901
CLIP ↑18.5247
PitchErr (abs.) ↓10.2322
GravityErr (abs.) ↓16.8846
RotErr (rel.) ↓5.2063
TransErr (rel.) ↓0.2898
CLIP ↑18.1901
Prompt: A tranquil natural landscape features towering mountains, a serene lake, and a glacier, all bathed in a golden, hazy sky that enhances the sense of peace and grandeur.
Camera description (for AC3D): The camera starts with large tilt-up of 41 degrees, and a large Dutch angle tilted counterclockwise 26 degrees. The camera ends with pan of 140 degrees turned left, small tilt-up of 7 degrees, a small Dutch angle tilted counterclockwise 12 degree
GT Trajectory:
Top
GT top view
Side
GT side view
Ours
Ours (w/o null-pitch cond.)
UCPE
AC3D + cam. text.
PreciseCam + WAN-I2V-CC
PreciseCam + GEN3C
PitchErr (abs.) ↓7.7905
GravityErr (abs.) ↓8.3825
RotErr (rel.) ↓27.8608
TransErr (rel.) ↓0.3450
CLIP ↑22.6281
PitchErr (abs.) ↓9.0780
GravityErr (abs.) ↓14.1736
RotErr (rel.) ↓18.3716
TransErr (rel.) ↓0.7963
CLIP ↑22.4388
PitchErr (abs.) ↓11.3845
GravityErr (abs.) ↓14.0517
RotErr (rel.) ↓35.1550
TransErr (rel.) ↓0.4917
CLIP ↑22.5286
PitchErr (abs.) ↓21.4826
GravityErr (abs.) ↓24.1312
RotErr (rel.) ↓29.2883
TransErr (rel.) ↓0.4646
CLIP ↑25.4205
PitchErr (abs.) ↓9.1725
GravityErr (abs.) ↓14.6736
RotErr (rel.) ↓22.5483
TransErr (rel.) ↓0.5364
CLIP ↑23.0408
PitchErr (abs.) ↓6.1081
GravityErr (abs.) ↓16.0730
RotErr (rel.) ↓17.2306
TransErr (rel.) ↓0.2603
CLIP ↑20.9883
Prompt: A tranquil, overcast stream flows through a lush, green valley, framed by stone walls and dense vegetation, evoking a peaceful natural setting.
Camera description (for AC3D): The camera starts with large tilt-down of 28 degrees, and a large Dutch angle tilted counterclockwise 22 degrees. The camera ends with pan of 85 degrees turned right, extreme tilt-up of 49 degrees, a large Dutch angle tilted counterclockwise 24 degree
GT Trajectory:
Top
GT top view
Side
GT side view
Ours
Ours (w/o null-pitch cond.)
UCPE
AC3D + cam. text.
PreciseCam + WAN-I2V-CC
PreciseCam + GEN3C
PitchErr (abs.) ↓3.5463
GravityErr (abs.) ↓7.8978
RotErr (rel.) ↓10.3566
TransErr (rel.) ↓0.9282
CLIP ↑23.9111
PitchErr (abs.) ↓7.6507
GravityErr (abs.) ↓8.7388
RotErr (rel.) ↓9.3721
TransErr (rel.) ↓1.2238
CLIP ↑23.8106
PitchErr (abs.) ↓7.2480
GravityErr (abs.) ↓11.4872
RotErr (rel.) ↓22.4277
TransErr (rel.) ↓0.3916
CLIP ↑23.4212
PitchErr (abs.) ↓20.5811
GravityErr (abs.) ↓29.4177
RotErr (rel.) ↓30.6997
TransErr (rel.) ↓0.8983
CLIP ↑22.4874
PitchErr (abs.) ↓36.6695
GravityErr (abs.) ↓37.2216
RotErr (rel.) ↓32.8844
TransErr (rel.) ↓0.6433
CLIP ↑22.7227
PitchErr (abs.) ↓27.3445
GravityErr (abs.) ↓29.3676
RotErr (rel.) ↓17.9566
TransErr (rel.) ↓0.4298
CLIP ↑21.0256
Prompt: A serene pond with swans and cygnets, surrounded by stone edges and greenery, under an overcast sky, capturing quiet natural life.
Camera description (for AC3D): The camera starts with small tilt-up of 7 degrees, and a small Dutch angle tilted clockwise 6 degrees. The camera ends with pan of 80 degrees turned left, large tilt-up of 33 degree
GT Trajectory:
Top
GT top view
Side
GT side view
Ours
Ours (w/o null-pitch cond.)
UCPE
AC3D + cam. text.
PreciseCam + WAN-I2V-CC
PreciseCam + GEN3C
PitchErr (abs.) ↓2.8573
GravityErr (abs.) ↓3.7889
RotErr (rel.) ↓14.0554
TransErr (rel.) ↓1.0552
CLIP ↑24.4302
PitchErr (abs.) ↓12.7494
GravityErr (abs.) ↓12.9733
RotErr (rel.) ↓18.1938
TransErr (rel.) ↓1.0534
CLIP ↑25.6092
PitchErr (abs.) ↓8.7367
GravityErr (abs.) ↓9.5328
RotErr (rel.) ↓13.9433
TransErr (rel.) ↓0.9523
CLIP ↑28.2995
PitchErr (abs.) ↓20.9364
GravityErr (abs.) ↓21.0431
RotErr (rel.) ↓20.5057
TransErr (rel.) ↓0.8868
CLIP ↑27.3559
PitchErr (abs.) ↓6.8468
GravityErr (abs.) ↓9.0722
RotErr (rel.) ↓24.6648
TransErr (rel.) ↓0.7248
CLIP ↑16.5736
PitchErr (abs.) ↓13.8269
GravityErr (abs.) ↓15.2413
RotErr (rel.) ↓8.4598
TransErr (rel.) ↓0.3358
CLIP ↑16.5891
Prompt: A lively city street features a colorful food truck surrounded by pedestrians and traffic, under soft cloud cover, blending urban energy with cultural charm.
Camera description (for AC3D): The camera starts at large tilt-down of 27 degrees. The camera ends with pan of 65 degrees turned right, large tilt-up of 43 degrees, a large Dutch angle tilted clockwise 32 degree
GT Trajectory:
Top
GT top view
Side
GT side view
Ours
Ours (w/o null-pitch cond.)
UCPE
AC3D + cam. text.
PreciseCam + WAN-I2V-CC
PreciseCam + GEN3C
PitchErr (abs.) ↓3.4836
GravityErr (abs.) ↓10.6679
RotErr (rel.) ↓10.7291
TransErr (rel.) ↓0.3207
CLIP ↑21.2974
PitchErr (abs.) ↓3.8587
GravityErr (abs.) ↓5.8374
RotErr (rel.) ↓20.9080
TransErr (rel.) ↓0.4057
CLIP ↑21.9627
PitchErr (abs.) ↓4.5728
GravityErr (abs.) ↓11.8453
RotErr (rel.) ↓12.1141
TransErr (rel.) ↓0.2237
CLIP ↑22.3326
PitchErr (abs.) ↓11.4835
GravityErr (abs.) ↓29.9021
RotErr (rel.) ↓23.6326
TransErr (rel.) ↓0.4312
CLIP ↑25.8033
PitchErr (abs.) ↓8.0260
GravityErr (abs.) ↓17.2676
RotErr (rel.) ↓11.9612
TransErr (rel.) ↓0.3089
CLIP ↑23.7883
PitchErr (abs.) ↓5.4480
GravityErr (abs.) ↓13.9425
RotErr (rel.) ↓5.0740
TransErr (rel.) ↓0.2142
CLIP ↑22.9672
Prompt: A man in a blue shirt works on a marina dock during the daytime, surrounded by yachts and a clear blue sky, evoking a vibrant, luxurious atmosphere.
Camera description (for AC3D): The camera starts with extreme tilt-down of 78 degrees, and a large Dutch angle tilted clockwise 26 degrees. The camera ends with pan of 17 degrees turned left, large tilt-up of 32 degrees, a large Dutch angle tilted clockwise 38 degree
GT Trajectory:
Top
GT top view
Side
GT side view
Ours
Ours (w/o null-pitch cond.)
UCPE
AC3D + cam. text.
PreciseCam + WAN-I2V-CC
PreciseCam + GEN3C
PitchErr (abs.) ↓4.8463
GravityErr (abs.) ↓6.2187
RotErr (rel.) ↓8.6361
TransErr (rel.) ↓0.6297
CLIP ↑23.5810
PitchErr (abs.) ↓5.4456
GravityErr (abs.) ↓8.1023
RotErr (rel.) ↓10.2561
TransErr (rel.) ↓1.1524
CLIP ↑24.8524
PitchErr (abs.) ↓8.6654
GravityErr (abs.) ↓9.8130
RotErr (rel.) ↓15.7988
TransErr (rel.) ↓0.2517
CLIP ↑25.1469
PitchErr (abs.) ↓16.2291
GravityErr (abs.) ↓32.4096
RotErr (rel.) ↓32.3674
TransErr (rel.) ↓0.9814
CLIP ↑24.3097
PitchErr (abs.) ↓28.4748
GravityErr (abs.) ↓38.1090
RotErr (rel.) ↓40.9235
TransErr (rel.) ↓1.3353
CLIP ↑22.6506
PitchErr (abs.) ↓41.4208
GravityErr (abs.) ↓48.2946
RotErr (rel.) ↓12.4817
TransErr (rel.) ↓0.5966
CLIP ↑18.2815
Prompt: A clean, organized navigation station on a boat features instruments, controls, and a seating area, with a tropical landscape visible through the window, evoking a serene, functional maritime environment.
Camera description (for AC3D): The camera starts with small tilt-up of 11 degrees, and a large Dutch angle tilted counterclockwise 24 degrees. The camera ends with pan of 3 degrees turned left, extreme tilt-up of 76 degrees, a small Dutch angle tilted counterclockwise 18 degree
GT Trajectory:
Top
GT top view
Side
GT side view
Ours
Ours (w/o null-pitch cond.)
UCPE
AC3D + cam. text.
PreciseCam + WAN-I2V-CC
PreciseCam + GEN3C
PitchErr (abs.) ↓12.7647
GravityErr (abs.) ↓14.1959
RotErr (rel.) ↓14.5900
TransErr (rel.) ↓0.6959
CLIP ↑15.5817
PitchErr (abs.) ↓11.0004
GravityErr (abs.) ↓11.3812
RotErr (rel.) ↓5.5317
TransErr (rel.) ↓0.9898
CLIP ↑16.8967
PitchErr (abs.) ↓16.1051
GravityErr (abs.) ↓16.3942
RotErr (rel.) ↓15.6113
TransErr (rel.) ↓0.2461
CLIP ↑20.5828
PitchErr (abs.) ↓51.1265
GravityErr (abs.) ↓55.6638
RotErr (rel.) ↓14.0315
TransErr (rel.) ↓0.7444
CLIP ↑25.7179
PitchErr (abs.) ↓28.5064
GravityErr (abs.) ↓29.1580
RotErr (rel.) ↓18.8119
TransErr (rel.) ↓0.3052
CLIP ↑23.9789
PitchErr (abs.) ↓61.7162
GravityErr (abs.) ↓62.8028
RotErr (rel.) ↓9.4117
TransErr (rel.) ↓0.5981
CLIP ↑19.5752
Prompt: A lively pedestrian street lined with traditional shops and trees, bathed in soft afternoon light, exudes a relaxed, vibrant energy amid a crowd of people.
Camera description (for AC3D): The camera starts with small tilt-down of 14 degrees, and a small Dutch angle tilted counterclockwise 13 degrees. The camera ends with pan of 77 degrees turned left, small tilt-up of 10 degree
GT Trajectory:
Top
GT top view
Side
GT side view
Ours
Ours (w/o null-pitch cond.)
UCPE
AC3D + cam. text.
PreciseCam + WAN-I2V-CC
PreciseCam + GEN3C
PitchErr (abs.) ↓2.4097
GravityErr (abs.) ↓5.0623
RotErr (rel.) ↓16.1530
TransErr (rel.) ↓0.8581
CLIP ↑18.9255
PitchErr (abs.) ↓4.3999
GravityErr (abs.) ↓6.7890
RotErr (rel.) ↓27.0753
TransErr (rel.) ↓0.8181
CLIP ↑23.0018
PitchErr (abs.) ↓7.5854
GravityErr (abs.) ↓9.5029
RotErr (rel.) ↓15.4069
TransErr (rel.) ↓1.0438
CLIP ↑24.1084
PitchErr (abs.) ↓6.3432
GravityErr (abs.) ↓10.3546
RotErr (rel.) ↓14.0585
TransErr (rel.) ↓0.6816
CLIP ↑20.2794
PitchErr (abs.) ↓4.7331
GravityErr (abs.) ↓10.0578
RotErr (rel.) ↓11.7439
TransErr (rel.) ↓1.0422
CLIP ↑22.0151
PitchErr (abs.) ↓4.4809
GravityErr (abs.) ↓11.3291
RotErr (rel.) ↓2.0049
TransErr (rel.) ↓0.2286
CLIP ↑19.6481
Prompt: A cozy European town square with historic architecture, greenery, and people enjoying a quiet, sun-dappled afternoon in a relaxed, inviting setting.
Camera description (for AC3D): The camera starts with small tilt-up of 16 degrees, and a small Dutch angle tilted counterclockwise 9 degrees. The camera ends with pan of 68 degrees turned left, large tilt-up of 41 degree
GT Trajectory:
Top
GT top view
Side
GT side view
Ours
Ours (w/o null-pitch cond.)
UCPE
AC3D + cam. text.
PreciseCam + WAN-I2V-CC
PreciseCam + GEN3C
PitchErr (abs.) ↓3.2326
GravityErr (abs.) ↓4.2766
RotErr (rel.) ↓7.9688
TransErr (rel.) ↓0.7219
CLIP ↑23.9761
PitchErr (abs.) ↓2.6241
GravityErr (abs.) ↓3.3654
RotErr (rel.) ↓4.9019
TransErr (rel.) ↓0.8855
CLIP ↑24.9427
PitchErr (abs.) ↓2.7419
GravityErr (abs.) ↓3.7975
RotErr (rel.) ↓11.1457
TransErr (rel.) ↓0.1877
CLIP ↑24.2604
PitchErr (abs.) ↓19.1283
GravityErr (abs.) ↓21.0433
RotErr (rel.) ↓19.2928
TransErr (rel.) ↓0.8736
CLIP ↑26.9065
PitchErr (abs.) ↓6.6120
GravityErr (abs.) ↓9.9718
RotErr (rel.) ↓8.7302
TransErr (rel.) ↓0.8353
CLIP ↑27.1105
PitchErr (abs.) ↓4.5310
GravityErr (abs.) ↓7.0418
RotErr (rel.) ↓2.5711
TransErr (rel.) ↓0.2373
CLIP ↑25.2476
Prompt: A serene aerial view of a river winding through a green forest, bordered by red-brown plains and scattered palm trees under bright sunlight.
Camera description (for AC3D): The camera starts with large tilt-up of 28 degrees, and a large Dutch angle tilted counterclockwise 21 degrees. The camera ends with pan of 5 degrees turned left, a small Dutch angle tilted clockwise 15 degree
GT Trajectory:
Top
GT top view
Side
GT side view
Ours
Ours (w/o null-pitch cond.)
UCPE
AC3D + cam. text.
PreciseCam + WAN-I2V-CC
PreciseCam + GEN3C
PitchErr (abs.) ↓7.9396
GravityErr (abs.) ↓9.9094
RotErr (rel.) ↓3.6218
TransErr (rel.) ↓0.3680
CLIP ↑19.3195
PitchErr (abs.) ↓2.5417
GravityErr (abs.) ↓9.8657
RotErr (rel.) ↓13.1780
TransErr (rel.) ↓0.4928
CLIP ↑22.2930
PitchErr (abs.) ↓23.6944
GravityErr (abs.) ↓27.6941
RotErr (rel.) ↓7.4051
TransErr (rel.) ↓0.1607
CLIP ↑27.8953
PitchErr (abs.) ↓27.4511
GravityErr (abs.) ↓29.3907
RotErr (rel.) ↓24.4686
TransErr (rel.) ↓0.3677
CLIP ↑27.0442
PitchErr (abs.) ↓66.2299
GravityErr (abs.) ↓69.5474
RotErr (rel.) ↓9.7533
TransErr (rel.) ↓0.6132
CLIP ↑23.9130
PitchErr (abs.) ↓46.8744
GravityErr (abs.) ↓52.2966
RotErr (rel.) ↓2.9470
TransErr (rel.) ↓0.5611
CLIP ↑23.0218
Prompt: A serene forest under a clear blue sky, where tall trees stand alongside evergreens, creating a tranquil atmosphere.
Camera description (for AC3D): The camera starts with large tilt-up of 33 degrees, and a small Dutch angle tilted clockwise 10 degrees. The camera ends with pan of 130 degrees turned right, large tilt-up of 28 degrees, a large Dutch angle tilted clockwise 28 degree
GT Trajectory:
Top
GT top view
Side
GT side view
Ours
Ours (w/o null-pitch cond.)
UCPE
AC3D + cam. text.
PreciseCam + WAN-I2V-CC
PreciseCam + GEN3C
PitchErr (abs.) ↓2.7865
GravityErr (abs.) ↓7.1986
RotErr (rel.) ↓20.0275
TransErr (rel.) ↓1.0549
CLIP ↑23.1416
PitchErr (abs.) ↓1.5670
GravityErr (abs.) ↓3.0749
RotErr (rel.) ↓23.4705
TransErr (rel.) ↓1.0889
CLIP ↑20.6701
PitchErr (abs.) ↓4.0603
GravityErr (abs.) ↓9.1477
RotErr (rel.) ↓11.2254
TransErr (rel.) ↓0.9391
CLIP ↑25.0470
PitchErr (abs.) ↓3.6253
GravityErr (abs.) ↓16.5770
RotErr (rel.) ↓41.2876
TransErr (rel.) ↓0.7064
CLIP ↑23.0964
PitchErr (abs.) ↓25.4268
GravityErr (abs.) ↓30.7069
RotErr (rel.) ↓48.5685
TransErr (rel.) ↓0.8232
CLIP ↑20.5326
PitchErr (abs.) ↓16.6901
GravityErr (abs.) ↓28.2620
RotErr (rel.) ↓10.5293
TransErr (rel.) ↓0.4145
CLIP ↑22.3368
Prompt: A luxurious, elegantly designed foyer features symmetrical staircases, a chandelier, and decorative elements, exuding sophistication and calm serenity.
Camera description (for AC3D): The camera starts at large tilt-down of 24 degrees. The camera ends with pan of 141 degrees turned right, small tilt-down of 12 degrees, a large Dutch angle tilted counterclockwise 32 degree
GT Trajectory:
Top
GT top view
Side
GT side view
Ours
Ours (w/o null-pitch cond.)
UCPE
AC3D + cam. text.
PreciseCam + WAN-I2V-CC
PreciseCam + GEN3C
PitchErr (abs.) ↓15.1808
GravityErr (abs.) ↓15.9019
RotErr (rel.) ↓18.0686
TransErr (rel.) ↓0.3759
CLIP ↑22.0429
PitchErr (abs.) ↓14.3956
GravityErr (abs.) ↓15.3638
RotErr (rel.) ↓14.7519
TransErr (rel.) ↓0.4025
CLIP ↑20.6996
PitchErr (abs.) ↓3.1735
GravityErr (abs.) ↓7.3737
RotErr (rel.) ↓11.9370
TransErr (rel.) ↓0.3145
CLIP ↑26.1348
PitchErr (abs.) ↓17.5033
GravityErr (abs.) ↓20.8107
RotErr (rel.) ↓26.1418
TransErr (rel.) ↓0.5510
CLIP ↑19.2842
PitchErr (abs.) ↓7.5387
GravityErr (abs.) ↓15.2491
RotErr (rel.) ↓17.5111
TransErr (rel.) ↓0.5266
CLIP ↑23.9171
PitchErr (abs.) ↓8.6820
GravityErr (abs.) ↓13.8483
RotErr (rel.) ↓7.0481
TransErr (rel.) ↓0.1435
CLIP ↑19.9049
Prompt: A vibrant forest canopy bathed in green and yellow foliage frames a blue sky, creating a peaceful, tranquil atmosphere under soft, diffused light.
Camera description (for AC3D): The camera starts at large tilt-up of 21 degrees. The camera ends with pan of 42 degrees turned right, small tilt-up of 20 degrees, a large Dutch angle tilted clockwise 38 degree
GT Trajectory:
Top
GT top view
Side
GT side view
Ours
Ours (w/o null-pitch cond.)
UCPE
AC3D + cam. text.
PreciseCam + WAN-I2V-CC
PreciseCam + GEN3C
PitchErr (abs.) ↓22.0111
GravityErr (abs.) ↓24.1429
RotErr (rel.) ↓10.4896
TransErr (rel.) ↓0.3803
CLIP ↑21.1433
PitchErr (abs.) ↓14.2944
GravityErr (abs.) ↓15.7664
RotErr (rel.) ↓8.6505
TransErr (rel.) ↓0.8372
CLIP ↑23.0423
PitchErr (abs.) ↓11.3981
GravityErr (abs.) ↓13.7690
RotErr (rel.) ↓18.0704
TransErr (rel.) ↓0.1442
CLIP ↑24.2058
PitchErr (abs.) ↓33.9443
GravityErr (abs.) ↓41.5466
RotErr (rel.) ↓12.1503
TransErr (rel.) ↓0.6659
CLIP ↑25.1357
PitchErr (abs.) ↓55.9415
GravityErr (abs.) ↓57.0587
RotErr (rel.) ↓13.5988
TransErr (rel.) ↓0.2039
CLIP ↑25.2885
PitchErr (abs.) ↓54.7138
GravityErr (abs.) ↓56.4570
RotErr (rel.) ↓6.0962
TransErr (rel.) ↓0.3333
CLIP ↑23.1547
Prompt: A lively urban street with classical architecture, towering buildings, and a vibrant atmosphere, captured under a clear blue sky with pedestrians and vehicles in motion.
Camera description (for AC3D): The camera starts at small tilt-up of 14 degrees. The camera ends with pan of 147 degrees turned right, extreme tilt-up of 58 degrees, a large Dutch angle tilted counterclockwise 36 degree
GT Trajectory:
Top
GT top view
Side
GT side view
Ours
Ours (w/o null-pitch cond.)
UCPE
AC3D + cam. text.
PreciseCam + WAN-I2V-CC
PreciseCam + GEN3C
PitchErr (abs.) ↓7.1881
GravityErr (abs.) ↓8.9837
RotErr (rel.) ↓17.7980
TransErr (rel.) ↓0.4106
CLIP ↑22.8832
PitchErr (abs.) ↓9.0029
GravityErr (abs.) ↓10.2443
RotErr (rel.) ↓28.6189
TransErr (rel.) ↓0.9817
CLIP ↑23.8872
PitchErr (abs.) ↓5.7361
GravityErr (abs.) ↓8.1854
RotErr (rel.) ↓19.3946
TransErr (rel.) ↓0.1632
CLIP ↑22.7695
PitchErr (abs.) ↓24.2616
GravityErr (abs.) ↓28.2293
RotErr (rel.) ↓52.6200
TransErr (rel.) ↓0.3919
CLIP ↑19.3397
PitchErr (abs.) ↓13.7577
GravityErr (abs.) ↓15.9307
RotErr (rel.) ↓16.3743
TransErr (rel.) ↓0.3195
CLIP ↑21.1071
PitchErr (abs.) ↓5.2091
GravityErr (abs.) ↓8.1628
RotErr (rel.) ↓7.3163
TransErr (rel.) ↓0.1676
CLIP ↑19.6757
Prompt: A serene, white church with a neutral colored arched doorway and tall steeple stands in a lush, green setting under a soft, blue sky, exuding quiet reverence and welcome.
Camera description (for AC3D): The camera starts with extreme tilt-down of 57 degrees, and a large Dutch angle tilted clockwise 31 degrees. The camera ends with pan of 88 degrees turned right, small tilt-down of 17 degrees, a large Dutch angle tilted clockwise 23 degree
GT Trajectory:
Top
GT top view
Side
GT side view
Ours
Ours (w/o null-pitch cond.)
UCPE
AC3D + cam. text.
PreciseCam + WAN-I2V-CC
PreciseCam + GEN3C
PitchErr (abs.) ↓5.7712
GravityErr (abs.) ↓9.2607
RotErr (rel.) ↓8.7355
TransErr (rel.) ↓0.4564
CLIP ↑18.8290
PitchErr (abs.) ↓3.0359
GravityErr (abs.) ↓7.6106
RotErr (rel.) ↓3.8886
TransErr (rel.) ↓0.6009
CLIP ↑21.7065
PitchErr (abs.) ↓63.5745
GravityErr (abs.) ↓65.1083
RotErr (rel.) ↓17.6118
TransErr (rel.) ↓0.5241
CLIP ↑22.6278
PitchErr (abs.) ↓30.6113
GravityErr (abs.) ↓50.4680
RotErr (rel.) ↓20.5671
TransErr (rel.) ↓0.4450
CLIP ↑25.0338
PitchErr (abs.) ↓15.4764
GravityErr (abs.) ↓19.7296
RotErr (rel.) ↓8.9161
TransErr (rel.) ↓0.3539
CLIP ↑17.5143
PitchErr (abs.) ↓2.7019
GravityErr (abs.) ↓12.7787
RotErr (rel.) ↓2.6540
TransErr (rel.) ↓0.0733
CLIP ↑18.7657
Prompt: An aerial view of a bustling urban roundabout features a green park, winding paths, and heavy traffic, set against a muted overcast sky in a densely built cityscape.
Camera description (for AC3D): The camera starts with small tilt-up of 17 degrees, and a small Dutch angle tilted clockwise 20 degrees. The camera ends with pan of 100 degrees turned left, small tilt-up of 13 degrees, a small Dutch angle tilted counterclockwise 6 degree
GT Trajectory:
Top
GT top view
Side
GT side view
Ours
Ours (w/o null-pitch cond.)
UCPE
AC3D + cam. text.
PreciseCam + WAN-I2V-CC
PreciseCam + GEN3C
PitchErr (abs.) ↓8.5360
GravityErr (abs.) ↓10.0633
RotErr (rel.) ↓22.5292
TransErr (rel.) ↓0.7075
CLIP ↑20.6379
PitchErr (abs.) ↓3.0018
GravityErr (abs.) ↓3.1507
RotErr (rel.) ↓19.6201
TransErr (rel.) ↓0.7381
CLIP ↑22.3896
PitchErr (abs.) ↓5.4818
GravityErr (abs.) ↓5.5570
RotErr (rel.) ↓16.3980
TransErr (rel.) ↓0.4012
CLIP ↑24.2568
PitchErr (abs.) ↓29.6131
GravityErr (abs.) ↓31.0386
RotErr (rel.) ↓46.8790
TransErr (rel.) ↓0.4946
CLIP ↑27.3142
PitchErr (abs.) ↓37.0207
GravityErr (abs.) ↓39.1196
RotErr (rel.) ↓25.8057
TransErr (rel.) ↓0.7508
CLIP ↑22.6742
PitchErr (abs.) ↓40.2885
GravityErr (abs.) ↓42.5248
RotErr (rel.) ↓31.3004
TransErr (rel.) ↓0.3710
CLIP ↑15.1795
Prompt: An elegant dining setup features colorful food on white plates against a dark surface, evoking a sophisticated atmosphere.
Camera description (for AC3D): The camera starts with extreme tilt-up of 64 degrees, and a large Dutch angle tilted clockwise 35 degrees. The camera ends with pan of 21 degrees turned right, extreme tilt-down of 61 degrees, a large Dutch angle tilted clockwise 35 degree
GT Trajectory:
Top
GT top view
Side
GT side view
Ours
Ours (w/o null-pitch cond.)
UCPE
AC3D + cam. text.
PreciseCam + WAN-I2V-CC
PreciseCam + GEN3C
PitchErr (abs.) ↓4.4617
GravityErr (abs.) ↓6.5621
RotErr (rel.) ↓10.3198
TransErr (rel.) ↓0.8838
CLIP ↑17.9280
PitchErr (abs.) ↓7.4526
GravityErr (abs.) ↓9.8259
RotErr (rel.) ↓15.7841
TransErr (rel.) ↓0.9045
CLIP ↑19.0975
PitchErr (abs.) ↓26.3223
GravityErr (abs.) ↓32.2382
RotErr (rel.) ↓20.8833
TransErr (rel.) ↓0.7309
CLIP ↑14.1764
PitchErr (abs.) ↓75.6787
GravityErr (abs.) ↓79.3914
RotErr (rel.) ↓41.9982
TransErr (rel.) ↓0.9493
CLIP ↑19.6002
PitchErr (abs.) ↓46.5792
GravityErr (abs.) ↓62.5627
RotErr (rel.) ↓58.2799
TransErr (rel.) ↓0.7496
CLIP ↑20.2235
PitchErr (abs.) ↓66.5807
GravityErr (abs.) ↓72.6658
RotErr (rel.) ↓38.3987
TransErr (rel.) ↓0.8507
CLIP ↑18.0988
Prompt: A modern skyscraper rises against a clear blue sky, its grid-like facade reflecting sunlight in a bustling urban setting.
Camera description (for AC3D): The camera starts with large tilt-down of 43 degrees, and a large Dutch angle tilted counterclockwise 22 degrees. The camera ends with pan of 32 degrees turned right, large tilt-down of 23 degrees, a large Dutch angle tilted counterclockwise 22 degree
GT Trajectory:
Top
GT top view
Side
GT side view
Ours
Ours (w/o null-pitch cond.)
UCPE
AC3D + cam. text.
PreciseCam + WAN-I2V-CC
PreciseCam + GEN3C
PitchErr (abs.) ↓4.9611
GravityErr (abs.) ↓5.5036
RotErr (rel.) ↓20.2030
TransErr (rel.) ↓0.9899
CLIP ↑17.3897
PitchErr (abs.) ↓103.6987
GravityErr (abs.) ↓103.7976
RotErr (rel.) ↓16.1299
TransErr (rel.) ↓0.8347
CLIP ↑21.0651
PitchErr (abs.) ↓64.3037
GravityErr (abs.) ↓66.6905
RotErr (rel.) ↓6.9866
TransErr (rel.) ↓0.1907
CLIP ↑27.6920
PitchErr (abs.) ↓76.9265
GravityErr (abs.) ↓78.8622
RotErr (rel.) ↓23.7582
TransErr (rel.) ↓0.8034
CLIP ↑25.3587
PitchErr (abs.) ↓66.1499
GravityErr (abs.) ↓66.9142
RotErr (rel.) ↓6.7732
TransErr (rel.) ↓0.4363
CLIP ↑23.3828
PitchErr (abs.) ↓55.2593
GravityErr (abs.) ↓56.3146
RotErr (rel.) ↓4.3229
TransErr (rel.) ↓0.2138
CLIP ↑21.1609