— Supplementary Material

The following contents provide additional results and analysis for our proposed method in the main paper.

MixFormerM Tracking with Mask after Attacks (extends fig. 2)
OSTrackSTS Tracking with Mask after Attacks (extends fig. 2)
TransT-SEG Tracking with BBox after Attacks (extends fig. 4)
Perturbed Search Regions
Parameters Tuning

1. MixFormerM Tracking with Mask after Attacks

To show the performance of our method for tracking, we apply the adversarial attacks aginst MixFormerM, and as a result, we create a video of the output of the tracker before (Green Mask) and after the attack (Red Mask) .

2. OSTrackSTS Tracking with Mask after Attacks

To show the performance of our method for tracking, we apply the adversarial attacks aginst OSTrackSTS, and as a result, we create a video of the output of the tracker before (Green Mask) and after the attack (Red Mask) .

3. TransT-SEG Tracking with BBox after Attacks

In this section, we apply the adversarial attacks aginst TransT-SEG, and as a result, we create a video of the output of the tracker before (Green Mask) and after the attack (Red Mask) .

4. Perturbed Search Regions

We have created video sequences by using the original tracking sequences as a base. In each frame, we perturbed the search region using an attack approach. The search regions after the attack may show different areas of the same frame, depending on the effect of each attack and bounding box degradation. These videos are generated by attacking the TransT-SEG tracker.

5. Parameters Tuning

As described in sec. 4.1 of the main manuscript, we used the DAVIS2016 dataset as the validation set to adjust the \( \lambda_1 \) and \( \lambda_2 \) parameters. This dataset comprises of 50 video sequences with binary mask annotations. We select the DAVIS2016 dataset and protocol as the validation set of our work. The DAVIS protocol calculates the Jaccard Index \( \mathcal{J} \) and contour accuracy \( \mathcal{F} \) via three error statistic measures: mean, recall, and decay. The Jaccard index measures the Intersection over Union (IoU) metric, while the contour accuracy is computed by calculating the F-measure of the predicted mask and the ground truth. To obtain the F-measure, the predicted mask is considered as a set of closed contours, and precision \( P_c \) and recall \( R_c \) are calculated between contour points by comparing these contours with the ground-truth contours. Then, the F-measure \( \frac{2P_c R_c}{P_c + R_c} \) is computed to represent the accuracy of the predicted mask.Finally, the mean, recall and decay is computed to measure the error statistically. The mean is simple average over computed measurements. The recall metric calculates the length of subsequences where the evaluation metric \( \mathcal{J} \) or \( \mathcal{F} \) is higher than a pre-defined threshold(0.5). The decay quantifies the performance loss over four time steps.

To fine-tune the TrackPGD parameters, we computed the average value of \( \mathcal{J}(R) \) and \( \mathcal{F} (R) \), represented as \( \mathcal{J}\&\mathcal{F} (R) \). This approach considers both IoU and contour accuracy over time across video frames. Since this measurement is close to the robustness definition in VOT challenge, we chose this metric to find the best hyperparameters to decrease the robustness of transformer trackers. The following heatmaps show the average values of the Jaccard index \( \mathcal{J} \) and contour accuracy \( \mathcal{F} \) for the tracker's recall metric. The better performance of the tracker is represented by the greater \( \mathcal{J}\&\mathcal{F} (R) \). As our reported numbers are based on attack results, a smaller number indicates a more effective attack. Therefore, the best parameters for MixFormerM are \( ( \lambda_1, \lambda_2) = (10, 2.5) \), while the TrackPGD is strongest against OSTrackSTS with \( (\lambda_1, \lambda_2) = (1, 10) \) and against TransT-SEG with \( (\lambda_1, \lambda_2) = (50, 2.5) \).

MixFormerM

OSTrackSTS

TransT-SEG

Acknowledgements

This work is supported by the DEEL Project CRDPJ 537462-18 funded by the Natural Sciences and Engineering Research Council of Canada (NSERC) and the Consortium for Research and Innovation in Aerospace in Québec (CRIAQ), together with its industrial partners Thales Canada inc, Bell Textron Canada Limited, CAE inc and Bombardier inc. MÉIE-Québec(DEEL project)

[OpenReview]	[Supplementary Material]	[ArXiv]	[HAL]
[Code]	[Poster]	[Slides]

Abstract