TrackPGD: Efficient Adversarial Attack using Object Binary Masks against Robust Transformer Trackers

Fatemeh N. Nokabadi   Yann Batiste Pequignot   Jean-François Lalonde   Christian Gagné  







[OpenReview]
[Supplementary Material]
[ArXiv]
[HAL]
[Code]
[Poster]
[Slides]

Accepted in The 3rd New Frontiers in Adversarial Machine Learning (AdvML Frontiers @NeurIPS2024), 2024!



Abstract

Adversarial perturbations can deceive neural networks by adding small, imperceptible noise to the input. Recent object trackers with transformer backbones have shown strong performance on tracking datasets, but their adversarial robustness has not been thoroughly evaluated. While transformer trackers are resilient to black-box attacks, existing white-box adversarial attacks are not universally applicable against these new transformer trackers due to differences in backbone architecture. In this work, we introduce TrackPGD, a novel white-box attack that utilizes predicted object binary masks to target robust transformer trackers. Built upon the powerful segmentation attack SegPGD, our proposed TrackPGD effectively influences the decisions of transformer-based trackers. Our method addresses two primary challenges in adapting a segmentation attack for trackers: limited class numbers and extreme pixel class imbalance. TrackPGD uses the same number of iterations as other attack methods for tracker networks and produces competitive adversarial examples that mislead transformer and non-transformer trackers such as MixFormerM, OSTrackSTS, TransT-SEG, and RTS on datasets including VOT2022STS, DAVIS2016, UAV123, and GOT-10k.


— Supplementary Material

The following contents provide additional results and analysis for our proposed method in the main paper.

Table of Contents

  1. MixFormerM Tracking with Mask after Attacks (extends fig. 2)
  2. OSTrackSTS Tracking with Mask after Attacks (extends fig. 2)
  3. TransT-SEG Tracking with BBox after Attacks (extends fig. 4)
  4. Perturbed Search Regions
  5. Parameters Tuning

1. MixFormerM Tracking with Mask after Attacks

To show the performance of our method for tracking, we apply the adversarial attacks aginst MixFormerM, and as a result, we create a video of the output of the tracker before (Green Mask) and after the attack (Red Mask) .

2. OSTrackSTS Tracking with Mask after Attacks

To show the performance of our method for tracking, we apply the adversarial attacks aginst OSTrackSTS, and as a result, we create a video of the output of the tracker before (Green Mask) and after the attack (Red Mask) .

3. TransT-SEG Tracking with BBox after Attacks

In this section, we apply the adversarial attacks aginst TransT-SEG, and as a result, we create a video of the output of the tracker before (Green Mask) and after the attack (Red Mask) .

4. Perturbed Search Regions

We have created video sequences by using the original tracking sequences as a base. In each frame, we perturbed the search region using an attack approach. The search regions after the attack may show different areas of the same frame, depending on the effect of each attack and bounding box degradation. These videos are generated by attacking the TransT-SEG tracker.

5. Parameters Tuning

As described in sec. 4.1 of the main manuscript, we used the DAVIS2016 dataset as the validation set to adjust the \( \lambda_1 \) and \( \lambda_2 \) parameters. This dataset comprises of 50 video sequences with binary mask annotations. We select the DAVIS2016 dataset and protocol as the validation set of our work. The DAVIS protocol calculates the Jaccard Index \( \mathcal{J} \) and contour accuracy \( \mathcal{F} \) via three error statistic measures: mean, recall, and decay. The Jaccard index measures the Intersection over Union (IoU) metric, while the contour accuracy is computed by calculating the F-measure of the predicted mask and the ground truth. To obtain the F-measure, the predicted mask is considered as a set of closed contours, and precision \( P_c \) and recall \( R_c \) are calculated between contour points by comparing these contours with the ground-truth contours. Then, the F-measure \( \frac{2P_c R_c}{P_c + R_c} \) is computed to represent the accuracy of the predicted mask.Finally, the mean, recall and decay is computed to measure the error statistically. The mean is simple average over computed measurements. The recall metric calculates the length of subsequences where the evaluation metric \( \mathcal{J} \) or \( \mathcal{F} \) is higher than a pre-defined threshold(0.5). The decay quantifies the performance loss over four time steps.

To fine-tune the TrackPGD parameters, we computed the average value of \( \mathcal{J}(R) \) and \( \mathcal{F} (R) \), represented as \( \mathcal{J}\&\mathcal{F} (R) \). This approach considers both IoU and contour accuracy over time across video frames. Since this measurement is close to the robustness definition in VOT challenge, we chose this metric to find the best hyperparameters to decrease the robustness of transformer trackers. The following heatmaps show the average values of the Jaccard index \( \mathcal{J} \) and contour accuracy \( \mathcal{F} \) for the tracker's recall metric. The better performance of the tracker is represented by the greater \( \mathcal{J}\&\mathcal{F} (R) \). As our reported numbers are based on attack results, a smaller number indicates a more effective attack. Therefore, the best parameters for MixFormerM are \( ( \lambda_1, \lambda_2) = (10, 2.5) \), while the TrackPGD is strongest against OSTrackSTS with \( (\lambda_1, \lambda_2) = (1, 10) \) and against TransT-SEG with \( (\lambda_1, \lambda_2) = (50, 2.5) \).

MixFormerM

OSTrackSTS

TransT-SEG

Acknowledgements

This work is supported by the DEEL Project CRDPJ 537462-18 funded by the Natural Sciences and Engineering Research Council of Canada (NSERC) and the Consortium for Research and Innovation in Aerospace in Québec (CRIAQ), together with its industrial partners Thales Canada inc, Bell Textron Canada Limited, CAE inc and Bombardier inc. MÉIE-Québec(DEEL project)