Red Teaming Deep Neural Networks with Feature Synthesis Tools

Stephen Casper (, Yuxiao Li, Jiawei Li, Tong Bu, Kevin Zhang, Kaivalya Hariharan, Dylan Hadfield-Menell

View the paper on arXiv.

Find resources on GitHub.

Find more work from us on our lab website.

  title={Red Teaming Deep Neural Networks with Feature Synthesis Tools},
  author={Casper, Stephen and Li, Yuxiao and Li, Jiawei and Bu, Tong and Zhang, Kevin and and Hariharan, Kaivalya and Hadfield-Menell, Dylan},
  journal={arXiv preprint arXiv:2302.10894},

Benchmarking Interpretability Tools

Interpretability tools for deep neural networks are widely studied because of their potential to help us exercise oversight over deep neural networks. Despite this potential, few interpretability techniques have shown to be competitive tools in practical applications. Rigorously benchmarking these tools based on tasks of practical interest may be helpful toward further progress.

The Benchmarks

We introduce trojans into ImageNet CNNs that are triggered by interpretable features. Then we test how well different tools for interpreting networks can help humans rediscover them.

  1. “Patch” trojans are triggered by a small patch overlaid on an image.
  2. “Style” trojans are triggered by an image being style transferred.
  3. “Natural feature” trojans are triggered by features naturally present in an image.

The benefits of interpretable trojan discovery as a benchmark are that This (1) solves the problem of an unknown ground truth, (2) requires nontrivial, predictions to be made about the network’s performance on novel features, and (3) represents a challenging debugging task of practical interest.

We insert a total of 12 trojans into the model via data poisoning. You can see the first 12 below. There are 4 additional secret natural feature trojans.

How Existing Methods Perform

Feature Attribution/Saliency

We test 16 different feature visualization methods from Captum (Kokhlikyan et al., 2020).

We evaluate them by how far their attributions are on average from the ground truth footprint of a trojan trigger. Some methods do better than the edge-detector baseline, but not by much. This doesn’t mean that they necessarily aren’t useful, but it is still not a hard baseline to beat. Notably, the occlusion method from Zeilier and Fergus (2017) stands out on this benchmark.

Feature Synthesis

We test a total of 7 different methods from prior works.

We find that Robust feature-level adversaries (Casper et al., 2021) were the most effective. We introduce two novel variants of it:

  • – A method that uses a generator to parameterize robust feature-level adversaries. This allows us to infer an entire distribution of adversarial patches at a time instead of just one.
  • – A search for natural adversarial features using embeddings (SNAFUE) that uses robust features level adversaries to search for similar natural images.

All visualizations from these 9 methods can be found in the paper‘s appendix or the GitHub repository.

We had both humans evaluators and CLIP (Radford et al., 2021) take multiple-choice tests to rediscover the trojans. Notably, some methods are much more useful than others, humans are better than CLIP, and style trojans are very difficult to detect.


Do you think you can do better than these methods? Do you think you can find additional trojans? Check out our challenges with prizes here.