Challenges and Prizes

This competition was accepted to the competition track for SATML, 2024!

These challenges will be open from September 22, 2023 until March 22, 2024.

Download the model from our GitHub.

If you have any questions, let us know! We’d be happy to talk or see if we can help.

Send all communications to interp-benchmarks@mit.edu.

Competition: Set the new record for trojan rediscovery with a novel method

Bounty: $4,000 and authorship in the follow-up report.

Our best attempt to help humans re-identify trojans resulted in a 49% success rate across 100 knowledge workers over 12 multiple-choice questions with options each. We think this baseline can be beaten.

How to submit:

  1. Submit a set of 10 machine-generated visualizations (or other media, e.g. text) for each of the 12 trojans, a brief description of the method used, and code to reproduce the images. In total, this will involve 120 images (or other media), but please submit them as 12 images, each containing a row of 10 sub-images.
  2. Once we check the code and images, we will use your data to survey 100 knowledge workers using the same method as we did in the paper.

We will desk-reject submissions that are incomplete (e.g. not containing code), not reproducible using the code sent to us, or produced entirely with code off-the-shelf from someone other than the submitters.

The best-performing solution at the end of the competition will win.

Challenge: Find the four secret natural feature trojans by any means necessary

Bounty: $1,000 split among each to correctly identify each trojan and shared authorship in the final report for successful submissions.

In the paper, we only discuss the 12 main trojans we implanted. But we also secretly implanted another 4 natural feature trojans! Here is the full table with the details on all 16.

How to submit:

Share with us a guess for one of the trojans along with code to reproduce whatever method you used to make the guess and a brief explanation of how this guess was made. One guess is allowed per trojan per submitter.

The $1,000 prize for each trojan will be split between all successful submissions for that trojan.

Getting Started

Download the model from our GitHub and start trojan hunting 🙂

What types of methods might succeed? Different tools for synthesizing features differ in what priors they place over the generated feature. For example, TABOR imposes a weak one, while robust feature-level adversaries impose a strong one. Since the trojans for this competition are human-interpretable, we expect methods that visualize trojan triggers with highly-regularized features to be useful. Additionally, we found that combinations of methods succeeded more than any individual method on its own, so techniques that produce diverse synthesized features may have an advantage. We also found that style trojans were the most difficult to discover, so methods that are well-suited to finding these will be novel and useful.

Updates

10/18/2023, Clarification:

  • All types of methods, models, and data are fair game except for methods that use the trojan triggers themselves.
  • We will not grant access to finetuning details about the model beyond what is in the paper.