Adversarial Examples Are Not Bugs, They Are Features

This is fascinating. Adversarial examples—images that have been modified specifically to trick an algorithm but that are undetectably different from the original by a human—have always felt interesting to me. Their existence, and the ease with which they can be generated, always seemed to point to something worthwhile. Turns out, that instinct was right. From the paper:

We demonstrate that adversarial examples can be directly attributed to the presence of non-robust features: features derived from patterns in the data distribution that are highly predictive, yet brittle and incomprehensible to humans.


Want to receive more content like this in your inbox?