Customizing Triggers with Concealed Data Poisoning


We expose a new vulnerability in NLP models that is difficult to detect and debug: an adversary can insert concealed poisoned examples that cause targeted errors for inputs which contain a selected trigger phrase. Unlike past work on adversarial examples, this attack allows adversaries to control model predictions on benign user inputs. We hope that the strength of our attacks causes the NLP community to rethink the common practice of using untrusted training data, i.e., emphasize data quality over data quantity.


Want to receive more content like this in your inbox?