Adversarial Examples for Evaluating Reading Comprehension Systems

In order to improve upon our models, we have to understand what kind of errors they make and how much they have of overfit to the particular biases inherent in the data. QA is one particular task, where models have achieved startlingly close-to-human-level performance, as can be seen on the SQuAD leaderboard here. Jia & Liang craft adversarial examples that probe certain parts of these QA models: Accuracy drops from an average of 75% F1 to 36% across sixteen models! Also have a look here for slides from Yoav Goldberg on the problems with SQuAD and for some brief thoughts from Paul Mineiro here.


Want to receive more content like this in your inbox?