Bias in Natural Language Processing @EMNLP 2020

img src:
  1. Discovery of Bias
    1.1 Gender Bias
    1.2 Political Bias
    1.3 Annotator Bias
    1.4 Multiple
  2. Mitigating Bias
    2.1 Task Specific
    2.2 Embeddings
  3. Miscellaneous
  4. TL;DR
  5. Datasets
Before I begin the recap, I would like to remind readers that the post contains potentially offensive examples and should be taken in context. Additionally, I've copied some sentences from the papers, since who'd better explain the concepts than the author themselves. 😇.

Discovery of Bias

In this section, I will discuss papers that either present a method to discover or quantify bias in the text, dataset, or the model itself.

Gender Bias

The first paper on our list is by Dinan et al. They argue

img src: Dinan et al
img src: Field et al
  • O_TXT: They employ propensity matching to remove data points from the dataset for which the O_TXT is heavily associated with one gender.
  • W_TRAITS: For each COM_TXT they obtain a vector whose elements represent p(k|COM_TXT_i), and the dimensionality is the number of OW individuals in the training set. Here k represents OW. At training time, the adversary network should not be able to predict this vector, while the other classifier predicts the gender.
  • Overt signals: They replace gendered terms with more neutral language, for example woman → person and man → person.
img src: Gonzalez et al

Political Bias

Moving from gender bias, Baly et al. propose a method to predict political bias in an article. In their analysis they find that a naive classifier learns to associate the article’s source with the ideology rather than its actual content. To showcase this, they create a new media-based split, where all articles from a particular media source would only exist in one of the train/test/valid split. This form of splitting leads to very low accuracy. They circumvent this problem to some extend by domain adaptation training and triplet loss-pretraining. Chen et al. also answer a similar question alongside analysis of how sentence position and other article attribute correlates with bias.

img src: Chen et al
img src: Roy et al
  • Extending Frame Lexicon: They begin the process by annotating paragraphs with policy frames using lexicon matches and then extract repeating phrases (using bigrams and trigrams) occurring in these paragraphs.
  • Identification of Subframes: Ask humans to group these repeating phrases to subframes in a manner representing political talking points.
  • Weakly Supervised Categorization of Subframes: Train an embedding model so as to capture the same space as the text associated with the subframes. This enables the capturing of these subframes in the new text.

Annotator Bias

Hate speech annotation is often subjective and can lead to bias in training data. This bias in the training data can severely affect the performance of a hate speech classifier and might make the classifier unfair toward some demographics. The effect of this annotator bias is shown in this EMNLP by Kuwatly at al. and Wich et al. They both find that a classifier trained on data annotated by a specific demographic performed significantly worse on the same test data annotated by crowd workers with some other demographics. A primary difference between the works was that Kuwatly et al. relies on demographic attributes available in the dataset. On the other hand, Wich et al. employs a community detection-based algorithm to group annotators.


In this subsection, I am listing papers that are generally trying to tackle multiple stereotypes. That said, many of the above approaches can also be used for discovery, but they primarily focused on the discovery of gender bias.

img src: Li et al
  • Positional Dependence: Prediction of the QA model heavily depends upon the order of the subject, even if the content remains unchanged.
  • Attribute independence: Prediction of the QA model does not depend on the content of the question itself. Even negating the question doesn’t change the answer.
img src: Nangia et al

Mitigating Bias aka Debiasing

I have divided this section into two sub-sections namely (i) Task specific where I discuss methods which have been proposed in a context of a specific task such as language generation , and (ii) Embeddings which as the name suggests specifically focuses on debiasing word representations.

Task Specific

Most of the work in task-specific was related to identifying bias and debiasing dialogue systems. A common theme for mitigation, or at least as a baseline, was counterfactual data generation, not only in task-specific but also across the board.

  • Counterfactual Data Augmentation where they add new examples to the dataset by swapping the gender of the existing examples.
  • Positive-Bias Data Collection where they crowd source new examples with explicit focus on bias. The crowd workers are asked to manually swap gender and write additional diverse personas.
  • Bias control training where they forced model to associate a special token with the genderedness of the dialogue response. This enabled authors to control the genderness of the generated response at inference time.
img src:Liu et al
img src: Sheng et al
img src: Sheng et al


Based on the observation by Gonen et al. that the existing debiasing methods are unable to completely debias word embeddings because the relative spatial distribution of word embeddings after the debiasing process still encapsulates bias-related information. Kumar et al. propose a method which aims to minimize the projection of gender-biased word vectors on the gender direction and at the same time reduces the semantic similarity with neighbouring word vectors having illicit proximities. Specifically, they employ a multi-objective optimization function with three major components:

  1. Repulsion: The objective function aims to repel the given word from the neighboring word vectors which have high value of indirect bias.
  2. Attract: which aims to minimize the loss of semantics between the given word vector and it’s debiased counterpart.
  3. Neutralization: which aims to minimize its bias towards any particular gender.
img src: Fisher et al
img src: Ma et al


I will describe a few papers which I found interesting but do not fit well with the above categories. They are papers that are using techniques to quantify bias in auxiliary tasks.

img src: Schmahl et al
img src: Wich et al


A one-two liner summary of the papers 😇

Discovery of Bias

  • Multi-Dimensional Gender Bias Classification — A more expressive and nuanced take on gender bias by breaking them into three aspects.
  • Unsupervised Discovery of Implicit Gender Bias — A classifier that predicts gender bias in comments while controlling for the effects of observed confounding variables such as the original text, as well as latent confounding variables like authors traits and overt signals.
  • Detecting Independent Pronoun Bias with Partially-Synthetic Data Generation — Masked language model-based sentence generation method to measure pronoun detection biases in language models.
  • Type B Reflexivization as an Unambiguous Testbed for Multilingual Multi-Task Gender Bias — A multilingual, multitask dataset to investigate biasness via specific linguistic phenomena.
  • Exploring the Linear Subspace Hypothesis in Gender Bias Mitigation — They show that gender subspace is indeed linear.
  • We Can Detect Your Bias: Predicting the Political Ideology of News Articles — A dataset and a training mechanism such that the classifier learns to associate bias with text rather than the source.
  • Detecting Media Bias in News Articles using Gaussian Bias Distributions — Detecting political bias in a text and an additional analysis of how sentence position and other article attribute correlates with bias.
  • Analyzing Political Bias and Unfairness in News Articles at Different Levels of Granularity — Answers the question of how political bias manifests itself linguistically. To do so, they employ an RNN based classifier and reverse feature analysis to find bias patterns.
  • Viable Threat on News Reading: Generating Biased News Using Natural Language Models — A LM based model to generate politically biased news (i) from scratch including titles and other metadata, and (ii) changing the bias of a given article.
  • Weakly Supervised Learning of Nuanced Frames for Analyzing Polarization in News Media — Fifteen broad frames to analyze how issues are framed in news media are not nuanced enough. Proposes a three-step approach to add more fine-grained subframes.
  • Identifying and Measuring Annotator Bias Based on Annotators’ Demographic Characteristics — The demography of annotator affects annotation and a classifier trained on corpus by one annotator demographics shows deterioration tested on the test same data when annotated by different demographics. They employ metadata from sources to identify different demographics.
  • Investigating Annotator Bias with a Graph-Based Approach — Similar to above but they use cluster detection algorithms to identify different demographics.
  • UNQOVERing Stereotyping Biases via Underspecified Questions — An approach to quantify bias in QA models via under specified questions while taking care of other reasoning errors.
  • CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models — A dataset consisting of 1508 examples, with each example containing two sentences with one being more stereotyping when compared to other.
  • LOGAN: Local Group Bias Detection by Clustering — Corpus level bias evaluation doesn’t paint the complete picture of biasness. They propose a new mechanism of clustering dataset, which groups similar examples together. At the same time, the cluster showcases some local bias.

Mitigating Bias aka Debiasing

  • Mitigating Gender Bias for Neural Dialogue Generation with Adversarial Learning — A disentanglement model alongside an adversarial learning framework to generate responses with unbiased gender features and without biased gender features.
  • Queens are Powerful too: Mitigating Gender Bias in Dialogue Generation — Examines gender bias present in the dialogue corpus and propose debiasing methods such as counterfactual data augmentation, positive-bias data collection, and special token to control the generated response’s genderedness.
  • Towards Controllable Biases in Language Generation — An adversarial trigger based mechanism to influence the bias polarity of text generated.
  • Reducing Unintended Identity Bias in Russian Hate Speech Detection — Hate speech detection triggers due to certain words that are not toxic but serve as triggers for the classifier due to model caveats. They propose to generate examples using LM trained on normative Russian text alongside word dropout techniques to circumvent it.
  • Mitigating Gender Bias in Machine Translation with Target Gender Annotations — An approach where the source words are annotated with target gender, so that translation between non-grammatical gender to gender ones improves.
  • Nurse is Closer to Woman than Surgeon? Mitigating Gender-Biased Proximities in Word Embeddings — Minimize the projection of gender-biased word vectors on the gender direction and at the same time reduces the semantic similarity with neighboring word vectors having illicit proximities.
  • Neutralizing Gender Bias in Word Embeddings with Latent Disentanglement and Counterfactual Generation: An Encoder-Decoder framework that disentangles a latent space of a given word embedding into two encoded latent spaces: the first part is the gender latent space, and the second part is the semantic latent space that is independent of the gender information. Then employs counterfactual generation to generate gender-neutral embeddings.
  • Debiasing knowledge graph embeddings — adversarial loss-based training mechanism to de-bias knowledge graph embeddings with minimum changes in the downstream accuracy and training time.
  • PowerTransformer: Unsupervised Controllable Revision for Biased Language Correction — A GPT transformer-based mechanism to rewrite text using controlled tokens to remove potentially undesirable bias present in the text. They use reconstruction and paraphrasing objective to overcome the absence of parallel training data.


  • Is Wikipedia succeeding in reducing gender bias? Assessing changes in gender bias in Wikipedia using word embeddings — Train embeddings at multiple snapshots of Wikipedia and see how they perform over WEAT and other in-depth analyses.
  • Analyzing Gender Bias within Narrative Tropes — An analysis of gender bias in TV trope content by assigning genderdness scores via counting the number of pronouns and gendered words in their description and associated content.
  • Impact of politically biased data on hate speech classification — Politically biased data can affect hate speech classification significantly.
  • Fair Embedding Engine: A Library for Analyzing and Mitigating Gender Bias in Word Embeddings — A library which combines various state of the art techniques for quantifying, visualizing, and mitigating gender bias in word embeddings under a standard abstraction.


A list of datasets which were proposed in the papers described above.

That’s all Folks!



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Gaurav Maheshwari

Gaurav Maheshwari

Ph.D. student @INRIA working on be working on fairness and privacy related topics in NLP. More about me here: