Tools to Identify and Mitigate Bias & Toxicity in LLMs

2 min readSep 19, 2023

Language Models have shown very powerful capabilities, and thus an increased adoption can be seen across different domains. These models are mostly trained on very large data(almost all the internet), which is good for enhancing the model’s capabilities, but it also includes data that we may not want to be reflected in our model’s output. This includes but not limited to generated text which is Toxic, Biased, exposes personal information of individuals (PII), Misinformation, Disinformation, Hallucinated.

We can create a more Responsible LLM, by accounting for the above limitations. There are several techniques which could help us identify and mitigate Bias, Toxicity, PIIs etc. Some of these techniques are:

Carefully curating your training and fine-tuning data, that ensure the dataset is both diverse and representative, and devoid of any harmful content.
Guardrails Models: These models are trained on both appropriate or inappropriate data, and aim is to filter out the data(before training) or model’s output(post training) which is harmful and not acceptable.
https://github.com/NVIDIA/NeMo-Guardrails
Reinforcement Learning with Human/AI feedback : This is done by letting Human/AI rate completions of LLM for one prompt at a time. This data is then used to train a reward model which takes these scores to a Reinforcement Learning model to update the weights of the original LLM. This will ensure LLM generates more robust and bias free texts.
https://arxiv.org/pdf/2203.02155.pdf
Careful prompt design to let LLM generate unbiased text.

There are many open source tools/models/datasets to identify and mitigate Bias, Toxicity and PII, which can pretty useful in estimating your model’s performance and usability. Some of these awesome tools are mentioned in the list below.

Toxicity:

Perspective API: https://perspectiveapi.com/
Detoxify : https://pypi.org/project/detoxify/
Real-Toxicity-Prompt dataset to further train/evaluate the model : https://arxiv.org/pdf/2009.11462.pdf
Hate-Speech Dataset: https://arxiv.org/pdf/2006.08328.pdf
Domain Adaptive Pretraining: https://arxiv.org/pdf/2205.00320.pdf

PII information detection:

Presidio: https://microsoft.github.io/presidio/
Octopii: https://github.com/redhuntlabs/Octopii

Bias:

Evaluate by hugging-face: https://huggingface.co/blog/evaluating-llm-bias
Dbias: https://pypi.org/project/detoxify/
Crows-Pairs Dataset: https://github.com/nyu-mll/crows-pairs
StereoSet Dataset: https://huggingface.co/datasets/stereoset

While we can incorporate the above techniques and tools to build and deploy a more regulated and safe LLM, at the same time we should enhance our model do a self diagnosis to identify bias/toxicity and perform self debiasing/detoxification.

Tools to Identify and Mitigate Bias & Toxicity in LLMs

Written by Rajneesh Jha