BLEU (Bilingual Evaluation Understudy) is a widely used performance measurement tool for evaluating the quality of machine-generated text, including the output of Generative AI (GenAI) models. It was originally developed for machine translation tasks but has since been adapted for various natural language processing applications. BLEU assesses the similarity between the generated text and one or more reference texts based on n-gram precision and brevity penalty. Here’s a detailed explanation of how BLEU works:
N-gram Precision: BLEU calculates the precision of n-grams (contiguous sequences of words) in the generated text compared to those in the reference text. It assesses how well the generated text captures common phrases and word sequences present in the reference text. BLEU typically considers multiple values of n, such as unigrams (BLEU-1), bigrams (BLEU-2), trigrams (BLEU-3), and so on.
Modified Precision: To avoid excessively high precision scores for overly short generated text that matches reference text by chance, BLEU uses modified precision. It adds a brevity penalty (BP) to the precision calculation, which penalizes shorter translations. This encourages the generation of longer, more contextually appropriate text.
Brevity Penalty (BP): The brevity penalty assesses how closely the length of the generated text matches the length of the reference text. If the generated text is shorter than the reference text, the brevity penalty will be applied to reduce the overall BLEU score.
Cumulative BLEU: BLEU often reports cumulative BLEU scores, which take into account precision scores for various n-gram lengths. The cumulative BLEU score is usually calculated as the geometric mean of precision scores for n-grams of different lengths.
The BLEU score is reported as a percentage, with higher values indicating better similarity between the generated text and the reference text. A perfect match with the reference text results in a BLEU score of 100%, while a BLEU score of 0% suggests no overlap between the generated and reference text.
While BLEU is a valuable tool for measuring the quality of machine-generated text, it has some limitations:
Lack of Semantic Understanding: BLEU primarily measures surface-level textual similarity and does not assess the semantic understanding of the generated content. It may not capture the nuances of meaning or coherence.
Dependency on Reference Texts: BLEU relies on the availability of reference texts for evaluation. In some cases, it may be challenging to obtain appropriate reference texts.
Focus on N-grams: BLEU heavily relies on n-grams and may not capture the overall fluency and coherence of the generated text.
Doesn’t Consider Synonyms and Variations: BLEU may penalize synonyms or paraphrased expressions that are semantically equivalent to the reference text but use different words or phrasings.
BLEU – Example interpretation
Example Scenario: Calculate BLEU score for a sentence
Reference Text: “The cat with the hat.”
Machine-Generated Text: “The cat in the hat.”
- Tokenization: Tokenize (split) reference and machine-generated summaries into words:
- Reference Tokens: [“The”, “cat”, “with”, “the”, “hat”, “.”]
- Machine-Generated Tokens: [“The”, “cat”, “in”, “the”, “hat”, “.”]
- N-Grams: Calculate the precision of unigrams (1-grams). Unigrams are single words:
- Reference unigrams: [“The”, “cat”, “with”, “the”, “hat”, “.”]
- Machine-Generated unigrams: [“The”, “cat”, “in”, “the”, “hat”, “.”]
- Precision Calculation: Count the number of overlapping unigrams in the machine and reference translations and divide it by the total number of unigrams in the candidate translation.
- Number of overlapping unigrams: 5 (The, cat, the, hat, .)
- Total number of unigrams in the candidate translation: 6
- Precision (P) for unigrams = 5 / 6 = 0.8333
- BP (Brevity Penalty): Calculate the brevity penalty (BP) to account for the length of the candidate translation compared to the reference translation. Here, both translations have the same length, so BP is 1.
- BLEU Score Calculation: Calculate BLEU score by combining the precision for unigrams with brevity penalty:
- BLEU = BP * exp((1/n) * Σ(log(precision_n)))
- BLEU Score = 1 * exp(log(0.8333)) = 0.8333
Despite these limitations, BLEU remains a widely used metric for GenAI evaluation, especially in machine translation and text generation tasks. It is often used alongside other metrics and qualitative assessments to provide a more comprehensive evaluation of GenAI performance.