GenAI Performance Measurement Tool CHRF++


CHRF++ (Character n-gram F-score++) is a performance measurement tool for evaluating the quality of machine-generated text, including the output of Generative AI (GenAI) models. CHRF++ is designed to address some of the limitations of traditional metrics like BLEU and focuses on character-level n-gram precision and recall. Here’s an explanation of how CHRF++ works and its key features:

Character n-grams: CHRF++ operates at the character level, rather than the word level used in BLEU. It considers sequences of characters (character n-grams) in the generated text and compares them to those in the reference text. This approach allows CHRF++ to capture morphological and structural aspects of the text, making it more robust to word order variations.

Precision and Recall: CHRF++ calculates both character n-gram precision and character n-gram recall. Precision measures how many character n-grams in the generated text are also found in the reference text, while recall measures how many character n-grams in the reference text are present in the generated text.

F1-Score: The primary metric reported by CHRF++ is the F1-score, which is the harmonic mean of character n-gram precision and character n-gram recall. The F1-score provides a balanced measure of both precision and recall and is a commonly used metric for evaluating text generation quality.

Variable N-gram Length: CHRF++ can be configured to calculate F1-scores for different character n-gram lengths, allowing for flexibility in measuring the quality of generated text at different levels of granularity.

Smoothing: CHRF++ includes smoothing techniques to handle cases where there are zero matches between the generated and reference texts for specific character n-grams. Smoothing helps prevent overly harsh penalties in cases where there are no perfect matches.

Parameter Tuning: CHRF++ provides options for adjusting parameters to customize the evaluation for specific tasks or domains, making it adaptable to a wide range of text generation applications.

ChrF++ – Example interpretation

Reference Text: “apple” ; Machine-Generated Text: ”aple”


  1. Character N-grams: Compute character-level n-grams for both the reference and machine generated text. In this example, we’ll consider unigrams (individual characters), bigrams (pairs of characters), and trigrams (triplets of characters).
    • Reference Unigrams: [“a”, “p”, “p”, “l”, “e”] / Machine-Unigrams: [“a”, “p”, “l”, “e”]
    • Reference Bigrams: [“ap”, “pp”, “pl”, “le”] / Machine-Generated Bigrams: [“ap”, “pl”, “le”]
    • Reference Trigrams: [“app”, “ppl”, “ple”] / Machine-Generated Trigrams: [“apl”, “ple”]
  1. Overlap of Character N-grams: Count # of overlapping character n-grams between the reference and machine-generated text.
    •  Overlapping Unigrams: [“a”, “p”, “l”, “e”]
    •  Overlapping Bigrams: [“ap”, “pl”, “le”]
    •  Overlapping Trigrams: [“ple”]
  1. ChrF++ Calculation: Use these overlapping n-grams to calculate the F-score for each ngram length

           F-score for Unigrams: Precision (P1) = 4 / 4 = 1.0; Recall (R1) = 4 / 5 = 0.8

    • F1-Score (F1) = 2 * (P1 * R1) / (P1 + R1) = 0.8889

           F-score for Bigrams (ChrF++-2): Precision (P2) = 3 / 3 = 1.0; Recall (R2) = 0.75

    • F2-Score (F2) = 2 * (P2 * R2) / (P2 + R2) = 0.8571

           F-score for Trigrams (ChrF++-3): Precision (P3) = 1 / 1 = 1.0; Recall (R3) = 1 / 3 = 0.3333

    • F3-Score (F3) = 2 * (P3 * R3) / (P3 + R3) = 0.5 

CHRF++ is especially useful when evaluating machine-generated text in languages with complex morphology, agglutinative languages, or languages with flexible word orders. By focusing on character-level n-grams, CHRF++ can capture subtle linguistic variations and generate meaningful scores even when the generated text differs from the reference text in word choice or order.

It’s important to note that CHRF++, like other evaluation metrics, should be used in combination with other metrics and qualitative assessments to provide a comprehensive evaluation of GenAI performance. Different metrics may be more suitable for different tasks and languages, and researchers often consider a range of evaluation methods to gain a complete understanding of how well a GenAI model is performing.

Leave a Reply

Your email address will not be published. Required fields are marked *