skip to Main Content
Join Us for Comet's Annual Convergence Conference on May 8-9:

LangChain Evaluators for Language Model Validation

Exploring Exact Matches, Embedding Distances, and More: A Deep Dive into Advanced String Evaluation Methods for AI Applications

langchain validation, langchain evaluators
Photo by Florian Schmetz on Unsplash

Introduction

While string evaluators provide a robust way to measure a model’s accuracy, myriad other methods offer nuanced and targeted approaches to evaluation.

For developers and data scientists venturing into building applications with language models, ensuring the reliability of the model’s output becomes paramount. From the simplicity of an exact match to the depth of embedding distances, each evaluation method serves a unique purpose in the grand tapestry of language model validation.

Delving deeper, this guide explores various string evaluation techniques — each with its strengths, intricacies, and use cases. 

Whether you’re looking to validate a specific format using regex or measure semantic similarity through embeddings, understanding these evaluation methods is key to creating AI-driven applications that are both accurate and effective.

Evaluation

When building apps with language models, it’s crucial to ensure your models produce reliable and valuable results for various inputs and integrate seamlessly with other software components. This often requires a mix of intelligent application design, thorough testing, and runtime checks.

Exact Match Evaluators

Probably the simplest ways to evaluate an LLM or runnable’s string output against a reference label is by a simple string equivalence.

The ExactMatchStringEvaluator simply checks if the prediction string exactly matches the reference string.

It is case-sensitive by default.

from langchain.evaluation import ExactMatchStringEvaluator

evaluator = ExactMatchStringEvaluator()

evaluator.evaluate_strings(
    prediction="My name is Harpreet, and I love to learn LangChain",
    reference="Harpreet loves learning langchain",
                           )
{'score': 0}
evaluator.evaluate_strings(prediction="My name is Harpreet, and I love to learn LangChain",
                           reference="My name is Harpreet, and I love to learn LangChain",
                           )
{'score': 1}

Configure the ExactMatchStringEvaluator

You can relax the “exactness” when comparing strings.

evaluator = ExactMatchStringEvaluator(
    ignore_case=True,
    ignore_numbers=True,
    ignore_punctuation=True,
)
evaluator.evaluate_strings(
    prediction="My name is Harpreet, and I love to learn LangChain",
    reference="my name is harpreet, and I love to learn langchain!"
    )

# will output {'score': 1}

String Distance

String distance is a measure of the difference between two strings.

The smaller the distance, the more similar the two strings are. Different algorithms provide different ways of calculating this distance.

Under the hood, LangChain uses the RapidFuzz library to perform several calculations.

This can be used alongside approximate/fuzzy matching criteria for fundamental unit testing.

The StringDistanceStringEvaluator measures the similarity between two strings using a string distance algorithm like Levenshtein distance.

It returns a score between 0 and 1, with 1 indicating an exact match.


Want to learn how to build modern software with LLMs using the newest tools and techniques in the field? Check out this free LLMOps course from industry expert Elvis Saravia of DAIR.AI.


Supported Evaluator metrics

This enumeration defines the types of string distance metrics supported:

  • Damerau-Levenshtein: Considers insertions, deletions, substitutions, and the transposition of two adjacent characters.
  • Levenshtein: Considers insertions, deletions, and substitutions.
  • Jaro: Measures the similarity between two strings.
  • Jaro-Winkler: A modification of Jaro’s similarity to give more weight to the prefix.
  • Hamming: Measures the difference between two strings of equal length.
  • Indel: Considers only insertions and deletions.
from langchain.evaluation import load_evaluator, StringDistance

evaluator = load_evaluator("string_distance")

evaluator.evaluate_strings(
    prediction="My name is Harpreet, and I love to learn LangChain",
    reference="Harpreet loves learning langchain",
)

# will output {'score': 0.31919191919191914}

You can change the metric like so:

levenshtein_evaluator = load_evaluator(
    "string_distance",
    distance='levenshtein'
)

levenshtein_evaluator.evaluate_strings(
    prediction="My name is Harpreet, and I love to learn LangChain",
    reference="Harpreet loves learning langchain",
)

# {'score': 0.52}

For some metrics, you need to instantiate the StringDistanceEvalChain:

from langchain.evaluation import StringDistanceEvalChain

evaluator = StringDistanceEvalChain(value='indel')

evaluator.evaluate_strings(
    prediction="My name is Harpreet, and I love to learn LangChain",
    reference="Harpreet loves learning langchain",
)

# {'score': 0.31919191919191914}

Embedding Distance Evaluator

To measure semantic similarity (or dissimilarity) between a prediction and a reference label string, you could use a vector vector distance metric the two embedded representations using the embedding_distance evaluator.

Note: This returns a distance score, meaning that the lower the number, the more similar the prediction is to the reference, according to their embedded representation.

These distance measures you can choose from are:

  1. Cosine Distance("cosine"): This is computed as (1 — {cosine similarity}). The cosine similarity measures the cosine of the angle between two vectors. A cosine similarity of 1 means the vectors are identical, while a value of 0 means they are orthogonal (entirely dissimilar). Therefore, a cosine distance of 0 indicates that the embeddings are identical, and a value of 1 indicates they are entirely dissimilar.
  2. Euclidean Distance ("euclidean"): It is the straight-line distance between two points in Euclidean space.
  3. Manhattan Distance (or L1 Distance) ("manhattan"): It is the sum of the absolute differences of their coordinates. In a 2D space, it represents the distance between two points measured along the axes at right angles.
  4. Chebyshev Distance ("chebyshev"): It is the maximum absolute difference between elements of the vectors. It’s essentially the infinity norm of the difference between the vectors.
  5. Hamming Distance ("hamming"): It measures the minimum number of substitutions required to change one string into the other or the minimum number of errors that could have transformed one string into the other. In the context of this code, it seems to be applied to vectors by determining the proportion of differing vector elements.

Considerations for Choosing a Distance Metric for Text Embeddings:

  1. Scale or Magnitude: Embeddings from models like Word2Vec, FastText, BERT, and GPT are often normalized to unit length. In such cases, cosine distance is suitable as it focuses on the angle (direction) between vectors and ignores uniform magnitude.
  2. Distribution of Embeddings: Understand the distribution of your embeddings. For densely packed vectors, minor changes in direction can be significant, making cosine distance a good choice.
  3. High Dimensionality: Text embeddings are often high-dimensional. The “curse of dimensionality” can make the distinction between points appear more pronounced with Euclidean distance. Cosine distance might be more reliable in such situations.
  4. Nature of Textual Data: For longer documents, cosine similarity can capture nuanced semantic information. For shorter texts, like phrases, the absolute position of embeddings can be important, making Euclidean or Manhattan distances more informative.
  5. Use Case: Your specific application (e.g., clustering, matching) can dictate the best metric.
  6. Interpretability: Cosine distance values are bounded between 0 (identical) and 1 (opposite), offering more interpretability than unbounded metrics like Euclidean or Manhattan.
  7. Performance: Computationally, cosine distance can be more efficient for normalized vectors.

In general, cosine distance is a common choice for text embeddings. However, it’s beneficial to experiment with different metrics based on your specific needs and validate them against a known benchmark or application outcome.

from langchain.evaluation import load_evaluator

evaluator = load_evaluator("embedding_distance")

evaluator.evaluate_strings(
    prediction="My name is Harpreet, and I love to learn LangChain",
    reference="Harpreet loves learning langchain"
    )

# {'score': 0.0404781648420105}
evaluator = load_evaluator(
    "embedding_distance",
    distance_metric="euclidean"
)

evaluator.evaluate_strings(
    prediction="My name is Harpreet, and I love to learn LangChain",
    reference="Harpreet loves learning langchain"
    )

# {'score': 0.2844376766821911}

Select the embeddings you want to use

The constructor uses OpenAI embeddings by default, but you can configure this however you want. Below, use HuggingFace local embeddings:

from langchain.embeddings import HuggingFaceEmbeddings

embedding_model = HuggingFaceEmbeddings()
hf_evaluator = load_evaluator("embedding_distance",
                              embeddings=embedding_model)

hf_evaluator.evaluate_strings(
    prediction="My name is Harpreet, and I love to learn LangChain",
    reference="Harpreet loves learning langchain"
    )

# {'score': 0.2803533789378635}

Regex Matching Evaluator

The RegexMatchStringEvaluator checks if a regex pattern matches the prediction string. This is useful for validating outputs.

from langchain.evaluation import RegexMatchStringEvaluator

evaluator = RegexMatchStringEvaluator()

evaluator.evaluate_strings(
    prediction="The date is 2022-01-01",
    reference="The date is 2022-01-01"
  )

#  {'score': 1}

# Check for the presence of a MM-DD-YYYY string.
evaluator.evaluate_strings(
    prediction="The delivery will be made on 2024-01-05",
    reference=".*\\b\\d{2}-\\d{2}-\\d{4}\\b.*"
)

# {'score': 0}

evaluator.evaluate_strings(
    prediction="The delivery will be made on 01-05-2024",
    reference=".*\\b\\d{2}-\\d{2}-\\d{4}\\b.*"
)

# {'score': 1}

Match against multiple patterns

To match against multiple patterns, use a regex union “|”.

# Check for the presence of a MM-DD-YYYY string or YYYY-MM-DD
evaluator.evaluate_strings(
    prediction="The delivery will be made on 01-05-2024",
    reference="|".join([".*\\b\\d{4}-\\d{2}-\\d{2}\\b.*", ".*\\b\\d{2}-\\d{2}-\\d{4}\\b.*"])
)

# {'score': 1}

Configure the RegexMatchStringEvaluator

You can specify any regex flags to use when matching.

import re

evaluator = RegexMatchStringEvaluator(
    flags=re.IGNORECASE
)

evaluator.evaluate_strings(
    prediction="My name is Harpreet, and I love to learn LangChain",
    reference="Harpreet loves learning langchain"
    )

# {'score': 0}

So, in summary:

  • Exact Match does literal string comparison
  • String Distance measures similarity using algorithms like Levenshtein distance
  • Embedding Distance measures semantic similarity using embeddings
  • Regex Match validates string formats using regular expressions

Conclusion

As we journey through the multifaceted landscape of language model evaluation, it becomes evident that more than a one-size-fits-all approach is required.

From the precision of exact matches to the interpretive power of embedding distances, each evaluation technique offers a unique lens through which we can scrutinize our models. The role of regex in format validation and the nuanced ways string distance algorithms operate underscore the richness and diversity of tools at our disposal.

For developers and AI enthusiasts, understanding and leveraging these evaluation methods are crucial steps toward building applications that not only function seamlessly but also uphold the standards of reliability and accuracy.

A comprehensive toolkit like this ensures we remain equipped to meet challenges, validate outputs, and drive innovation. As you conclude this guide, I hope you’re better prepared and inspired to harness the power of these evaluative techniques, ensuring that your AI applications are always a cut above the rest.

Harpreet Sahota

Back To Top