The SAS metric can be used to evaluate the Reader node or the entire pipeline, so we initialize the SAS model together with the EvalAnswers() node: If you want to follow along with the below code example, simply copy the notebook and open it in Colab. We’ve updated our QA system evaluation tutorial to cover the new SAS metric. To evaluate your question answering system with the new metric, make sure that you’re using the latest release of Haystack. When applied, the SAS metric returns a score between zero (for two answers that are semantically completely different) and one (for two answers with the same meaning). Importantly, the model learns to distinguish which words in a sentence contribute most to its meaning, eliminating the need for a preprocessing step like stop-word removal. To assess the similarity of the two strings, SAS leverages a pre-trained semantic text similarity (STS) model. SAS uses a cross-encoder architecture that accepts a pair of two answers as input - one answer being the correct one, the other the prediction by the system. We’re happy to announce that the paper was accepted at the 2021 EMNLP conference - one of the most prestigious events in the world of NLP. In the paper, we show that the metric correlates highly with human judgment on three different datasets. For a detailed description of SAS, see our paper. Our SAS metric is the most recent addition. To address the need for metrics that reflect a deeper understanding of semantics, several Transformer-based metrics have been introduced over the past years. You might find that same information expressed as: “The British monarch is traveling to the United States.” A traditional metric will return a similarity score of zero for the two sentences (F1 performs stop-word removal during preprocessing, so words like “the” and “is” are not taken into account). Consider the sentence, “The Queen is visiting the U.S.” from a British newspaper. Such properties of natural language mean that we can express a single piece of information with completely different sets of words. A well-formed deep language model can faithfully represent linguistic phenomena like synonymy and homonymy (where multiple words might look or sound the same but have different meanings). Their aim is to encode the meaning of a word rather than its lexical representation. BERT, RoBERTa, and other common models represent tokens as vectors in a high-dimensional embedding space. What’s Wrong with the Existing Metrics?Ī Transformer-based language model represents language by abstracting away from a word’s surface form. F1 is more lenient: It provides a score between zero and one that expresses the degree of lexical overlap between the correct answer and the prediction. EM is a binary metric that returns 1 if two strings (including their positions in a document) are identical and 0 if they aren’t. Both EM and F1 measure performance in terms of lexical overlap. That’s why we rely on metrics to tell us how well - or how poorly - a model is doing. And it does make sense to check a subsample of answers manually, even if only to get a feel for the system.īut it’s clearly beyond our capacity to evaluate hundreds or thousands of results each time we want to retrain a model. In an ideal world, we would have enough time to evaluate our machine learning system’s predictions by hand to get a good understanding of its capabilities. When we build, train, and fine-tune language models, we need a way of knowing how well these models ultimately perform. In this blog post, we’ll show you how to use SAS in Haystack and provide some interpretation guidelines. Rather than measuring lexical overlap, it seeks to compare two answer strings based on their semantic similarity, allowing it to better approximate human judgment than both EM and F1. Like the language models that we employ in question answering and other NLP tasks, the SAS metric builds upon Transformers. We first introduced SAS in August of 2021 with a paper that was accepted at the conference for Empirical Methods in Natural Language Processing (EMNLP). That’s why we’re excited to introduce a new metric: Semantic Answer Similarity (SAS). However, both metrics sometimes fall short when evaluating semantic search systems. In our recent post on evaluating a question answering model, we discussed the most commonly used metrics for evaluating the Reader node’s performance: Exact Match (EM) and F1, which measures precision against recall.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |