Model Interpretation Report

Interpretation result:
Problems:   3    
Insights:   5  
Models: 44 LLM/RAG models
Dataset: 55 inputs
Interpretation status:  SUCCESS 
Interpretation ID: 133445c7-feef-4bae-835a-2e2cb726cbf3
Created: 2026-01-30 14:26:04

Explainers overview

Representatives
All explainers

Problems

Explainers identified the following problems:

Severity Type Problem Suggested actions Explainer Resources
 MEDIUM  accuracy Evaluated model h2oai/h2ogpt-4096-llama2-13b-chat failed to satisfy the  threshold   0.5 for metric Model passes with average  score   0.4. Metric details: Percentage of successfully evaluated RAG/LLM outputs. For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures. Text matching GlobalHtmlFragmentExplanation / text/html
 MEDIUM  accuracy Evaluated model HuggingFaceH4/zephyr-7b-beta failed to satisfy the  threshold   0.5 for metric Model passes with average  score   0.4. Metric details: Percentage of successfully evaluated RAG/LLM outputs. For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures. Text matching GlobalHtmlFragmentExplanation / text/html
 MEDIUM  accuracy Evaluated model Yukang/LongAlpaca-70B failed to satisfy the  threshold   0.5 for metric Model passes with average  score   0.4. Metric details: Percentage of successfully evaluated RAG/LLM outputs. For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures. Text matching GlobalHtmlFragmentExplanation / text/html

Insights

Explainers identified the following insights:

Type Insight Suggested actions Explainer Resources
accuracy Model gpt-3.5-turbo-0613 was evaluated as  the most accurate    model according to Text matching evaluator. A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Check the prompt, expected answer and condition - are they correct? Check models answers in failed cases and look for a common denominator and/or root cause of these failures. Text matching GlobalHtmlFragmentExplanation / text/html
accuracy Model Yukang/LongAlpaca-70B was evaluated as  the least accurate    model according to Text matching evaluator. A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Check the prompt, expected answer and condition - are they correct? Check models answers in failed cases and look for a common denominator and/or root cause of these failures. Text matching GlobalHtmlFragmentExplanation / text/html
performance Model gpt-3.5-turbo-0613 was evaluated as  the fastest    model according to Text matching evaluator. A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Text matching GlobalHtmlFragmentExplanation / text/html
performance Model Yukang/LongAlpaca-70B was evaluated as  the slowest    model according to Text matching evaluator. A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Text matching GlobalHtmlFragmentExplanation / text/html
weak-point Prompt 'Who are the board members?'   was evaluated as  the most difficult     prompt to be correctly answered according to Text matching evaluator. A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Check the prompt, expected answer and condition - are they correct? Check models answers in failed cases and look for a common denominator and/or root cause of these failures. Text matching GlobalHtmlFragmentExplanation / text/html

Explainers

100%

All explainers (1):

Scheduled explainers (1): Finished explainers (1): Successful explainers (1):

Explainer: Text matching

Problems

Explainer identified the following problems:

Severity Type Problem Suggested actions Explainer Resources
 MEDIUM  accuracy Evaluated model h2oai/h2ogpt-4096-llama2-13b-chat failed to satisfy the  threshold   0.5 for metric Model passes with average  score   0.4. Metric details: Percentage of successfully evaluated RAG/LLM outputs. For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures. Text matching GlobalHtmlFragmentExplanation / text/html
 MEDIUM  accuracy Evaluated model HuggingFaceH4/zephyr-7b-beta failed to satisfy the  threshold   0.5 for metric Model passes with average  score   0.4. Metric details: Percentage of successfully evaluated RAG/LLM outputs. For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures. Text matching GlobalHtmlFragmentExplanation / text/html
 MEDIUM  accuracy Evaluated model Yukang/LongAlpaca-70B failed to satisfy the  threshold   0.5 for metric Model passes with average  score   0.4. Metric details: Percentage of successfully evaluated RAG/LLM outputs. For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures. Text matching GlobalHtmlFragmentExplanation / text/html
Insights

Explainer identified the following insights:

Type Insight Suggested actions Explainer Resources
accuracy Model gpt-3.5-turbo-0613 was evaluated as  the most accurate    model according to Text matching evaluator. A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Check the prompt, expected answer and condition - are they correct? Check models answers in failed cases and look for a common denominator and/or root cause of these failures. Text matching GlobalHtmlFragmentExplanation / text/html
accuracy Model Yukang/LongAlpaca-70B was evaluated as  the least accurate    model according to Text matching evaluator. A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Check the prompt, expected answer and condition - are they correct? Check models answers in failed cases and look for a common denominator and/or root cause of these failures. Text matching GlobalHtmlFragmentExplanation / text/html
performance Model gpt-3.5-turbo-0613 was evaluated as  the fastest    model according to Text matching evaluator. A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Text matching GlobalHtmlFragmentExplanation / text/html
performance Model Yukang/LongAlpaca-70B was evaluated as  the slowest    model according to Text matching evaluator. A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Text matching GlobalHtmlFragmentExplanation / text/html
weak-point Prompt 'Who are the board members?'   was evaluated as  the most difficult     prompt to be correctly answered according to Text matching evaluator. A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Check the prompt, expected answer and condition - are they correct? Check models answers in failed cases and look for a common denominator and/or root cause of these failures. Text matching GlobalHtmlFragmentExplanation / text/html
Explainer description

Evaluator input requirements:

Question Expected Answer Retrieved Context Actual Answer Conditions

Description:

Text Matching Evaluator assesses whether both the retrieved context (in the case of RAG hosted models) and the generated answer contain/match a specified set of required strings. The evaluation is based on an boolean expression (condition) that can be used to define the required strings presence:

  • operands are strings or regular expressions
  • operators are AND, OR, and NOT
  • parentheses can be used to group expressions

  • Example 1: Simple string matching

  • Expression: "15,969"
  • The evaluator will check if the retrieved context and the actual answer contain the string 15,969. If the condition is satisfied, the test case passes.

  • Example 2: Flexible regex patterns

  • Expression: regexp("15,?969")
  • What if the number 15,969 might be expressed as 15969 or 15,969? The boolean expression can be extended to use a regular expression. The evaluator will check if the retrieved context and the actual answer contain the string 15,969 or 15969. If the condition is satisfied, the test case passes.

  • Example 3: Combining string and regex

  • Expression: "15,969" AND regexp("[Mm]illion")
  • The evaluator will check if the retrieved context and the actual answer contain the string 15,969 and match the regular expression [Mm]illion. If the condition is satisfied, the test case passes.

  • Example 4: Complex boolean logic

  • Expression: ("Rio" OR "rio") AND regexp("15,?969 [Mm]il") AND NOT "Real"
  • The evaluator will check if the retrieved context and the actual answer contain either Rio or rio and match the regular expression 15,969 [Mm]il and do not contain the string Real. If the condition is satisfied, the test case passes.

  • Example 5: Exact matching with regex anchors

  • Expression: regexp("^Brazil revenue was 15,969 million$")
  • The evaluator will check if the retrieved context and the actual answer exactly match the regular expression ^Brazil revenue was 15,969 million$. If the condition is satisfied, the test case passes.

  • Example 6: Case-insensitive matching

  • Expression: regexp("(?i)python")
  • The (?i) flag enables case-insensitive matching. The evaluator will match python, Python, PYTHON, PyThOn, etc. This is useful when the capitalization in the output is unpredictable.

  • Example 7: OR within regular expressions

  • Expression: regexp("(cat|dog|bird)")
  • Using the pipe | operator inside a group allows matching multiple alternatives. The evaluator will match any of: cat, dog, or bird. This is more concise than using multiple OR operators in the boolean expression.

  • Example 8: Capturing groups and word boundaries

  • Expression: regexp("\b(error|warning|failure)\b")
  • The \b word boundary ensures exact word matching (not as part of a larger word). The regex will match error, warning, or failure as complete words. Parentheses capture the matched text for reference.

  • Example 9: Repeated patterns and quantifiers

  • Expression: regexp("\d3-\d3-\d4")
  • Quantifiers specify repetition: \d3 matches exactly 3 digits, + matches one or more, * matches zero or more. This example matches phone numbers in the format 123-456-7890. Use \d for digits, \w for word characters, \s for whitespace.

  • Example 10: Lookahead and combining patterns

  • Expression: regexp("(?i)(success|completed).*\d+%")
  • This combines case-insensitive matching (?i), an OR group (success|completed), .* to match any characters, and \d+% to match one or more digits followed by a percent sign. Useful for matching complex patterns like progress messages.

Method:

  • The evaluator parses the boolean expression and checks if the retrieved context and the generated answer contain the required strings.
  • The evaluator uses Python re module for regular expression matching (re.search function). See https://docs.python.org/3/howto/regex.html#regex-howto

Metrics calculated by the evaluator:

  • Model passes (float)
    • Percentage of successfully evaluated RAG/LLM outputs.
    • Higher score is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.5
    • This is primary metric.
  • Model failures (float)
    • Percentage of RAG/LLM outputs that failed to pass the evaluator check.
    • Lower score is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.5
  • Model retrieval failures (float)
    • Percentage of RAG's retrieved contexts that failed to pass the evaluator check.
    • Lower score is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.5
  • Model generation failures (float)
    • Percentage of outputs generated by RAG from the retrieved contexts that failed to pass the evaluator check (equivalent to the model failures).
    • Lower score is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.5
  • Model parse failures (float)
    • Percentage of RAG/LLM outputs that evaluator's judge (LLM, RAG, agent or model) was unable to parse, and therefore unable to evaluate and provide a metrics score.
    • Lower score is better.
    • Range: [0.0, 1.0]
    • Default threshold: 0.5

Problems reported by the evaluator:

  • If average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
  • If test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator:

  • Most accurate, least accurate, fastest, slowest, most expensive and cheapest LLM models based on the evaluated primary metric.
  • LLM models with best and worst context retrieval performance.
  • The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters:

  • metric_threshold (float):
    • Evaluated metric threshold - values below this threshold are considered problematic.
    • Default value: 0.5
  • save_llm_result (bool):
    • Control whether to save LLM result which contains input LLM dataset and all metrics calculated by the evaluator.
    • Default value: True
  • evaluate_retrieved_context (bool):
    • Control whether to evaluate also retrieved context - conditions to check whether it contains or does not contained specific strings.
    • Default value: ""

Explanations

LLM Models by Success Rate Pass Fail Success rate Total time Cost
1. gpt-3.5-turbo-0613 h2oGPTe RAG 4 1 80.000% 18.775s $0.000
2. gpt-4-0613 h2oGPTe RAG 4 1 80.000% 30.675s $0.000
3. gpt-4-32k-0613 h2oGPTe RAG 4 1 80.000% 37.538s $0.000
4. gpt-3.5-turbo-16k-0613 h2oGPTe RAG 3 2 60.000% 20.997s $0.000
5. h2oai/h2ogpt-4096-llama2-70b-chat h2oGPTe RAG 3 2 60.000% 68.370s $0.000
6. lmsys/vicuna-13b-v1.5-16k h2oGPTe RAG 3 2 60.000% 69.986s $0.000
7. h2oai/h2ogpt-32k-codellama-34b-instruct h2oGPTe RAG 3 2 60.000% 70.819s $0.000
8. h2oai/h2ogpt-4096-llama2-70b-chat-4bit h2oGPTe RAG 3 2 60.000% 111.219s $0.000
9. HuggingFaceH4/zephyr-7b-beta h2oGPTe RAG 2 3 40.000% 50.480s $0.000
10. h2oai/h2ogpt-4096-llama2-13b-chat h2oGPTe RAG 2 3 40.000% 56.327s $0.000
11. Yukang/LongAlpaca-70B h2oGPTe RAG 2 3 40.000% 211.522s $0.000

LLM Models by Time Pass Fail Success rate Total time Cost
1. gpt-3.5-turbo-0613 h2oGPTe RAG 4 1 80.000% 18.775s $0.000
2. gpt-3.5-turbo-16k-0613 h2oGPTe RAG 3 2 60.000% 20.997s $0.000
3. gpt-4-0613 h2oGPTe RAG 4 1 80.000% 30.675s $0.000
4. gpt-4-32k-0613 h2oGPTe RAG 4 1 80.000% 37.538s $0.000
5. HuggingFaceH4/zephyr-7b-beta h2oGPTe RAG 2 3 40.000% 50.480s $0.000
6. h2oai/h2ogpt-4096-llama2-13b-chat h2oGPTe RAG 2 3 40.000% 56.327s $0.000
7. h2oai/h2ogpt-4096-llama2-70b-chat h2oGPTe RAG 3 2 60.000% 68.370s $0.000
8. lmsys/vicuna-13b-v1.5-16k h2oGPTe RAG 3 2 60.000% 69.986s $0.000
9. h2oai/h2ogpt-32k-codellama-34b-instruct h2oGPTe RAG 3 2 60.000% 70.819s $0.000
10. h2oai/h2ogpt-4096-llama2-70b-chat-4bit h2oGPTe RAG 3 2 60.000% 111.219s $0.000
11. Yukang/LongAlpaca-70B h2oGPTe RAG 2 3 40.000% 211.522s $0.000

LLM Models by Cost Pass Fail Success rate Total time Cost
1. gpt-3.5-turbo-0613 h2oGPTe RAG 4 1 80.000% 18.775s $0.000
2. gpt-3.5-turbo-16k-0613 h2oGPTe RAG 3 2 60.000% 20.997s $0.000
3. gpt-4-0613 h2oGPTe RAG 4 1 80.000% 30.675s $0.000
4. gpt-4-32k-0613 h2oGPTe RAG 4 1 80.000% 37.538s $0.000
5. HuggingFaceH4/zephyr-7b-beta h2oGPTe RAG 2 3 40.000% 50.480s $0.000
6. h2oai/h2ogpt-4096-llama2-13b-chat h2oGPTe RAG 2 3 40.000% 56.327s $0.000
7. h2oai/h2ogpt-4096-llama2-70b-chat h2oGPTe RAG 3 2 60.000% 68.370s $0.000
8. lmsys/vicuna-13b-v1.5-16k h2oGPTe RAG 3 2 60.000% 69.986s $0.000
9. h2oai/h2ogpt-32k-codellama-34b-instruct h2oGPTe RAG 3 2 60.000% 70.819s $0.000
10. h2oai/h2ogpt-4096-llama2-70b-chat-4bit h2oGPTe RAG 3 2 60.000% 111.219s $0.000
11. Yukang/LongAlpaca-70B h2oGPTe RAG 2 3 40.000% 211.522s $0.000

Most difficult prompts across all models Failures Success rate
Who are the board members? 11 0.000%
How many stores are in Florida? 7 36.364%
What was the revenue of Brazil? 4 63.636%

Model failures
Model h2oai/h2ogpt-4096-llama2-70b-chat failures:
  • Corpus: https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/AXA-Sigorta-2022-Annual-Report.pdf
    • Error ( generation ): Generated actual answer did not satisfy the condition:  "Christophe Knaub" AND "Yavuz Ölken" AND "Guillaume Herve Marie Xavier Lejeune" AND "Xavier Veyry" AND "Maria Jesus De Arteaga Larru" AND "Nuria Fernandez Paris" AND "Onur Koçkar"  . The following part of the condition did not match:  "Xavier Veyry"  .
    • Prompt: Who are the board members?
    • Output constraints: ['Christophe Knaub', 'Yavuz Ölken', 'Guillaume Herve Marie Xavier Lejeune', 'Xavier Veyry', 'Maria Jesus De Arteaga Larru', 'Nuria Fernandez Paris', 'Onur Koçkar']
    • Expected output: The board members are Christophe Knaub, Yavuz Ölken, Guillaume Herve Marie Xavier Lejeune,Xavier Veyry, Maria Jesus De Arteaga Larru, Nuria Fernandez Paris and Onur Koçkar.
    • Actual output (fragment_level): Based on the information provided in the document sources, the following individuals are listed as board members of AXA Sigorta and AXA Hayat ve Emeklilik: 1. Christophe Knaub - Board Chairman 2. Guillaume Herve Marie Xavier Lejeune - Board Member 3. Yavuz Ölken - CEO and Deputy Chairman of the Board 4. Onur Koçkar - Board Member 5. Maria Jesus de Artega Larru - Board Member 6. Nuria Fernandez Paris - Board Member 7. Fernando Garcia - Board Member 8. Laurent Goutard - Board Member 9. Jean-Luc Laurent - Board Member 10. Emmanuel Faber - Board Member Note that the list of board members may not be exhaustive, as the document sources only provide information up to page 13, and there may be additional board members not mentioned in the provided pages.
    • Context size: 13990B
    • Context chunks: 10
  • Corpus: https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/lowes-2022ar-full-report-4-6-23-final.pdf
    • Error ( generation ): Generated actual answer did not satisfy the condition:  "128"  .
    • Prompt: How many stores are in Florida?
    • Output constraints: ['128']
    • Expected output: There are 128 stores in Florida.
    • Actual output (fragment_level): According to the information provided in the document, Lowe's operates 1,738 home improvement stores in the United States, including 232 stores in Canada, as of February 3, 2023. However, the document does not provide a specific number of stores located in Florida. Therefore, I cannot answer your question based on the information provided.
    • Context size: 14727B
    • Context chunks: 10
Model h2oai/h2ogpt-4096-llama2-70b-chat-4bit failures:
  • Corpus: https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/AXA-Sigorta-2022-Annual-Report.pdf
    • Error ( generation ): Generated actual answer did not satisfy the condition:  "Christophe Knaub" AND "Yavuz Ölken" AND "Guillaume Herve Marie Xavier Lejeune" AND "Xavier Veyry" AND "Maria Jesus De Arteaga Larru" AND "Nuria Fernandez Paris" AND "Onur Koçkar"  . The following part of the condition did not match:  "Xavier Veyry"  .
    • Prompt: Who are the board members?
    • Output constraints: ['Christophe Knaub', 'Yavuz Ölken', 'Guillaume Herve Marie Xavier Lejeune', 'Xavier Veyry', 'Maria Jesus De Arteaga Larru', 'Nuria Fernandez Paris', 'Onur Koçkar']
    • Expected output: The board members are Christophe Knaub, Yavuz Ölken, Guillaume Herve Marie Xavier Lejeune,Xavier Veyry, Maria Jesus De Arteaga Larru, Nuria Fernandez Paris and Onur Koçkar.
    • Actual output (fragment_level): Based on the information provided in the document sources, the following individuals are listed as board members of AXA Sigorta and AXA Hayat ve Emeklilik: 1. Christophe Knaub - Board Chairman 2. Guillaume Herve Marie Xavier Lejeune - Board Member 3. Yavuz Ölken - CEO and Deputy Chairman of the Board 4. Onur Koçkar - Board Member 5. Maria Jesus de Artega Larru - Board Member 6. Nuria Fernandez Paris - Board Member 7. Fernando Garcia - Board Member 8. Laurent Mignon - Board Member 9. Jean-Luc Robert - Board Member 10. Jean-Francois Lepelletier - Board Member Note that the list of board members may not be exhaustive, as the documents provided are only snapshots of the company's leadership structure at certain points in time.
    • Context size: 13990B
    • Context chunks: 10
  • Corpus: https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/lowes-2022ar-full-report-4-6-23-final.pdf
    • Error ( generation ): Generated actual answer did not satisfy the condition:  "128"  .
    • Prompt: How many stores are in Florida?
    • Output constraints: ['128']
    • Expected output: There are 128 stores in Florida.
    • Actual output (fragment_level): According to the information provided in the document, Lowe's operates 1,738 home improvement stores in the United States, including 232 stores in Canada, as of February 3, 2023. However, the document does not provide a specific number of stores located in Florida. It only mentions that the company operates stores in the United States and Canada. Therefore, I cannot provide an answer to your question based on the information provided.
    • Context size: 14727B
    • Context chunks: 10
Model lmsys/vicuna-13b-v1.5-16k failures:
  • Corpus: https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/AXA-Sigorta-2022-Annual-Report.pdf
    • Error ( generation ): Generated actual answer did not satisfy the condition:  "Christophe Knaub" AND "Yavuz Ölken" AND "Guillaume Herve Marie Xavier Lejeune" AND "Xavier Veyry" AND "Maria Jesus De Arteaga Larru" AND "Nuria Fernandez Paris" AND "Onur Koçkar"  . The following part of the condition did not match:  "Maria Jesus De Arteaga Larru"  .
    • Prompt: Who are the board members?
    • Output constraints: ['Christophe Knaub', 'Yavuz Ölken', 'Guillaume Herve Marie Xavier Lejeune', 'Xavier Veyry', 'Maria Jesus De Arteaga Larru', 'Nuria Fernandez Paris', 'Onur Koçkar']
    • Expected output: The board members are Christophe Knaub, Yavuz Ölken, Guillaume Herve Marie Xavier Lejeune,Xavier Veyry, Maria Jesus De Arteaga Larru, Nuria Fernandez Paris and Onur Koçkar.
    • Actual output (fragment_level): The board members of AXA Sigorta A.Ş. and AXA Hayat ve Emeklilik A.Ş. are: 1. Christophe Knaub - Board Chairman 2. Guillaume Herve Marie Xavier Lejeune - Board Member 3. Onur Koçkar - Board Member 4. Xavier Veyry - Board Member 5. Maria Jesus de Arteaga Larru - Board Member 6. Yavuz Ölken - Deputy Board Chairman and CEO 7. Fatih Bayraktar - Head of Innovative Business Partnerships and Workforce Management 8. Cemre İpek - Head of Automotive Dealerships Sales Channel 9. Gülnur Uşaklı - Head of Regions and Branches 10. İpek Akkaya - Head of Life and Health Insurance 11. Murat Can - Head of Motor Vehicle Insurance Sales Channel 12. Barış Macit - Director of Corporate Technical Department 13. Tolga Tiknaz - Director of Human Resources 14. Zaf
    • Context size: 13990B
    • Context chunks: 10
  • Corpus: https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/lowes-2022ar-full-report-4-6-23-final.pdf
    • Error ( generation ): Generated actual answer did not satisfy the condition:  "128"  .
    • Prompt: How many stores are in Florida?
    • Output constraints: ['128']
    • Expected output: There are 128 stores in Florida.
    • Actual output (fragment_level): Based on the information provided in the document, there are 112 Lowe's-branded home improvement stores in Florida. This information can be found on page 29 of the document under the "State" section, where it lists the number of stores for each state.
    • Context size: 14727B
    • Context chunks: 10
Model h2oai/h2ogpt-4096-llama2-13b-chat failures:
  • Corpus: https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf
    • Error ( generation ): Generated actual answer did not satisfy the condition:  "15,969" AND "million"  . The following part of the condition did not match:  "15,969"  .
    • Prompt: What was the revenue of Brazil?
    • Output constraints: ['15,969', 'million']
    • Expected output: Brazil revenue was 15,969 million.
    • Actual output (fragment_level): Based on the information provided in the document, the revenue of Brazil was Ps. 1,450 million for the first quarter of 2023, and Ps. 1,250 million for the same period of the previous year. This represents an increase of 15.1% compared to the same period of the previous year.
    • Context size: 10268B
    • Context chunks: 10
  • Corpus: https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/AXA-Sigorta-2022-Annual-Report.pdf
    • Error ( generation ): Generated actual answer did not satisfy the condition:  "Christophe Knaub" AND "Yavuz Ölken" AND "Guillaume Herve Marie Xavier Lejeune" AND "Xavier Veyry" AND "Maria Jesus De Arteaga Larru" AND "Nuria Fernandez Paris" AND "Onur Koçkar"  . The following part of the condition did not match:  "Christophe Knaub"  .
    • Prompt: Who are the board members?
    • Output constraints: ['Christophe Knaub', 'Yavuz Ölken', 'Guillaume Herve Marie Xavier Lejeune', 'Xavier Veyry', 'Maria Jesus De Arteaga Larru', 'Nuria Fernandez Paris', 'Onur Koçkar']
    • Expected output: The board members are Christophe Knaub, Yavuz Ölken, Guillaume Herve Marie Xavier Lejeune,Xavier Veyry, Maria Jesus De Arteaga Larru, Nuria Fernandez Paris and Onur Koçkar.
    • Actual output (fragment_level): Based on the information provided in the document, the board members of AXA Sigorta A.Ş. and AXA Hayat ve Emeklilik A.Ş. are: 1. Christophe Stephane Knaub (Board Chairman) 2. Guillaume Herve Marie Xavier Lejeune (Board Member) 3. Yavuz Ölken (CEO and Deputy Chairman of the Board) The document provides information about their backgrounds, education, and experience, as well as their roles and responsibilities within the company.
    • Context size: 13990B
    • Context chunks: 10
  • Corpus: https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/lowes-2022ar-full-report-4-6-23-final.pdf
    • Error ( generation ): Generated actual answer did not satisfy the condition:  "128"  .
    • Prompt: How many stores are in Florida?
    • Output constraints: ['128']
    • Expected output: There are 128 stores in Florida.
    • Actual output (fragment_level): Based on the information provided in the document, there are 38 stores in Florida. This information can be found on page 29 of the document, where it states "State Stores Alabama... Florida 38".
    • Context size: 14727B
    • Context chunks: 10
Model HuggingFaceH4/zephyr-7b-beta failures:
  • Corpus: https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf
    • Error ( generation ): Generated actual answer did not satisfy the condition:  "15,969" AND "million"  . The following part of the condition did not match:  "15,969"  .
    • Prompt: What was the revenue of Brazil?
    • Output constraints: ['15,969', 'million']
    • Expected output: Brazil revenue was 15,969 million.
    • Actual output (fragment_level): The revenue for Brazil in the first quarter of 2023 was Ps. 14,007 million (approximately USD 756 million) as reported, and Ps. 13,982 million (approximately USD 754 million) on a comparable basis (excluding volumes from the acquisition of CVI in Brazil). This information can be found on pages 10-11 and 13 of the document sources provided.
    • Context size: 10268B
    • Context chunks: 10
  • Corpus: https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/AXA-Sigorta-2022-Annual-Report.pdf
    • Error ( generation ): Generated actual answer did not satisfy the condition:  "Christophe Knaub" AND "Yavuz Ölken" AND "Guillaume Herve Marie Xavier Lejeune" AND "Xavier Veyry" AND "Maria Jesus De Arteaga Larru" AND "Nuria Fernandez Paris" AND "Onur Koçkar"  . The following part of the condition did not match:  "Yavuz Ölken"  .
    • Prompt: Who are the board members?
    • Output constraints: ['Christophe Knaub', 'Yavuz Ölken', 'Guillaume Herve Marie Xavier Lejeune', 'Xavier Veyry', 'Maria Jesus De Arteaga Larru', 'Nuria Fernandez Paris', 'Onur Koçkar']
    • Expected output: The board members are Christophe Knaub, Yavuz Ölken, Guillaume Herve Marie Xavier Lejeune,Xavier Veyry, Maria Jesus De Arteaga Larru, Nuria Fernandez Paris and Onur Koçkar.
    • Actual output (fragment_level): The board members mentioned in the document sources provided are Christophe Knaub (Board Chairman), Guillaume Herve Marie, and Xavier Lejeune (both Board Members).
    • Context size: 13990B
    • Context chunks: 10
  • Corpus: https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/lowes-2022ar-full-report-4-6-23-final.pdf
    • Error ( generation ): Generated actual answer did not satisfy the condition:  "128"  .
    • Prompt: How many stores are in Florida?
    • Output constraints: ['128']
    • Expected output: There are 128 stores in Florida.
    • Actual output (fragment_level): The document sources provided do not specifically mention the number of Lowe's stores in Florida.
    • Context size: 14727B
    • Context chunks: 10
Model h2oai/h2ogpt-32k-codellama-34b-instruct failures:
  • Corpus: https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/AXA-Sigorta-2022-Annual-Report.pdf
    • Error ( generation ): Generated actual answer did not satisfy the condition:  "Christophe Knaub" AND "Yavuz Ölken" AND "Guillaume Herve Marie Xavier Lejeune" AND "Xavier Veyry" AND "Maria Jesus De Arteaga Larru" AND "Nuria Fernandez Paris" AND "Onur Koçkar"  . The following part of the condition did not match:  "Maria Jesus De Arteaga Larru"  .
    • Prompt: Who are the board members?
    • Output constraints: ['Christophe Knaub', 'Yavuz Ölken', 'Guillaume Herve Marie Xavier Lejeune', 'Xavier Veyry', 'Maria Jesus De Arteaga Larru', 'Nuria Fernandez Paris', 'Onur Koçkar']
    • Expected output: The board members are Christophe Knaub, Yavuz Ölken, Guillaume Herve Marie Xavier Lejeune,Xavier Veyry, Maria Jesus De Arteaga Larru, Nuria Fernandez Paris and Onur Koçkar.
    • Actual output (fragment_level): Based on the information provided in the document sources, the board members of AXA Sigorta A.Ş. and AXA Hayat ve Emeklilik A.Ş. are: 1. Christophe Knaub - Board Chairman 2. Guillaume Herve Marie Xavier Lejeune - Board Member 3. Yavuz Ölken - CEO and Deputy Chairman of the Board 4. Maria Jesus de Arteaga Larru - Board Member 5. Xavier Veyry - Board Member These individuals are listed as board members in the "Board Members" section of the document sources.
    • Context size: 13990B
    • Context chunks: 10
  • Corpus: https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/lowes-2022ar-full-report-4-6-23-final.pdf
    • Error ( generation ): Generated actual answer did not satisfy the condition:  "128"  .
    • Prompt: How many stores are in Florida?
    • Output constraints: ['128']
    • Expected output: There are 128 stores in Florida.
    • Actual output (fragment_level): According to the information provided in the document sources, there are 14 stores in Florida.
    • Context size: 14727B
    • Context chunks: 10
Model Yukang/LongAlpaca-70B failures:
  • Corpus: https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf
    • Error ( generation ): Generated actual answer did not satisfy the condition:  "15,969" AND "million"  . The following part of the condition did not match:  "15,969"  .
    • Prompt: What was the revenue of Brazil?
    • Output constraints: ['15,969', 'million']
    • Expected output: Brazil revenue was 15,969 million.
    • Actual output (fragment_level): According to the information provided on page 13: Revenues by region: Brazil: 218.3 million pesos (reported) So the revenue of Brazil was 218.3 million pesos for the first quarter of 2023, according to the information provided.
    • Context size: 10268B
    • Context chunks: 10
  • Corpus: https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/AXA-Sigorta-2022-Annual-Report.pdf
    • Error ( generation ): Generated actual answer did not satisfy the condition:  "Christophe Knaub" AND "Yavuz Ölken" AND "Guillaume Herve Marie Xavier Lejeune" AND "Xavier Veyry" AND "Maria Jesus De Arteaga Larru" AND "Nuria Fernandez Paris" AND "Onur Koçkar"  . The following part of the condition did not match:  "Maria Jesus De Arteaga Larru"  .
    • Prompt: Who are the board members?
    • Output constraints: ['Christophe Knaub', 'Yavuz Ölken', 'Guillaume Herve Marie Xavier Lejeune', 'Xavier Veyry', 'Maria Jesus De Arteaga Larru', 'Nuria Fernandez Paris', 'Onur Koçkar']
    • Expected output: The board members are Christophe Knaub, Yavuz Ölken, Guillaume Herve Marie Xavier Lejeune,Xavier Veyry, Maria Jesus De Arteaga Larru, Nuria Fernandez Paris and Onur Koçkar.
    • Actual output (fragment_level): According to the information provided in the annual report, the board members of AXA Sigorta A.Ş. are: 1. Christophe Knaub (Board Chairman) 2. Yavuz Ölken (Deputy Chairman and CEO) 3. Guillaume Herve Marie Xavier Lejeune (Board Member) 4. Onur Koçkar (Board Member) 5. Maria Jesus de Arteaga Larru (Board Member) 6. Xavier Veyry (Board Member) These are the names of the board members mentioned in the annual report.
    • Context size: 13990B
    • Context chunks: 10
  • Corpus: https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/lowes-2022ar-full-report-4-6-23-final.pdf
    • Error ( generation ): Generated actual answer did not satisfy the condition:  "128"  .
    • Prompt: How many stores are in Florida?
    • Output constraints: ['128']
    • Expected output: There are 128 stores in Florida.
    • Actual output (fragment_level): According to the information provided on page 29, Lowe's operated 112 stores in Florida as of February 3, 2023.
    • Context size: 14727B
    • Context chunks: 10
Model gpt-3.5-turbo-0613 failures:
  • Corpus: https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/AXA-Sigorta-2022-Annual-Report.pdf
    • Error ( generation ): Generated actual answer did not satisfy the condition:  "Christophe Knaub" AND "Yavuz Ölken" AND "Guillaume Herve Marie Xavier Lejeune" AND "Xavier Veyry" AND "Maria Jesus De Arteaga Larru" AND "Nuria Fernandez Paris" AND "Onur Koçkar"  . The following part of the condition did not match:  "Christophe Knaub"  .
    • Prompt: Who are the board members?
    • Output constraints: ['Christophe Knaub', 'Yavuz Ölken', 'Guillaume Herve Marie Xavier Lejeune', 'Xavier Veyry', 'Maria Jesus De Arteaga Larru', 'Nuria Fernandez Paris', 'Onur Koçkar']
    • Expected output: The board members are Christophe Knaub, Yavuz Ölken, Guillaume Herve Marie Xavier Lejeune,Xavier Veyry, Maria Jesus De Arteaga Larru, Nuria Fernandez Paris and Onur Koçkar.
    • Actual output (fragment_level): Based on the information provided, the board members of AXA Sigorta A.Ş. are: 1. Christophe Stephane KNAUB (Board Chairman) 2. Guillaume Herve Marie Xavier LEJEUNE 3. Yavuz ÖLKEN 4. Onur KOÇKAR 5. Xavier VEYRY 6. Maria Jesus de Artega LARRU Please note that this information is based solely on the document sources provided and may not be exhaustive or up to date.
    • Context size: 13990B
    • Context chunks: 10
Model gpt-3.5-turbo-16k-0613 failures:
  • Corpus: https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf
    • Error ( generation ): Generated actual answer did not satisfy the condition:  "15,969" AND "million"  . The following part of the condition did not match:  "15,969"  .
    • Prompt: What was the revenue of Brazil?
    • Output constraints: ['15,969', 'million']
    • Expected output: Brazil revenue was 15,969 million.
    • Actual output (fragment_level): According to the information provided, the revenue of Brazil in the first quarter of 2023 was Ps. 5,632.20 million.
    • Context size: 10268B
    • Context chunks: 10
  • Corpus: https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/AXA-Sigorta-2022-Annual-Report.pdf
    • Error ( generation ): Generated actual answer did not satisfy the condition:  "Christophe Knaub" AND "Yavuz Ölken" AND "Guillaume Herve Marie Xavier Lejeune" AND "Xavier Veyry" AND "Maria Jesus De Arteaga Larru" AND "Nuria Fernandez Paris" AND "Onur Koçkar"  . The following part of the condition did not match:  "Maria Jesus De Arteaga Larru"  .
    • Prompt: Who are the board members?
    • Output constraints: ['Christophe Knaub', 'Yavuz Ölken', 'Guillaume Herve Marie Xavier Lejeune', 'Xavier Veyry', 'Maria Jesus De Arteaga Larru', 'Nuria Fernandez Paris', 'Onur Koçkar']
    • Expected output: The board members are Christophe Knaub, Yavuz Ölken, Guillaume Herve Marie Xavier Lejeune,Xavier Veyry, Maria Jesus De Arteaga Larru, Nuria Fernandez Paris and Onur Koçkar.
    • Actual output (fragment_level): Based on the information provided, the board members of AXA Sigorta A.Ş. and AXA Hayat ve Emeklilik A.Ş. are: 1. Christophe Knaub (Board Chairman) 2. Guillaume Herve Marie Xavier Lejeune (Board Member) 3. Onur Koçkar (Board Member) 4. Maria Jesus de Artega Larru (Board Member) 5. Xavier Veyry (Board Member) 6. Nuria Fernandez Paris (Board Member) 7. Yavuz Ölken (Deputy Chairman of Board and CEO) 8. Barış Altın (Chairman and Member of the Executive Committee Corporate Technical Department) 9. Ege Örer (Chairman and Member of the Executive Committee Directorate of Information Technologies and Innovation)
    • Context size: 13990B
    • Context chunks: 10
Model gpt-4-0613 failures:
  • Corpus: https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/AXA-Sigorta-2022-Annual-Report.pdf
    • Error ( generation ): Generated actual answer did not satisfy the condition:  "Christophe Knaub" AND "Yavuz Ölken" AND "Guillaume Herve Marie Xavier Lejeune" AND "Xavier Veyry" AND "Maria Jesus De Arteaga Larru" AND "Nuria Fernandez Paris" AND "Onur Koçkar"  . The following part of the condition did not match:  "Maria Jesus De Arteaga Larru"  .
    • Prompt: Who are the board members?
    • Output constraints: ['Christophe Knaub', 'Yavuz Ölken', 'Guillaume Herve Marie Xavier Lejeune', 'Xavier Veyry', 'Maria Jesus De Arteaga Larru', 'Nuria Fernandez Paris', 'Onur Koçkar']
    • Expected output: The board members are Christophe Knaub, Yavuz Ölken, Guillaume Herve Marie Xavier Lejeune,Xavier Veyry, Maria Jesus De Arteaga Larru, Nuria Fernandez Paris and Onur Koçkar.
    • Actual output (fragment_level): The board members mentioned in the provided document are: 1. Christophe Knaub - Board Chairman 2. Guillaume Herve Marie Xavier Lejeune - Board Member 3. Xavier Veyry - Board Member 4. Maria Jesus de Arteaga Larru - Board Member 5. Onur Koçkar - Board Member 6. Yavuz Ölken - Deputy Chairman of Board and CEO 7. Nuria Fernandez Paris - Board Member
    • Context size: 13990B
    • Context chunks: 10
Model gpt-4-32k-0613 failures:
  • Corpus: https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/AXA-Sigorta-2022-Annual-Report.pdf
    • Error ( generation ): Generated actual answer did not satisfy the condition:  "Christophe Knaub" AND "Yavuz Ölken" AND "Guillaume Herve Marie Xavier Lejeune" AND "Xavier Veyry" AND "Maria Jesus De Arteaga Larru" AND "Nuria Fernandez Paris" AND "Onur Koçkar"  . The following part of the condition did not match:  "Maria Jesus De Arteaga Larru"  .
    • Prompt: Who are the board members?
    • Output constraints: ['Christophe Knaub', 'Yavuz Ölken', 'Guillaume Herve Marie Xavier Lejeune', 'Xavier Veyry', 'Maria Jesus De Arteaga Larru', 'Nuria Fernandez Paris', 'Onur Koçkar']
    • Expected output: The board members are Christophe Knaub, Yavuz Ölken, Guillaume Herve Marie Xavier Lejeune,Xavier Veyry, Maria Jesus De Arteaga Larru, Nuria Fernandez Paris and Onur Koçkar.
    • Actual output (fragment_level): The board members mentioned in the provided document are: 1. Christophe Knaub - Board Chairman 2. Yavuz Ölken - Deputy Chairman of Board and CEO 3. Guillaume Herve Marie Xavier Lejeune - Board Member 4. Onur Koçkar - Board Member 5. Xavier Veyry - Board Member 6. Maria Jesus de Arteaga Larru - Board Member 7. Nuria Fernandez Paris - Board Member
    • Context size: 13990B
    • Context chunks: 10

Explanations
Model explanations created by the explainer organized by explanation types with its formats (representations) identified by media types :
Explainer parameters
Explainer was run with the following parameters:
Parameter Value Description Type Default value
metric_threshold 0.5 Evaluated metric threshold - values below this threshold are considered problematic. float 0.5
save_llm_result True Control whether to save LLM result which contains input LLM dataset and all metrics calculated by the evaluator. bool True
evaluate_retrieved_context False Control whether to evaluate also retrieved context - conditions to check whether it contains or does not contained specific strings. bool False
Explainer metadata
Explainer keywords:
  • llm
  • evaluates_rag
  • evaluates_llm
  • requires_prompts
  • requires_constraints
  • sr-11-7-conceptual-soundness
  • sr-11-7-ongoing-monitoring
  • sr-11-7-outcomes-analysis
  • nist-ai-rmf-safe
  • nist-ai-rmf-secure-and-resilient
  • nist-ai-rmf-privacy-enhanced
  • nist-ai-rmf-fair
  • nist-ai-rmf-accountable-and-transparent
  • nist-ai-rmf-valid-and-reliable
  • problem-type-information-retrieval
  • problem-type-question-answering
  • problem-type-summarization
  • problem-type-regression
  • problem-type-classification
  • problem-type-binary-classification
  • problem-type-multiclass-classification
  • es-purpose-generation
  • evaluation-method-rule-based
  • evaluation-type-deterministic
  • capability-condition-highlight
Explainer run
Explainer run details:
  • Status code:  SUCCESS 
  • Progress: 100%
  • Started: 2026-01-30 14:26:04 T+0100
  • Duration: 0.481s
Explainer log
Explainer log file:

Dataset

Interpretation test suite details:

Prompts (5)
How many stores are in Florida?
What was the number of agreements that include human rights clauses, in 2022?
What was the revenue of Brazil?
Who are the board members?
Who is the chairman of the board?

Dataset description:

Models (44)

Interpreted models - LLM and corpus (in case of RAG) - overview:



RAG model h2oai/h2ogpt-4096-llama2-70b-chat (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])

Interpreted model details:

RAG model h2oai/h2ogpt-4096-llama2-70b-chat (docs: ['bradesco-2022-integrated-report.pdf'])

Interpreted model details:

RAG model h2oai/h2ogpt-4096-llama2-70b-chat (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])

Interpreted model details:

RAG model h2oai/h2ogpt-4096-llama2-70b-chat (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])

Interpreted model details:

RAG model h2oai/h2ogpt-4096-llama2-70b-chat-4bit (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])

Interpreted model details:

RAG model h2oai/h2ogpt-4096-llama2-70b-chat-4bit (docs: ['bradesco-2022-integrated-report.pdf'])

Interpreted model details:

RAG model h2oai/h2ogpt-4096-llama2-70b-chat-4bit (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])

Interpreted model details:

RAG model h2oai/h2ogpt-4096-llama2-70b-chat-4bit (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])

Interpreted model details:

RAG model lmsys/vicuna-13b-v1.5-16k (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])

Interpreted model details:

RAG model lmsys/vicuna-13b-v1.5-16k (docs: ['bradesco-2022-integrated-report.pdf'])

Interpreted model details:

RAG model lmsys/vicuna-13b-v1.5-16k (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])

Interpreted model details:

RAG model lmsys/vicuna-13b-v1.5-16k (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])

Interpreted model details:

RAG model h2oai/h2ogpt-4096-llama2-13b-chat (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])

Interpreted model details:

Explainers identified the following problems:

Severity Type Problem Suggested actions Explainer Resources
 MEDIUM  accuracy Evaluated model h2oai/h2ogpt-4096-llama2-13b-chat failed to satisfy the  threshold   0.5 for metric Model passes with average  score   0.4. Metric details: Percentage of successfully evaluated RAG/LLM outputs. For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures. Text matching GlobalHtmlFragmentExplanation / text/html


RAG model h2oai/h2ogpt-4096-llama2-13b-chat (docs: ['bradesco-2022-integrated-report.pdf'])

Interpreted model details:

Explainers identified the following problems:

Severity Type Problem Suggested actions Explainer Resources
 MEDIUM  accuracy Evaluated model h2oai/h2ogpt-4096-llama2-13b-chat failed to satisfy the  threshold   0.5 for metric Model passes with average  score   0.4. Metric details: Percentage of successfully evaluated RAG/LLM outputs. For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures. Text matching GlobalHtmlFragmentExplanation / text/html


RAG model h2oai/h2ogpt-4096-llama2-13b-chat (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])

Interpreted model details:

Explainers identified the following problems:

Severity Type Problem Suggested actions Explainer Resources
 MEDIUM  accuracy Evaluated model h2oai/h2ogpt-4096-llama2-13b-chat failed to satisfy the  threshold   0.5 for metric Model passes with average  score   0.4. Metric details: Percentage of successfully evaluated RAG/LLM outputs. For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures. Text matching GlobalHtmlFragmentExplanation / text/html


RAG model h2oai/h2ogpt-4096-llama2-13b-chat (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])

Interpreted model details:

Explainers identified the following problems:

Severity Type Problem Suggested actions Explainer Resources
 MEDIUM  accuracy Evaluated model h2oai/h2ogpt-4096-llama2-13b-chat failed to satisfy the  threshold   0.5 for metric Model passes with average  score   0.4. Metric details: Percentage of successfully evaluated RAG/LLM outputs. For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures. Text matching GlobalHtmlFragmentExplanation / text/html


RAG model HuggingFaceH4/zephyr-7b-beta (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])

Interpreted model details:

Explainers identified the following problems:

Severity Type Problem Suggested actions Explainer Resources
 MEDIUM  accuracy Evaluated model HuggingFaceH4/zephyr-7b-beta failed to satisfy the  threshold   0.5 for metric Model passes with average  score   0.4. Metric details: Percentage of successfully evaluated RAG/LLM outputs. For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures. Text matching GlobalHtmlFragmentExplanation / text/html


RAG model HuggingFaceH4/zephyr-7b-beta (docs: ['bradesco-2022-integrated-report.pdf'])

Interpreted model details:

Explainers identified the following problems:

Severity Type Problem Suggested actions Explainer Resources
 MEDIUM  accuracy Evaluated model HuggingFaceH4/zephyr-7b-beta failed to satisfy the  threshold   0.5 for metric Model passes with average  score   0.4. Metric details: Percentage of successfully evaluated RAG/LLM outputs. For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures. Text matching GlobalHtmlFragmentExplanation / text/html


RAG model HuggingFaceH4/zephyr-7b-beta (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])

Interpreted model details:

Explainers identified the following problems:

Severity Type Problem Suggested actions Explainer Resources
 MEDIUM  accuracy Evaluated model HuggingFaceH4/zephyr-7b-beta failed to satisfy the  threshold   0.5 for metric Model passes with average  score   0.4. Metric details: Percentage of successfully evaluated RAG/LLM outputs. For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures. Text matching GlobalHtmlFragmentExplanation / text/html


RAG model HuggingFaceH4/zephyr-7b-beta (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])

Interpreted model details:

Explainers identified the following problems:

Severity Type Problem Suggested actions Explainer Resources
 MEDIUM  accuracy Evaluated model HuggingFaceH4/zephyr-7b-beta failed to satisfy the  threshold   0.5 for metric Model passes with average  score   0.4. Metric details: Percentage of successfully evaluated RAG/LLM outputs. For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures. Text matching GlobalHtmlFragmentExplanation / text/html


RAG model h2oai/h2ogpt-32k-codellama-34b-instruct (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])

Interpreted model details:

RAG model h2oai/h2ogpt-32k-codellama-34b-instruct (docs: ['bradesco-2022-integrated-report.pdf'])

Interpreted model details:

RAG model h2oai/h2ogpt-32k-codellama-34b-instruct (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])

Interpreted model details:

RAG model h2oai/h2ogpt-32k-codellama-34b-instruct (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])

Interpreted model details:

RAG model Yukang/LongAlpaca-70B (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])

Interpreted model details:

Explainers identified the following problems:

Severity Type Problem Suggested actions Explainer Resources
 MEDIUM  accuracy Evaluated model Yukang/LongAlpaca-70B failed to satisfy the  threshold   0.5 for metric Model passes with average  score   0.4. Metric details: Percentage of successfully evaluated RAG/LLM outputs. For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures. Text matching GlobalHtmlFragmentExplanation / text/html

Explainers identified the following insights:

Type Insight Suggested actions Explainer Resources
accuracy Model Yukang/LongAlpaca-70B was evaluated as  the least accurate    model according to Text matching evaluator. A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Check the prompt, expected answer and condition - are they correct? Check models answers in failed cases and look for a common denominator and/or root cause of these failures. Text matching GlobalHtmlFragmentExplanation / text/html
performance Model Yukang/LongAlpaca-70B was evaluated as  the slowest    model according to Text matching evaluator. A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Text matching GlobalHtmlFragmentExplanation / text/html


RAG model Yukang/LongAlpaca-70B (docs: ['bradesco-2022-integrated-report.pdf'])

Interpreted model details:

Explainers identified the following problems:

Severity Type Problem Suggested actions Explainer Resources
 MEDIUM  accuracy Evaluated model Yukang/LongAlpaca-70B failed to satisfy the  threshold   0.5 for metric Model passes with average  score   0.4. Metric details: Percentage of successfully evaluated RAG/LLM outputs. For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures. Text matching GlobalHtmlFragmentExplanation / text/html

Explainers identified the following insights:

Type Insight Suggested actions Explainer Resources
accuracy Model Yukang/LongAlpaca-70B was evaluated as  the least accurate    model according to Text matching evaluator. A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Check the prompt, expected answer and condition - are they correct? Check models answers in failed cases and look for a common denominator and/or root cause of these failures. Text matching GlobalHtmlFragmentExplanation / text/html
performance Model Yukang/LongAlpaca-70B was evaluated as  the slowest    model according to Text matching evaluator. A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Text matching GlobalHtmlFragmentExplanation / text/html


RAG model Yukang/LongAlpaca-70B (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])

Interpreted model details:

Explainers identified the following problems:

Severity Type Problem Suggested actions Explainer Resources
 MEDIUM  accuracy Evaluated model Yukang/LongAlpaca-70B failed to satisfy the  threshold   0.5 for metric Model passes with average  score   0.4. Metric details: Percentage of successfully evaluated RAG/LLM outputs. For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures. Text matching GlobalHtmlFragmentExplanation / text/html

Explainers identified the following insights:

Type Insight Suggested actions Explainer Resources
accuracy Model Yukang/LongAlpaca-70B was evaluated as  the least accurate    model according to Text matching evaluator. A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Check the prompt, expected answer and condition - are they correct? Check models answers in failed cases and look for a common denominator and/or root cause of these failures. Text matching GlobalHtmlFragmentExplanation / text/html
performance Model Yukang/LongAlpaca-70B was evaluated as  the slowest    model according to Text matching evaluator. A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Text matching GlobalHtmlFragmentExplanation / text/html


RAG model Yukang/LongAlpaca-70B (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])

Interpreted model details:

Explainers identified the following problems:

Severity Type Problem Suggested actions Explainer Resources
 MEDIUM  accuracy Evaluated model Yukang/LongAlpaca-70B failed to satisfy the  threshold   0.5 for metric Model passes with average  score   0.4. Metric details: Percentage of successfully evaluated RAG/LLM outputs. For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures. Text matching GlobalHtmlFragmentExplanation / text/html

Explainers identified the following insights:

Type Insight Suggested actions Explainer Resources
accuracy Model Yukang/LongAlpaca-70B was evaluated as  the least accurate    model according to Text matching evaluator. A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Check the prompt, expected answer and condition - are they correct? Check models answers in failed cases and look for a common denominator and/or root cause of these failures. Text matching GlobalHtmlFragmentExplanation / text/html
performance Model Yukang/LongAlpaca-70B was evaluated as  the slowest    model according to Text matching evaluator. A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Text matching GlobalHtmlFragmentExplanation / text/html


RAG model gpt-3.5-turbo-0613 (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])

Interpreted model details:

Explainers identified the following insights:

Type Insight Suggested actions Explainer Resources
accuracy Model gpt-3.5-turbo-0613 was evaluated as  the most accurate    model according to Text matching evaluator. A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Check the prompt, expected answer and condition - are they correct? Check models answers in failed cases and look for a common denominator and/or root cause of these failures. Text matching GlobalHtmlFragmentExplanation / text/html
performance Model gpt-3.5-turbo-0613 was evaluated as  the fastest    model according to Text matching evaluator. A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Text matching GlobalHtmlFragmentExplanation / text/html


RAG model gpt-3.5-turbo-0613 (docs: ['bradesco-2022-integrated-report.pdf'])

Interpreted model details:

Explainers identified the following insights:

Type Insight Suggested actions Explainer Resources
accuracy Model gpt-3.5-turbo-0613 was evaluated as  the most accurate    model according to Text matching evaluator. A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Check the prompt, expected answer and condition - are they correct? Check models answers in failed cases and look for a common denominator and/or root cause of these failures. Text matching GlobalHtmlFragmentExplanation / text/html
performance Model gpt-3.5-turbo-0613 was evaluated as  the fastest    model according to Text matching evaluator. A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Text matching GlobalHtmlFragmentExplanation / text/html


RAG model gpt-3.5-turbo-0613 (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])

Interpreted model details:

Explainers identified the following insights:

Type Insight Suggested actions Explainer Resources
accuracy Model gpt-3.5-turbo-0613 was evaluated as  the most accurate    model according to Text matching evaluator. A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Check the prompt, expected answer and condition - are they correct? Check models answers in failed cases and look for a common denominator and/or root cause of these failures. Text matching GlobalHtmlFragmentExplanation / text/html
performance Model gpt-3.5-turbo-0613 was evaluated as  the fastest    model according to Text matching evaluator. A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Text matching GlobalHtmlFragmentExplanation / text/html


RAG model gpt-3.5-turbo-0613 (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])

Interpreted model details:

Explainers identified the following insights:

Type Insight Suggested actions Explainer Resources
accuracy Model gpt-3.5-turbo-0613 was evaluated as  the most accurate    model according to Text matching evaluator. A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Check the prompt, expected answer and condition - are they correct? Check models answers in failed cases and look for a common denominator and/or root cause of these failures. Text matching GlobalHtmlFragmentExplanation / text/html
performance Model gpt-3.5-turbo-0613 was evaluated as  the fastest    model according to Text matching evaluator. A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Text matching GlobalHtmlFragmentExplanation / text/html


RAG model gpt-3.5-turbo-16k-0613 (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])

Interpreted model details:

RAG model gpt-3.5-turbo-16k-0613 (docs: ['bradesco-2022-integrated-report.pdf'])

Interpreted model details:

RAG model gpt-3.5-turbo-16k-0613 (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])

Interpreted model details:

RAG model gpt-3.5-turbo-16k-0613 (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])

Interpreted model details:

RAG model gpt-4-0613 (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])

Interpreted model details:

RAG model gpt-4-0613 (docs: ['bradesco-2022-integrated-report.pdf'])

Interpreted model details:

RAG model gpt-4-0613 (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])

Interpreted model details:

RAG model gpt-4-0613 (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])

Interpreted model details:

RAG model gpt-4-32k-0613 (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])

Interpreted model details:

RAG model gpt-4-32k-0613 (docs: ['bradesco-2022-integrated-report.pdf'])

Interpreted model details:

RAG model gpt-4-32k-0613 (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])

Interpreted model details:

RAG model gpt-4-32k-0613 (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])

Interpreted model details:

Configuration and parameters

Interpretation parameters:

H2O Sonar library configuration:

Directories, files and logs

Directories and files: