Model Interpretation Report

Interpretation result:
Problems:	3
Insights:	5
Models:	`44 LLM/RAG models`
Dataset:	`55 inputs`
Interpretation status:	SUCCESS
Interpretation ID:	`133445c7-feef-4bae-835a-2e2cb726cbf3`
Created:	`2026-01-30 14:26:04`

Explainers overview

Representatives

All explainers

Problems

Explainers identified the following problems:

Severity	Type	Problem	Suggested actions	Explainer	Resources
MEDIUM	accuracy	Evaluated model `h2oai/h2ogpt-4096-llama2-13b-chat` failed to satisfy the threshold `0.5` for metric `Model passes` with average score `0.4.` Metric details: Percentage of successfully evaluated RAG/LLM outputs.	For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures.	Text matching	GlobalHtmlFragmentExplanation / text/html
MEDIUM	accuracy	Evaluated model `HuggingFaceH4/zephyr-7b-beta` failed to satisfy the threshold `0.5` for metric `Model passes` with average score `0.4.` Metric details: Percentage of successfully evaluated RAG/LLM outputs.	For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures.	Text matching	GlobalHtmlFragmentExplanation / text/html
MEDIUM	accuracy	Evaluated model `Yukang/LongAlpaca-70B` failed to satisfy the threshold `0.5` for metric `Model passes` with average score `0.4.` Metric details: Percentage of successfully evaluated RAG/LLM outputs.	For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures.	Text matching	GlobalHtmlFragmentExplanation / text/html

Insights

Explainers identified the following insights:

Type	Insight	Suggested actions	Explainer	Resources
accuracy	Model `gpt-3.5-turbo-0613` was evaluated as the most accurate model according to `Text matching` evaluator.	A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Check the prompt, expected answer and condition - are they correct? Check models answers in failed cases and look for a common denominator and/or root cause of these failures.	Text matching	GlobalHtmlFragmentExplanation / text/html
accuracy	Model `Yukang/LongAlpaca-70B` was evaluated as the least accurate model according to `Text matching` evaluator.	A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Check the prompt, expected answer and condition - are they correct? Check models answers in failed cases and look for a common denominator and/or root cause of these failures.	Text matching	GlobalHtmlFragmentExplanation / text/html
performance	Model `gpt-3.5-turbo-0613` was evaluated as the fastest model according to `Text matching` evaluator.	A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation.	Text matching	GlobalHtmlFragmentExplanation / text/html
performance	Model `Yukang/LongAlpaca-70B` was evaluated as the slowest model according to `Text matching` evaluator.	A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation.	Text matching	GlobalHtmlFragmentExplanation / text/html
weak-point	Prompt *'Who are the board members?'* was evaluated as the most difficult prompt to be correctly answered according to `Text matching` evaluator.	A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Check the prompt, expected answer and condition - are they correct? Check models answers in failed cases and look for a common denominator and/or root cause of these failures.	Text matching	GlobalHtmlFragmentExplanation / text/html

Explainers

100%

All explainers (1):

Text matching

Scheduled explainers (1):

Text matching

Finished explainers (1):

Text matching

Successful explainers (1):

Text matching

Problems

Explainer identified the following problems:

Severity	Type	Problem	Suggested actions	Explainer	Resources
MEDIUM	accuracy	Evaluated model `h2oai/h2ogpt-4096-llama2-13b-chat` failed to satisfy the threshold `0.5` for metric `Model passes` with average score `0.4.` Metric details: Percentage of successfully evaluated RAG/LLM outputs.	For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures.	Text matching	GlobalHtmlFragmentExplanation / text/html
MEDIUM	accuracy	Evaluated model `HuggingFaceH4/zephyr-7b-beta` failed to satisfy the threshold `0.5` for metric `Model passes` with average score `0.4.` Metric details: Percentage of successfully evaluated RAG/LLM outputs.	For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures.	Text matching	GlobalHtmlFragmentExplanation / text/html
MEDIUM	accuracy	Evaluated model `Yukang/LongAlpaca-70B` failed to satisfy the threshold `0.5` for metric `Model passes` with average score `0.4.` Metric details: Percentage of successfully evaluated RAG/LLM outputs.	For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures.	Text matching	GlobalHtmlFragmentExplanation / text/html

Insights

Explainer identified the following insights:

Type	Insight	Suggested actions	Explainer	Resources
accuracy	Model `gpt-3.5-turbo-0613` was evaluated as the most accurate model according to `Text matching` evaluator.	A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Check the prompt, expected answer and condition - are they correct? Check models answers in failed cases and look for a common denominator and/or root cause of these failures.	Text matching	GlobalHtmlFragmentExplanation / text/html
accuracy	Model `Yukang/LongAlpaca-70B` was evaluated as the least accurate model according to `Text matching` evaluator.	A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Check the prompt, expected answer and condition - are they correct? Check models answers in failed cases and look for a common denominator and/or root cause of these failures.	Text matching	GlobalHtmlFragmentExplanation / text/html
performance	Model `gpt-3.5-turbo-0613` was evaluated as the fastest model according to `Text matching` evaluator.	A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation.	Text matching	GlobalHtmlFragmentExplanation / text/html
performance	Model `Yukang/LongAlpaca-70B` was evaluated as the slowest model according to `Text matching` evaluator.	A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation.	Text matching	GlobalHtmlFragmentExplanation / text/html
weak-point	Prompt *'Who are the board members?'* was evaluated as the most difficult prompt to be correctly answered according to `Text matching` evaluator.	A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Check the prompt, expected answer and condition - are they correct? Check models answers in failed cases and look for a common denominator and/or root cause of these failures.	Text matching	GlobalHtmlFragmentExplanation / text/html

Explainer description

Evaluator input requirements:

Question	Expected Answer	Retrieved Context	Actual Answer	Conditions
✓				✓

Description:

Text Matching Evaluator assesses whether both the retrieved context (in the case of RAG hosted models) and the generated answer contain/match a specified set of required strings. The evaluation is based on an boolean expression (condition) that can be used to define the required strings presence:

operands are strings or regular expressions
operators are AND, OR, and NOT
parentheses can be used to group expressions
Example 1: Simple string matching
Expression: "15,969"
The evaluator will check if the retrieved context and the actual answer contain the string 15,969. If the condition is satisfied, the test case passes.
Example 2: Flexible regex patterns
Expression: regexp("15,?969")
What if the number 15,969 might be expressed as 15969 or 15,969? The boolean expression can be extended to use a regular expression. The evaluator will check if the retrieved context and the actual answer contain the string 15,969 or 15969. If the condition is satisfied, the test case passes.
Example 3: Combining string and regex
Expression: "15,969" AND regexp("[Mm]illion")
The evaluator will check if the retrieved context and the actual answer contain the string 15,969 and match the regular expression [Mm]illion. If the condition is satisfied, the test case passes.
Example 4: Complex boolean logic
Expression: ("Rio" OR "rio") AND regexp("15,?969 [Mm]il") AND NOT "Real"
The evaluator will check if the retrieved context and the actual answer contain either Rio or rio and match the regular expression 15,969 [Mm]il and do not contain the string Real. If the condition is satisfied, the test case passes.
Example 5: Exact matching with regex anchors
Expression: regexp("^Brazil revenue was 15,969 million$")
The evaluator will check if the retrieved context and the actual answer exactly match the regular expression ^Brazil revenue was 15,969 million$. If the condition is satisfied, the test case passes.
Example 6: Case-insensitive matching
Expression: regexp("(?i)python")
The (?i) flag enables case-insensitive matching. The evaluator will match python, Python, PYTHON, PyThOn, etc. This is useful when the capitalization in the output is unpredictable.
Example 7: OR within regular expressions
Expression: regexp("(cat|dog|bird)")
Using the pipe | operator inside a group allows matching multiple alternatives. The evaluator will match any of: cat, dog, or bird. This is more concise than using multiple OR operators in the boolean expression.
Example 8: Capturing groups and word boundaries
Expression: regexp("\b(error|warning|failure)\b")
The \b word boundary ensures exact word matching (not as part of a larger word). The regex will match error, warning, or failure as complete words. Parentheses capture the matched text for reference.
Example 9: Repeated patterns and quantifiers
Expression: regexp("\d3-\d3-\d4")
Quantifiers specify repetition: \d3 matches exactly 3 digits, + matches one or more, * matches zero or more. This example matches phone numbers in the format 123-456-7890. Use \d for digits, \w for word characters, \s for whitespace.
Example 10: Lookahead and combining patterns
Expression: regexp("(?i)(success|completed).*\d+%")
This combines case-insensitive matching (?i), an OR group (success|completed), .* to match any characters, and \d+% to match one or more digits followed by a percent sign. Useful for matching complex patterns like progress messages.

Method:

The evaluator parses the boolean expression and checks if the retrieved context and the generated answer contain the required strings.
The evaluator uses Python re module for regular expression matching (re.search function). See https://docs.python.org/3/howto/regex.html#regex-howto

Metrics calculated by the evaluator:

Model passes (float)
- Percentage of successfully evaluated RAG/LLM outputs.
- Higher score is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
- This is primary metric.
Model failures (float)
- Percentage of RAG/LLM outputs that failed to pass the evaluator check.
- Lower score is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Model retrieval failures (float)
- Percentage of RAG's retrieved contexts that failed to pass the evaluator check.
- Lower score is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Model generation failures (float)
- Percentage of outputs generated by RAG from the retrieved contexts that failed to pass the evaluator check (equivalent to the model failures).
- Lower score is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5
Model parse failures (float)
- Percentage of RAG/LLM outputs that evaluator's judge (LLM, RAG, agent or model) was unable to parse, and therefore unable to evaluate and provide a metrics score.
- Lower score is better.
- Range: [0.0, 1.0]
- Default threshold: 0.5

Problems reported by the evaluator:

If average score of the metric for an evaluated LLM is below the threshold, then the evaluator will report a problem for that LLM.
If test suite has perturbed test cases, then the evaluator will report a problem for each perturbed test case and LLM model whose metric flipped (moved above/below threshold) after perturbation.

Insights diagnosed by the evaluator:

Most accurate, least accurate, fastest, slowest, most expensive and cheapest LLM models based on the evaluated primary metric.
LLM models with best and worst context retrieval performance.
The most difficult test case for the evaluated LLM models, i.e., the prompt, which most of the evaluated LLM models had a problem answering correctly.

Evaluator parameters:

metric_threshold (float):
- Evaluated metric threshold - values below this threshold are considered problematic.
- Default value: 0.5
save_llm_result (bool):
- Control whether to save LLM result which contains input LLM dataset and all metrics calculated by the evaluator.
- Default value: True
evaluate_retrieved_context (bool):
- Control whether to evaluate also retrieved context - conditions to check whether it contains or does not contained specific strings.
- Default value: ""

Explanations

	LLM Models by Success Rate	Pass	Fail	Success rate	Total time	Cost
1.	gpt-3.5-turbo-0613 ^{h2oGPTe RAG}	4	1	80.000%	18.775s	$0.000
2.	gpt-4-0613 ^{h2oGPTe RAG}	4	1	80.000%	30.675s	$0.000
3.	gpt-4-32k-0613 ^{h2oGPTe RAG}	4	1	80.000%	37.538s	$0.000
4.	gpt-3.5-turbo-16k-0613 ^{h2oGPTe RAG}	3	2	60.000%	20.997s	$0.000
5.	h2oai/h2ogpt-4096-llama2-70b-chat ^{h2oGPTe RAG}	3	2	60.000%	68.370s	$0.000
6.	lmsys/vicuna-13b-v1.5-16k ^{h2oGPTe RAG}	3	2	60.000%	69.986s	$0.000
7.	h2oai/h2ogpt-32k-codellama-34b-instruct ^{h2oGPTe RAG}	3	2	60.000%	70.819s	$0.000
8.	h2oai/h2ogpt-4096-llama2-70b-chat-4bit ^{h2oGPTe RAG}	3	2	60.000%	111.219s	$0.000
9.	HuggingFaceH4/zephyr-7b-beta ^{h2oGPTe RAG}	2	3	40.000%	50.480s	$0.000
10.	h2oai/h2ogpt-4096-llama2-13b-chat ^{h2oGPTe RAG}	2	3	40.000%	56.327s	$0.000
11.	Yukang/LongAlpaca-70B ^{h2oGPTe RAG}	2	3	40.000%	211.522s	$0.000

	LLM Models by Time	Pass	Fail	Success rate	Total time	Cost
1.	gpt-3.5-turbo-0613 ^{h2oGPTe RAG}	4	1	80.000%	18.775s	$0.000
2.	gpt-3.5-turbo-16k-0613 ^{h2oGPTe RAG}	3	2	60.000%	20.997s	$0.000
3.	gpt-4-0613 ^{h2oGPTe RAG}	4	1	80.000%	30.675s	$0.000
4.	gpt-4-32k-0613 ^{h2oGPTe RAG}	4	1	80.000%	37.538s	$0.000
5.	HuggingFaceH4/zephyr-7b-beta ^{h2oGPTe RAG}	2	3	40.000%	50.480s	$0.000
6.	h2oai/h2ogpt-4096-llama2-13b-chat ^{h2oGPTe RAG}	2	3	40.000%	56.327s	$0.000
7.	h2oai/h2ogpt-4096-llama2-70b-chat ^{h2oGPTe RAG}	3	2	60.000%	68.370s	$0.000
8.	lmsys/vicuna-13b-v1.5-16k ^{h2oGPTe RAG}	3	2	60.000%	69.986s	$0.000
9.	h2oai/h2ogpt-32k-codellama-34b-instruct ^{h2oGPTe RAG}	3	2	60.000%	70.819s	$0.000
10.	h2oai/h2ogpt-4096-llama2-70b-chat-4bit ^{h2oGPTe RAG}	3	2	60.000%	111.219s	$0.000
11.	Yukang/LongAlpaca-70B ^{h2oGPTe RAG}	2	3	40.000%	211.522s	$0.000

	LLM Models by Cost	Pass	Fail	Success rate	Total time	Cost
1.	gpt-3.5-turbo-0613 ^{h2oGPTe RAG}	4	1	80.000%	18.775s	$0.000
2.	gpt-3.5-turbo-16k-0613 ^{h2oGPTe RAG}	3	2	60.000%	20.997s	$0.000
3.	gpt-4-0613 ^{h2oGPTe RAG}	4	1	80.000%	30.675s	$0.000
4.	gpt-4-32k-0613 ^{h2oGPTe RAG}	4	1	80.000%	37.538s	$0.000
5.	HuggingFaceH4/zephyr-7b-beta ^{h2oGPTe RAG}	2	3	40.000%	50.480s	$0.000
6.	h2oai/h2ogpt-4096-llama2-13b-chat ^{h2oGPTe RAG}	2	3	40.000%	56.327s	$0.000
7.	h2oai/h2ogpt-4096-llama2-70b-chat ^{h2oGPTe RAG}	3	2	60.000%	68.370s	$0.000
8.	lmsys/vicuna-13b-v1.5-16k ^{h2oGPTe RAG}	3	2	60.000%	69.986s	$0.000
9.	h2oai/h2ogpt-32k-codellama-34b-instruct ^{h2oGPTe RAG}	3	2	60.000%	70.819s	$0.000
10.	h2oai/h2ogpt-4096-llama2-70b-chat-4bit ^{h2oGPTe RAG}	3	2	60.000%	111.219s	$0.000
11.	Yukang/LongAlpaca-70B ^{h2oGPTe RAG}	2	3	40.000%	211.522s	$0.000

Most difficult prompts across all models	Failures	Success rate
Who are the board members?	11	0.000%
How many stores are in Florida?	7	36.364%
What was the revenue of Brazil?	4	63.636%

Model failures

Model h2oai/h2ogpt-4096-llama2-70b-chat failures:

Corpus: https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/AXA-Sigorta-2022-Annual-Report.pdf
- Error ( generation ): Generated actual answer did not satisfy the condition: "Christophe Knaub" AND "Yavuz Ölken" AND "Guillaume Herve Marie Xavier Lejeune" AND "Xavier Veyry" AND "Maria Jesus De Arteaga Larru" AND "Nuria Fernandez Paris" AND "Onur Koçkar" . The following part of the condition did not match: "Xavier Veyry" .
- Prompt: Who are the board members?
- Output constraints: ['Christophe Knaub', 'Yavuz Ölken', 'Guillaume Herve Marie Xavier Lejeune', 'Xavier Veyry', 'Maria Jesus De Arteaga Larru', 'Nuria Fernandez Paris', 'Onur Koçkar']
- Expected output: The board members are Christophe Knaub, Yavuz Ölken, Guillaume Herve Marie Xavier Lejeune,Xavier Veyry, Maria Jesus De Arteaga Larru, Nuria Fernandez Paris and Onur Koçkar.
- Actual output (fragment_level): Based on the information provided in the document sources, the following individuals are listed as board members of AXA Sigorta and AXA Hayat ve Emeklilik: 1. Christophe Knaub - Board Chairman 2. Guillaume Herve Marie Xavier Lejeune - Board Member 3. Yavuz Ölken - CEO and Deputy Chairman of the Board 4. Onur Koçkar - Board Member 5. Maria Jesus de Artega Larru - Board Member 6. Nuria Fernandez Paris - Board Member 7. Fernando Garcia - Board Member 8. Laurent Goutard - Board Member 9. Jean-Luc Laurent - Board Member 10. Emmanuel Faber - Board Member Note that the list of board members may not be exhaustive, as the document sources only provide information up to page 13, and there may be additional board members not mentioned in the provided pages.
- Context size: 13990B
- Context chunks: 10
Corpus: https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/lowes-2022ar-full-report-4-6-23-final.pdf
- Error ( generation ): Generated actual answer did not satisfy the condition: "128" .
- Prompt: How many stores are in Florida?
- Output constraints: ['128']
- Expected output: There are 128 stores in Florida.
- Actual output (fragment_level): According to the information provided in the document, Lowe's operates 1,738 home improvement stores in the United States, including 232 stores in Canada, as of February 3, 2023. However, the document does not provide a specific number of stores located in Florida. Therefore, I cannot answer your question based on the information provided.
- Context size: 14727B
- Context chunks: 10

Model h2oai/h2ogpt-4096-llama2-70b-chat-4bit failures:

Corpus: https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/AXA-Sigorta-2022-Annual-Report.pdf
- Error ( generation ): Generated actual answer did not satisfy the condition: "Christophe Knaub" AND "Yavuz Ölken" AND "Guillaume Herve Marie Xavier Lejeune" AND "Xavier Veyry" AND "Maria Jesus De Arteaga Larru" AND "Nuria Fernandez Paris" AND "Onur Koçkar" . The following part of the condition did not match: "Xavier Veyry" .
- Prompt: Who are the board members?
- Output constraints: ['Christophe Knaub', 'Yavuz Ölken', 'Guillaume Herve Marie Xavier Lejeune', 'Xavier Veyry', 'Maria Jesus De Arteaga Larru', 'Nuria Fernandez Paris', 'Onur Koçkar']
- Expected output: The board members are Christophe Knaub, Yavuz Ölken, Guillaume Herve Marie Xavier Lejeune,Xavier Veyry, Maria Jesus De Arteaga Larru, Nuria Fernandez Paris and Onur Koçkar.
- Actual output (fragment_level): Based on the information provided in the document sources, the following individuals are listed as board members of AXA Sigorta and AXA Hayat ve Emeklilik: 1. Christophe Knaub - Board Chairman 2. Guillaume Herve Marie Xavier Lejeune - Board Member 3. Yavuz Ölken - CEO and Deputy Chairman of the Board 4. Onur Koçkar - Board Member 5. Maria Jesus de Artega Larru - Board Member 6. Nuria Fernandez Paris - Board Member 7. Fernando Garcia - Board Member 8. Laurent Mignon - Board Member 9. Jean-Luc Robert - Board Member 10. Jean-Francois Lepelletier - Board Member Note that the list of board members may not be exhaustive, as the documents provided are only snapshots of the company's leadership structure at certain points in time.
- Context size: 13990B
- Context chunks: 10
Corpus: https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/lowes-2022ar-full-report-4-6-23-final.pdf
- Error ( generation ): Generated actual answer did not satisfy the condition: "128" .
- Prompt: How many stores are in Florida?
- Output constraints: ['128']
- Expected output: There are 128 stores in Florida.
- Actual output (fragment_level): According to the information provided in the document, Lowe's operates 1,738 home improvement stores in the United States, including 232 stores in Canada, as of February 3, 2023. However, the document does not provide a specific number of stores located in Florida. It only mentions that the company operates stores in the United States and Canada. Therefore, I cannot provide an answer to your question based on the information provided.
- Context size: 14727B
- Context chunks: 10

Model lmsys/vicuna-13b-v1.5-16k failures:

Corpus: https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/AXA-Sigorta-2022-Annual-Report.pdf
- Error ( generation ): Generated actual answer did not satisfy the condition: "Christophe Knaub" AND "Yavuz Ölken" AND "Guillaume Herve Marie Xavier Lejeune" AND "Xavier Veyry" AND "Maria Jesus De Arteaga Larru" AND "Nuria Fernandez Paris" AND "Onur Koçkar" . The following part of the condition did not match: "Maria Jesus De Arteaga Larru" .
- Prompt: Who are the board members?
- Output constraints: ['Christophe Knaub', 'Yavuz Ölken', 'Guillaume Herve Marie Xavier Lejeune', 'Xavier Veyry', 'Maria Jesus De Arteaga Larru', 'Nuria Fernandez Paris', 'Onur Koçkar']
- Expected output: The board members are Christophe Knaub, Yavuz Ölken, Guillaume Herve Marie Xavier Lejeune,Xavier Veyry, Maria Jesus De Arteaga Larru, Nuria Fernandez Paris and Onur Koçkar.
- Actual output (fragment_level): The board members of AXA Sigorta A.Ş. and AXA Hayat ve Emeklilik A.Ş. are: 1. Christophe Knaub - Board Chairman 2. Guillaume Herve Marie Xavier Lejeune - Board Member 3. Onur Koçkar - Board Member 4. Xavier Veyry - Board Member 5. Maria Jesus de Arteaga Larru - Board Member 6. Yavuz Ölken - Deputy Board Chairman and CEO 7. Fatih Bayraktar - Head of Innovative Business Partnerships and Workforce Management 8. Cemre İpek - Head of Automotive Dealerships Sales Channel 9. Gülnur Uşaklı - Head of Regions and Branches 10. İpek Akkaya - Head of Life and Health Insurance 11. Murat Can - Head of Motor Vehicle Insurance Sales Channel 12. Barış Macit - Director of Corporate Technical Department 13. Tolga Tiknaz - Director of Human Resources 14. Zaf
- Context size: 13990B
- Context chunks: 10
Corpus: https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/lowes-2022ar-full-report-4-6-23-final.pdf
- Error ( generation ): Generated actual answer did not satisfy the condition: "128" .
- Prompt: How many stores are in Florida?
- Output constraints: ['128']
- Expected output: There are 128 stores in Florida.
- Actual output (fragment_level): Based on the information provided in the document, there are 112 Lowe's-branded home improvement stores in Florida. This information can be found on page 29 of the document under the "State" section, where it lists the number of stores for each state.
- Context size: 14727B
- Context chunks: 10

Model h2oai/h2ogpt-4096-llama2-13b-chat failures:

Corpus: https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf
- Error ( generation ): Generated actual answer did not satisfy the condition: "15,969" AND "million" . The following part of the condition did not match: "15,969" .
- Prompt: What was the revenue of Brazil?
- Output constraints: ['15,969', 'million']
- Expected output: Brazil revenue was 15,969 million.
- Actual output (fragment_level): Based on the information provided in the document, the revenue of Brazil was Ps. 1,450 million for the first quarter of 2023, and Ps. 1,250 million for the same period of the previous year. This represents an increase of 15.1% compared to the same period of the previous year.
- Context size: 10268B
- Context chunks: 10
Corpus: https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/AXA-Sigorta-2022-Annual-Report.pdf
- Error ( generation ): Generated actual answer did not satisfy the condition: "Christophe Knaub" AND "Yavuz Ölken" AND "Guillaume Herve Marie Xavier Lejeune" AND "Xavier Veyry" AND "Maria Jesus De Arteaga Larru" AND "Nuria Fernandez Paris" AND "Onur Koçkar" . The following part of the condition did not match: "Christophe Knaub" .
- Prompt: Who are the board members?
- Output constraints: ['Christophe Knaub', 'Yavuz Ölken', 'Guillaume Herve Marie Xavier Lejeune', 'Xavier Veyry', 'Maria Jesus De Arteaga Larru', 'Nuria Fernandez Paris', 'Onur Koçkar']
- Expected output: The board members are Christophe Knaub, Yavuz Ölken, Guillaume Herve Marie Xavier Lejeune,Xavier Veyry, Maria Jesus De Arteaga Larru, Nuria Fernandez Paris and Onur Koçkar.
- Actual output (fragment_level): Based on the information provided in the document, the board members of AXA Sigorta A.Ş. and AXA Hayat ve Emeklilik A.Ş. are: 1. Christophe Stephane Knaub (Board Chairman) 2. Guillaume Herve Marie Xavier Lejeune (Board Member) 3. Yavuz Ölken (CEO and Deputy Chairman of the Board) The document provides information about their backgrounds, education, and experience, as well as their roles and responsibilities within the company.
- Context size: 13990B
- Context chunks: 10
Corpus: https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/lowes-2022ar-full-report-4-6-23-final.pdf
- Error ( generation ): Generated actual answer did not satisfy the condition: "128" .
- Prompt: How many stores are in Florida?
- Output constraints: ['128']
- Expected output: There are 128 stores in Florida.
- Actual output (fragment_level): Based on the information provided in the document, there are 38 stores in Florida. This information can be found on page 29 of the document, where it states "State Stores Alabama... Florida 38".
- Context size: 14727B
- Context chunks: 10

Model HuggingFaceH4/zephyr-7b-beta failures:

Corpus: https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf
- Error ( generation ): Generated actual answer did not satisfy the condition: "15,969" AND "million" . The following part of the condition did not match: "15,969" .
- Prompt: What was the revenue of Brazil?
- Output constraints: ['15,969', 'million']
- Expected output: Brazil revenue was 15,969 million.
- Actual output (fragment_level): The revenue for Brazil in the first quarter of 2023 was Ps. 14,007 million (approximately USD 756 million) as reported, and Ps. 13,982 million (approximately USD 754 million) on a comparable basis (excluding volumes from the acquisition of CVI in Brazil). This information can be found on pages 10-11 and 13 of the document sources provided.
- Context size: 10268B
- Context chunks: 10
Corpus: https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/AXA-Sigorta-2022-Annual-Report.pdf
- Error ( generation ): Generated actual answer did not satisfy the condition: "Christophe Knaub" AND "Yavuz Ölken" AND "Guillaume Herve Marie Xavier Lejeune" AND "Xavier Veyry" AND "Maria Jesus De Arteaga Larru" AND "Nuria Fernandez Paris" AND "Onur Koçkar" . The following part of the condition did not match: "Yavuz Ölken" .
- Prompt: Who are the board members?
- Output constraints: ['Christophe Knaub', 'Yavuz Ölken', 'Guillaume Herve Marie Xavier Lejeune', 'Xavier Veyry', 'Maria Jesus De Arteaga Larru', 'Nuria Fernandez Paris', 'Onur Koçkar']
- Expected output: The board members are Christophe Knaub, Yavuz Ölken, Guillaume Herve Marie Xavier Lejeune,Xavier Veyry, Maria Jesus De Arteaga Larru, Nuria Fernandez Paris and Onur Koçkar.
- Actual output (fragment_level): The board members mentioned in the document sources provided are Christophe Knaub (Board Chairman), Guillaume Herve Marie, and Xavier Lejeune (both Board Members).
- Context size: 13990B
- Context chunks: 10
Corpus: https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/lowes-2022ar-full-report-4-6-23-final.pdf
- Error ( generation ): Generated actual answer did not satisfy the condition: "128" .
- Prompt: How many stores are in Florida?
- Output constraints: ['128']
- Expected output: There are 128 stores in Florida.
- Actual output (fragment_level): The document sources provided do not specifically mention the number of Lowe's stores in Florida.
- Context size: 14727B
- Context chunks: 10

Model h2oai/h2ogpt-32k-codellama-34b-instruct failures:

Corpus: https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/AXA-Sigorta-2022-Annual-Report.pdf
- Error ( generation ): Generated actual answer did not satisfy the condition: "Christophe Knaub" AND "Yavuz Ölken" AND "Guillaume Herve Marie Xavier Lejeune" AND "Xavier Veyry" AND "Maria Jesus De Arteaga Larru" AND "Nuria Fernandez Paris" AND "Onur Koçkar" . The following part of the condition did not match: "Maria Jesus De Arteaga Larru" .
- Prompt: Who are the board members?
- Output constraints: ['Christophe Knaub', 'Yavuz Ölken', 'Guillaume Herve Marie Xavier Lejeune', 'Xavier Veyry', 'Maria Jesus De Arteaga Larru', 'Nuria Fernandez Paris', 'Onur Koçkar']
- Expected output: The board members are Christophe Knaub, Yavuz Ölken, Guillaume Herve Marie Xavier Lejeune,Xavier Veyry, Maria Jesus De Arteaga Larru, Nuria Fernandez Paris and Onur Koçkar.
- Actual output (fragment_level): Based on the information provided in the document sources, the board members of AXA Sigorta A.Ş. and AXA Hayat ve Emeklilik A.Ş. are: 1. Christophe Knaub - Board Chairman 2. Guillaume Herve Marie Xavier Lejeune - Board Member 3. Yavuz Ölken - CEO and Deputy Chairman of the Board 4. Maria Jesus de Arteaga Larru - Board Member 5. Xavier Veyry - Board Member These individuals are listed as board members in the "Board Members" section of the document sources.
- Context size: 13990B
- Context chunks: 10
Corpus: https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/lowes-2022ar-full-report-4-6-23-final.pdf
- Error ( generation ): Generated actual answer did not satisfy the condition: "128" .
- Prompt: How many stores are in Florida?
- Output constraints: ['128']
- Expected output: There are 128 stores in Florida.
- Actual output (fragment_level): According to the information provided in the document sources, there are 14 stores in Florida.
- Context size: 14727B
- Context chunks: 10

Model Yukang/LongAlpaca-70B failures:

Corpus: https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf
- Error ( generation ): Generated actual answer did not satisfy the condition: "15,969" AND "million" . The following part of the condition did not match: "15,969" .
- Prompt: What was the revenue of Brazil?
- Output constraints: ['15,969', 'million']
- Expected output: Brazil revenue was 15,969 million.
- Actual output (fragment_level): According to the information provided on page 13: Revenues by region: Brazil: 218.3 million pesos (reported) So the revenue of Brazil was 218.3 million pesos for the first quarter of 2023, according to the information provided.
- Context size: 10268B
- Context chunks: 10
Corpus: https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/AXA-Sigorta-2022-Annual-Report.pdf
- Error ( generation ): Generated actual answer did not satisfy the condition: "Christophe Knaub" AND "Yavuz Ölken" AND "Guillaume Herve Marie Xavier Lejeune" AND "Xavier Veyry" AND "Maria Jesus De Arteaga Larru" AND "Nuria Fernandez Paris" AND "Onur Koçkar" . The following part of the condition did not match: "Maria Jesus De Arteaga Larru" .
- Prompt: Who are the board members?
- Output constraints: ['Christophe Knaub', 'Yavuz Ölken', 'Guillaume Herve Marie Xavier Lejeune', 'Xavier Veyry', 'Maria Jesus De Arteaga Larru', 'Nuria Fernandez Paris', 'Onur Koçkar']
- Expected output: The board members are Christophe Knaub, Yavuz Ölken, Guillaume Herve Marie Xavier Lejeune,Xavier Veyry, Maria Jesus De Arteaga Larru, Nuria Fernandez Paris and Onur Koçkar.
- Actual output (fragment_level): According to the information provided in the annual report, the board members of AXA Sigorta A.Ş. are: 1. Christophe Knaub (Board Chairman) 2. Yavuz Ölken (Deputy Chairman and CEO) 3. Guillaume Herve Marie Xavier Lejeune (Board Member) 4. Onur Koçkar (Board Member) 5. Maria Jesus de Arteaga Larru (Board Member) 6. Xavier Veyry (Board Member) These are the names of the board members mentioned in the annual report.
- Context size: 13990B
- Context chunks: 10
Corpus: https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/lowes-2022ar-full-report-4-6-23-final.pdf
- Error ( generation ): Generated actual answer did not satisfy the condition: "128" .
- Prompt: How many stores are in Florida?
- Output constraints: ['128']
- Expected output: There are 128 stores in Florida.
- Actual output (fragment_level): According to the information provided on page 29, Lowe's operated 112 stores in Florida as of February 3, 2023.
- Context size: 14727B
- Context chunks: 10

Model gpt-3.5-turbo-0613 failures:

Corpus: https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/AXA-Sigorta-2022-Annual-Report.pdf
- Error ( generation ): Generated actual answer did not satisfy the condition: "Christophe Knaub" AND "Yavuz Ölken" AND "Guillaume Herve Marie Xavier Lejeune" AND "Xavier Veyry" AND "Maria Jesus De Arteaga Larru" AND "Nuria Fernandez Paris" AND "Onur Koçkar" . The following part of the condition did not match: "Christophe Knaub" .
- Prompt: Who are the board members?
- Output constraints: ['Christophe Knaub', 'Yavuz Ölken', 'Guillaume Herve Marie Xavier Lejeune', 'Xavier Veyry', 'Maria Jesus De Arteaga Larru', 'Nuria Fernandez Paris', 'Onur Koçkar']
- Expected output: The board members are Christophe Knaub, Yavuz Ölken, Guillaume Herve Marie Xavier Lejeune,Xavier Veyry, Maria Jesus De Arteaga Larru, Nuria Fernandez Paris and Onur Koçkar.
- Actual output (fragment_level): Based on the information provided, the board members of AXA Sigorta A.Ş. are: 1. Christophe Stephane KNAUB (Board Chairman) 2. Guillaume Herve Marie Xavier LEJEUNE 3. Yavuz ÖLKEN 4. Onur KOÇKAR 5. Xavier VEYRY 6. Maria Jesus de Artega LARRU Please note that this information is based solely on the document sources provided and may not be exhaustive or up to date.
- Context size: 13990B
- Context chunks: 10

Model gpt-3.5-turbo-16k-0613 failures:

Corpus: https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf
- Error ( generation ): Generated actual answer did not satisfy the condition: "15,969" AND "million" . The following part of the condition did not match: "15,969" .
- Prompt: What was the revenue of Brazil?
- Output constraints: ['15,969', 'million']
- Expected output: Brazil revenue was 15,969 million.
- Actual output (fragment_level): According to the information provided, the revenue of Brazil in the first quarter of 2023 was Ps. 5,632.20 million.
- Context size: 10268B
- Context chunks: 10
Corpus: https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/AXA-Sigorta-2022-Annual-Report.pdf
- Error ( generation ): Generated actual answer did not satisfy the condition: "Christophe Knaub" AND "Yavuz Ölken" AND "Guillaume Herve Marie Xavier Lejeune" AND "Xavier Veyry" AND "Maria Jesus De Arteaga Larru" AND "Nuria Fernandez Paris" AND "Onur Koçkar" . The following part of the condition did not match: "Maria Jesus De Arteaga Larru" .
- Prompt: Who are the board members?
- Output constraints: ['Christophe Knaub', 'Yavuz Ölken', 'Guillaume Herve Marie Xavier Lejeune', 'Xavier Veyry', 'Maria Jesus De Arteaga Larru', 'Nuria Fernandez Paris', 'Onur Koçkar']
- Expected output: The board members are Christophe Knaub, Yavuz Ölken, Guillaume Herve Marie Xavier Lejeune,Xavier Veyry, Maria Jesus De Arteaga Larru, Nuria Fernandez Paris and Onur Koçkar.
- Actual output (fragment_level): Based on the information provided, the board members of AXA Sigorta A.Ş. and AXA Hayat ve Emeklilik A.Ş. are: 1. Christophe Knaub (Board Chairman) 2. Guillaume Herve Marie Xavier Lejeune (Board Member) 3. Onur Koçkar (Board Member) 4. Maria Jesus de Artega Larru (Board Member) 5. Xavier Veyry (Board Member) 6. Nuria Fernandez Paris (Board Member) 7. Yavuz Ölken (Deputy Chairman of Board and CEO) 8. Barış Altın (Chairman and Member of the Executive Committee Corporate Technical Department) 9. Ege Örer (Chairman and Member of the Executive Committee Directorate of Information Technologies and Innovation)
- Context size: 13990B
- Context chunks: 10

Model gpt-4-0613 failures:

Corpus: https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/AXA-Sigorta-2022-Annual-Report.pdf
- Error ( generation ): Generated actual answer did not satisfy the condition: "Christophe Knaub" AND "Yavuz Ölken" AND "Guillaume Herve Marie Xavier Lejeune" AND "Xavier Veyry" AND "Maria Jesus De Arteaga Larru" AND "Nuria Fernandez Paris" AND "Onur Koçkar" . The following part of the condition did not match: "Maria Jesus De Arteaga Larru" .
- Prompt: Who are the board members?
- Output constraints: ['Christophe Knaub', 'Yavuz Ölken', 'Guillaume Herve Marie Xavier Lejeune', 'Xavier Veyry', 'Maria Jesus De Arteaga Larru', 'Nuria Fernandez Paris', 'Onur Koçkar']
- Expected output: The board members are Christophe Knaub, Yavuz Ölken, Guillaume Herve Marie Xavier Lejeune,Xavier Veyry, Maria Jesus De Arteaga Larru, Nuria Fernandez Paris and Onur Koçkar.
- Actual output (fragment_level): The board members mentioned in the provided document are: 1. Christophe Knaub - Board Chairman 2. Guillaume Herve Marie Xavier Lejeune - Board Member 3. Xavier Veyry - Board Member 4. Maria Jesus de Arteaga Larru - Board Member 5. Onur Koçkar - Board Member 6. Yavuz Ölken - Deputy Chairman of Board and CEO 7. Nuria Fernandez Paris - Board Member
- Context size: 13990B
- Context chunks: 10

Model gpt-4-32k-0613 failures:

Corpus: https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/AXA-Sigorta-2022-Annual-Report.pdf
- Error ( generation ): Generated actual answer did not satisfy the condition: "Christophe Knaub" AND "Yavuz Ölken" AND "Guillaume Herve Marie Xavier Lejeune" AND "Xavier Veyry" AND "Maria Jesus De Arteaga Larru" AND "Nuria Fernandez Paris" AND "Onur Koçkar" . The following part of the condition did not match: "Maria Jesus De Arteaga Larru" .
- Prompt: Who are the board members?
- Output constraints: ['Christophe Knaub', 'Yavuz Ölken', 'Guillaume Herve Marie Xavier Lejeune', 'Xavier Veyry', 'Maria Jesus De Arteaga Larru', 'Nuria Fernandez Paris', 'Onur Koçkar']
- Expected output: The board members are Christophe Knaub, Yavuz Ölken, Guillaume Herve Marie Xavier Lejeune,Xavier Veyry, Maria Jesus De Arteaga Larru, Nuria Fernandez Paris and Onur Koçkar.
- Actual output (fragment_level): The board members mentioned in the provided document are: 1. Christophe Knaub - Board Chairman 2. Yavuz Ölken - Deputy Chairman of Board and CEO 3. Guillaume Herve Marie Xavier Lejeune - Board Member 4. Onur Koçkar - Board Member 5. Xavier Veyry - Board Member 6. Maria Jesus de Arteaga Larru - Board Member 7. Nuria Fernandez Paris - Board Member
- Context size: 13990B
- Context chunks: 10

Explanations

Model explanations created by the explainer organized by explanation types with its formats (representations) identified by media types :

Text Matching evaluation results
global-llm-eval-results
RAG benchmark leaderboard
global-llm-bool-leaderboard
RAG benchmark leaderboard as HTML
global-html-fragment
- text/html
Archive of Text matching artifacts
global-work-dir-archive
- application/zip

Explainer parameters

Explainer was run with the following parameters:

Parameter	Value	Description	Type	Default value
metric_threshold	`0.5`	Evaluated metric threshold - values below this threshold are considered problematic.	`float`	`0.5`
save_llm_result	`True`	Control whether to save LLM result which contains input LLM dataset and all metrics calculated by the evaluator.	`bool`	`True`
evaluate_retrieved_context	`False`	Control whether to evaluate also retrieved context - conditions to check whether it contains or does not contained specific strings.	`bool`	`False`

Explainer metadata

Explainer keywords:

llm
evaluates_rag
evaluates_llm
requires_prompts
requires_constraints
sr-11-7-conceptual-soundness
sr-11-7-ongoing-monitoring
sr-11-7-outcomes-analysis
nist-ai-rmf-safe
nist-ai-rmf-secure-and-resilient
nist-ai-rmf-privacy-enhanced
nist-ai-rmf-fair
nist-ai-rmf-accountable-and-transparent
nist-ai-rmf-valid-and-reliable
problem-type-information-retrieval
problem-type-question-answering
problem-type-summarization
problem-type-regression
problem-type-classification
problem-type-binary-classification
problem-type-multiclass-classification
es-purpose-generation
evaluation-method-rule-based
evaluation-type-deterministic
capability-condition-highlight

Explainer run

Explainer run details:

Status code: SUCCESS
Progress: 100%
Started: 2026-01-30 14:26:04 T+0100
Duration: 0.481s

Explainer log

Explainer log file:

explainer.log

Dataset

Interpretation test suite details:

Prompts (5)
How many stores are in Florida?
What was the number of agreements that include human rights clauses, in 2022?
What was the revenue of Brazil?
Who are the board members?
Who is the chairman of the board?

Dataset description:

Dataset = <class 'datatable.Frame'>
- Dataset data frame type.
Dataset frame shape = (55, 14)
- Dataset data frame shape.
Row count = 55
- The number of the dataset rows.
Column names = ['key', 'input', 'corpus', 'context', 'categories', 'relationships', 'expected_output', 'output_constraints', 'output_condition', 'actual_output', 'actual_duration', 'cost', 'model_key', 'test_key']
- Dataset column names.
Unique values = [55, 5, 1, 5, 2, 1, 5, 5, 1, 47, 55, 1, 44, 1]
- The number of unique values for dataset columns.
Column types = ['str', 'str', 'str', 'str', 'str', 'str', 'str', 'str', 'str', 'str', 'real', 'real', 'str', 'str']
- Dataset column types.
Dataset columns:
- key
  - Data type = str
  - Categorical column = True
  - Count = 55
  - Unique = 55
- input
  - Data type = str
  - Categorical column = True
  - Count = 5
  - Unique = 5
- corpus
  - Data type = str
  - Categorical column = True
  - Count = 1
  - Unique = 1
- context
  - Data type = str
  - Categorical column = True
  - Count = 5
  - Unique = 5
- categories
  - Data type = str
  - Categorical column = True
  - Count = 2
  - Unique = 2
- relationships
  - Data type = str
  - Categorical column = True
  - Count = 1
  - Unique = 1
- expected_output
  - Data type = str
  - Categorical column = True
  - Count = 5
  - Unique = 5
- output_constraints
  - Data type = str
  - Categorical column = True
  - Count = 5
  - Unique = 5
- output_condition
  - Data type = str
  - Categorical column = True
  - Count = 1
  - Unique = 1
- actual_output
  - Data type = str
  - Categorical column = True
  - Count = 47
  - Unique = 47
- actual_duration
  - Data type = real
  - Numeric column = True
  - Count = 55
  - Unique = 55
- cost
  - Data type = real
  - Numeric column = True
  - Count = 1
  - Unique = 1
- model_key
  - Data type = str
  - Categorical column = True
  - Count = 44
  - Unique = 44
- test_key
  - Data type = str
  - Categorical column = True
  - Count = 1
  - Unique = 1
- Dataset columns metadata.

Models (44)

Interpreted models - LLM and corpus (in case of RAG) - overview:

RAG model h2oai/h2ogpt-4096-llama2-70b-chat (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])
RAG model h2oai/h2ogpt-4096-llama2-70b-chat (docs: ['bradesco-2022-integrated-report.pdf'])
RAG model h2oai/h2ogpt-4096-llama2-70b-chat (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])
RAG model h2oai/h2ogpt-4096-llama2-70b-chat (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])
RAG model h2oai/h2ogpt-4096-llama2-70b-chat-4bit (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])
RAG model h2oai/h2ogpt-4096-llama2-70b-chat-4bit (docs: ['bradesco-2022-integrated-report.pdf'])
RAG model h2oai/h2ogpt-4096-llama2-70b-chat-4bit (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])
RAG model h2oai/h2ogpt-4096-llama2-70b-chat-4bit (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])
RAG model lmsys/vicuna-13b-v1.5-16k (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])
RAG model lmsys/vicuna-13b-v1.5-16k (docs: ['bradesco-2022-integrated-report.pdf'])
RAG model lmsys/vicuna-13b-v1.5-16k (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])
RAG model lmsys/vicuna-13b-v1.5-16k (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])
RAG model h2oai/h2ogpt-4096-llama2-13b-chat (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])
RAG model h2oai/h2ogpt-4096-llama2-13b-chat (docs: ['bradesco-2022-integrated-report.pdf'])
RAG model h2oai/h2ogpt-4096-llama2-13b-chat (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])
RAG model h2oai/h2ogpt-4096-llama2-13b-chat (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])
RAG model HuggingFaceH4/zephyr-7b-beta (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])
RAG model HuggingFaceH4/zephyr-7b-beta (docs: ['bradesco-2022-integrated-report.pdf'])
RAG model HuggingFaceH4/zephyr-7b-beta (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])
RAG model HuggingFaceH4/zephyr-7b-beta (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])
RAG model h2oai/h2ogpt-32k-codellama-34b-instruct (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])
RAG model h2oai/h2ogpt-32k-codellama-34b-instruct (docs: ['bradesco-2022-integrated-report.pdf'])
RAG model h2oai/h2ogpt-32k-codellama-34b-instruct (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])
RAG model h2oai/h2ogpt-32k-codellama-34b-instruct (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])
RAG model Yukang/LongAlpaca-70B (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])
RAG model Yukang/LongAlpaca-70B (docs: ['bradesco-2022-integrated-report.pdf'])
RAG model Yukang/LongAlpaca-70B (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])
RAG model Yukang/LongAlpaca-70B (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])
RAG model gpt-3.5-turbo-0613 (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])
RAG model gpt-3.5-turbo-0613 (docs: ['bradesco-2022-integrated-report.pdf'])
RAG model gpt-3.5-turbo-0613 (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])
RAG model gpt-3.5-turbo-0613 (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])
RAG model gpt-3.5-turbo-16k-0613 (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])
RAG model gpt-3.5-turbo-16k-0613 (docs: ['bradesco-2022-integrated-report.pdf'])
RAG model gpt-3.5-turbo-16k-0613 (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])
RAG model gpt-3.5-turbo-16k-0613 (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])
RAG model gpt-4-0613 (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])
RAG model gpt-4-0613 (docs: ['bradesco-2022-integrated-report.pdf'])
RAG model gpt-4-0613 (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])
RAG model gpt-4-0613 (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])
RAG model gpt-4-32k-0613 (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])
RAG model gpt-4-32k-0613 (docs: ['bradesco-2022-integrated-report.pdf'])
RAG model gpt-4-32k-0613 (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])
RAG model gpt-4-32k-0613 (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])

RAG model h2oai/h2ogpt-4096-llama2-70b-chat (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])

Interpreted model details:

Evaluated model ID:
a71cdbc9-a4b2-4c0f-bc42-68da51aee472
LLM model name:
h2oai/h2ogpt-4096-llama2-70b-chat
Model host:
h2ogpte
Collection ID:
c83fc72d-1425-4899-86b8-061d8613e1a0
Collection name:
RAG collection (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])
Documents (corpus):
- https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf

RAG model h2oai/h2ogpt-4096-llama2-70b-chat (docs: ['bradesco-2022-integrated-report.pdf'])

Interpreted model details:

Evaluated model ID:
71722ef3-f254-48e8-86c2-9483380e9e4a
LLM model name:
h2oai/h2ogpt-4096-llama2-70b-chat
Model host:
h2ogpte
Collection ID:
114687dc-6339-4309-b8f0-6e049d0424a0
Collection name:
RAG collection (docs: ['bradesco-2022-integrated-report.pdf'])
Documents (corpus):
- https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/bradesco-2022-integrated-report.pdf

RAG model h2oai/h2ogpt-4096-llama2-70b-chat (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])

Interpreted model details:

Evaluated model ID:
a219b84a-4942-470d-820c-c6d58d64a0e2
LLM model name:
h2oai/h2ogpt-4096-llama2-70b-chat
Model host:
h2ogpte
Collection ID:
76df8161-c0d1-414a-a046-92c6570ba9a1
Collection name:
RAG collection (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])
Documents (corpus):
- https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/AXA-Sigorta-2022-Annual-Report.pdf

RAG model h2oai/h2ogpt-4096-llama2-70b-chat (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])

Interpreted model details:

Evaluated model ID:
6cef59c4-2c3a-4ba3-9144-c8007767f147
LLM model name:
h2oai/h2ogpt-4096-llama2-70b-chat
Model host:
h2ogpte
Collection ID:
61ca9c07-9266-410d-a19c-5aaece5102a5
Collection name:
RAG collection (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])
Documents (corpus):
- https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/lowes-2022ar-full-report-4-6-23-final.pdf

RAG model h2oai/h2ogpt-4096-llama2-70b-chat-4bit (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])

Interpreted model details:

Evaluated model ID:
cd3b813f-2628-4862-93a2-54d3d1597621
LLM model name:
h2oai/h2ogpt-4096-llama2-70b-chat-4bit
Model host:
h2ogpte
Collection ID:
c83fc72d-1425-4899-86b8-061d8613e1a0
Collection name:
RAG collection (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])
Documents (corpus):
- https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf

RAG model h2oai/h2ogpt-4096-llama2-70b-chat-4bit (docs: ['bradesco-2022-integrated-report.pdf'])

Interpreted model details:

Evaluated model ID:
97295f3d-cbc1-4a9c-a71d-a0c2c611a9ee
LLM model name:
h2oai/h2ogpt-4096-llama2-70b-chat-4bit
Model host:
h2ogpte
Collection ID:
114687dc-6339-4309-b8f0-6e049d0424a0
Collection name:
RAG collection (docs: ['bradesco-2022-integrated-report.pdf'])
Documents (corpus):
- https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/bradesco-2022-integrated-report.pdf

RAG model h2oai/h2ogpt-4096-llama2-70b-chat-4bit (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])

Interpreted model details:

Evaluated model ID:
52ef289d-a161-4729-991b-162b123b45c8
LLM model name:
h2oai/h2ogpt-4096-llama2-70b-chat-4bit
Model host:
h2ogpte
Collection ID:
76df8161-c0d1-414a-a046-92c6570ba9a1
Collection name:
RAG collection (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])
Documents (corpus):
- https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/AXA-Sigorta-2022-Annual-Report.pdf

RAG model h2oai/h2ogpt-4096-llama2-70b-chat-4bit (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])

Interpreted model details:

Evaluated model ID:
ca937905-a129-4e6d-a269-1067e5403bf1
LLM model name:
h2oai/h2ogpt-4096-llama2-70b-chat-4bit
Model host:
h2ogpte
Collection ID:
61ca9c07-9266-410d-a19c-5aaece5102a5
Collection name:
RAG collection (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])
Documents (corpus):
- https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/lowes-2022ar-full-report-4-6-23-final.pdf

RAG model lmsys/vicuna-13b-v1.5-16k (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])

Interpreted model details:

Evaluated model ID:
acc35b26-a5e4-4a7d-828f-4bd230cd7837
LLM model name:
lmsys/vicuna-13b-v1.5-16k
Model host:
h2ogpte
Collection ID:
c83fc72d-1425-4899-86b8-061d8613e1a0
Collection name:
RAG collection (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])
Documents (corpus):
- https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf

RAG model lmsys/vicuna-13b-v1.5-16k (docs: ['bradesco-2022-integrated-report.pdf'])

Interpreted model details:

Evaluated model ID:
3ac90634-22be-4dc0-82a9-ddde91324a28
LLM model name:
lmsys/vicuna-13b-v1.5-16k
Model host:
h2ogpte
Collection ID:
114687dc-6339-4309-b8f0-6e049d0424a0
Collection name:
RAG collection (docs: ['bradesco-2022-integrated-report.pdf'])
Documents (corpus):
- https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/bradesco-2022-integrated-report.pdf

RAG model lmsys/vicuna-13b-v1.5-16k (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])

Interpreted model details:

Evaluated model ID:
174202f6-39c3-4a03-a1ef-07247eb1679d
LLM model name:
lmsys/vicuna-13b-v1.5-16k
Model host:
h2ogpte
Collection ID:
76df8161-c0d1-414a-a046-92c6570ba9a1
Collection name:
RAG collection (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])
Documents (corpus):
- https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/AXA-Sigorta-2022-Annual-Report.pdf

RAG model lmsys/vicuna-13b-v1.5-16k (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])

Interpreted model details:

Evaluated model ID:
06d5f18b-2294-4a14-a7f4-0939407b1c54
LLM model name:
lmsys/vicuna-13b-v1.5-16k
Model host:
h2ogpte
Collection ID:
61ca9c07-9266-410d-a19c-5aaece5102a5
Collection name:
RAG collection (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])
Documents (corpus):
- https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/lowes-2022ar-full-report-4-6-23-final.pdf

RAG model h2oai/h2ogpt-4096-llama2-13b-chat (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])

Interpreted model details:

Evaluated model ID:
7574a8bf-45ba-4b72-90b4-8868d84f325f
LLM model name:
h2oai/h2ogpt-4096-llama2-13b-chat
Model host:
h2ogpte
Collection ID:
c83fc72d-1425-4899-86b8-061d8613e1a0
Collection name:
RAG collection (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])
Documents (corpus):
- https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf

Explainers identified the following problems:

Severity	Type	Problem	Suggested actions	Explainer	Resources
MEDIUM	accuracy	Evaluated model `h2oai/h2ogpt-4096-llama2-13b-chat` failed to satisfy the threshold `0.5` for metric `Model passes` with average score `0.4.` Metric details: Percentage of successfully evaluated RAG/LLM outputs.	For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures.	Text matching	GlobalHtmlFragmentExplanation / text/html

RAG model h2oai/h2ogpt-4096-llama2-13b-chat (docs: ['bradesco-2022-integrated-report.pdf'])

Interpreted model details:

Evaluated model ID:
fee47fe2-d058-4614-8e42-be5c902897aa
LLM model name:
h2oai/h2ogpt-4096-llama2-13b-chat
Model host:
h2ogpte
Collection ID:
114687dc-6339-4309-b8f0-6e049d0424a0
Collection name:
RAG collection (docs: ['bradesco-2022-integrated-report.pdf'])
Documents (corpus):
- https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/bradesco-2022-integrated-report.pdf

Explainers identified the following problems:

Severity	Type	Problem	Suggested actions	Explainer	Resources
MEDIUM	accuracy	Evaluated model `h2oai/h2ogpt-4096-llama2-13b-chat` failed to satisfy the threshold `0.5` for metric `Model passes` with average score `0.4.` Metric details: Percentage of successfully evaluated RAG/LLM outputs.	For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures.	Text matching	GlobalHtmlFragmentExplanation / text/html

RAG model h2oai/h2ogpt-4096-llama2-13b-chat (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])

Interpreted model details:

Evaluated model ID:
4c785ad5-9764-492d-b940-c19d97c44f1a
LLM model name:
h2oai/h2ogpt-4096-llama2-13b-chat
Model host:
h2ogpte
Collection ID:
76df8161-c0d1-414a-a046-92c6570ba9a1
Collection name:
RAG collection (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])
Documents (corpus):
- https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/AXA-Sigorta-2022-Annual-Report.pdf

Explainers identified the following problems:

Severity	Type	Problem	Suggested actions	Explainer	Resources
MEDIUM	accuracy	Evaluated model `h2oai/h2ogpt-4096-llama2-13b-chat` failed to satisfy the threshold `0.5` for metric `Model passes` with average score `0.4.` Metric details: Percentage of successfully evaluated RAG/LLM outputs.	For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures.	Text matching	GlobalHtmlFragmentExplanation / text/html

RAG model h2oai/h2ogpt-4096-llama2-13b-chat (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])

Interpreted model details:

Evaluated model ID:
47d43d93-9040-4f59-b453-03eaf0791284
LLM model name:
h2oai/h2ogpt-4096-llama2-13b-chat
Model host:
h2ogpte
Collection ID:
61ca9c07-9266-410d-a19c-5aaece5102a5
Collection name:
RAG collection (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])
Documents (corpus):
- https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/lowes-2022ar-full-report-4-6-23-final.pdf

Explainers identified the following problems:

Severity	Type	Problem	Suggested actions	Explainer	Resources
MEDIUM	accuracy	Evaluated model `h2oai/h2ogpt-4096-llama2-13b-chat` failed to satisfy the threshold `0.5` for metric `Model passes` with average score `0.4.` Metric details: Percentage of successfully evaluated RAG/LLM outputs.	For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures.	Text matching	GlobalHtmlFragmentExplanation / text/html

RAG model HuggingFaceH4/zephyr-7b-beta (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])

Interpreted model details:

Evaluated model ID:
3deaaa70-9d9c-468d-91b0-16ff55b26ad2
LLM model name:
HuggingFaceH4/zephyr-7b-beta
Model host:
h2ogpte
Collection ID:
c83fc72d-1425-4899-86b8-061d8613e1a0
Collection name:
RAG collection (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])
Documents (corpus):
- https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf

Explainers identified the following problems:

Severity	Type	Problem	Suggested actions	Explainer	Resources
MEDIUM	accuracy	Evaluated model `HuggingFaceH4/zephyr-7b-beta` failed to satisfy the threshold `0.5` for metric `Model passes` with average score `0.4.` Metric details: Percentage of successfully evaluated RAG/LLM outputs.	For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures.	Text matching	GlobalHtmlFragmentExplanation / text/html

RAG model HuggingFaceH4/zephyr-7b-beta (docs: ['bradesco-2022-integrated-report.pdf'])

Interpreted model details:

Evaluated model ID:
2522de54-e789-4fa3-9d35-5b2d5067ba58
LLM model name:
HuggingFaceH4/zephyr-7b-beta
Model host:
h2ogpte
Collection ID:
114687dc-6339-4309-b8f0-6e049d0424a0
Collection name:
RAG collection (docs: ['bradesco-2022-integrated-report.pdf'])
Documents (corpus):
- https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/bradesco-2022-integrated-report.pdf

Explainers identified the following problems:

Severity	Type	Problem	Suggested actions	Explainer	Resources
MEDIUM	accuracy	Evaluated model `HuggingFaceH4/zephyr-7b-beta` failed to satisfy the threshold `0.5` for metric `Model passes` with average score `0.4.` Metric details: Percentage of successfully evaluated RAG/LLM outputs.	For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures.	Text matching	GlobalHtmlFragmentExplanation / text/html

RAG model HuggingFaceH4/zephyr-7b-beta (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])

Interpreted model details:

Evaluated model ID:
bdfc32ab-5095-429e-b33f-a496e0ef3b78
LLM model name:
HuggingFaceH4/zephyr-7b-beta
Model host:
h2ogpte
Collection ID:
76df8161-c0d1-414a-a046-92c6570ba9a1
Collection name:
RAG collection (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])
Documents (corpus):
- https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/AXA-Sigorta-2022-Annual-Report.pdf

Explainers identified the following problems:

Severity	Type	Problem	Suggested actions	Explainer	Resources
MEDIUM	accuracy	Evaluated model `HuggingFaceH4/zephyr-7b-beta` failed to satisfy the threshold `0.5` for metric `Model passes` with average score `0.4.` Metric details: Percentage of successfully evaluated RAG/LLM outputs.	For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures.	Text matching	GlobalHtmlFragmentExplanation / text/html

RAG model HuggingFaceH4/zephyr-7b-beta (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])

Interpreted model details:

Evaluated model ID:
41ba2ac9-d375-4d49-9a27-30123f973ede
LLM model name:
HuggingFaceH4/zephyr-7b-beta
Model host:
h2ogpte
Collection ID:
61ca9c07-9266-410d-a19c-5aaece5102a5
Collection name:
RAG collection (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])
Documents (corpus):
- https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/lowes-2022ar-full-report-4-6-23-final.pdf

Explainers identified the following problems:

Severity	Type	Problem	Suggested actions	Explainer	Resources
MEDIUM	accuracy	Evaluated model `HuggingFaceH4/zephyr-7b-beta` failed to satisfy the threshold `0.5` for metric `Model passes` with average score `0.4.` Metric details: Percentage of successfully evaluated RAG/LLM outputs.	For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures.	Text matching	GlobalHtmlFragmentExplanation / text/html

RAG model h2oai/h2ogpt-32k-codellama-34b-instruct (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])

Interpreted model details:

Evaluated model ID:
8feded98-fda7-4924-9900-605d39d973e2
LLM model name:
h2oai/h2ogpt-32k-codellama-34b-instruct
Model host:
h2ogpte
Collection ID:
c83fc72d-1425-4899-86b8-061d8613e1a0
Collection name:
RAG collection (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])
Documents (corpus):
- https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf

RAG model h2oai/h2ogpt-32k-codellama-34b-instruct (docs: ['bradesco-2022-integrated-report.pdf'])

Interpreted model details:

Evaluated model ID:
1b291af5-f12d-4afd-83ea-f4aae7e3ed73
LLM model name:
h2oai/h2ogpt-32k-codellama-34b-instruct
Model host:
h2ogpte
Collection ID:
114687dc-6339-4309-b8f0-6e049d0424a0
Collection name:
RAG collection (docs: ['bradesco-2022-integrated-report.pdf'])
Documents (corpus):
- https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/bradesco-2022-integrated-report.pdf

RAG model h2oai/h2ogpt-32k-codellama-34b-instruct (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])

Interpreted model details:

Evaluated model ID:
c6aed4f8-2ba5-4372-8b37-7d0b16dec6e1
LLM model name:
h2oai/h2ogpt-32k-codellama-34b-instruct
Model host:
h2ogpte
Collection ID:
76df8161-c0d1-414a-a046-92c6570ba9a1
Collection name:
RAG collection (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])
Documents (corpus):
- https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/AXA-Sigorta-2022-Annual-Report.pdf

RAG model h2oai/h2ogpt-32k-codellama-34b-instruct (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])

Interpreted model details:

Evaluated model ID:
7c7f8b09-1b2c-47f6-a174-e40cff7791bd
LLM model name:
h2oai/h2ogpt-32k-codellama-34b-instruct
Model host:
h2ogpte
Collection ID:
61ca9c07-9266-410d-a19c-5aaece5102a5
Collection name:
RAG collection (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])
Documents (corpus):
- https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/lowes-2022ar-full-report-4-6-23-final.pdf

RAG model Yukang/LongAlpaca-70B (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])

Interpreted model details:

Evaluated model ID:
ac091bf6-4f51-4bc9-bd9a-34b8cd36b01d
LLM model name:
Yukang/LongAlpaca-70B
Model host:
h2ogpte
Collection ID:
c83fc72d-1425-4899-86b8-061d8613e1a0
Collection name:
RAG collection (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])
Documents (corpus):
- https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf

Explainers identified the following problems:

Severity	Type	Problem	Suggested actions	Explainer	Resources
MEDIUM	accuracy	Evaluated model `Yukang/LongAlpaca-70B` failed to satisfy the threshold `0.5` for metric `Model passes` with average score `0.4.` Metric details: Percentage of successfully evaluated RAG/LLM outputs.	For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures.	Text matching	GlobalHtmlFragmentExplanation / text/html

Explainers identified the following insights:

Type	Insight	Suggested actions	Explainer	Resources
accuracy	Model `Yukang/LongAlpaca-70B` was evaluated as the least accurate model according to `Text matching` evaluator.	A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Check the prompt, expected answer and condition - are they correct? Check models answers in failed cases and look for a common denominator and/or root cause of these failures.	Text matching	GlobalHtmlFragmentExplanation / text/html
performance	Model `Yukang/LongAlpaca-70B` was evaluated as the slowest model according to `Text matching` evaluator.	A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation.	Text matching	GlobalHtmlFragmentExplanation / text/html

RAG model Yukang/LongAlpaca-70B (docs: ['bradesco-2022-integrated-report.pdf'])

Interpreted model details:

Evaluated model ID:
46671553-da02-4a1b-aa84-4b750cebf7ad
LLM model name:
Yukang/LongAlpaca-70B
Model host:
h2ogpte
Collection ID:
114687dc-6339-4309-b8f0-6e049d0424a0
Collection name:
RAG collection (docs: ['bradesco-2022-integrated-report.pdf'])
Documents (corpus):
- https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/bradesco-2022-integrated-report.pdf

Explainers identified the following problems:

Severity	Type	Problem	Suggested actions	Explainer	Resources
MEDIUM	accuracy	Evaluated model `Yukang/LongAlpaca-70B` failed to satisfy the threshold `0.5` for metric `Model passes` with average score `0.4.` Metric details: Percentage of successfully evaluated RAG/LLM outputs.	For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures.	Text matching	GlobalHtmlFragmentExplanation / text/html

Explainers identified the following insights:

Type	Insight	Suggested actions	Explainer	Resources
accuracy	Model `Yukang/LongAlpaca-70B` was evaluated as the least accurate model according to `Text matching` evaluator.	A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Check the prompt, expected answer and condition - are they correct? Check models answers in failed cases and look for a common denominator and/or root cause of these failures.	Text matching	GlobalHtmlFragmentExplanation / text/html
performance	Model `Yukang/LongAlpaca-70B` was evaluated as the slowest model according to `Text matching` evaluator.	A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation.	Text matching	GlobalHtmlFragmentExplanation / text/html

RAG model Yukang/LongAlpaca-70B (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])

Interpreted model details:

Evaluated model ID:
ff2e1fd0-7a3b-42a3-a407-7b7bd97b3163
LLM model name:
Yukang/LongAlpaca-70B
Model host:
h2ogpte
Collection ID:
76df8161-c0d1-414a-a046-92c6570ba9a1
Collection name:
RAG collection (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])
Documents (corpus):
- https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/AXA-Sigorta-2022-Annual-Report.pdf

Explainers identified the following problems:

Severity	Type	Problem	Suggested actions	Explainer	Resources
MEDIUM	accuracy	Evaluated model `Yukang/LongAlpaca-70B` failed to satisfy the threshold `0.5` for metric `Model passes` with average score `0.4.` Metric details: Percentage of successfully evaluated RAG/LLM outputs.	For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures.	Text matching	GlobalHtmlFragmentExplanation / text/html

Explainers identified the following insights:

Type	Insight	Suggested actions	Explainer	Resources
accuracy	Model `Yukang/LongAlpaca-70B` was evaluated as the least accurate model according to `Text matching` evaluator.	A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Check the prompt, expected answer and condition - are they correct? Check models answers in failed cases and look for a common denominator and/or root cause of these failures.	Text matching	GlobalHtmlFragmentExplanation / text/html
performance	Model `Yukang/LongAlpaca-70B` was evaluated as the slowest model according to `Text matching` evaluator.	A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation.	Text matching	GlobalHtmlFragmentExplanation / text/html

RAG model Yukang/LongAlpaca-70B (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])

Interpreted model details:

Evaluated model ID:
0e81c11b-4528-4655-a045-1749d08d6f7a
LLM model name:
Yukang/LongAlpaca-70B
Model host:
h2ogpte
Collection ID:
61ca9c07-9266-410d-a19c-5aaece5102a5
Collection name:
RAG collection (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])
Documents (corpus):
- https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/lowes-2022ar-full-report-4-6-23-final.pdf

Explainers identified the following problems:

Severity	Type	Problem	Suggested actions	Explainer	Resources
MEDIUM	accuracy	Evaluated model `Yukang/LongAlpaca-70B` failed to satisfy the threshold `0.5` for metric `Model passes` with average score `0.4.` Metric details: Percentage of successfully evaluated RAG/LLM outputs.	For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures.	Text matching	GlobalHtmlFragmentExplanation / text/html

Explainers identified the following insights:

Type	Insight	Suggested actions	Explainer	Resources
accuracy	Model `Yukang/LongAlpaca-70B` was evaluated as the least accurate model according to `Text matching` evaluator.	A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Check the prompt, expected answer and condition - are they correct? Check models answers in failed cases and look for a common denominator and/or root cause of these failures.	Text matching	GlobalHtmlFragmentExplanation / text/html
performance	Model `Yukang/LongAlpaca-70B` was evaluated as the slowest model according to `Text matching` evaluator.	A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation.	Text matching	GlobalHtmlFragmentExplanation / text/html

RAG model gpt-3.5-turbo-0613 (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])

Interpreted model details:

Evaluated model ID:
eab0da37-68ee-44b6-a2e7-0ff05691fdee
LLM model name:
gpt-3.5-turbo-0613
Model host:
h2ogpte
Collection ID:
c83fc72d-1425-4899-86b8-061d8613e1a0
Collection name:
RAG collection (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])
Documents (corpus):
- https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf

Explainers identified the following insights:

Type	Insight	Suggested actions	Explainer	Resources
accuracy	Model `gpt-3.5-turbo-0613` was evaluated as the most accurate model according to `Text matching` evaluator.	A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Check the prompt, expected answer and condition - are they correct? Check models answers in failed cases and look for a common denominator and/or root cause of these failures.	Text matching	GlobalHtmlFragmentExplanation / text/html
performance	Model `gpt-3.5-turbo-0613` was evaluated as the fastest model according to `Text matching` evaluator.	A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation.	Text matching	GlobalHtmlFragmentExplanation / text/html

RAG model gpt-3.5-turbo-0613 (docs: ['bradesco-2022-integrated-report.pdf'])

Interpreted model details:

Evaluated model ID:
ddcaa7dc-09d5-4297-854d-c699b698fe23
LLM model name:
gpt-3.5-turbo-0613
Model host:
h2ogpte
Collection ID:
114687dc-6339-4309-b8f0-6e049d0424a0
Collection name:
RAG collection (docs: ['bradesco-2022-integrated-report.pdf'])
Documents (corpus):
- https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/bradesco-2022-integrated-report.pdf

Explainers identified the following insights:

Type	Insight	Suggested actions	Explainer	Resources
accuracy	Model `gpt-3.5-turbo-0613` was evaluated as the most accurate model according to `Text matching` evaluator.	A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Check the prompt, expected answer and condition - are they correct? Check models answers in failed cases and look for a common denominator and/or root cause of these failures.	Text matching	GlobalHtmlFragmentExplanation / text/html
performance	Model `gpt-3.5-turbo-0613` was evaluated as the fastest model according to `Text matching` evaluator.	A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation.	Text matching	GlobalHtmlFragmentExplanation / text/html

RAG model gpt-3.5-turbo-0613 (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])

Interpreted model details:

Evaluated model ID:
41b4a810-562f-412f-ae79-f053e575d81a
LLM model name:
gpt-3.5-turbo-0613
Model host:
h2ogpte
Collection ID:
76df8161-c0d1-414a-a046-92c6570ba9a1
Collection name:
RAG collection (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])
Documents (corpus):
- https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/AXA-Sigorta-2022-Annual-Report.pdf

Explainers identified the following insights:

Type	Insight	Suggested actions	Explainer	Resources
accuracy	Model `gpt-3.5-turbo-0613` was evaluated as the most accurate model according to `Text matching` evaluator.	A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Check the prompt, expected answer and condition - are they correct? Check models answers in failed cases and look for a common denominator and/or root cause of these failures.	Text matching	GlobalHtmlFragmentExplanation / text/html
performance	Model `gpt-3.5-turbo-0613` was evaluated as the fastest model according to `Text matching` evaluator.	A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation.	Text matching	GlobalHtmlFragmentExplanation / text/html

RAG model gpt-3.5-turbo-0613 (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])

Interpreted model details:

Evaluated model ID:
032d3e52-cd14-4d3b-8294-48aa2ed0e0d7
LLM model name:
gpt-3.5-turbo-0613
Model host:
h2ogpte
Collection ID:
61ca9c07-9266-410d-a19c-5aaece5102a5
Collection name:
RAG collection (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])
Documents (corpus):
- https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/lowes-2022ar-full-report-4-6-23-final.pdf

Explainers identified the following insights:

Type	Insight	Suggested actions	Explainer	Resources
accuracy	Model `gpt-3.5-turbo-0613` was evaluated as the most accurate model according to `Text matching` evaluator.	A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Check the prompt, expected answer and condition - are they correct? Check models answers in failed cases and look for a common denominator and/or root cause of these failures.	Text matching	GlobalHtmlFragmentExplanation / text/html
performance	Model `gpt-3.5-turbo-0613` was evaluated as the fastest model according to `Text matching` evaluator.	A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation.	Text matching	GlobalHtmlFragmentExplanation / text/html

RAG model gpt-3.5-turbo-16k-0613 (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])

Interpreted model details:

Evaluated model ID:
e4bc2e64-602a-42c0-903f-f3b200bba34b
LLM model name:
gpt-3.5-turbo-16k-0613
Model host:
h2ogpte
Collection ID:
c83fc72d-1425-4899-86b8-061d8613e1a0
Collection name:
RAG collection (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])
Documents (corpus):
- https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf

RAG model gpt-3.5-turbo-16k-0613 (docs: ['bradesco-2022-integrated-report.pdf'])

Interpreted model details:

Evaluated model ID:
5032ca3a-823b-439e-b65b-a08e9731c23f
LLM model name:
gpt-3.5-turbo-16k-0613
Model host:
h2ogpte
Collection ID:
114687dc-6339-4309-b8f0-6e049d0424a0
Collection name:
RAG collection (docs: ['bradesco-2022-integrated-report.pdf'])
Documents (corpus):
- https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/bradesco-2022-integrated-report.pdf

RAG model gpt-3.5-turbo-16k-0613 (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])

Interpreted model details:

Evaluated model ID:
e69a5c9c-543c-47f2-ae22-94ece977c8d7
LLM model name:
gpt-3.5-turbo-16k-0613
Model host:
h2ogpte
Collection ID:
76df8161-c0d1-414a-a046-92c6570ba9a1
Collection name:
RAG collection (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])
Documents (corpus):
- https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/AXA-Sigorta-2022-Annual-Report.pdf

RAG model gpt-3.5-turbo-16k-0613 (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])

Interpreted model details:

Evaluated model ID:
17866716-869d-4fe9-ae34-c0247c5dfda7
LLM model name:
gpt-3.5-turbo-16k-0613
Model host:
h2ogpte
Collection ID:
61ca9c07-9266-410d-a19c-5aaece5102a5
Collection name:
RAG collection (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])
Documents (corpus):
- https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/lowes-2022ar-full-report-4-6-23-final.pdf

RAG model gpt-4-0613 (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])

Interpreted model details:

Evaluated model ID:
6e01af0f-65aa-414a-a107-eff121b8b1fa
LLM model name:
gpt-4-0613
Model host:
h2ogpte
Collection ID:
c83fc72d-1425-4899-86b8-061d8613e1a0
Collection name:
RAG collection (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])
Documents (corpus):
- https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf

RAG model gpt-4-0613 (docs: ['bradesco-2022-integrated-report.pdf'])

Interpreted model details:

Evaluated model ID:
94b427e8-0d1e-440f-9e98-ecc13ca94f08
LLM model name:
gpt-4-0613
Model host:
h2ogpte
Collection ID:
114687dc-6339-4309-b8f0-6e049d0424a0
Collection name:
RAG collection (docs: ['bradesco-2022-integrated-report.pdf'])
Documents (corpus):
- https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/bradesco-2022-integrated-report.pdf

RAG model gpt-4-0613 (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])

Interpreted model details:

Evaluated model ID:
2c20ff83-d339-491c-b205-67215d09b456
LLM model name:
gpt-4-0613
Model host:
h2ogpte
Collection ID:
76df8161-c0d1-414a-a046-92c6570ba9a1
Collection name:
RAG collection (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])
Documents (corpus):
- https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/AXA-Sigorta-2022-Annual-Report.pdf

RAG model gpt-4-0613 (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])

Interpreted model details:

Evaluated model ID:
6d0b972c-1863-41e6-b5cf-236cce1ce297
LLM model name:
gpt-4-0613
Model host:
h2ogpte
Collection ID:
61ca9c07-9266-410d-a19c-5aaece5102a5
Collection name:
RAG collection (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])
Documents (corpus):
- https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/lowes-2022ar-full-report-4-6-23-final.pdf

RAG model gpt-4-32k-0613 (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])

Interpreted model details:

Evaluated model ID:
fcd062fa-cf1c-469f-8d2f-620bb743e0d0
LLM model name:
gpt-4-32k-0613
Model host:
h2ogpte
Collection ID:
c83fc72d-1425-4899-86b8-061d8613e1a0
Collection name:
RAG collection (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])
Documents (corpus):
- https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf

RAG model gpt-4-32k-0613 (docs: ['bradesco-2022-integrated-report.pdf'])

Interpreted model details:

Evaluated model ID:
e819729f-fc5c-4475-9fbe-98d59315eeac
LLM model name:
gpt-4-32k-0613
Model host:
h2ogpte
Collection ID:
114687dc-6339-4309-b8f0-6e049d0424a0
Collection name:
RAG collection (docs: ['bradesco-2022-integrated-report.pdf'])
Documents (corpus):
- https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/bradesco-2022-integrated-report.pdf

RAG model gpt-4-32k-0613 (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])

Interpreted model details:

Evaluated model ID:
6576ffc5-e59e-4e41-9ac6-321a8e0ff312
LLM model name:
gpt-4-32k-0613
Model host:
h2ogpte
Collection ID:
76df8161-c0d1-414a-a046-92c6570ba9a1
Collection name:
RAG collection (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])
Documents (corpus):
- https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/AXA-Sigorta-2022-Annual-Report.pdf

RAG model gpt-4-32k-0613 (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])

Interpreted model details:

Evaluated model ID:
b7baea75-2602-4931-9c42-649845832125
LLM model name:
gpt-4-32k-0613
Model host:
h2ogpte
Collection ID:
61ca9c07-9266-410d-a19c-5aaece5102a5
Collection name:
RAG collection (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])
Documents (corpus):
- https://eval-studio-artifacts.s3.amazonaws.com/h2o-eval-studio-suite-library/corpus-h2ogpte-benchmark/lowes-2022ar-full-report-4-6-23-final.pdf

Configuration and parameters

Interpretation parameters:

model = None
- Interpreted model.
models = <class 'list'>
- Evaluated LLM/RAG models.
dataset = <class 'h2o_sonar.lib.api.datasets._datasets_genai.LlmDataset'>
- Dataset used to interpret the model.
validset = None
- Optional validation dataset used to interpret the model.
testset = None
- Optional test dataset used to interpret the model.
use_raw_features = True
- Whether to use original features for the training of Surrogate model explainers (True, default) or force transformed features (False).
target_col =
- Target column of the interpreted model.
weight_col =
- Optional name of the dataset column with weights of examples to be used in the model interpretation.
prediction_col =
- Optional name of the dataset column with predictions of examples to be used in the (standalone ~ no model, just predictions) model interpretation.
drop_cols = []
- Optional list of the dataset column names to be drop and not used in the model interpretation.
sample_num_rows = 0
- The sample size, number of rows, to be used for the surrogate models. This setting overrides global library sampling configuration.
results_location = /tmp/pytest-of-dvorka/pytest-26/test_async_evaluate0
- Filesystem path (database connectio or other storage type location) which specifies where to store interpretation results. Current directory is used by default
used_features = None
- Optional specification of the features (dataset columns) used by the interpreted model (in case that model doesn's support used features introspection).

H2O Sonar library configuration:

Config parameter	Value	Description	Type	Default value
h2o_host	`localhost`	The host of the H2O-3 server that should be used for the explanation that requires it.	`str`	`localhost`
h2o_port	`12349`	The port of the H2O-3 server that should be used for the explanation that requires it.	`int`	`12349`
h2o_auto_start	`True`	Automatically start H2O-3 server on the interpretation start (True), or do not start the server (False).	`bool`	`True`
h2o_auto_cleanup	`True`	Automatically remove all data from the H2O-3 server onthe interpretation end (True), or do not remove all data fromthe server (False).	`bool`	`True`
h2o_auto_stop	`False`	Automatically stop H2O-3 server on the interpretation end (True), or do not stop the server (False).	`bool`	`False`
h2o_min_mem_size	`2G`	Minimum memory specification for H2O-3 server started by H2O Sonar.	`int`	`2G`
h2o_max_mem_size	`4G`	Maximum memory specification for H2O-3 server started by H2O Sonar.	`int`	`4G`
custom_explainers	`[]`	List of custom "Bring Your Own Explainer" string locators to be registered on H2O Sonar run. The location has the following structure: "[PACKAGE and MODULE]::[EXPLAINER-CLASS-NAME]" where PACKAGE and MODULE is dot (.) separated path to the the module (installed on PYTHONPATH) and EXPLAINER-CLASS-NAME is the name of explainer class. Example: [ "my_package.explainer_module::MyExplainerClass", "their_package.explainer_module::TheirExplainerClass"]	`customlist`	`[]`
look_and_feel	`h2o_sonar`	Charts theme (look and feel) - one of: 'h2o_sonar', 'blue', 'driverless_ai'.	`str`	`h2o_sonar`
device	`cpu`	Device to be used for the calculations. The value of this configuration item might be ``cpu`` or ``gpu``.	`str`
enable_slow_perturbators	`False`	Enable slow (agent-based, model-based, resource intensive) perturbators which are by default skipped and not listed.	`bool`	`False`
force_eval_judge	`false`	Force the use of custom evaluation judge for the evaluation of the models over the judges used by evaluators by default. For example to use a local judge in order to avoid sending sensitive data to a 3rd party or to the cloud. The value of this configuration item might be ``false``, ``true`` or configuration key of the custom evaluation judge. Forcing the use of a custom evaluation judge will automatically reconfigure the embeddings calculation in evaluations to a local model to ensure privacy safety.	`str`	`false`
multiprocessing_start_method	`spawn`	Multiprocessing start method - one of: 'spawn', 'fork', 'forkserver' or `None` (default).	`str`	`spawn`
model_cache_dir	`/home/dvorka/.cache/h2o_sonar/models`	Directory where the models are cached. If not specified, the models are cached in a default directory in user home which follows operating system conventions.	`str`	`/home/dvorka/.cache/h2o_sonar/models`
http_ssl_cert_verify	`True`	SSL certificate verification for HTTPS requests. If set to ``false``, then SSL certificate verification is disabled. If set to ``true``, then SSL certificate verification is enabled. If set to the path (string) to a ``CA_BUNDLE`` file or directory with certificates of trusted CAs, then they will be used for the verification (in this case the directory must have been processed using the c_rehash utility supplied with OpenSSL).	`str`	`true`
branding	`H2O_SONAR`	Branding for HTML reports. If not specified (empty string). Valid values: 'H2O_SONAR', 'EVAL_STUDIO', or '' (empty for auto).	`str`
per_explainer_logger	`True`	Create new logger for each explainer (which logs to explainer sandbox) or reuse one logger and use library logger for all log messages.	`bool`	`True`
create_html_representations	`True`	Indicate that explainers can create HTML representation (True), or request to skip it (False) from performance/resource consumption reasons.	`bool`	`True`
connections	`[]`
licenses	`[]`
evaluation_judges	`[]`

Directories, files and logs

Directories and files: