| Interpretation result: |
|
| Problems: | 3 |
| Insights: | 5 |
| Models: |
44 LLM/RAG models
|
| Dataset: |
55 inputs
|
| Interpretation status: | SUCCESS |
| Interpretation ID: |
133445c7-feef-4bae-835a-2e2cb726cbf3
|
| Created: |
2026-01-30 14:26:04
|
Explainers identified the following problems:
| Severity | Type | Problem | Suggested actions | Explainer | Resources |
|---|---|---|---|---|---|
| MEDIUM | accuracy |
Evaluated model
h2oai/h2ogpt-4096-llama2-13b-chat
failed to satisfy the
threshold
0.5
for metric
Model passes
with average
score
0.4.
Metric details:
Percentage of successfully evaluated RAG/LLM outputs.
|
For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures. | Text matching | GlobalHtmlFragmentExplanation / text/html |
| MEDIUM | accuracy |
Evaluated model
HuggingFaceH4/zephyr-7b-beta
failed to satisfy the
threshold
0.5
for metric
Model passes
with average
score
0.4.
Metric details:
Percentage of successfully evaluated RAG/LLM outputs.
|
For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures. | Text matching | GlobalHtmlFragmentExplanation / text/html |
| MEDIUM | accuracy |
Evaluated model
Yukang/LongAlpaca-70B
failed to satisfy the
threshold
0.5
for metric
Model passes
with average
score
0.4.
Metric details:
Percentage of successfully evaluated RAG/LLM outputs.
|
For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures. | Text matching | GlobalHtmlFragmentExplanation / text/html |
Explainers identified the following insights:
| Type | Insight | Suggested actions | Explainer | Resources |
|---|---|---|---|---|
| accuracy |
Model
gpt-3.5-turbo-0613
was evaluated as
the most accurate
model according to
Text matching
evaluator.
|
A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Check the prompt, expected answer and condition - are they correct? Check models answers in failed cases and look for a common denominator and/or root cause of these failures. | Text matching | GlobalHtmlFragmentExplanation / text/html |
| accuracy |
Model
Yukang/LongAlpaca-70B
was evaluated as
the least accurate
model according to
Text matching
evaluator.
|
A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Check the prompt, expected answer and condition - are they correct? Check models answers in failed cases and look for a common denominator and/or root cause of these failures. | Text matching | GlobalHtmlFragmentExplanation / text/html |
| performance |
Model
gpt-3.5-turbo-0613
was evaluated as
the fastest
model according to
Text matching
evaluator.
|
A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. | Text matching | GlobalHtmlFragmentExplanation / text/html |
| performance |
Model
Yukang/LongAlpaca-70B
was evaluated as
the slowest
model according to
Text matching
evaluator.
|
A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. | Text matching | GlobalHtmlFragmentExplanation / text/html |
| weak-point |
Prompt
'Who are the board members?'
was evaluated as
the most
difficult
prompt to be correctly answered according to
Text matching
evaluator.
|
A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Check the prompt, expected answer and condition - are they correct? Check models answers in failed cases and look for a common denominator and/or root cause of these failures. | Text matching | GlobalHtmlFragmentExplanation / text/html |
All explainers (1):
Scheduled explainers (1): Finished explainers (1): Successful explainers (1):Explainer identified the following problems:
| Severity | Type | Problem | Suggested actions | Explainer | Resources |
|---|---|---|---|---|---|
| MEDIUM | accuracy |
Evaluated model
h2oai/h2ogpt-4096-llama2-13b-chat
failed to satisfy the
threshold
0.5
for metric
Model passes
with average
score
0.4.
Metric details:
Percentage of successfully evaluated RAG/LLM outputs.
|
For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures. | Text matching | GlobalHtmlFragmentExplanation / text/html |
| MEDIUM | accuracy |
Evaluated model
HuggingFaceH4/zephyr-7b-beta
failed to satisfy the
threshold
0.5
for metric
Model passes
with average
score
0.4.
Metric details:
Percentage of successfully evaluated RAG/LLM outputs.
|
For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures. | Text matching | GlobalHtmlFragmentExplanation / text/html |
| MEDIUM | accuracy |
Evaluated model
Yukang/LongAlpaca-70B
failed to satisfy the
threshold
0.5
for metric
Model passes
with average
score
0.4.
Metric details:
Percentage of successfully evaluated RAG/LLM outputs.
|
For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures. | Text matching | GlobalHtmlFragmentExplanation / text/html |
Explainer identified the following insights:
| Type | Insight | Suggested actions | Explainer | Resources |
|---|---|---|---|---|
| accuracy |
Model
gpt-3.5-turbo-0613
was evaluated as
the most accurate
model according to
Text matching
evaluator.
|
A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Check the prompt, expected answer and condition - are they correct? Check models answers in failed cases and look for a common denominator and/or root cause of these failures. | Text matching | GlobalHtmlFragmentExplanation / text/html |
| accuracy |
Model
Yukang/LongAlpaca-70B
was evaluated as
the least accurate
model according to
Text matching
evaluator.
|
A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Check the prompt, expected answer and condition - are they correct? Check models answers in failed cases and look for a common denominator and/or root cause of these failures. | Text matching | GlobalHtmlFragmentExplanation / text/html |
| performance |
Model
gpt-3.5-turbo-0613
was evaluated as
the fastest
model according to
Text matching
evaluator.
|
A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. | Text matching | GlobalHtmlFragmentExplanation / text/html |
| performance |
Model
Yukang/LongAlpaca-70B
was evaluated as
the slowest
model according to
Text matching
evaluator.
|
A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. | Text matching | GlobalHtmlFragmentExplanation / text/html |
| weak-point |
Prompt
'Who are the board members?'
was evaluated as
the most
difficult
prompt to be correctly answered according to
Text matching
evaluator.
|
A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Check the prompt, expected answer and condition - are they correct? Check models answers in failed cases and look for a common denominator and/or root cause of these failures. | Text matching | GlobalHtmlFragmentExplanation / text/html |
Evaluator input requirements:
| Question | Expected Answer | Retrieved Context | Actual Answer | Conditions |
|---|---|---|---|---|
| ✓ | ✓ |
Description:
Text Matching Evaluator assesses whether both the retrieved context (in the case of RAG hosted models) and the generated answer contain/match a specified set of required strings. The evaluation is based on an boolean expression (condition) that can be used to define the required strings presence:
AND, OR, and NOTparentheses can be used to group expressions
Example 1: Simple string matching
"15,969"The evaluator will check if the retrieved context and the actual answer
contain the string 15,969. If the condition is satisfied, the test case
passes.
Example 2: Flexible regex patterns
regexp("15,?969")What if the number 15,969 might be expressed as 15969 or 15,969?
The boolean expression can be extended to use a regular expression. The
evaluator will check if the retrieved context and the actual answer contain
the string 15,969 or 15969. If the condition is satisfied, the test
case passes.
Example 3: Combining string and regex
"15,969" AND regexp("[Mm]illion")The evaluator will check if the retrieved context and the actual answer
contain the string 15,969 and match the regular expression
[Mm]illion. If the condition is satisfied, the test case passes.
Example 4: Complex boolean logic
("Rio" OR "rio") AND regexp("15,?969 [Mm]il") AND NOT "Real"The evaluator will check if the retrieved context and the actual answer
contain either Rio or rio and match the regular expression
15,969 [Mm]il and do not contain the string Real. If the
condition is satisfied, the test case passes.
Example 5: Exact matching with regex anchors
regexp("^Brazil revenue was 15,969 million$")The evaluator will check if the retrieved context and the actual answer
exactly match the regular expression
^Brazil revenue was 15,969 million$. If the condition is satisfied, the
test case passes.
Example 6: Case-insensitive matching
regexp("(?i)python")The (?i) flag enables case-insensitive matching. The evaluator will match
python, Python, PYTHON, PyThOn, etc. This is useful when the
capitalization in the output is unpredictable.
Example 7: OR within regular expressions
regexp("(cat|dog|bird)")Using the pipe | operator inside a group allows matching multiple
alternatives. The evaluator will match any of: cat, dog, or bird.
This is more concise than using multiple OR operators in the boolean
expression.
Example 8: Capturing groups and word boundaries
regexp("\b(error|warning|failure)\b")The \b word boundary ensures exact word matching (not as part of a larger
word). The regex will match error, warning, or failure as complete
words. Parentheses capture the matched text for reference.
Example 9: Repeated patterns and quantifiers
regexp("\d3-\d3-\d4")Quantifiers specify repetition: \d3 matches exactly 3 digits, +
matches one or more, * matches zero or more. This example matches phone
numbers in the format 123-456-7890. Use \d for digits, \w for
word characters, \s for whitespace.
Example 10: Lookahead and combining patterns
regexp("(?i)(success|completed).*\d+%")(?i), an OR group
(success|completed), .* to match any characters, and \d+% to
match one or more digits followed by a percent sign. Useful for matching
complex patterns like progress messages.Method:
re module for regular expression matching (re.search
function). See https://docs.python.org/3/howto/regex.html#regex-howtoMetrics calculated by the evaluator:
[0.0, 1.0]0.5[0.0, 1.0]0.5[0.0, 1.0]0.5[0.0, 1.0]0.5[0.0, 1.0]0.5Problems reported by the evaluator:
Insights diagnosed by the evaluator:
Evaluator parameters:
metric_threshold (float):0.5save_llm_result (bool):Trueevaluate_retrieved_context (bool):""
| LLM Models by Success Rate | Pass | Fail | Success rate | Total time | Cost | |
|---|---|---|---|---|---|---|
| 1. | gpt-3.5-turbo-0613 h2oGPTe RAG | 4 | 1 | 80.000% | 18.775s | $0.000 |
| 2. | gpt-4-0613 h2oGPTe RAG | 4 | 1 | 80.000% | 30.675s | $0.000 |
| 3. | gpt-4-32k-0613 h2oGPTe RAG | 4 | 1 | 80.000% | 37.538s | $0.000 |
| 4. | gpt-3.5-turbo-16k-0613 h2oGPTe RAG | 3 | 2 | 60.000% | 20.997s | $0.000 |
| 5. | h2oai/h2ogpt-4096-llama2-70b-chat h2oGPTe RAG | 3 | 2 | 60.000% | 68.370s | $0.000 |
| 6. | lmsys/vicuna-13b-v1.5-16k h2oGPTe RAG | 3 | 2 | 60.000% | 69.986s | $0.000 |
| 7. | h2oai/h2ogpt-32k-codellama-34b-instruct h2oGPTe RAG | 3 | 2 | 60.000% | 70.819s | $0.000 |
| 8. | h2oai/h2ogpt-4096-llama2-70b-chat-4bit h2oGPTe RAG | 3 | 2 | 60.000% | 111.219s | $0.000 |
| 9. | HuggingFaceH4/zephyr-7b-beta h2oGPTe RAG | 2 | 3 | 40.000% | 50.480s | $0.000 |
| 10. | h2oai/h2ogpt-4096-llama2-13b-chat h2oGPTe RAG | 2 | 3 | 40.000% | 56.327s | $0.000 |
| 11. | Yukang/LongAlpaca-70B h2oGPTe RAG | 2 | 3 | 40.000% | 211.522s | $0.000 |
| LLM Models by Time | Pass | Fail | Success rate | Total time | Cost | |
|---|---|---|---|---|---|---|
| 1. | gpt-3.5-turbo-0613 h2oGPTe RAG | 4 | 1 | 80.000% | 18.775s | $0.000 |
| 2. | gpt-3.5-turbo-16k-0613 h2oGPTe RAG | 3 | 2 | 60.000% | 20.997s | $0.000 |
| 3. | gpt-4-0613 h2oGPTe RAG | 4 | 1 | 80.000% | 30.675s | $0.000 |
| 4. | gpt-4-32k-0613 h2oGPTe RAG | 4 | 1 | 80.000% | 37.538s | $0.000 |
| 5. | HuggingFaceH4/zephyr-7b-beta h2oGPTe RAG | 2 | 3 | 40.000% | 50.480s | $0.000 |
| 6. | h2oai/h2ogpt-4096-llama2-13b-chat h2oGPTe RAG | 2 | 3 | 40.000% | 56.327s | $0.000 |
| 7. | h2oai/h2ogpt-4096-llama2-70b-chat h2oGPTe RAG | 3 | 2 | 60.000% | 68.370s | $0.000 |
| 8. | lmsys/vicuna-13b-v1.5-16k h2oGPTe RAG | 3 | 2 | 60.000% | 69.986s | $0.000 |
| 9. | h2oai/h2ogpt-32k-codellama-34b-instruct h2oGPTe RAG | 3 | 2 | 60.000% | 70.819s | $0.000 |
| 10. | h2oai/h2ogpt-4096-llama2-70b-chat-4bit h2oGPTe RAG | 3 | 2 | 60.000% | 111.219s | $0.000 |
| 11. | Yukang/LongAlpaca-70B h2oGPTe RAG | 2 | 3 | 40.000% | 211.522s | $0.000 |
| LLM Models by Cost | Pass | Fail | Success rate | Total time | Cost | |
|---|---|---|---|---|---|---|
| 1. | gpt-3.5-turbo-0613 h2oGPTe RAG | 4 | 1 | 80.000% | 18.775s | $0.000 |
| 2. | gpt-3.5-turbo-16k-0613 h2oGPTe RAG | 3 | 2 | 60.000% | 20.997s | $0.000 |
| 3. | gpt-4-0613 h2oGPTe RAG | 4 | 1 | 80.000% | 30.675s | $0.000 |
| 4. | gpt-4-32k-0613 h2oGPTe RAG | 4 | 1 | 80.000% | 37.538s | $0.000 |
| 5. | HuggingFaceH4/zephyr-7b-beta h2oGPTe RAG | 2 | 3 | 40.000% | 50.480s | $0.000 |
| 6. | h2oai/h2ogpt-4096-llama2-13b-chat h2oGPTe RAG | 2 | 3 | 40.000% | 56.327s | $0.000 |
| 7. | h2oai/h2ogpt-4096-llama2-70b-chat h2oGPTe RAG | 3 | 2 | 60.000% | 68.370s | $0.000 |
| 8. | lmsys/vicuna-13b-v1.5-16k h2oGPTe RAG | 3 | 2 | 60.000% | 69.986s | $0.000 |
| 9. | h2oai/h2ogpt-32k-codellama-34b-instruct h2oGPTe RAG | 3 | 2 | 60.000% | 70.819s | $0.000 |
| 10. | h2oai/h2ogpt-4096-llama2-70b-chat-4bit h2oGPTe RAG | 3 | 2 | 60.000% | 111.219s | $0.000 |
| 11. | Yukang/LongAlpaca-70B h2oGPTe RAG | 2 | 3 | 40.000% | 211.522s | $0.000 |
| Most difficult prompts across all models | Failures | Success rate |
|---|---|---|
| Who are the board members? | 11 | 0.000% |
| How many stores are in Florida? | 7 | 36.364% |
| What was the revenue of Brazil? | 4 | 63.636% |
| Parameter | Value | Description | Type | Default value |
|---|---|---|---|---|
| metric_threshold |
0.5
|
Evaluated metric threshold - values below this threshold are considered problematic. |
float
|
0.5
|
| save_llm_result |
True
|
Control whether to save LLM result which contains input LLM dataset and all metrics calculated by the evaluator. |
bool
|
True
|
| evaluate_retrieved_context |
False
|
Control whether to evaluate also retrieved context - conditions to check whether it contains or does not contained specific strings. |
bool
|
False
|
Interpretation test suite details:
| Prompts (5) |
|---|
| How many stores are in Florida? |
| What was the number of agreements that include human rights clauses, in 2022? |
| What was the revenue of Brazil? |
| Who are the board members? |
| Who is the chairman of the board? |
<class 'datatable.Frame'>
(55, 14)
55
['key', 'input', 'corpus', 'context', 'categories', 'relationships', 'expected_output', 'output_constraints', 'output_condition', 'actual_output', 'actual_duration', 'cost', 'model_key', 'test_key']
[55, 5, 1, 5, 2, 1, 5, 5, 1, 47, 55, 1, 44, 1]
['str', 'str', 'str', 'str', 'str', 'str', 'str', 'str', 'str', 'str', 'real', 'real', 'str', 'str']
key
str
True
55
55
input
str
True
5
5
corpus
str
True
1
1
context
str
True
5
5
categories
str
True
2
2
relationships
str
True
1
1
expected_output
str
True
5
5
output_constraints
str
True
5
5
output_condition
str
True
1
1
actual_output
str
True
47
47
actual_duration
real
True
55
55
cost
real
True
1
1
model_key
str
True
44
44
test_key
str
True
1
1
Interpreted models - LLM and corpus (in case of RAG) - overview:
h2oai/h2ogpt-4096-llama2-70b-chat
h2ogpte
c83fc72d-1425-4899-86b8-061d8613e1a0
RAG collection (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])
h2oai/h2ogpt-4096-llama2-70b-chat
h2ogpte
114687dc-6339-4309-b8f0-6e049d0424a0
RAG collection (docs: ['bradesco-2022-integrated-report.pdf'])
h2oai/h2ogpt-4096-llama2-70b-chat
h2ogpte
76df8161-c0d1-414a-a046-92c6570ba9a1
RAG collection (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])
h2oai/h2ogpt-4096-llama2-70b-chat
h2ogpte
61ca9c07-9266-410d-a19c-5aaece5102a5
RAG collection (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])
h2oai/h2ogpt-4096-llama2-70b-chat-4bit
h2ogpte
c83fc72d-1425-4899-86b8-061d8613e1a0
RAG collection (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])
h2oai/h2ogpt-4096-llama2-70b-chat-4bit
h2ogpte
114687dc-6339-4309-b8f0-6e049d0424a0
RAG collection (docs: ['bradesco-2022-integrated-report.pdf'])
h2oai/h2ogpt-4096-llama2-70b-chat-4bit
h2ogpte
76df8161-c0d1-414a-a046-92c6570ba9a1
RAG collection (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])
h2oai/h2ogpt-4096-llama2-70b-chat-4bit
h2ogpte
61ca9c07-9266-410d-a19c-5aaece5102a5
RAG collection (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])
lmsys/vicuna-13b-v1.5-16k
h2ogpte
c83fc72d-1425-4899-86b8-061d8613e1a0
RAG collection (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])
lmsys/vicuna-13b-v1.5-16k
h2ogpte
114687dc-6339-4309-b8f0-6e049d0424a0
RAG collection (docs: ['bradesco-2022-integrated-report.pdf'])
lmsys/vicuna-13b-v1.5-16k
h2ogpte
76df8161-c0d1-414a-a046-92c6570ba9a1
RAG collection (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])
lmsys/vicuna-13b-v1.5-16k
h2ogpte
61ca9c07-9266-410d-a19c-5aaece5102a5
RAG collection (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])
h2oai/h2ogpt-4096-llama2-13b-chat
h2ogpte
c83fc72d-1425-4899-86b8-061d8613e1a0
RAG collection (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])
Explainers identified the following problems:
| Severity | Type | Problem | Suggested actions | Explainer | Resources |
|---|---|---|---|---|---|
| MEDIUM | accuracy |
Evaluated model
h2oai/h2ogpt-4096-llama2-13b-chat
failed to satisfy the
threshold
0.5
for metric
Model passes
with average
score
0.4.
Metric details:
Percentage of successfully evaluated RAG/LLM outputs.
|
For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures. | Text matching | GlobalHtmlFragmentExplanation / text/html |
h2oai/h2ogpt-4096-llama2-13b-chat
h2ogpte
114687dc-6339-4309-b8f0-6e049d0424a0
RAG collection (docs: ['bradesco-2022-integrated-report.pdf'])
Explainers identified the following problems:
| Severity | Type | Problem | Suggested actions | Explainer | Resources |
|---|---|---|---|---|---|
| MEDIUM | accuracy |
Evaluated model
h2oai/h2ogpt-4096-llama2-13b-chat
failed to satisfy the
threshold
0.5
for metric
Model passes
with average
score
0.4.
Metric details:
Percentage of successfully evaluated RAG/LLM outputs.
|
For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures. | Text matching | GlobalHtmlFragmentExplanation / text/html |
h2oai/h2ogpt-4096-llama2-13b-chat
h2ogpte
76df8161-c0d1-414a-a046-92c6570ba9a1
RAG collection (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])
Explainers identified the following problems:
| Severity | Type | Problem | Suggested actions | Explainer | Resources |
|---|---|---|---|---|---|
| MEDIUM | accuracy |
Evaluated model
h2oai/h2ogpt-4096-llama2-13b-chat
failed to satisfy the
threshold
0.5
for metric
Model passes
with average
score
0.4.
Metric details:
Percentage of successfully evaluated RAG/LLM outputs.
|
For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures. | Text matching | GlobalHtmlFragmentExplanation / text/html |
h2oai/h2ogpt-4096-llama2-13b-chat
h2ogpte
61ca9c07-9266-410d-a19c-5aaece5102a5
RAG collection (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])
Explainers identified the following problems:
| Severity | Type | Problem | Suggested actions | Explainer | Resources |
|---|---|---|---|---|---|
| MEDIUM | accuracy |
Evaluated model
h2oai/h2ogpt-4096-llama2-13b-chat
failed to satisfy the
threshold
0.5
for metric
Model passes
with average
score
0.4.
Metric details:
Percentage of successfully evaluated RAG/LLM outputs.
|
For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures. | Text matching | GlobalHtmlFragmentExplanation / text/html |
HuggingFaceH4/zephyr-7b-beta
h2ogpte
c83fc72d-1425-4899-86b8-061d8613e1a0
RAG collection (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])
Explainers identified the following problems:
| Severity | Type | Problem | Suggested actions | Explainer | Resources |
|---|---|---|---|---|---|
| MEDIUM | accuracy |
Evaluated model
HuggingFaceH4/zephyr-7b-beta
failed to satisfy the
threshold
0.5
for metric
Model passes
with average
score
0.4.
Metric details:
Percentage of successfully evaluated RAG/LLM outputs.
|
For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures. | Text matching | GlobalHtmlFragmentExplanation / text/html |
HuggingFaceH4/zephyr-7b-beta
h2ogpte
114687dc-6339-4309-b8f0-6e049d0424a0
RAG collection (docs: ['bradesco-2022-integrated-report.pdf'])
Explainers identified the following problems:
| Severity | Type | Problem | Suggested actions | Explainer | Resources |
|---|---|---|---|---|---|
| MEDIUM | accuracy |
Evaluated model
HuggingFaceH4/zephyr-7b-beta
failed to satisfy the
threshold
0.5
for metric
Model passes
with average
score
0.4.
Metric details:
Percentage of successfully evaluated RAG/LLM outputs.
|
For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures. | Text matching | GlobalHtmlFragmentExplanation / text/html |
HuggingFaceH4/zephyr-7b-beta
h2ogpte
76df8161-c0d1-414a-a046-92c6570ba9a1
RAG collection (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])
Explainers identified the following problems:
| Severity | Type | Problem | Suggested actions | Explainer | Resources |
|---|---|---|---|---|---|
| MEDIUM | accuracy |
Evaluated model
HuggingFaceH4/zephyr-7b-beta
failed to satisfy the
threshold
0.5
for metric
Model passes
with average
score
0.4.
Metric details:
Percentage of successfully evaluated RAG/LLM outputs.
|
For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures. | Text matching | GlobalHtmlFragmentExplanation / text/html |
HuggingFaceH4/zephyr-7b-beta
h2ogpte
61ca9c07-9266-410d-a19c-5aaece5102a5
RAG collection (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])
Explainers identified the following problems:
| Severity | Type | Problem | Suggested actions | Explainer | Resources |
|---|---|---|---|---|---|
| MEDIUM | accuracy |
Evaluated model
HuggingFaceH4/zephyr-7b-beta
failed to satisfy the
threshold
0.5
for metric
Model passes
with average
score
0.4.
Metric details:
Percentage of successfully evaluated RAG/LLM outputs.
|
For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures. | Text matching | GlobalHtmlFragmentExplanation / text/html |
h2oai/h2ogpt-32k-codellama-34b-instruct
h2ogpte
c83fc72d-1425-4899-86b8-061d8613e1a0
RAG collection (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])
h2oai/h2ogpt-32k-codellama-34b-instruct
h2ogpte
114687dc-6339-4309-b8f0-6e049d0424a0
RAG collection (docs: ['bradesco-2022-integrated-report.pdf'])
h2oai/h2ogpt-32k-codellama-34b-instruct
h2ogpte
76df8161-c0d1-414a-a046-92c6570ba9a1
RAG collection (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])
h2oai/h2ogpt-32k-codellama-34b-instruct
h2ogpte
61ca9c07-9266-410d-a19c-5aaece5102a5
RAG collection (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])
Yukang/LongAlpaca-70B
h2ogpte
c83fc72d-1425-4899-86b8-061d8613e1a0
RAG collection (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])
Explainers identified the following problems:
| Severity | Type | Problem | Suggested actions | Explainer | Resources |
|---|---|---|---|---|---|
| MEDIUM | accuracy |
Evaluated model
Yukang/LongAlpaca-70B
failed to satisfy the
threshold
0.5
for metric
Model passes
with average
score
0.4.
Metric details:
Percentage of successfully evaluated RAG/LLM outputs.
|
For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures. | Text matching | GlobalHtmlFragmentExplanation / text/html |
Explainers identified the following insights:
| Type | Insight | Suggested actions | Explainer | Resources |
|---|---|---|---|---|
| accuracy |
Model
Yukang/LongAlpaca-70B
was evaluated as
the least accurate
model according to
Text matching
evaluator.
|
A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Check the prompt, expected answer and condition - are they correct? Check models answers in failed cases and look for a common denominator and/or root cause of these failures. | Text matching | GlobalHtmlFragmentExplanation / text/html |
| performance |
Model
Yukang/LongAlpaca-70B
was evaluated as
the slowest
model according to
Text matching
evaluator.
|
A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. | Text matching | GlobalHtmlFragmentExplanation / text/html |
Yukang/LongAlpaca-70B
h2ogpte
114687dc-6339-4309-b8f0-6e049d0424a0
RAG collection (docs: ['bradesco-2022-integrated-report.pdf'])
Explainers identified the following problems:
| Severity | Type | Problem | Suggested actions | Explainer | Resources |
|---|---|---|---|---|---|
| MEDIUM | accuracy |
Evaluated model
Yukang/LongAlpaca-70B
failed to satisfy the
threshold
0.5
for metric
Model passes
with average
score
0.4.
Metric details:
Percentage of successfully evaluated RAG/LLM outputs.
|
For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures. | Text matching | GlobalHtmlFragmentExplanation / text/html |
Explainers identified the following insights:
| Type | Insight | Suggested actions | Explainer | Resources |
|---|---|---|---|---|
| accuracy |
Model
Yukang/LongAlpaca-70B
was evaluated as
the least accurate
model according to
Text matching
evaluator.
|
A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Check the prompt, expected answer and condition - are they correct? Check models answers in failed cases and look for a common denominator and/or root cause of these failures. | Text matching | GlobalHtmlFragmentExplanation / text/html |
| performance |
Model
Yukang/LongAlpaca-70B
was evaluated as
the slowest
model according to
Text matching
evaluator.
|
A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. | Text matching | GlobalHtmlFragmentExplanation / text/html |
Yukang/LongAlpaca-70B
h2ogpte
76df8161-c0d1-414a-a046-92c6570ba9a1
RAG collection (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])
Explainers identified the following problems:
| Severity | Type | Problem | Suggested actions | Explainer | Resources |
|---|---|---|---|---|---|
| MEDIUM | accuracy |
Evaluated model
Yukang/LongAlpaca-70B
failed to satisfy the
threshold
0.5
for metric
Model passes
with average
score
0.4.
Metric details:
Percentage of successfully evaluated RAG/LLM outputs.
|
For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures. | Text matching | GlobalHtmlFragmentExplanation / text/html |
Explainers identified the following insights:
| Type | Insight | Suggested actions | Explainer | Resources |
|---|---|---|---|---|
| accuracy |
Model
Yukang/LongAlpaca-70B
was evaluated as
the least accurate
model according to
Text matching
evaluator.
|
A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Check the prompt, expected answer and condition - are they correct? Check models answers in failed cases and look for a common denominator and/or root cause of these failures. | Text matching | GlobalHtmlFragmentExplanation / text/html |
| performance |
Model
Yukang/LongAlpaca-70B
was evaluated as
the slowest
model according to
Text matching
evaluator.
|
A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. | Text matching | GlobalHtmlFragmentExplanation / text/html |
Yukang/LongAlpaca-70B
h2ogpte
61ca9c07-9266-410d-a19c-5aaece5102a5
RAG collection (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])
Explainers identified the following problems:
| Severity | Type | Problem | Suggested actions | Explainer | Resources |
|---|---|---|---|---|---|
| MEDIUM | accuracy |
Evaluated model
Yukang/LongAlpaca-70B
failed to satisfy the
threshold
0.5
for metric
Model passes
with average
score
0.4.
Metric details:
Percentage of successfully evaluated RAG/LLM outputs.
|
For all failed test cases, check the prompt, expected answer, and condition to see if they are correct. Then, examine the model's answers in the failed cases and look for a common denominator or root cause of these failures. | Text matching | GlobalHtmlFragmentExplanation / text/html |
Explainers identified the following insights:
| Type | Insight | Suggested actions | Explainer | Resources |
|---|---|---|---|---|
| accuracy |
Model
Yukang/LongAlpaca-70B
was evaluated as
the least accurate
model according to
Text matching
evaluator.
|
A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Check the prompt, expected answer and condition - are they correct? Check models answers in failed cases and look for a common denominator and/or root cause of these failures. | Text matching | GlobalHtmlFragmentExplanation / text/html |
| performance |
Model
Yukang/LongAlpaca-70B
was evaluated as
the slowest
model according to
Text matching
evaluator.
|
A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. | Text matching | GlobalHtmlFragmentExplanation / text/html |
gpt-3.5-turbo-0613
h2ogpte
c83fc72d-1425-4899-86b8-061d8613e1a0
RAG collection (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])
Explainers identified the following insights:
| Type | Insight | Suggested actions | Explainer | Resources |
|---|---|---|---|---|
| accuracy |
Model
gpt-3.5-turbo-0613
was evaluated as
the most accurate
model according to
Text matching
evaluator.
|
A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Check the prompt, expected answer and condition - are they correct? Check models answers in failed cases and look for a common denominator and/or root cause of these failures. | Text matching | GlobalHtmlFragmentExplanation / text/html |
| performance |
Model
gpt-3.5-turbo-0613
was evaluated as
the fastest
model according to
Text matching
evaluator.
|
A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. | Text matching | GlobalHtmlFragmentExplanation / text/html |
gpt-3.5-turbo-0613
h2ogpte
114687dc-6339-4309-b8f0-6e049d0424a0
RAG collection (docs: ['bradesco-2022-integrated-report.pdf'])
Explainers identified the following insights:
| Type | Insight | Suggested actions | Explainer | Resources |
|---|---|---|---|---|
| accuracy |
Model
gpt-3.5-turbo-0613
was evaluated as
the most accurate
model according to
Text matching
evaluator.
|
A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Check the prompt, expected answer and condition - are they correct? Check models answers in failed cases and look for a common denominator and/or root cause of these failures. | Text matching | GlobalHtmlFragmentExplanation / text/html |
| performance |
Model
gpt-3.5-turbo-0613
was evaluated as
the fastest
model according to
Text matching
evaluator.
|
A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. | Text matching | GlobalHtmlFragmentExplanation / text/html |
gpt-3.5-turbo-0613
h2ogpte
76df8161-c0d1-414a-a046-92c6570ba9a1
RAG collection (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])
Explainers identified the following insights:
| Type | Insight | Suggested actions | Explainer | Resources |
|---|---|---|---|---|
| accuracy |
Model
gpt-3.5-turbo-0613
was evaluated as
the most accurate
model according to
Text matching
evaluator.
|
A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Check the prompt, expected answer and condition - are they correct? Check models answers in failed cases and look for a common denominator and/or root cause of these failures. | Text matching | GlobalHtmlFragmentExplanation / text/html |
| performance |
Model
gpt-3.5-turbo-0613
was evaluated as
the fastest
model according to
Text matching
evaluator.
|
A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. | Text matching | GlobalHtmlFragmentExplanation / text/html |
gpt-3.5-turbo-0613
h2ogpte
61ca9c07-9266-410d-a19c-5aaece5102a5
RAG collection (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])
Explainers identified the following insights:
| Type | Insight | Suggested actions | Explainer | Resources |
|---|---|---|---|---|
| accuracy |
Model
gpt-3.5-turbo-0613
was evaluated as
the most accurate
model according to
Text matching
evaluator.
|
A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. Check the prompt, expected answer and condition - are they correct? Check models answers in failed cases and look for a common denominator and/or root cause of these failures. | Text matching | GlobalHtmlFragmentExplanation / text/html |
| performance |
Model
gpt-3.5-turbo-0613
was evaluated as
the fastest
model according to
Text matching
evaluator.
|
A detailed description of the failures, questions and answers to identify the weaknesses and strengths of the model and their root causes can be found in the explanation. | Text matching | GlobalHtmlFragmentExplanation / text/html |
gpt-3.5-turbo-16k-0613
h2ogpte
c83fc72d-1425-4899-86b8-061d8613e1a0
RAG collection (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])
gpt-3.5-turbo-16k-0613
h2ogpte
114687dc-6339-4309-b8f0-6e049d0424a0
RAG collection (docs: ['bradesco-2022-integrated-report.pdf'])
gpt-3.5-turbo-16k-0613
h2ogpte
76df8161-c0d1-414a-a046-92c6570ba9a1
RAG collection (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])
gpt-3.5-turbo-16k-0613
h2ogpte
61ca9c07-9266-410d-a19c-5aaece5102a5
RAG collection (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])
gpt-4-0613
h2ogpte
c83fc72d-1425-4899-86b8-061d8613e1a0
RAG collection (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])
gpt-4-0613
h2ogpte
114687dc-6339-4309-b8f0-6e049d0424a0
RAG collection (docs: ['bradesco-2022-integrated-report.pdf'])
gpt-4-0613
h2ogpte
76df8161-c0d1-414a-a046-92c6570ba9a1
RAG collection (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])
gpt-4-0613
h2ogpte
61ca9c07-9266-410d-a19c-5aaece5102a5
RAG collection (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])
gpt-4-32k-0613
h2ogpte
c83fc72d-1425-4899-86b8-061d8613e1a0
RAG collection (docs: ['Coca-Cola-FEMSA-Results-1Q23-vf-2.pdf'])
gpt-4-32k-0613
h2ogpte
114687dc-6339-4309-b8f0-6e049d0424a0
RAG collection (docs: ['bradesco-2022-integrated-report.pdf'])
gpt-4-32k-0613
h2ogpte
76df8161-c0d1-414a-a046-92c6570ba9a1
RAG collection (docs: ['AXA-Sigorta-2022-Annual-Report.pdf'])
gpt-4-32k-0613
h2ogpte
61ca9c07-9266-410d-a19c-5aaece5102a5
RAG collection (docs: ['lowes-2022ar-full-report-4-6-23-final.pdf'])
Interpretation parameters:
None
any
<class 'list'>
any
<class 'h2o_sonar.lib.api.datasets._datasets_genai.LlmDataset'>
any
None
str
None
str
True
bool
str
str
str
[]
list
[]
0
int
0
/tmp/pytest-of-dvorka/pytest-26/test_async_evaluate0
str
None
list
None
| Config parameter | Value | Description | Type | Default value |
|---|---|---|---|---|
| h2o_host |
localhost
|
The host of the H2O-3 server that should be used for the explanation that requires it. |
str
|
localhost
|
| h2o_port |
12349
|
The port of the H2O-3 server that should be used for the explanation that requires it. |
int
|
12349
|
| h2o_auto_start |
True
|
Automatically start H2O-3 server on the interpretation start (True), or do not start the server (False). |
bool
|
True
|
| h2o_auto_cleanup |
True
|
Automatically remove all data from the H2O-3 server onthe interpretation end (True), or do not remove all data fromthe server (False). |
bool
|
True
|
| h2o_auto_stop |
False
|
Automatically stop H2O-3 server on the interpretation end (True), or do not stop the server (False). |
bool
|
False
|
| h2o_min_mem_size |
2G
|
Minimum memory specification for H2O-3 server started by H2O Sonar. |
int
|
2G
|
| h2o_max_mem_size |
4G
|
Maximum memory specification for H2O-3 server started by H2O Sonar. |
int
|
4G
|
| custom_explainers |
[]
|
List of custom "Bring Your Own Explainer" string locators to be registered on H2O Sonar run. The location has the following structure: "[PACKAGE and MODULE]::[EXPLAINER-CLASS-NAME]" where PACKAGE and MODULE is dot (.) separated path to the the module (installed on PYTHONPATH) and EXPLAINER-CLASS-NAME is the name of explainer class. Example: [ "my_package.explainer_module::MyExplainerClass", "their_package.explainer_module::TheirExplainerClass"] |
customlist
|
[]
|
| look_and_feel |
h2o_sonar
|
Charts theme (look and feel) - one of: 'h2o_sonar', 'blue', 'driverless_ai'. |
str
|
h2o_sonar
|
| device |
cpu
|
Device to be used for the calculations. The value of this configuration item might be ``cpu`` or ``gpu``. |
str
|
|
| enable_slow_perturbators |
False
|
Enable slow (agent-based, model-based, resource intensive) perturbators which are by default skipped and not listed. |
bool
|
False
|
| force_eval_judge |
false
|
Force the use of custom evaluation judge for the evaluation of the models over the judges used by evaluators by default. For example to use a local judge in order to avoid sending sensitive data to a 3rd party or to the cloud. The value of this configuration item might be ``false``, ``true`` or configuration key of the custom evaluation judge. Forcing the use of a custom evaluation judge will automatically reconfigure the embeddings calculation in evaluations to a local model to ensure privacy safety. |
str
|
false
|
| multiprocessing_start_method |
spawn
|
Multiprocessing start method - one of: 'spawn', 'fork', 'forkserver' or `None` (default). |
str
|
spawn
|
| model_cache_dir |
/home/dvorka/.cache/h2o_sonar/models
|
Directory where the models are cached. If not specified, the models are cached in a default directory in user home which follows operating system conventions. |
str
|
/home/dvorka/.cache/h2o_sonar/models
|
| http_ssl_cert_verify |
True
|
SSL certificate verification for HTTPS requests. If set to ``false``, then SSL certificate verification is disabled. If set to ``true``, then SSL certificate verification is enabled. If set to the path (string) to a ``CA_BUNDLE`` file or directory with certificates of trusted CAs, then they will be used for the verification (in this case the directory must have been processed using the c_rehash utility supplied with OpenSSL). |
str
|
true
|
| branding |
H2O_SONAR
|
Branding for HTML reports. If not specified (empty string). Valid values: 'H2O_SONAR', 'EVAL_STUDIO', or '' (empty for auto). |
str
|
|
| per_explainer_logger |
True
|
Create new logger for each explainer (which logs to explainer sandbox) or reuse one logger and use library logger for all log messages. |
bool
|
True
|
| create_html_representations |
True
|
Indicate that explainers can create HTML representation (True), or request to skip it (False) from performance/resource consumption reasons. |
bool
|
True
|
| connections |
[]
|
|
|
|
| licenses |
[]
|
|
|
|
| evaluation_judges |
[]
|
|
|
Directories and files: