[SC-16871] Make DeepEval agentic tests natively accessible within the default ValidMind test suite for Enterprise Google Gemini (Vertex AI) by juanmleng · Pull Request #528 · validmind/validmind-library

juanmleng · 2026-06-26T17:09:11Z

Pull Request Description

What and why?

Before this change, calling set_judge_config() to configure a judge LLM applied to
prompt-validation and RAGAS tests but was silently ignored by DeepEval scorers. Scorers
in validmind.scorers.llm.deepeval had their own provider-detection logic in
get_deepeval_model() that never consulted the configured judge, so enterprise customers
using Vertex AI (or any non-default provider) had to write custom scorer boilerplate to
work around it.

This PR wires get_deepeval_model() to honour set_judge_config(): when a judge LLM
has been explicitly configured, all five built-in DeepEval scorers (AnswerRelevancy,
ArgumentCorrectness, Hallucination, PlanAdherence, PlanQuality, ToolCorrectness)
now use it automatically — including Vertex AI via langchain-google-vertexai.

Internally, the provider-specific GeminiDeepEvalModel wrapper is replaced by a generic
LangChainDeepEvalModel that wraps any BaseChatModel, and _build_gemini_deepeval_model
is simplified to delegate to it.

The configure_judge_llms.ipynb notebook is updated to document this behaviour and
includes an optional set_judge_config() example cell.

How to test

Automated tests

cd validmind-library
make test-unit

All 89 test suites pass with no changes to existing tests.

Manual testing

from langchain_google_genai import ChatGoogleGenerativeAI, GoogleGenerativeAIEmbeddings
from validmind.ai.utils import set_judge_config, get_deepeval_model

set_judge_config(
    judge_llm=ChatGoogleGenerativeAI(model="gemini-3.5-flash"),
    judge_embeddings=GoogleGenerativeAIEmbeddings(model="models/text-embedding-004"),
)

model = get_deepeval_model()
assert type(model).__name__ == "LangChainDeepEvalModel"
assert model.get_model_name() == "gemini-3.5-flash"

Then run a built-in scorer end-to-end:

import pandas as pd, validmind as vm

df = pd.DataFrame({
    "input": ["What is the capital of France?"],
    "actual_output": ["Paris."],
    "context": [["France's capital is Paris."]],
})
ds = vm.init_dataset(dataset=df, input_id="test", text_column="input", target_column="actual_output")
ds.assign_scores(metrics=["validmind.scorers.llm.deepeval.AnswerRelevancy"])

Expected: DataFrame with a non-null score column, no errors.

What needs special review?

get_deepeval_model() now checks __judge_llm before provider detection. This is
the intended priority order, but reviewers should confirm there are no edge cases where
a stale __judge_llm from a previous set_judge_config() call in the same session
would produce unexpected results in a notebook that later switches providers via env vars.
get_model_name() introspection — the new LangChainDeepEvalModel checks model,
model_name, and azure_deployment attributes in order. If a custom BaseChatModel
exposes none of these, it falls back to the class name. Confirm this is acceptable for
all supported providers.

Dependencies, breaking changes, and deployment notes

None. No new dependencies, no migrations, no env var changes. The existing Gemini
deepeval path (_build_gemini_deepeval_model) is preserved and now delegates to the
new generic wrapper.

No companion PRs in frontend or documentation repos.

Release notes

This is an enhancement. Suggested label: enhancement.

DeepEval scorers (AnswerRelevancy, Hallucination, and others in
validmind.scorers.llm.deepeval) now respect the judge LLM configured via
set_judge_config(). This means any provider — including Vertex AI — configured once
for prompt-validation and RAGAS tests is automatically used by scorers as well, with no
additional setup required.

Checklist

Shortcut: SC-16871

get_deepeval_model() now checks __judge_llm first so any provider configured via set_judge_config() (Gemini, Vertex AI, Azure, etc.) is automatically used by built-in DeepEval scorers without custom boilerplate. Adds generic LangChainDeepEvalModel wrapper and simplifies _build_gemini_deepeval_model() to delegate to it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

DeepEval scorers now use the judge LLM from set_judge_config() — update description, key concepts, DeepEval section, and summary to reflect this. Add optional set_judge_config() example cell. Remove separate scorers notebook (configure_judge_llms_scorers.ipynb). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

johnwalz97

lgtm but claude raised a finding that i think needs addressed

johnwalz97 · 2026-06-26T20:21:22Z

+    global __judge_llm
+    if __judge_llm is not None:
+        return _build_langchain_deepeval_model(__judge_llm)


Heads up: gating on if __judge_llm is not None triggers on more than set_judge_config(). __judge_llm also gets set as a side effect of get_judge_config() (line 334), which runs on every prompt-validation test, every RAGAS test, and is_configured(). So once any of those run in a session, this branch is taken for everything after.

Checked against deepeval 3.4.0 with OPENAI_API_KEY set:

scorer first -> get_deepeval_model() = "gpt-4.1" -> GPTModel (using_native_model=True) scorer after a test -> get_deepeval_model() = LangChainDeepEvalModel (using_native_model=False)

Same env, same scorer, different eval path depending on what ran first in the kernel.

Probably want to key off intent instead of the cached global:

# module level __judge_llm_explicitly_set = False # in set_judge_config(), after validation: global __judge_llm_explicitly_set __judge_llm_explicitly_set = True # here: if __judge_llm_explicitly_set and __judge_llm is not None: return _build_langchain_deepeval_model(__judge_llm)

Also evens out a small inconsistency: this checks is not None while get_judge_config (line 325) and set_judge_config both require embeddings too.

__judge_llm is set as a side effect of get_judge_config(), which runs on every prompt-validation and RAGAS test. The previous `is not None` check caused get_deepeval_model() to wrap the cached LLM even when the user never called set_judge_config(), breaking the native OpenAI DeepEval path when any test ran first in the session. Add __judge_llm_explicitly_set flag (False by default, True only when set_judge_config() is called) and check both __judge_llm and __judge_embeddings for consistency with get_judge_config(). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

AnilSorathiya

Looking good. Points to consider:

Embeddings required even when DeepEval doesn’t use them

get_deepeval_model() requires both __judge_llm and __judge_embeddings to be set. DeepEval scorers don’t use embeddings, but callers must still pass an embeddings object to set_judge_config(). We might keep embeddings are optional.

No per-scorer judge override (Could be separate work)

Prompt-validation and RAGAS support per-call judge_llm/judge_embeddings overrides. All 14 DeepEval scorers call get_deepeval_model() with no parameters — global config only.

No reset API

Once set_judge_config() is called, there’s no reset_judge_config() to clear __judge_llm_explicitly_set. Session restart is the only way back to env-based defaults.

…tively-accessible

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

When a judge LLM is set via set_judge_config(), DeepEval scorers run through the generic LangChainDeepEvalModel wrapper. Gemini's default with_structured_output() uses function-calling with a first_tool_only parser that intermittently returns None when the model emits no tool call. That None reached DeepEval and surfaced as a cryptic 'NoneType' object has no attribute 'reason'. The wrapper now: - retries via the deterministic json_mode (response_schema) path when the function-calling path returns None - degrades reason-only schemas (AnswerRelevancyScoreReason, HallucinationScoreReason) to a placeholder explanation so the metric score is preserved, logging a warning - raises a clear, actionable error for score-bearing schemas (statements, verdicts) rather than fabricating a silently wrong score Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-30T15:57:28Z

PR Summary

This PR introduces significant enhancements to the configuration and integration of judge LLMs and judge embeddings in the ValidMind library. The changes focus on two main areas:

Notebook Improvements
- The notebook has been updated with clearer instructions on configuring the judge provider. It now emphasizes the usage of the set_judge_config() function which allows users to explicitly set the judge LLM and judge embeddings. This new approach unifies the configuration for prompt-validation tests, RAGAS tests, and DeepEval scorers.
- Text adjustments clarify that DeepEval scorers now use the judge LLM configured via the explicit override rather than relying solely on the automatic provider selection.
Utils Module Enhancements
- A new global flag (__judge_llm_explicitly_set) has been introduced. This flag is set when the user calls set_judge_config() and forces the library to use the explicitly provided judge models, especially for DeepEval scorers.
- New helper functions (_is_reason_only_schema and _require_structured_response) have been added to better handle cases where the judge LLM returns incomplete or unstructured outputs. This enables graceful degradation when only a minimal explanation is provided (e.g., just a reason field).
- The structured generation functions (_structured_generate and its async counterpart) have been updated to retry using a fallback method (json_mode) in case the initial attempt returns None. This improves reliability when interfacing with providers like Gemini.
- The function _build_langchain_deepeval_model has been refactored to create a DeepEval-compatible wrapper over any LangChain ChatModel. It dynamically determines the model name from available attributes, thereby enhancing adaptability across different provider configurations.
- The get_deepeval_model() function now prefers an explicitly set judge configuration over the automatic provider detection if available.
- Updates in set_judge_config() include improved type-checking and the setting of the explicit flag to ensure the correct configuration is used throughout the evaluation workflows.

Overall, these changes improve flexibility, robustness, and clarity in configuring judge LLMs and embeddings for all evaluation paths within the library.

Test Suggestions

Run the updated notebook in an environment with multiple provider credentials (e.g., both OpenAI and Gemini) to verify that the explicit configuration via set_judge_config takes precedence.
Simulate a scenario where the judge LLM returns no structured output to confirm that _require_structured_response handles the situation as expected, especially for reason-only schemas.
Execute tests for the async path using _a_structured_generate to ensure structured responses are correctly handled in asynchronous contexts.
Test that after calling set_judge_config(), get_deepeval_model returns a DeepEval model built using the explicit configuration rather than the auto-detected provider.
Ensure that all existing unit tests and integration tests pass after the updates.

juanmleng and others added 3 commits June 26, 2026 19:07

chore: open branch for SC-16871

998e6ea

juanmleng self-assigned this Jun 26, 2026

juanmleng added the enhancement New feature or request label Jun 26, 2026

juanmleng requested review from AnilSorathiya, cachafla and johnwalz97 June 26, 2026 19:48

johnwalz97 approved these changes Jun 26, 2026

View reviewed changes

juanmleng marked this pull request as ready for review June 26, 2026 21:09

AnilSorathiya approved these changes Jun 29, 2026

View reviewed changes

juanmleng and others added 3 commits June 29, 2026 17:13

Merge branch 'main' into juan/sc-16871/make-deepeval-agentic-tests-na…

29cf791

…tively-accessible

fix: remove unnecessary global declaration in get_deepeval_model

89e0c62

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

cachafla approved these changes Jun 30, 2026

View reviewed changes

juanmleng merged commit d81fec7 into main Jun 30, 2026
35 of 36 checks passed

juanmleng deleted the juan/sc-16871/make-deepeval-agentic-tests-natively-accessible branch June 30, 2026 17:03

hunner mentioned this pull request Jun 30, 2026

Release v2.13.6 #530

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SC-16871] Make DeepEval agentic tests natively accessible within the default ValidMind test suite for Enterprise Google Gemini (Vertex AI)#528

[SC-16871] Make DeepEval agentic tests natively accessible within the default ValidMind test suite for Enterprise Google Gemini (Vertex AI)#528
juanmleng merged 7 commits into
mainfrom
juan/sc-16871/make-deepeval-agentic-tests-natively-accessible

juanmleng commented Jun 26, 2026 •

edited

Loading

Uh oh!

johnwalz97 left a comment

Uh oh!

johnwalz97 Jun 26, 2026

Uh oh!

AnilSorathiya left a comment

Uh oh!

github-actions Bot commented Jun 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

juanmleng commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Description

What and why?

How to test

What needs special review?

Dependencies, breaking changes, and deployment notes

Release notes

Checklist

Uh oh!

johnwalz97 left a comment

Choose a reason for hiding this comment

Uh oh!

johnwalz97 Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

AnilSorathiya left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 30, 2026

PR Summary

Test Suggestions

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

juanmleng commented Jun 26, 2026 •

edited

Loading