Skip to content

[SC-16871] Make DeepEval agentic tests natively accessible within the default ValidMind test suite for Enterprise Google Gemini (Vertex AI)#528

Merged
juanmleng merged 7 commits into
mainfrom
juan/sc-16871/make-deepeval-agentic-tests-natively-accessible
Jun 30, 2026
Merged

[SC-16871] Make DeepEval agentic tests natively accessible within the default ValidMind test suite for Enterprise Google Gemini (Vertex AI)#528
juanmleng merged 7 commits into
mainfrom
juan/sc-16871/make-deepeval-agentic-tests-natively-accessible

Conversation

@juanmleng

@juanmleng juanmleng commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Pull Request Description

What and why?

Before this change, calling set_judge_config() to configure a judge LLM applied to
prompt-validation and RAGAS tests but was silently ignored by DeepEval scorers. Scorers
in validmind.scorers.llm.deepeval had their own provider-detection logic in
get_deepeval_model() that never consulted the configured judge, so enterprise customers
using Vertex AI (or any non-default provider) had to write custom scorer boilerplate to
work around it.

This PR wires get_deepeval_model() to honour set_judge_config(): when a judge LLM
has been explicitly configured, all five built-in DeepEval scorers (AnswerRelevancy,
ArgumentCorrectness, Hallucination, PlanAdherence, PlanQuality, ToolCorrectness)
now use it automatically — including Vertex AI via langchain-google-vertexai.

Internally, the provider-specific GeminiDeepEvalModel wrapper is replaced by a generic
LangChainDeepEvalModel that wraps any BaseChatModel, and _build_gemini_deepeval_model
is simplified to delegate to it.

The configure_judge_llms.ipynb notebook is updated to document this behaviour and
includes an optional set_judge_config() example cell.

How to test

Automated tests

cd validmind-library
make test-unit

All 89 test suites pass with no changes to existing tests.

Manual testing

from langchain_google_genai import ChatGoogleGenerativeAI, GoogleGenerativeAIEmbeddings
from validmind.ai.utils import set_judge_config, get_deepeval_model

set_judge_config(
    judge_llm=ChatGoogleGenerativeAI(model="gemini-3.5-flash"),
    judge_embeddings=GoogleGenerativeAIEmbeddings(model="models/text-embedding-004"),
)

model = get_deepeval_model()
assert type(model).__name__ == "LangChainDeepEvalModel"
assert model.get_model_name() == "gemini-3.5-flash"

Then run a built-in scorer end-to-end:

import pandas as pd, validmind as vm

df = pd.DataFrame({
    "input": ["What is the capital of France?"],
    "actual_output": ["Paris."],
    "context": [["France's capital is Paris."]],
})
ds = vm.init_dataset(dataset=df, input_id="test", text_column="input", target_column="actual_output")
ds.assign_scores(metrics=["validmind.scorers.llm.deepeval.AnswerRelevancy"])

Expected: DataFrame with a non-null score column, no errors.

What needs special review?

  • get_deepeval_model() now checks __judge_llm before provider detection. This is
    the intended priority order, but reviewers should confirm there are no edge cases where
    a stale __judge_llm from a previous set_judge_config() call in the same session
    would produce unexpected results in a notebook that later switches providers via env vars.
  • get_model_name() introspection — the new LangChainDeepEvalModel checks model,
    model_name, and azure_deployment attributes in order. If a custom BaseChatModel
    exposes none of these, it falls back to the class name. Confirm this is acceptable for
    all supported providers.

Dependencies, breaking changes, and deployment notes

None. No new dependencies, no migrations, no env var changes. The existing Gemini
deepeval path (_build_gemini_deepeval_model) is preserved and now delegates to the
new generic wrapper.

No companion PRs in frontend or documentation repos.

Release notes

This is an enhancement. Suggested label: enhancement.

DeepEval scorers (AnswerRelevancy, Hallucination, and others in
validmind.scorers.llm.deepeval) now respect the judge LLM configured via
set_judge_config(). This means any provider — including Vertex AI — configured once
for prompt-validation and RAGAS tests is automatically used by scorers as well, with no
additional setup required.

Checklist

  • What and why
  • Screenshots or videos (Frontend)
  • How to test
  • What needs special review
  • Dependencies, breaking changes, and deployment notes
  • Labels applied
  • PR linked to Shortcut
  • Unit tests added (Backend)
  • Tested locally
  • Documentation updated (if required)
  • Environment variable additions/changes documented (if required)

Shortcut: SC-16871

juanmleng and others added 3 commits June 26, 2026 19:07
get_deepeval_model() now checks __judge_llm first so any provider
configured via set_judge_config() (Gemini, Vertex AI, Azure, etc.)
is automatically used by built-in DeepEval scorers without custom
boilerplate. Adds generic LangChainDeepEvalModel wrapper and
simplifies _build_gemini_deepeval_model() to delegate to it.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
DeepEval scorers now use the judge LLM from set_judge_config() —
update description, key concepts, DeepEval section, and summary to
reflect this. Add optional set_judge_config() example cell. Remove
separate scorers notebook (configure_judge_llms_scorers.ipynb).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@juanmleng juanmleng self-assigned this Jun 26, 2026
@juanmleng juanmleng added the enhancement New feature or request label Jun 26, 2026

@johnwalz97 johnwalz97 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm but claude raised a finding that i think needs addressed

Comment thread validmind/ai/utils.py Outdated
Comment on lines +345 to +347
global __judge_llm
if __judge_llm is not None:
return _build_langchain_deepeval_model(__judge_llm)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heads up: gating on if __judge_llm is not None triggers on more than set_judge_config(). __judge_llm also gets set as a side effect of get_judge_config() (line 334), which runs on every prompt-validation test, every RAGAS test, and is_configured(). So once any of those run in a session, this branch is taken for everything after.

Checked against deepeval 3.4.0 with OPENAI_API_KEY set:

scorer first        -> get_deepeval_model() = "gpt-4.1"  -> GPTModel               (using_native_model=True)
scorer after a test -> get_deepeval_model() = LangChainDeepEvalModel  (using_native_model=False)

Same env, same scorer, different eval path depending on what ran first in the kernel.

Probably want to key off intent instead of the cached global:

# module level
__judge_llm_explicitly_set = False

# in set_judge_config(), after validation:
global __judge_llm_explicitly_set
__judge_llm_explicitly_set = True

# here:
if __judge_llm_explicitly_set and __judge_llm is not None:
    return _build_langchain_deepeval_model(__judge_llm)

Also evens out a small inconsistency: this checks is not None while get_judge_config (line 325) and set_judge_config both require embeddings too.

__judge_llm is set as a side effect of get_judge_config(), which runs
on every prompt-validation and RAGAS test. The previous `is not None`
check caused get_deepeval_model() to wrap the cached LLM even when the
user never called set_judge_config(), breaking the native OpenAI DeepEval
path when any test ran first in the session.

Add __judge_llm_explicitly_set flag (False by default, True only when
set_judge_config() is called) and check both __judge_llm and
__judge_embeddings for consistency with get_judge_config().

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@juanmleng juanmleng marked this pull request as ready for review June 26, 2026 21:09

@AnilSorathiya AnilSorathiya left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. Points to consider:

  1. Embeddings required even when DeepEval doesn’t use them

get_deepeval_model() requires both __judge_llm and __judge_embeddings to be set. DeepEval scorers don’t use embeddings, but callers must still pass an embeddings object to set_judge_config(). We might keep embeddings are optional.

  1. No per-scorer judge override (Could be separate work)

Prompt-validation and RAGAS support per-call judge_llm/judge_embeddings overrides. All 14 DeepEval scorers call get_deepeval_model() with no parameters — global config only.

  1. No reset API

Once set_judge_config() is called, there’s no reset_judge_config() to clear __judge_llm_explicitly_set. Session restart is the only way back to env-based defaults.

juanmleng and others added 3 commits June 29, 2026 17:13
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When a judge LLM is set via set_judge_config(), DeepEval scorers run
through the generic LangChainDeepEvalModel wrapper. Gemini's default
with_structured_output() uses function-calling with a first_tool_only
parser that intermittently returns None when the model emits no tool
call. That None reached DeepEval and surfaced as a cryptic
'NoneType' object has no attribute 'reason'.

The wrapper now:
- retries via the deterministic json_mode (response_schema) path when
  the function-calling path returns None
- degrades reason-only schemas (AnswerRelevancyScoreReason,
  HallucinationScoreReason) to a placeholder explanation so the metric
  score is preserved, logging a warning
- raises a clear, actionable error for score-bearing schemas
  (statements, verdicts) rather than fabricating a silently wrong score

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

PR Summary

This PR introduces significant enhancements to the configuration and integration of judge LLMs and judge embeddings in the ValidMind library. The changes focus on two main areas:

  1. Notebook Improvements

    • The notebook has been updated with clearer instructions on configuring the judge provider. It now emphasizes the usage of the set_judge_config() function which allows users to explicitly set the judge LLM and judge embeddings. This new approach unifies the configuration for prompt-validation tests, RAGAS tests, and DeepEval scorers.
    • Text adjustments clarify that DeepEval scorers now use the judge LLM configured via the explicit override rather than relying solely on the automatic provider selection.
  2. Utils Module Enhancements

    • A new global flag (__judge_llm_explicitly_set) has been introduced. This flag is set when the user calls set_judge_config() and forces the library to use the explicitly provided judge models, especially for DeepEval scorers.
    • New helper functions (_is_reason_only_schema and _require_structured_response) have been added to better handle cases where the judge LLM returns incomplete or unstructured outputs. This enables graceful degradation when only a minimal explanation is provided (e.g., just a reason field).
    • The structured generation functions (_structured_generate and its async counterpart) have been updated to retry using a fallback method (json_mode) in case the initial attempt returns None. This improves reliability when interfacing with providers like Gemini.
    • The function _build_langchain_deepeval_model has been refactored to create a DeepEval-compatible wrapper over any LangChain ChatModel. It dynamically determines the model name from available attributes, thereby enhancing adaptability across different provider configurations.
    • The get_deepeval_model() function now prefers an explicitly set judge configuration over the automatic provider detection if available.
    • Updates in set_judge_config() include improved type-checking and the setting of the explicit flag to ensure the correct configuration is used throughout the evaluation workflows.

Overall, these changes improve flexibility, robustness, and clarity in configuring judge LLMs and embeddings for all evaluation paths within the library.

Test Suggestions

  • Run the updated notebook in an environment with multiple provider credentials (e.g., both OpenAI and Gemini) to verify that the explicit configuration via set_judge_config takes precedence.
  • Simulate a scenario where the judge LLM returns no structured output to confirm that _require_structured_response handles the situation as expected, especially for reason-only schemas.
  • Execute tests for the async path using _a_structured_generate to ensure structured responses are correctly handled in asynchronous contexts.
  • Test that after calling set_judge_config(), get_deepeval_model returns a DeepEval model built using the explicit configuration rather than the auto-detected provider.
  • Ensure that all existing unit tests and integration tests pass after the updates.

@juanmleng juanmleng merged commit d81fec7 into main Jun 30, 2026
35 of 36 checks passed
@juanmleng juanmleng deleted the juan/sc-16871/make-deepeval-agentic-tests-natively-accessible branch June 30, 2026 17:03
@hunner hunner mentioned this pull request Jun 30, 2026
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants