[SC-16871] Make DeepEval agentic tests natively accessible within the default ValidMind test suite for Enterprise Google Gemini (Vertex AI)#528
Conversation
get_deepeval_model() now checks __judge_llm first so any provider configured via set_judge_config() (Gemini, Vertex AI, Azure, etc.) is automatically used by built-in DeepEval scorers without custom boilerplate. Adds generic LangChainDeepEvalModel wrapper and simplifies _build_gemini_deepeval_model() to delegate to it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
DeepEval scorers now use the judge LLM from set_judge_config() — update description, key concepts, DeepEval section, and summary to reflect this. Add optional set_judge_config() example cell. Remove separate scorers notebook (configure_judge_llms_scorers.ipynb). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
johnwalz97
left a comment
There was a problem hiding this comment.
lgtm but claude raised a finding that i think needs addressed
| global __judge_llm | ||
| if __judge_llm is not None: | ||
| return _build_langchain_deepeval_model(__judge_llm) |
There was a problem hiding this comment.
Heads up: gating on if __judge_llm is not None triggers on more than set_judge_config(). __judge_llm also gets set as a side effect of get_judge_config() (line 334), which runs on every prompt-validation test, every RAGAS test, and is_configured(). So once any of those run in a session, this branch is taken for everything after.
Checked against deepeval 3.4.0 with OPENAI_API_KEY set:
scorer first -> get_deepeval_model() = "gpt-4.1" -> GPTModel (using_native_model=True)
scorer after a test -> get_deepeval_model() = LangChainDeepEvalModel (using_native_model=False)
Same env, same scorer, different eval path depending on what ran first in the kernel.
Probably want to key off intent instead of the cached global:
# module level
__judge_llm_explicitly_set = False
# in set_judge_config(), after validation:
global __judge_llm_explicitly_set
__judge_llm_explicitly_set = True
# here:
if __judge_llm_explicitly_set and __judge_llm is not None:
return _build_langchain_deepeval_model(__judge_llm)Also evens out a small inconsistency: this checks is not None while get_judge_config (line 325) and set_judge_config both require embeddings too.
__judge_llm is set as a side effect of get_judge_config(), which runs on every prompt-validation and RAGAS test. The previous `is not None` check caused get_deepeval_model() to wrap the cached LLM even when the user never called set_judge_config(), breaking the native OpenAI DeepEval path when any test ran first in the session. Add __judge_llm_explicitly_set flag (False by default, True only when set_judge_config() is called) and check both __judge_llm and __judge_embeddings for consistency with get_judge_config(). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
AnilSorathiya
left a comment
There was a problem hiding this comment.
Looking good. Points to consider:
- Embeddings required even when DeepEval doesn’t use them
get_deepeval_model() requires both __judge_llm and __judge_embeddings to be set. DeepEval scorers don’t use embeddings, but callers must still pass an embeddings object to set_judge_config(). We might keep embeddings are optional.
- No per-scorer judge override (Could be separate work)
Prompt-validation and RAGAS support per-call judge_llm/judge_embeddings overrides. All 14 DeepEval scorers call get_deepeval_model() with no parameters — global config only.
- No reset API
Once set_judge_config() is called, there’s no reset_judge_config() to clear __judge_llm_explicitly_set. Session restart is the only way back to env-based defaults.
…tively-accessible
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When a judge LLM is set via set_judge_config(), DeepEval scorers run through the generic LangChainDeepEvalModel wrapper. Gemini's default with_structured_output() uses function-calling with a first_tool_only parser that intermittently returns None when the model emits no tool call. That None reached DeepEval and surfaced as a cryptic 'NoneType' object has no attribute 'reason'. The wrapper now: - retries via the deterministic json_mode (response_schema) path when the function-calling path returns None - degrades reason-only schemas (AnswerRelevancyScoreReason, HallucinationScoreReason) to a placeholder explanation so the metric score is preserved, logging a warning - raises a clear, actionable error for score-bearing schemas (statements, verdicts) rather than fabricating a silently wrong score Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
PR SummaryThis PR introduces significant enhancements to the configuration and integration of judge LLMs and judge embeddings in the ValidMind library. The changes focus on two main areas:
Overall, these changes improve flexibility, robustness, and clarity in configuring judge LLMs and embeddings for all evaluation paths within the library. Test Suggestions
|
Pull Request Description
What and why?
Before this change, calling
set_judge_config()to configure a judge LLM applied toprompt-validation and RAGAS tests but was silently ignored by DeepEval scorers. Scorers
in
validmind.scorers.llm.deepevalhad their own provider-detection logic inget_deepeval_model()that never consulted the configured judge, so enterprise customersusing Vertex AI (or any non-default provider) had to write custom scorer boilerplate to
work around it.
This PR wires
get_deepeval_model()to honourset_judge_config(): when a judge LLMhas been explicitly configured, all five built-in DeepEval scorers (
AnswerRelevancy,ArgumentCorrectness,Hallucination,PlanAdherence,PlanQuality,ToolCorrectness)now use it automatically — including Vertex AI via
langchain-google-vertexai.Internally, the provider-specific
GeminiDeepEvalModelwrapper is replaced by a genericLangChainDeepEvalModelthat wraps anyBaseChatModel, and_build_gemini_deepeval_modelis simplified to delegate to it.
The
configure_judge_llms.ipynbnotebook is updated to document this behaviour andincludes an optional
set_judge_config()example cell.How to test
Automated tests
cd validmind-library make test-unitAll 89 test suites pass with no changes to existing tests.
Manual testing
Then run a built-in scorer end-to-end:
Expected: DataFrame with a non-null score column, no errors.
What needs special review?
get_deepeval_model()now checks__judge_llmbefore provider detection. This isthe intended priority order, but reviewers should confirm there are no edge cases where
a stale
__judge_llmfrom a previousset_judge_config()call in the same sessionwould produce unexpected results in a notebook that later switches providers via env vars.
get_model_name()introspection — the newLangChainDeepEvalModelchecksmodel,model_name, andazure_deploymentattributes in order. If a customBaseChatModelexposes none of these, it falls back to the class name. Confirm this is acceptable for
all supported providers.
Dependencies, breaking changes, and deployment notes
None. No new dependencies, no migrations, no env var changes. The existing Gemini
deepeval path (
_build_gemini_deepeval_model) is preserved and now delegates to thenew generic wrapper.
No companion PRs in frontend or documentation repos.
Release notes
This is an
enhancement. Suggested label:enhancement.DeepEval scorers (
AnswerRelevancy,Hallucination, and others invalidmind.scorers.llm.deepeval) now respect the judge LLM configured viaset_judge_config(). This means any provider — including Vertex AI — configured oncefor prompt-validation and RAGAS tests is automatically used by scorers as well, with no
additional setup required.
Checklist
Shortcut: SC-16871