You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Benchmark harness for A/B testing Claude Code plugins against OOLONG long-context reasoning tasks. Compare truncation vs RLM-RS recursive chunking strategies. Features Claude Code hooks integration, SQLite persistence, and comprehensive scoring aligned with the OOLONG paper methodology.
Vendor-neutral research umbrella for measuring AI plugin, agent, and MCP server quality across CLI runtimes (Claude Code, Gemini CLI, Copilot CLI, Codex CLI).
Trustworthy, automated testing for REAPER plugins and extensions - point it at a plugin and get a real pass/fail, no test code required. (macOS today; Windows/Linux WIP)
Performed end-to-end manual testing and source code analysis on the easy.jobs WordPress Plugin (v2.7.1), identifying 8 bugs across 22 test cases covering authentication, job management, candidate management, security, and edge cases.
Binary-criteria evaluation harness for Claude skills with planned extension to plugins, agents, and MCP servers. Score every change yes/no across 7 layers — package integrity, trigger quality, functional quality, regression protection, baseline value, model variance, rollout safety. Never gradients.