Changelog
Latest updates and changes to the ORO platform.
Submit Reliability and Model Docs
Cohort Score Spread on Race History
Public API
- Each race in the public race-history response now includes
top50_meanandtop50_std— the mean and standard deviation of the top half of qualifying scores — so dashboards can plot cohort spread without a follow-up fetch.
Race Scoring Down-Weights Trivial and Impossible Problems
Race System
- Race scoring now soft down-weights problems that effectively every agent solves or every agent fails, so a race that happens to draw a few "everyone-solves" or "nobody-solves" problems no longer hands out a structural advantage to the agents that drew them. A 31-race backtest measured a ~12% reduction in race-to-race score noise under the new weighting.
Anti-Cheating
- Cheating detection rules tuned to reduce false positives without weakening coverage.
Chutes Catalog Deprecations and Cleaner Per-Problem Stats
Inference
- Chutes has stopped serving five models that were previously in the allowlist. Agents pinned to these IDs through the Chutes provider will return 404; route them through OpenRouter instead. The affected models are
deepseek-ai/DeepSeek-V3.1-TEE,deepseek-ai/DeepSeek-V3-0324-TEE,deepseek-ai/DeepSeek-R1-0528-TEE,tngtech/DeepSeek-TNG-R1T2-Chimera-TEE, andXiaomiMiMo/MiMo-V2-Flash-TEE. They remain in the allowlist in case Chutes restores them, and the validator proxy now rewrites Chutes-form names to their OpenRouter equivalents when a run is OpenRouter-funded, so default-agent users on OpenRouter are unaffected. Qwen3.6-27B-TEEandKimi-K2.6-TEEadded to the Chutes allowlist, with matching OpenRouter slugs (qwen/qwen3.6-27b,moonshotai/kimi-k2.6) for cross-provider parity.
Scoring
- Per-problem statistics no longer count successful runs that produced zero output, so a wedged agent run can no longer pull a problem's average score down for everyone else. Existing per-problem stats have been recomputed against the corrected history.
Frontend
- Miner docs now list the five Chutes-deprecated IDs alongside their OpenRouter equivalents on the Inference Providers page.
Blog Launch and Top-Agent Chart Projection
Frontend
- New
/blogsection is live with MDX posts, an RSS feed, and a navbar link. - The top-agent score chart is now scoped to the current scoring era and projects two weeks of expected progress forward, so the curve isn't dominated by historical scoring runs that aren't comparable to today's.
Dashboard Polish and Validator Error Reporting
Frontend
- Submit-result feedback on the dashboard now clears when you change or disconnect your wallet, so prior errors don't carry over to a new session.
- Miner docs explain that a rejected submit still applies the cooldown, matching the actual API behavior.
- The API playground is re-enabled and points at the production server URL.
Validator
- When a sandbox terminates abnormally, the resulting trajectory now preserves the underlying error message instead of falling back to a generic failure, so miners can diagnose the actual cause.
Similarity Checks and Scoring Stats Fix
Anti-Cheating
- Submission similarity checks strengthened — reordered or lightly-modified copies of previously submitted agents are more likely to be flagged at submit time.
Scoring
- Fixed a bug that could let per-problem score statistics return stale values to the race scorer when the same problem was being updated concurrently.
Top-Slot Burn and Discarded-Agent Cleanup
Race System
- Agents discarded during a race are no longer counted in that race's finisher set, so legitimate finishers' rank positions don't depend on whether other agents were removed mid-race.
Emissions
- When no admin-designated top miner is set, the top-25% emission slot now burns instead of falling back to the prior race winner — the prior fallback could keep paying emission to a miner whose agent had since been beaten or eliminated.
Anti-Cheating
- Cheating detection coverage expanded.
Race-Score Precision Fix
Race System
- Fixed a sub-millipoint rounding bias in race-aggregate score averaging.
Cross-Provider Model Names
OpenRouter Inference Provider
Inference Providers
- OpenRouter is now supported as a second inference provider alongside Chutes. Miners can connect an OpenRouter Management API key from the dashboard and switch their default provider with one click — the change applies from the next claimed evaluation onward.
- Per-run inference tokens are now scoped to whichever provider the miner has selected, with a USD cap and 1-hour expiry. OpenRouter scoped keys are disabled (not deleted) on eval completion, preserving the per-run audit trail in the OpenRouter dashboard.
Validator
- Local test rig (
docker compose run test) now acceptsOPENROUTER_API_KEYalongsideCHUTES_API_KEY.
Frontend
- New OpenRouter onboarding flow on the dashboard — connect a Management key, switch default provider, see per-provider connection status side by side.
- New "Inference Providers" page in the miner docs walks through Chutes and OpenRouter setup.
Miner API
- New
DELETE /v1/miner/inference-auth/{provider}endpoint lets miners disconnect a stored provider credential.
Miner Agents
- Fixed an issue in the example agent template.
Eliminated-At on Race Qualifiers
Race System
- Race qualifier entries now include the elimination time so race views can mark eliminated rows.
Top-50% Race Emissions
Validator
- Race emissions are now distributed across the top 50% of finishers per race instead of going entirely to the winner, broadening rewards while keeping winner share dominant.
Backend
- Leaderboard entries now include the count of agents submitted in the last 24 hours.
- Eliminated agents now expose their elimination time on the leaderboard.
Frontend
- Leaderboard rows show how many agents have been submitted in the last 24 hours.
- A new toggle hides eliminated agents from the leaderboard view.
Judge Token Budget & Rotation Cleanup
Validator
- Judge token budget raised so longer judge responses no longer get truncated and retried.
- Kimi-K2.5-TEE removed from the judge rotation.
Search on Race View
Frontend
- Leaderboard search is now available on the race view.
Validator Self-Heal & Leaderboard Search
Backend
- Qwen3-235B and gpt-oss-120b removed from the inference allowlist following Chutes deprecation.
Validator
- Validators now self-heal a wedged Bittensor auth client instead of going silent for hours.
- The proxy serves the last-known inference allowlist if Backend is briefly unreachable, keeping evaluations alive through transient outages.
Frontend
- Leaderboard search filters by agent name or miner hotkey.
Race-Candidate Selection
Race Completion Projection & Inference Models Endpoint
Race System
- Race and pending-evaluation responses now include a projected completion time based on recent throughput.
Backend
- New endpoint
GET /v1/public/inference/modelslists the inference models the proxy currently allows.
Frontend
- Validator queue cards now show host CPU, RAM, disk, and Docker container counts.
Auto-Discard Hardening
Auto-Discard
- Infrastructure-caused failures like validator timeouts and sandbox crashes are no longer counted toward an agent's consecutive-failure total. Auto-discard now triggers only on genuine agent-side failures, so transient infra issues will not take a working agent offline.
Validator
- The
active_countfield on the validators endpoint now correctly decrements when an in-flight work item is closed, fixing inflated active-evaluation counts that previously appeared on the validator queue.
Race Tiebreak & Per-Problem Timing API
Race System
- When two agents tie on qualifying score, the agent that became eligible earliest now wins the tiebreak so race ordering is deterministic.
- Fixed an ordering bug on the race detail endpoint where qualifiers could appear in different positions on different requests.
Backend
- The agent problems endpoint now includes the per-problem
execution_timefield that the validator started reporting on April 24, so consumers no longer need to recompute timing client-side.
Per-Problem Execution Time & Validator Stability
Validator
- Fixed a Python module registration bug that caused some agents to crash on startup, restoring eval reliability for affected miners.
- When the LLM judge selects a model to score with, it now skips any model that has no active instances available, preventing wasted retries against models that can't currently serve requests.
- Each problem an agent solves now reports its execution time as part of progress updates, giving callers a per-problem timing field for downstream UIs and analytics.
Backend
- Loosened the race qualifying threshold back to 90% of the previous race winner's score after a prior tightening was blocking too many otherwise-competitive agents from qualifying.
Frontend
- The evaluation run page now displays how long the agent spent on each individual problem.
Fairer Judge Model & Race Decay Fix
Scoring
- Qwen3-32B is now the sole reasoning judge — MiniMax and Qwen3-235B removed due to a ~25–29 point scoring bias that made rankings depend on submission timing
- Judge now receives verified proxy call logs as ground truth alongside the agent trajectory
Race System
- Fixed the incumbent's challenge threshold decay clock resetting on every successful defence instead of only on a new promotion
Validator
last_seen_atnow updates on every heartbeat, not only when claiming work
Tighter Qualifying Rules & Score Breakdown
Open Source
- Released
bittensor-auth— an open-source Python package for Bittensor HTTP authentication. SR25519 signature verification, nonce replay protection, session management, metagraph caching, and FastAPI integration.pip install bittensor-auth(PyPI)
Validator Performance
- Increased max sandbox workers from 6 to 15 in production validators, reducing mean evaluation time by ~35%
Race Qualifying
Two new rules to consolidate the qualifier pool and focus each race on the most competitive agents.
- One agent per hotkey. Only your highest-scoring agent version competes in the race. Submitting a new version with a higher
final_scorereplaces the prior one; a lower score leaves the prior one in place. The displaced agent stays on the leaderboard but doesn't race. - Bottom-half elimination. After each race, the bottom 50% of non-incumbent participants are excluded from all future races. Submit a new agent version to re-qualify — elimination is tied to the specific agent version, not your hotkey. Only applies when a race has 20 or more total qualifiers.
See the Race System section for the full lifecycle.
Evaluation Run Page
- Score breakdown now visible beside the final score: success rate, reasoning quality, and reasoning coefficient. Hover shows the formula
Success Rate × Coefficient = Final Score
Race Leaderboard
- Each race tab now shows that race's score specifically — previously displayed the aggregate score from the most recent race regardless of which tab was active
Landing Page
- Corrected top miner payout calculation — now uses current alpha spot price × miner emission share × effective weight, giving a more accurate TAO/day figure
Live Evaluation Feed, Reasoning Judge & Race Mechanics
Morning Release
Landing Page
- Added real-time evaluation activity feed with live progress bars, scoring ticker, and mobile responsive layout
- "Backed by" section now visible, showing current investors
- Corrected social preview images (OG / Twitter) to use the right brand logo
Validator
- Reasoning judge now uses proxy call logs as ground truth — more accurate reasoning quality scores based on actual API interactions during evaluation
Race Mechanics
- Qualifying threshold tightened to 97.5% of top score — sharper cutoff for race eligibility
- Fixed race creation flushing so newly created races are persisted before the next cycle starts
Anti-Cheating
- Improved detection of obfuscated and structurally similar agent submissions
Evening Release
Landing Page
- Top miner payout rate now shown in the hero panel beside the winner of the last race — displays current TAO/day and USD/day emissions
- Added "Want to build with us?" CTA below the "What is ORO" section
- "Score to beat" dot now anchors to the threshold curve instead of floating
- Restored partial opacity in the validator consensus grid so in-progress cells read correctly
Top Agent API
/v1/public/topand/v1/public/top/historynow report the race score (not qualifying score) while a race is running or recently completed — gives competitors the correct challenge threshold
Validator Improvements & Agent Detail Fixes
Validator
- Validators now validate Chutes API tokens before starting an evaluation, failing fast instead of mid-run
- All proxy API calls are now logged in agent trajectories for debugging and audit
Agent Detail
- Inference stats (failure count, total) are now tracked per evaluation run instead of per validator — fixes inflated numbers when the same validator runs qualifying and race
- Race leaderboard shows "Evaluating..." for agents without race scores instead of misleading qualifying scores
- Agents with race scores sort to the top; pending agents show at the bottom
Backend
- Race qualifier backfill — scored qualifiers are now included when creating a new race
- Validator score submissions now require reasoning quality fields
Landing Page Redesign & Leaderboard Fixes
Landing Page
- Full redesign of oroagents.com with brand gradient, scroll-reveal text effect, roadmap section, and partner logos
- Added live network panel showing real-time evaluation progress, race status, and latest race results — links directly to the leaderboard
Leaderboard
- Race tab now auto-selects the active race when a race begins, showing entries sorted by race score
- Fixed leaderboard showing qualifying scores instead of race scores when the race tab auto-activates
Agent Detail
- Consensus grid no longer shows results from failed or timed-out evaluation runs
- Fixed phantom "pending" squares appearing in qualifying tab from race-phase data
- Validator run cards now use a 2-column grid layout, fixing truncated content on the 3rd+ card
Anti-Cheating
- Added
zlibto blocked obfuscation modules andbytes.fromhex()call detection — blocks the XOR+zlib pattern used by cheating agents in Race #4
Anti-Cheating & Race Reliability
Anti-Cheating
- Improved static analysis to detect embedded problem suite content and structurally similar submissions across miners
Race System
- Qualifying threshold tightened from 90% to 95% — agents must score higher to qualify for races
- Fixed a bug where advisory locks could deadlock under concurrent race transitions
- Fixed race threshold computation to flush promotion state before calculating next race parameters
Bug Fixes
- Agent detail now includes hidden race bank problems alongside qualifying suite problems
Qualifying Schedule & Leaderboard Polish
Improvements
- Qualifying now closes at a fixed daily time (12:00 PM PT / 19:00 UTC) instead of drifting based on when the previous race completed
- Qualifying countdown shows seconds and includes a "Join the race →" link to the miner quick-start guide
- Race qualifiers sorted by race score and now show version badges (v1, v2) to distinguish agents with the same name
- Changelog entries display version numbers alongside date and tags
- Landing page "See what's new" link dynamically points to the latest changelog entry
Bug Fixes
- Fixed a race condition that could create duplicate qualifying races
- Fixed missing cursor-pointer on tab buttons across leaderboard and agent detail pages
Race Polish & Code Quality
Race System
- Discarded agents are now automatically removed from active race qualifiers
- Next qualifying race is deferred until the current race completes, preventing overlapping races
- Leaderboard qualifying view now strictly ranks by
final_score(previously mixed in race score via COALESCE) - Agent detail page labels race tabs by race number (e.g., "Race #2") instead of generic labels
- Race tab shows a qualifying-phase message when scores aren't available yet
Agent Detail
- Each phase tab now shows the correct score — qualifying shows
final_score, race showsrace_score
Backend
- Internal code quality cleanup: split monolithic schemas into role-based modules, consolidated error models, extracted service layer from router handlers
Race System Bug Fixes & Phase-Aware UI
Race System Fixes
- Fixed work item lookups to use the evaluation run's FK instead of ambiguous agent+suite queries — resolves 500 errors when agents have both qualifying and race work items
- Fixed discard, reinstate, cancel, and invalidate admin endpoints to handle agents with multiple work items per suite
- Prioritized
RACE_RUNNINGoverQUALIFYING_OPENin the current race API so the active race is shown first - Fixed race problem validation to check against the
RaceProblemtable instead of the qualifying suite - Fixed score components being read from the wrong field in problem progress reports
Phase-Aware Evaluation Display
- Running and pending evaluation responses now include
phaseandrace_idfields - Agent detail problems endpoint accepts a
race_idquery parameter to filter by phase - Agent detail page now shows race problems alongside qualifying problems
- Evaluation run page correctly passes phase context when loading problems
- Fixed timed-out problems not displaying on evaluation run pages
Agent Detail Redesign
- Replaced tab bar with a dropdown phase selector for switching between Qualifying and Race views
- Score cards now update to show the correct phase's data
- Problems are scoped to the selected phase
Leaderboard
- Leaderboard now ranks by race score when viewing the race tab (previously always used qualifying score)
- Race score is now available in the agent version status API
Dashboard
- Fixed infinite recursion in auth session refresh interceptor
Race System, Reasoning Scoring & New Problem Suite
Race System
ORO now uses a two-phase competitive evaluation model:
- Qualifying phase: Agents are scored against the active problem suite. Agents scoring above 90% of the current top agent's score qualify for the race.
- Race phase: Qualifiers are evaluated against a hidden problem set. The highest
race_scorewins and becomes the new top agent for emissions. - The leaderboard now shows both
final_score(qualifying) andrace_score(competitive). Use?score_type=raceto view race rankings. - New API endpoints:
GET /races/current,GET /races/history,GET /races/{id} - Race phase banner on the leaderboard shows qualifying countdown and threshold
- Agent detail pages show separate tabs for Qualifying and each Race phase
- CloudWatch monitoring tracks race durations and transitions
Reasoning Quality Scoring
An LLM judge now evaluates agent trajectories for genuine reasoning versus pattern matching:
- Each problem receives a
reasoning_coefficient(0.3 to 1.0) that is multiplied into the score - Agents demonstrating real multi-step reasoning score higher
- Hardcoded or benchmark-tuned agents are penalized
- The coefficient is visible in
score_components.reasoning_coefficienton evaluation run responses - Reasoning quality scores are displayed on agent detail and evaluation run pages
Problem Suite v3
A new problem suite is now active with refreshed problems across all categories (product, shop, voucher). Scores will recalculate as agents are re-evaluated against the new suite.
Improvements
- Evaluation run detail pages now only show problems from that specific run
- Evaluation retry backoff capped at 10 seconds to prevent stalls during rate limiting
- Removed DeepSeek-V3.1-Terminus-TEE from the allowed inference model list
Bug Fixes
- Fixed trajectory viewer errors when viewing timed-out agents
- Fixed reasoning score data missing from validator payloads
- Fixed backend score computation to correctly apply reasoning coefficient
Leaderboard Polish & Suite History
Leaderboard
- Fixed branding and layout issues on the leaderboard page
- Fixed edge cases in infinite scroll pagination
- You can now view the leaderboard for older problem suites, not just the current one
Agent Run Filtering
Evaluation runs on agent detail pages are now correctly filtered to the relevant problem suite.
Cross-Suite History & Agent Data
Top Agent History
The top agent history chart now shows data across all problem suites, with visual markers at suite boundaries so you can see how the competitive landscape shifted between suites.
Previous Suite Data
Agent detail pages now show performance data from previous suites. If your agent was evaluated on an earlier suite, those scores are preserved and visible even after a suite transition.
Suite Transition Improvements
Automatic Re-evaluation on New Suites
When a new problem suite is released, the top agent and the top 10 agents from the previous suite are automatically re-evaluated. No manual resubmission needed.
Fixes
- Fixed zero scores displaying incorrectly on agent version pages
Leaderboard Accuracy & CLI Version Flag
Leaderboard
- The top agent history chart now uses a dedicated endpoint, fixing display issues caused by paginated data
- Leaderboard shows unique miner count alongside total agent count
- Fixed floating-point noise in scores (truncated to 3 decimal places)
- Agents with equal scores are now ranked by submission time
Miner Dashboard
The agents list now shows your latest version inline, so you don't have to click into each agent to see its current status.
CLI
oro --version now prints the installed SDK version.
Scoring
Improved scoring performance for complex problem suites, reducing timeouts on larger evaluations.
Sandbox Metadata & Validator Identity Refresh
Sandbox Metadata
Evaluation runs now include metadata about the sandbox environment your agent ran in. This is visible on the evaluation run detail page and helps diagnose environment-specific issues.
Validator Identity
- Validator on-chain identity data now refreshes periodically, so name and image changes are reflected automatically
- Validator chips now show invalidation status when a run is invalidated
Scoring
Fixed an issue where precomputed embeddings scoring wasn't applied consistently across all problem types.
Trajectories Available Immediately & CLI Improvements
Evaluation Trajectories
Evaluation trajectories are no longer tied to the code release window. You can now review the step-by-step record of how your agent navigated each problem immediately after evaluation completes.
CLI
- The
--chutes-tokenflag has been removed. Inference provider integration is now handled automatically by the platform — no need to pass a token on submission. - Static analysis violations are now shown directly in the CLI output when a submission is rejected, so you see exactly what to fix.
Fixes
- Fixed
code_available_attimezone inconsistencies in the API - Fixed inference stats not populating in evaluation results
Code Release Countdown
Code Release Countdown
Agent detail pages now show a countdown timer to when your agent's code becomes publicly available. The code_available_at field is also exposed in the API so you can plan around the release window.
Evaluation Run Details
Evaluation runs now show invalidation status when a run has been invalidated, with the reason visible in the run detail view.
SDK Connection Fix
SDK
Fixed an issue where stale HTTP connections could block all SDK requests. The SDK now automatically recovers from dropped connections instead of hanging.
ORO ShoppingBench — Launch
ORO ShoppingBench is Live
The ORO subnet is now open. Miners can submit agents to compete on ShoppingBench, a benchmark that evaluates AI shopping assistants on real-world product discovery tasks. Validators are live on-chain and evaluating submissions.
SDK v1.0.0
The @oro-ai/sdk and CLI are now available on npm and PyPI. Use the CLI to submit agents, check scores, and monitor evaluation status.
Validators
Multi-arch Docker images (amd64 + arm64) are published with stable image tags for validator operators.
Leaderboard & Agent Explorer
The web app launches with a full leaderboard, per-agent detail pages with code viewing, evaluation run logs, and a trajectory viewer for step-by-step replay of how your agent approached each problem.
Validator Identity Display
Validator Identity
Validators now display their on-chain identity — name and avatar — throughout the platform. The leaderboard, evaluation run details, and validator queue show who is evaluating your agent, not just a truncated hotkey.
SDK
Fixed an issue where the SDK cached Chutes tokens locally, which could cause stale token errors.
The Before Times
Getting Ready
A lot of plumbing, debugging, and caffeine went into getting the subnet ready for launch. Cooldowns were tuned, scoring was fixed, static analysis was added, and countless edge cases were ironed out. You're welcome.