DiffuJudge-AV: A Diffusion-Inspired Framework for Calibrated AV Video Evaluation

Like a Noisy Sensor. It Modified Which Autonomous-Driving Evaluator I Would Ship.

There’s a explicit form of outcome that appears spectacular till you ask the improper second query.

On this venture, that outcome was a Pearson correlation of 0.753 from a text-only Claude decide grading autonomous-driving visual-QA solutions. At first look, that appears like a usable evaluator. It tracks the gold scores, it produces rationales, it’s a robust closed mannequin. Ok to triage mannequin outputs, proper?

Then I checked out quadratic-weighted Cohen’s κ. It was 0.057.

That’s the second the venture modified. The decide was rank-correlated with the gold labels, however it was not behaving like an ordinal security evaluator. It had realized the safest-looking failure mode: compress virtually all the things towards the center of the 1–5 scale. For extraordinary benchmark reporting, that may move unnoticed. For an autonomous-driving evaluate pipeline that should flag dangerous solutions earlier than they gate a software program launch, it’s harmful.

So I constructed DiffuJudge-AV, a small evaluation-of-evaluation framework for LLM/VLM judges on driving video. The thought is straightforward: deal with a decide’s rating as a loud commentary of a latent true rubric rating, intentionally expose the decide to recognized sources of scoring bias, then denoise the ensuing rating distribution with a one-step Tweedie posterior imply and report calibrated uncertainty.

Throughout 28,400 decide evaluations on Wayve’s LingoQA benchmark, essentially the most fascinating discovering was not {that a} bigger closed mannequin gained. It didn’t. One of the best decide within the experiment was Qwen2.5-VL-7B, an open 7B vision-language mannequin. It reached:

Pearson r = 0.857
Spearman ρ = 0.856
Quadratic-weighted Cohen’s κ = 0.837
MAE = 0.57
Fail-detection F1 = 0.712

Be aware: The LingoQA benchmark is launched beneath a non-commercial license. The dataset creators at Wayve have granted permission for its use on this article.

For this AV-style analysis activity, an open VLM was not simply aggressive. It was higher on the metrics that truly matter.

Why “analysis of analysis”?

When a mannequin solutions a query a few driving scene, the plain analysis query is:

Did the mannequin reply appropriately?

For instance:

Query: Are there any parked automobiles on the aspect of the street? Reference: Sure, there are two automobiles parked on the suitable. Candidate reply (mannequin beneath take a look at): I don’t know. Gold rating: 1.13 (low).

For a human, that is straightforward. Watch the clip, evaluate the reply to the scene, assign a rating. At scale although, human analysis turns into the bottleneck. Fashionable autonomy stacks generate extra notion clips, situation logs, counterfactual rollouts, and mannequin outputs than any annotation staff can rating manually. So groups naturally attain for LLM-as-a-Choose or VLM-as-a-Choose: give a mannequin the query, reference, candidate reply, rubric, and typically the frames, then ask it to attain.

That creates a second-order downside:

If the decide is a mannequin, how do we all know the decide is dependable?

That is analysis of analysis (eval-of-eval). As a substitute of solely asking whether or not the AV mannequin is appropriate, we ask whether or not the evaluator itself is secure, calibrated, bias-resistant, and helpful for downstream choices. Current papers (Judging the Judges by Shi et al., IJCNLP-AACL 2025; JETTS by Salesforce, 2025; CALM by Ye et al., ICLR 2025) have catalogued structural failure modes in LLM judges: place bias, verbosity bias, scoring-ID-format bias, self-inconsistency throughout runs, and extreme rating compression.

There may be additionally a extra uncomfortable declare from Wang Lun’s latest essay, Your Evals Will Break and You Received’t See It Coming: analysis infrastructure fails silently when fashions cross functionality thresholds,

as a result of present benchmarks assume incremental enchancment. His proposed treatment is adaptive evals that detect their very own obsolescence. DiffuJudge-AV is one concrete step in that route. By attaching a calibrated uncertainty to each rating the decide emits, the framework widens its personal confidence interval earlier than the purpose estimate misleads you.

For autonomous driving this issues operationally. If a realized evaluator decides which failures get escalated to human evaluate, which situations enter a regression suite, or which releases deserve extra scrutiny, then the evaluator’s failure modes grow to be a part of the security story.

The instinct: a decide rating is a loud sensor studying

An LLM decide rating appears to be like clear as a result of it’s a quantity: 1, 2, 3, 4, or 5.

However that quantity can transfer for causes that don’t have anything to do with the precise high quality of the reply. Change the order of choices. Paraphrase the rubric. Reorder standards. Swap rating labels from Arabic numerals to Roman. Resample exemplars. Change temperature. Shuﬄe the video frames you pattern. The true reply high quality didn’t change. The decide modified.

That means a helpful psychological mannequin:

Deal with the decide like a loud sensor.

There’s a latent rating s ₀. The decide by no means observes it immediately. Every immediate variant produces a loud studying

$tilde{s}_t = s_0 + epsilon_t, quad t in {1, ldots, 7}$

Right here t is just not a diffusion timestep within the image-generation sense. It’s a documented supply of decide

perturbation, drawn immediately from the 2024–2025 LLM-as-a-Choose bias literature. Seven canonical sources, every one a managed noise degree:

Picture created by writer utilizing figurelabs

Stage t	Perturbation	What it checks	Reference
1	choice / order swap	place bias	Shi et al., 2025
2	rubric paraphrase	immediate sensitivity	SPUQ, arXiv 2403.02509
3	criterion reorder	rubric-order sensitivity	Chen et al., 2025
4	score-ID format swap (1–5 / I–V / A–E)	scoring-format bias	Chen et al., 2025
5	temperature noise	self-inconsistency	Thakur et al., 2025
6	exemplar resample	few-shot variance	classical
7	body shuﬄe (video)	temporal robustness	this work

For every merchandise, the framework runs the decide throughout all seven perturbation ranges with okay = 3 samples every, giving roughly 22 rating observations per merchandise as an alternative of 1. That could be a sensible measurement of decide instability. It additionally lets us run a reverse step.

The denoising step: Tweedie in a single equation

The diffusion analogy turns into helpful as a result of there’s a classical outcome behind denoising: Tweedie’s components (Robbins 1956; revived for contemporary diffusion by Manor & Michaeli, ICLR 2024).

If a loud commentary s^~is generated by including Gaussian noise to a latent clear worth s ₀, the posterior imply is:

$hat{s}_0 = tilde{s} + sigma_t^2 , nabla_{tilde{s}} log p(tilde{s})$

$textual content{Var}[,s_0 mid tilde{s},] = sigma_t^2 + sigma_t^4 , nabla_{tilde{s}}^2 log p(tilde{s})$

The framework estimates p(s^~) with a Gaussian KDE over the per-item sampled scores. Inside-perturbation-level variance provides $sigma_t^2$ . Pooling throughout ranges is precision-weighted earlier than the Tweedie

correction. Two outputs come out of this single reverse step:

A denoised level estimate $hat{s}_0$
A per-item posterior uncertainty $hat{sigma}_i$

The second output is the half I care about extra. In a safety-review workflow, I don’t solely need a quantity. I need to know whether or not the decide is assured sufficient for that quantity to be actionable.

The denoised estimate is then wrapped in an ordinal-boundary-adjusted split-conformal interval (Sheng et al., EMNLP 2025), studentized by the Tweedie posterior σ:

$alpha_i = frac{|,s_i^{textual content{true}} – hat{s}_i,|}{max(hat{sigma}_i, epsilon)}$

As a result of the rating is ordinal, the ensuing interval is snapped to legitimate 1–5 boundaries. The decide’s output is not “this reply is a 2.1.” It’s one among three actions:

“This reply is probably going within the failure area, and the calibrated interval is slim sufficient to escalate mechanically.”

“This reply is probably going a clear move and the interval is slim. Launch.”

“The mannequin is unsure on this case. Path to a human reviewer.”

*Picture created by writer* utilizing figurelabs

The issue area and the info

This framework is general-purpose, however the software is deliberately particular: safety-critical autonomous-driving video analysis. AV methods generate situation logs and counterfactual rollouts at a quantity that no human staff can label. Business now routinely makes use of LLM/VLM judges to attain mannequin solutions, prediction high quality, planner rationales, and chain-of-thought outputs, and people judges gate launch choices. NHTSA’s pre-crash typology catalogues 37 light-vehicle crash situations; ISO 26262 and SOTIF demand calibrated confidence on safety-critical occasions; CARLA Leaderboard 2.0 generates extra validation visitors per day than any annotation funds can take in.

The benchmark we take a look at on is LingoQA (Marcu et al., ECCV 2024), a visible question-answering dataset for autonomous driving launched by Wayve. Every merchandise is a brief driving clip (4 seconds, dash-cam, 1 Hz, 5 frames) with a free-form query, reference reply, and a learned-classifier gold rating from Lingo-Choose. We use a stratified 200-clip subset of the official analysis suite and deal with the high-confidence Lingo-Choose scores as Tier-1 anchor labels.

A consultant merchandise, the identical one the judges grade in manufacturing:

Query: “Why did the ego car decelerate right here?” Reference reply: “As a result of a pedestrian began crossing the street on the marked crossing.” Candidate reply (AV-VLM beneath take a look at): “Due to visitors forward.” Gold rating: 2.

The qualitative determine above exhibits three actual LingoQA gadgets. Within the score-1 clip (purple border) the candidate reply hallucinated a bike that’s not within the frames. The imaginative and prescient decide’s rationale explicitly calls out that contradiction. The score-5 clip (inexperienced border) is one the place the candidate appropriately verifies a damaging declare (“there are not any scooters seen”) {that a} text-only decide can not test with out seeing the scene. The score-3 clip is genuinely ambiguous: the candidate is partially proper.

That is the form of resolution the decide has to make, and the form of resolution a text-only decide can not make properly, as a result of the proof lives in pixels.

Pipeline

The system that produced the numbers on this article suits in a single diagram:

From the highest:

Inputs: a driving clip with sampled frames, the query, the reference reply, and the AV-VLM’s candidate reply.
Ahead perturbation cascade: 7 recognized judge-bias operators utilized programmatically to the immediate, producing 22 immediate variants per merchandise.
Choose ensemble: 5 configurations evaluated (Claude text-only ensemble, open-source textual content ensemble, Claude with 3 frames, Qwen2.5-VL-7B with 1 body, InternVL2-8B with 1 body). Every emits a scalar rating plus a one-sentence rationale.
Noisy rating samples: per-item distribution of scores tagged with perturbation degree.
Tweedie reverse step: single-step denoising with posterior imply and posterior variance.
Ordinal conformal interval: boundary-snapped, studentized by the Tweedie σ.
Eval-of-eval report: Cohen’s κ, Krippendorff’s α, ECE, Brier, MAE, fail-F1, stochastic stability, robustness deltas per perturbation supply.

Throughout all 5 decide configurations this produced 28,400 actual decide evaluations on LingoQA.

The total implementation, scripts, run logs, and each determine on this article reside within the venture repository at github.com/syedhumarahim/diffujudge-av.

The place this slots into NVIDIA’s AV-Eval stack

I constructed this framework with NVIDIA’s AV-Eval constitution in thoughts: realized analysis pipelines that change hand-crafted guidelines, agentic workflows that chain mannequin inference with retrieval and structured reasoning, and specific evaluation-of-evaluation methodology. Each primitive in DiffuJudge-AV maps onto that mandate. The 7-level perturbation cascade is the agentic workflow. The Tweedie and conformal layer is the calibration loop. The 12-category habits taxonomy used internally maps cleanly to NHTSA pre-crash IDs, ASAM OpenSCENARIO 1.x phenomena, and CARLA Leaderboard 2.0 routes, the identical situation vocabulary NVIDIA’s AV coaching and eval stack already speaks.

The repository additionally ships a drop-in wrapper for NVILA-8B, NVIDIA’s personal environment friendly VLM (Liu et

al., 2024), and a deployment recipe that serves the three-VLM decide ensemble as OpenAI-compatible

NVIDIA NIM endpoints. One caveat: NVILA-8B’s structure is just not but supported by vLLM

0.8.4, so the working numbers on this article use Qwen2.5-VL-7B and InternVL2-8B because the open-VLM substitutes. The combination form is prepared for the day vLLM lands NVILA help.

Consequence: Pearson correlation hid the failure mode

Right here is the total metric desk:

Mannequin	Mode	r	ρ	κ	MAE	ECE	Fail-F1
Claude TEXT-only	textual content ensemble	0.753	0.702	0.057	0.85	0.111	0.041
Open-source TEXT ensemble	Qwen+Llama+DSV3	0.803	0.717	0.701	0.92	0.207	0.526
Claude VISION	3 frames	0.708	0.703	0.632	1.05	0.252	0.612
Qwen2.5-VL-7B VISION ★	1 body	0.857	0.856	0.837	0.57	0.121	0.712
InternVL2-8B VISION	1 body	0.766	0.753	0.738	0.60	0.084	0.511

The important thing column is Cohen’s κ. Textual content-only Claude had a good Pearson correlation, however virtually zero ordinal settlement. Why? As a result of its predictions have been squeezed right into a slim center band. It was directionally conscious, however not operationally helpful.

That’s the Pearson entice:

A decide can protect rating whereas destroying the choice boundary you really care about.

A security-review system wants to tell apart:

Clear failure → path to human or regression suite.
Partial reply → examine or maintain unsure.
Clear move → enable lower-priority evaluate.

A decide that refuses to make use of the underside and prime of the dimensions can not help that workflow. Textual content-only Claude’s fail-detection F1 is 0.041. It flags 2% of precise failures. The identical mannequin with three frames jumps to 0.612. Qwen2.5-VL goes additional to 0.712, with κ = 0.837.

Consequence: imaginative and prescient modified Claude’s scoring habits

The stunning discovery was not simply that text-only judging carried out worse. It was that the identical Claude mannequin behaved in a different way when given frames.

Textual content-only Claude compressed predictions into roughly [1.3, 3.5]. With three driving frames, the vary expanded to roughly [1.0, 5.0].

The second panel above is Claude TEXT-only: roughly 80% of all judgments at rating 3 with a tiny sprinkle of 1s and 5s. The third panel is similar Claude mannequin with three frames: scores now unfold throughout the total ordinal scale. Identical mannequin, identical rubric, identical gadgets.

The compression was not a model-family property or a generic RLHF impact. It was input-mode-specific. When the decide solely noticed textual content, it hedged. When the decide noticed the scene, it was keen to make use of the total ordinal scale. Many analysis pipelines nonetheless use text-only decide prompts even for visible duties. They ask the decide to check a candidate reply to a reference reply, however the decide by no means sees the underlying proof. For driving scenes that may be a extreme limitation. A text-only decide can test semantic similarity; a imaginative and prescient decide can test whether or not the reply is grounded within the scene.

Identical discovering proven one other approach, per-item scatter towards gold:

Left panel: each text-only Claude prediction sits inside [1.3, 3.5] no matter the place the gold really is. Proper panel: the identical Claude on the identical gadgets with three frames. Predictions now climb the y = x line.

Consequence: imaginative and prescient unlocks safety-threshold choices

The underside-line operational metric for an AV-review pipeline is can this decide flag a nasty reply when the gold is dangerous? That’s fail-detection on the ordinal threshold gold ≤ 2.

Claude TEXT-only: precision 1.00 on fail-detection, however recall 0.02. Catches 2% of precise failures, as a result of it virtually by no means says “≤ 2”.
Claude VISION: 0.45 precision, 0.94 recall. Catches 94% of failures.
Qwen2.5-VL-7B: 0.43 precision, 1.00 recall.

For pass-detection (gold ≥ 4), text-only Claude has F1 = 0.00. It by no means says “5,” so it will possibly by no means affirm a clear move. Claude-vision reaches 0.76; Qwen-VL reaches 1.00.

That is what imaginative and prescient grounding plus a calibrated scale buys you in observe.

Consequence: a single heatmap of decide bias per noise supply

One of the crucial helpful artefacts the SDJ cascade provides you is a single image that exhibits which of the seven recognized judge-bias sources every mannequin household is most delicate to. For every (mannequin, perturbation degree) cell, we common absolutely the rating shift from the anchor throughout all gadgets:

Just a few issues stand out:

Textual content-only Claude is uniformly fragile. Rubric paraphrase, criterion reorder, score-ID swap, and temperature every shift its imply by ~0.4 on a 1–5 scale. That matches the diffusion-framing’s prediction that compressed, hedging judges drift essentially the most once you perturb the immediate.
The open-source textual content ensemble is roughly 3× extra sturdy throughout each column, maxing out at | Δ| = 0.15 for score-ID swap.
Qwen2.5-VL is dominated by one particular bias: score-ID format swap (Arabic → Roman → A–E) shifts its imply by 0.44. Realizing which bias issues most is itself actionable: lock the rating format in manufacturing prompts for this decide.

That is precisely the form of audit-ready, evaluation-of-evaluation artefact a learned-evaluation pipeline must ship alongside the headline metrics.

Consequence: does the uncertainty have sign?

The Tweedie reverse step produces a posterior σ at no further price. The query is whether or not that σ has info. Are the gadgets the cascade marks as unsure the identical gadgets the decide will get improper? For every imaginative and prescient decide we plot per-item std of perturbation samples (a proxy for posterior σ) towards absolutely the error towards gold:

Qwen2.5-VL’s σ has the cleanest sign (r = 0.26 between predicted-σ and observed-|error|). Objects with σ close to zero virtually by no means have |error| > 1; gadgets with σ > 0.6 are the place the decide’s imply was off by 1–3 rating factors. That’s exactly the regime the place the safety-gate diagram says path to human evaluate. We now have empirical proof the framework’s personal uncertainty estimate identifies these gadgets.

Consequence: stochastic stability hit the unique goal

One purpose was to check whether or not SDJ may expose and cut back stochastic instability. I ran Qwen2.5-VL-7B throughout 5 random seeds at two temperatures on all 100 imaginative and prescient gadgets:

Temperature	Median per-item std	Imply	Frac gadgets with std ≤ 0.15
T = 0.6 (noisy single-judge baseline)	0.40	0.40	31%
T = 0 (deterministic flooring)	0.00	0.024	95%

The noisy baseline matched the anticipated instability virtually precisely: about 0.40 per-item customary deviation, in keeping with the literature. At T = 0, 95% of things sat at or under the unique design goal of 0.15.

Within the diffusion framing, temperature is without doubt one of the ahead noise sources. I’m not saying each manufacturing decide ought to run at T = 0 ceaselessly. The purpose is {that a} decide harness ought to measure this instability explicitly as an alternative of pretending the rating is deterministic, and report a posterior σ alongside the purpose estimate.

Consequence: conformal protection matches the calibration goal

The conformal layer goals for empirical protection ≥ 1 − α with α = 0.10. Throughout the three runs with sufficient gadgets for secure split-conformal calibration:

Run	n_test	Empirical protection	Goal	Imply interval width
Claude TEXT-only	80	0.950	0.900	4.51
Open-source TEXT	80	1.000	0.900	4.50
Claude VISION	20	1.000	0.900	3.50

All three are above goal. Per-bin protection (fail / mid / move tiers) can be above 0.92 in each cell. The intervals are 3.5–4.5 score-units vast on a 1–5 scale. That’s the value of full protection on a heterogeneous calibration set.

Limitations

Just a few caveats value flagging up entrance. The gold labels are high-confidence Lingo-Choose classifier outputs, not a Tier-3 human-adjudicated set. The CODA-LM-style corner-case stress break up (cut-ins at night time, occluded VRUs, ambiguous near-misses) is just not included but. One of the best ECE measured is 0.084, not the unique 0.05 goal; a post-hoc isotonic or Platt calibration on the calibration break up would virtually definitely shut that hole. The present VLM ensemble has two dedicated judges (Qwen2.5-VL and InternVL2-8B) somewhat than the deliberate three. The imaginative and prescient runs are smaller than the text-only runs (100 gadgets vs 200)

due to obtainable body maps, so cross-modality numbers are model-level summaries somewhat than strictly per-item comparable. And the denoising step is a single-step analytical Tweedie correction somewhat than a multi-step realized sampler. These are sincere limitations, however in addition they map immediately onto the roadmap under.

Conclusion

An important lesson from this venture is just not that one mannequin beat one other. It’s that the metric you optimize for throughout eval-of-eval determines the decide you ship.

If I had optimized for Pearson r alone, I might have shipped a text-only Claude decide that hardly used the ordinal scale and caught 2% of safety-critical failures. Utilizing the total eval-of-eval desk (ordinal κ, fail-detection F1, calibration, stochastic stability) flipped the choice to an open VLM decide with calibrated uncertainty and a routing rule that sends ambiguous instances to people. Identical knowledge, completely different metric, completely different decide in manufacturing.

That’s the distinction between an eval that appears good in a benchmark desk and an eval that may help actual engineering choices.

For realized analysis methods, particularly in autonomous driving, robotics, and healthcare, we should always cease treating evaluator scores as floor reality. They’re measurements. Measurements have noise. Noise has construction. If we will measure that construction, we will construct higher evaluators.

That’s what DiffuJudge-AV tries to do: make the failure modes of the evaluator seen earlier than the evaluator turns into a part of the manufacturing resolution loop. Wang Lun’s essay closes with a line value quoting: “When you can consider appropriately, you may practice appropriately.” This work is one small contribution towards that ambition.

Future work

The roadmap follows immediately from the constraints above: a small Tier-3 expert-anchored golden set on the 50 hardest LingoQA gadgets, a CODA-LM corner-case stress break up, a 3rd specialised VLM decide (LLaVA-Critic-7B, or NVILA-8B as soon as vLLM lands the structure), a realized Tweedie MLP that replaces the analytical Gaussian KDE with a small denoiser educated on perturbation-level / judge-family / item-embedding options, a post-hoc isotonic calibration layer to shut the remaining ECE hole, and a NIM-served manufacturing ensemble with A/B comparability tooling and mannequin versioning.

References

Shi, Y. et al. Judging the Judges: A Systematic Research of Place Bias in LLM-as-a-Choose. IJCNLP-AACL 2025.
Chen, X. et al. Evaluating Scoring Bias in LLM-as-a-Choose. arXiv 2506.22316, 2025.
Thakur, A. et al. Ranking Roulette: Self-Inconsistency in LLM-as-a-Choose. arXiv 2510.27106, 2025.
SPUQ: Semantically-Perturbed Uncertainty Quantification. arXiv 2403.02509, 2024.
Sheng, H. et al. Analyzing Uncertainty of LLM-as-a-Choose by way of Conformal Prediction. EMNLP 2025 / arXiv 2509.18658.
Ye, S. et al. CALM: A Reasoning-Calibrated Multi-Step Eval-of-Eval Framework. ICLR 2025.
Wang, L. Your Evals Will Break and You Received’t See It Coming, weblog, 2025.
Robbins, H. An empirical Bayes method to statistics. Berkeley Symp. 1956 (Tweedie’s components).
Manor, H. & Michaeli, T. On the Posterior Distribution in Denoising: Software to Uncertainty Quantification. ICLR 2024 / arXiv 2309.13598.
Marcu, A. et al. LingoQA: Visible Query Answering for Autonomous Driving. ECCV 2024 (Wayve).
Sima, C. et al. DriveLM: Driving with Graph Visible Query Answering. ECCV 2024.
Liu, Z. et al. NVILA: Environment friendly Frontier Visible Language Fashions. NVIDIA, arXiv 2412.04468, 2024.
Lin, J. et al. VILA: On Pre-training for Visible Language Fashions. NeurIPS 2024 / arXiv 2312.07533 (NVIDIA).
Wang, S. et al. OmniDrive: A Holistic LLM-Agent Framework for Autonomous Driving with 3D Notion, Reasoning, and Planning. NVIDIA, arXiv 2405.01533, 2024.
Mao, J. et al. A Survey on Multimodal Massive Language Fashions for Autonomous Driving. NVIDIA / Tsinghua, arXiv 2311.12320, 2023.
Najm, W. G. et al. Pre-Crash State of affairs Typology for Crash Avoidance Analysis. NHTSA / Volpe, 2007.
ASAM e.V. OpenSCENARIO 1.x specification. 2022–2024.

Source link

DiffuJudge-AV: A Diffusion-Inspired Framework for Calibrated AV Video Evaluation

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

How to Find the Optimal Coding Agent Interface

I Completed Five Years in Analytics Consulting: 5 Lessons That Changed How I Work

GPU-Resident Top-K for Agentic RAG: I Built a CUDA Kernel So My Retrieval Step Would Stop Bouncing Off the GPU

Can Machine Learning Predict the World Cup?

Automate Writing Your LLM Prompts

These Were My Favorite Things Samsung Unpacked During Its 2026 Galaxy Event

AI minister role boosted but tech department axed in Burnham shake-up

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

The risk of weather data sabotage is rising

Featured Picks

Stop Asking if a Model Is Interpretable

US Navy’s Rapidly Built Autonomous Ship Fleet

Topic Model Labelling with LLMs | Towards Data Science

DiffuJudge-AV: A Diffusion-Inspired Framework for Calibrated AV Video Evaluation

Like a Noisy Sensor. It Modified Which Autonomous-Driving Evaluator I Would Ship.

Why “analysis of analysis”?

The instinct: a decide rating is a loud sensor studying

The denoising step: Tweedie in a single equation

The issue area and the info

Pipeline

The place this slots into NVIDIA’s AV-Eval stack

Consequence: Pearson correlation hid the failure mode

Consequence: imaginative and prescient modified Claude’s scoring habits

Consequence: imaginative and prescient unlocks safety-threshold choices

Consequence: a single heatmap of decide bias per noise supply

Consequence: does the uncertainty have sign?

Consequence: stochastic stability hit the unique goal

Consequence: conformal protection matches the calibration goal

Limitations

Conclusion

Future work

References

Related Posts