Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • Post-viral depression linked to immune changes in study
    • Czech startup Passwd automates access management for Google Workspace teams (Sponsored)
    • Oura’s New Ring 5 Is Smaller and Lighter—and Adds an AI Health Coach
    • Reactor, which says its AI platform can generate video in real-time with near-zero latency, emerges from stealth with a $59M Series A led by Lightspeed (Todd Spangler/Variety)
    • There’s a Lot I Like About Xiaomi’s Stylish and Affordable 17T Pro
    • What Academics Need to Know About Industry Chip Design
    • DiffuJudge-AV: A Diffusion-Inspired Framework for Calibrated AV Video Evaluation
    • Tiny keychain flashlight with pro LEDs and waterproof design
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Thursday, May 28
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»DiffuJudge-AV: A Diffusion-Inspired Framework for Calibrated AV Video Evaluation
    Artificial Intelligence

    DiffuJudge-AV: A Diffusion-Inspired Framework for Calibrated AV Video Evaluation

    Editor Times FeaturedBy Editor Times FeaturedMay 28, 2026No Comments19 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    Like a Noisy Sensor. It Modified Which Autonomous-Driving Evaluator I Would Ship.

    There’s a explicit form of outcome that appears spectacular till you ask the improper second query.

    On this venture, that outcome was a Pearson correlation of 0.753 from a text-only Claude decide grading autonomous-driving visual-QA solutions. At first look, that appears like a usable evaluator. It tracks the gold scores, it produces rationales, it’s a robust closed mannequin. Ok to triage mannequin outputs, proper?

    Then I checked out quadratic-weighted Cohen’s κ. It was 0.057.

    That’s the second the venture modified. The decide was rank-correlated with the gold labels, however it was not behaving like an ordinal security evaluator. It had realized the safest-looking failure mode: compress virtually all the things towards the center of the 1–5 scale. For extraordinary benchmark reporting, that may move unnoticed. For an autonomous-driving evaluate pipeline that should flag dangerous solutions earlier than they gate a software program launch, it’s harmful.

    So I constructed DiffuJudge-AV, a small evaluation-of-evaluation framework for LLM/VLM judges on driving video. The thought is straightforward: deal with a decide’s rating as a loud commentary of a latent true rubric rating, intentionally expose the decide to recognized sources of scoring bias, then denoise the ensuing rating distribution with a one-step Tweedie posterior imply and report calibrated uncertainty.

    Throughout 28,400 decide evaluations on Wayve’s LingoQA benchmark, essentially the most fascinating discovering was not {that a} bigger closed mannequin gained. It didn’t. One of the best decide within the experiment was Qwen2.5-VL-7B, an open 7B vision-language mannequin. It reached:

    • Pearson r = 0.857
    • Spearman ρ = 0.856
    • Quadratic-weighted Cohen’s κ = 0.837
    • MAE = 0.57
    • Fail-detection F1 = 0.712

    Be aware: The LingoQA benchmark is launched beneath a non-commercial license. The dataset creators at Wayve have granted permission for its use on this article.

    For this AV-style analysis activity, an open VLM was not simply aggressive. It was higher on the metrics that truly matter.

    Why “analysis of analysis”?

    When a mannequin solutions a query a few driving scene, the plain analysis query is:

    Did the mannequin reply appropriately?

    For instance:

    Query: Are there any parked automobiles on the aspect of the street? Reference: Sure, there are two automobiles parked on the suitable. Candidate reply (mannequin beneath take a look at): I don’t know. Gold rating: 1.13 (low).

    For a human, that is straightforward. Watch the clip, evaluate the reply to the scene, assign a rating. At scale although, human analysis turns into the bottleneck. Fashionable autonomy stacks generate extra notion clips, situation logs, counterfactual rollouts, and mannequin outputs than any annotation staff can rating manually. So groups naturally attain for LLM-as-a-Choose or VLM-as-a-Choose: give a mannequin the query, reference, candidate reply, rubric, and typically the frames, then ask it to attain.

    That creates a second-order downside:

    If the decide is a mannequin, how do we all know the decide is dependable?

    That is analysis of analysis (eval-of-eval). As a substitute of solely asking whether or not the AV mannequin is appropriate, we ask whether or not the evaluator itself is secure, calibrated, bias-resistant, and helpful for downstream choices. Current papers (Judging the Judges by Shi et al., IJCNLP-AACL 2025; JETTS by Salesforce, 2025; CALM by Ye et al., ICLR 2025) have catalogued structural failure modes in LLM judges: place bias, verbosity bias, scoring-ID-format bias, self-inconsistency throughout runs, and extreme rating compression.

    There may be additionally a extra uncomfortable declare from Wang Lun’s latest essay, Your Evals Will Break and You Received’t See It Coming: analysis infrastructure fails silently when fashions cross functionality thresholds,

    as a result of present benchmarks assume incremental enchancment. His proposed treatment is adaptive evals that detect their very own obsolescence. DiffuJudge-AV is one concrete step in that route. By attaching a calibrated uncertainty to each rating the decide emits, the framework widens its personal confidence interval earlier than the purpose estimate misleads you.

    For autonomous driving this issues operationally. If a realized evaluator decides which failures get escalated to human evaluate, which situations enter a regression suite, or which releases deserve extra scrutiny, then the evaluator’s failure modes grow to be a part of the security story.

    The instinct: a decide rating is a loud sensor studying

    An LLM decide rating appears to be like clear as a result of it’s a quantity: 1, 2, 3, 4, or 5.

    However that quantity can transfer for causes that don’t have anything to do with the precise high quality of the reply. Change the order of choices. Paraphrase the rubric. Reorder standards. Swap rating labels from Arabic numerals to Roman. Resample exemplars. Change temperature. Shuffle the video frames you pattern. The true reply high quality didn’t change. The decide modified.

    That means a helpful psychological mannequin:

    Deal with the decide like a loud sensor.

    There’s a latent rating s 0. The decide by no means observes it immediately. Every immediate variant produces a loud studying

    s~t=s0+ϵt,t∈{1,…,7}tilde{s}_t = s_0 + epsilon_t, quad t in {1, ldots, 7}

    Right here t is just not a diffusion timestep within the image-generation sense. It’s a documented supply of decide

    perturbation, drawn immediately from the 2024–2025 LLM-as-a-Choose bias literature. Seven canonical sources, every one a managed noise degree:

    Picture created by writer utilizing figurelabs
    Stage t Perturbation What it checks Reference
    1 choice / order swap place bias Shi et al., 2025
    2 rubric paraphrase immediate sensitivity SPUQ, arXiv 2403.02509
    3 criterion reorder rubric-order sensitivity Chen et al., 2025
    4 score-ID format swap (1–5 / I–V / A–E) scoring-format bias Chen et al., 2025
    5 temperature noise self-inconsistency Thakur et al., 2025
    6 exemplar resample few-shot variance classical
    7 body shuffle (video) temporal robustness this work

    For every merchandise, the framework runs the decide throughout all seven perturbation ranges with okay = 3 samples every, giving roughly 22 rating observations per merchandise as an alternative of 1. That could be a sensible measurement of decide instability. It additionally lets us run a reverse step.

    The denoising step: Tweedie in a single equation

    The diffusion analogy turns into helpful as a result of there’s a classical outcome behind denoising: Tweedie’s components (Robbins 1956; revived for contemporary diffusion by Manor & Michaeli, ICLR 2024).

    If a loud commentary s~ is generated by including Gaussian noise to a latent clear worth s 0, the posterior imply is:

    s^0=s~+σt2∇s~log⁡p(s~)hat{s}_0 = tilde{s} + sigma_t^2 , nabla_{tilde{s}} log p(tilde{s})

    Var[s0|s~]=σt2+σt4∇s~2log⁡p(s~)textual content{Var}[,s_0 mid tilde{s},] = sigma_t^2 + sigma_t^4 , nabla_{tilde{s}}^2 log p(tilde{s})

    The framework estimates p(s~) with a Gaussian KDE over the per-item sampled scores. Inside-perturbation-level variance provides σt2sigma_t^2 . Pooling throughout ranges is precision-weighted earlier than the Tweedie

    correction. Two outputs come out of this single reverse step:

    1. A denoised level estimate s^0hat{s}_0
    2. A per-item posterior uncertainty σ^ihat{sigma}_i

    The second output is the half I care about extra. In a safety-review workflow, I don’t solely need a quantity. I need to know whether or not the decide is assured sufficient for that quantity to be actionable.

    The denoised estimate is then wrapped in an ordinal-boundary-adjusted split-conformal interval (Sheng et al., EMNLP 2025), studentized by the Tweedie posterior σ:

    αi=|sitrue−s^i|max⁡(σ^i, ϵ)alpha_i = frac{|,s_i^{textual content{true}} – hat{s}_i,|}{max(hat{sigma}_i, epsilon)}

    As a result of the rating is ordinal, the ensuing interval is snapped to legitimate 1–5 boundaries. The decide’s output is not “this reply is a 2.1.” It’s one among three actions:

    “This reply is probably going within the failure area, and the calibrated interval is slim sufficient to escalate mechanically.”

    “This reply is probably going a clear move and the interval is slim. Launch.”

    “The mannequin is unsure on this case. Path to a human reviewer.”

    Picture created by writer utilizing figurelabs

    The issue area and the info

    This framework is general-purpose, however the software is deliberately particular: safety-critical autonomous-driving video analysis. AV methods generate situation logs and counterfactual rollouts at a quantity that no human staff can label. Business now routinely makes use of LLM/VLM judges to attain mannequin solutions, prediction high quality, planner rationales, and chain-of-thought outputs, and people judges gate launch choices. NHTSA’s pre-crash typology catalogues 37 light-vehicle crash situations; ISO 26262 and SOTIF demand calibrated confidence on safety-critical occasions; CARLA Leaderboard 2.0 generates extra validation visitors per day than any annotation funds can take in.

    The benchmark we take a look at on is LingoQA (Marcu et al., ECCV 2024), a visible question-answering dataset for autonomous driving launched by Wayve. Every merchandise is a brief driving clip (4 seconds, dash-cam, 1 Hz, 5 frames) with a free-form query, reference reply, and a learned-classifier gold rating from Lingo-Choose. We use a stratified 200-clip subset of the official analysis suite and deal with the high-confidence Lingo-Choose scores as Tier-1 anchor labels.

    A consultant merchandise, the identical one the judges grade in manufacturing:

    Query: “Why did the ego car decelerate right here?” Reference reply: “As a result of a pedestrian began crossing the street on the marked crossing.” Candidate reply (AV-VLM beneath take a look at): “Due to visitors forward.” Gold rating: 2.

    Picture by writer

    The qualitative determine above exhibits three actual LingoQA gadgets. Within the score-1 clip (purple border) the candidate reply hallucinated a bike that’s not within the frames. The imaginative and prescient decide’s rationale explicitly calls out that contradiction. The score-5 clip (inexperienced border) is one the place the candidate appropriately verifies a damaging declare (“there are not any scooters seen”) {that a} text-only decide can not test with out seeing the scene. The score-3 clip is genuinely ambiguous: the candidate is partially proper.

    That is the form of resolution the decide has to make, and the form of resolution a text-only decide can not make properly, as a result of the proof lives in pixels.

    Pipeline

    The system that produced the numbers on this article suits in a single diagram:

    Picture created by writer utilizing figurelabs

    From the highest:

    • Inputs: a driving clip with sampled frames, the query, the reference reply, and the AV-VLM’s candidate reply.
    • Ahead perturbation cascade: 7 recognized judge-bias operators utilized programmatically to the immediate, producing 22 immediate variants per merchandise.
    • Choose ensemble: 5 configurations evaluated (Claude text-only ensemble, open-source textual content ensemble, Claude with 3 frames, Qwen2.5-VL-7B with 1 body, InternVL2-8B with 1 body). Every emits a scalar rating plus a one-sentence rationale.
    • Noisy rating samples: per-item distribution of scores tagged with perturbation degree.
    • Tweedie reverse step: single-step denoising with posterior imply and posterior variance.
    • Ordinal conformal interval: boundary-snapped, studentized by the Tweedie σ.
    • Eval-of-eval report: Cohen’s κ, Krippendorff’s α, ECE, Brier, MAE, fail-F1, stochastic stability, robustness deltas per perturbation supply.

    Throughout all 5 decide configurations this produced 28,400 actual decide evaluations on LingoQA.

    The total implementation, scripts, run logs, and each determine on this article reside within the venture repository at github.com/syedhumarahim/diffujudge-av.

    The place this slots into NVIDIA’s AV-Eval stack

    I constructed this framework with NVIDIA’s AV-Eval constitution in thoughts: realized analysis pipelines that change hand-crafted guidelines, agentic workflows that chain mannequin inference with retrieval and structured reasoning, and specific evaluation-of-evaluation methodology. Each primitive in DiffuJudge-AV maps onto that mandate. The 7-level perturbation cascade is the agentic workflow. The Tweedie and conformal layer is the calibration loop. The 12-category habits taxonomy used internally maps cleanly to NHTSA pre-crash IDs, ASAM OpenSCENARIO 1.x phenomena, and CARLA Leaderboard 2.0 routes, the identical situation vocabulary NVIDIA’s AV coaching and eval stack already speaks.

    The repository additionally ships a drop-in wrapper for NVILA-8B, NVIDIA’s personal environment friendly VLM (Liu et

    al., 2024), and a deployment recipe that serves the three-VLM decide ensemble as OpenAI-compatible

    NVIDIA NIM endpoints. One caveat: NVILA-8B’s structure is just not but supported by vLLM

    0.8.4, so the working numbers on this article use Qwen2.5-VL-7B and InternVL2-8B because the open-VLM substitutes. The combination form is prepared for the day vLLM lands NVILA help.

    Consequence: Pearson correlation hid the failure mode

    Right here is the total metric desk:

    Mannequin Mode r ρ κ MAE ECE Fail-F1
    Claude TEXT-only textual content ensemble 0.753 0.702 0.057 0.85 0.111 0.041
    Open-source TEXT ensemble Qwen+Llama+DSV3 0.803 0.717 0.701 0.92 0.207 0.526
    Claude VISION 3 frames 0.708 0.703 0.632 1.05 0.252 0.612
    Qwen2.5-VL-7B VISION ★ 1 body 0.857 0.856 0.837 0.57 0.121 0.712
    InternVL2-8B VISION 1 body 0.766 0.753 0.738 0.60 0.084 0.511
    Picture by writer

    The important thing column is Cohen’s κ. Textual content-only Claude had a good Pearson correlation, however virtually zero ordinal settlement. Why? As a result of its predictions have been squeezed right into a slim center band. It was directionally conscious, however not operationally helpful.

    That’s the Pearson entice:

    A decide can protect rating whereas destroying the choice boundary you really care about.

    A security-review system wants to tell apart:

    • Clear failure → path to human or regression suite.
    • Partial reply → examine or maintain unsure.
    • Clear move → enable lower-priority evaluate.

    A decide that refuses to make use of the underside and prime of the dimensions can not help that workflow. Textual content-only Claude’s fail-detection F1 is 0.041. It flags 2% of precise failures. The identical mannequin with three frames jumps to 0.612. Qwen2.5-VL goes additional to 0.712, with κ = 0.837.

    Consequence: imaginative and prescient modified Claude’s scoring habits

    The stunning discovery was not simply that text-only judging carried out worse. It was that the identical Claude mannequin behaved in a different way when given frames.

    Textual content-only Claude compressed predictions into roughly [1.3, 3.5]. With three driving frames, the vary expanded to roughly [1.0, 5.0].

    Picture by writer

    The second panel above is Claude TEXT-only: roughly 80% of all judgments at rating 3 with a tiny sprinkle of 1s and 5s. The third panel is similar Claude mannequin with three frames: scores now unfold throughout the total ordinal scale. Identical mannequin, identical rubric, identical gadgets.

    The compression was not a model-family property or a generic RLHF impact. It was input-mode-specific. When the decide solely noticed textual content, it hedged. When the decide noticed the scene, it was keen to make use of the total ordinal scale. Many analysis pipelines nonetheless use text-only decide prompts even for visible duties. They ask the decide to check a candidate reply to a reference reply, however the decide by no means sees the underlying proof. For driving scenes that may be a extreme limitation. A text-only decide can test semantic similarity; a imaginative and prescient decide can test whether or not the reply is grounded within the scene.

    Identical discovering proven one other approach, per-item scatter towards gold:

    Picture by writer

    Left panel: each text-only Claude prediction sits inside [1.3, 3.5] no matter the place the gold really is. Proper panel: the identical Claude on the identical gadgets with three frames. Predictions now climb the y = x line.

    Consequence: imaginative and prescient unlocks safety-threshold choices

    The underside-line operational metric for an AV-review pipeline is can this decide flag a nasty reply when the gold is dangerous? That’s fail-detection on the ordinal threshold gold ≤ 2.

    Picture by writer
    • Claude TEXT-only: precision 1.00 on fail-detection, however recall 0.02. Catches 2% of precise failures, as a result of it virtually by no means says “≤ 2”.
    • Claude VISION: 0.45 precision, 0.94 recall. Catches 94% of failures.
    • Qwen2.5-VL-7B: 0.43 precision, 1.00 recall.

    For pass-detection (gold ≥ 4), text-only Claude has F1 = 0.00. It by no means says “5,” so it will possibly by no means affirm a clear move. Claude-vision reaches 0.76; Qwen-VL reaches 1.00.

    That is what imaginative and prescient grounding plus a calibrated scale buys you in observe.

    Consequence: a single heatmap of decide bias per noise supply

    One of the crucial helpful artefacts the SDJ cascade provides you is a single image that exhibits which of the seven recognized judge-bias sources every mannequin household is most delicate to. For every (mannequin, perturbation degree) cell, we common absolutely the rating shift from the anchor throughout all gadgets:

    Picture by writer

    Just a few issues stand out:

    • Textual content-only Claude is uniformly fragile. Rubric paraphrase, criterion reorder, score-ID swap, and temperature every shift its imply by ~0.4 on a 1–5 scale. That matches the diffusion-framing’s prediction that compressed, hedging judges drift essentially the most once you perturb the immediate.
    • The open-source textual content ensemble is roughly 3× extra sturdy throughout each column, maxing out at | Δ| = 0.15 for score-ID swap.
    • Qwen2.5-VL is dominated by one particular bias: score-ID format swap (Arabic → Roman → A–E) shifts its imply by 0.44. Realizing which bias issues most is itself actionable: lock the rating format in manufacturing prompts for this decide.

    That is precisely the form of audit-ready, evaluation-of-evaluation artefact a learned-evaluation pipeline must ship alongside the headline metrics.

    Consequence: does the uncertainty have sign?

    The Tweedie reverse step produces a posterior σ at no further price. The query is whether or not that σ has info. Are the gadgets the cascade marks as unsure the identical gadgets the decide will get improper? For every imaginative and prescient decide we plot per-item std of perturbation samples (a proxy for posterior σ) towards absolutely the error towards gold:

    Picture by writer

    Qwen2.5-VL’s σ has the cleanest sign (r = 0.26 between predicted-σ and observed-|error|). Objects with σ close to zero virtually by no means have |error| > 1; gadgets with σ > 0.6 are the place the decide’s imply was off by 1–3 rating factors. That’s exactly the regime the place the safety-gate diagram says path to human evaluate. We now have empirical proof the framework’s personal uncertainty estimate identifies these gadgets.

    Consequence: stochastic stability hit the unique goal

    One purpose was to check whether or not SDJ may expose and cut back stochastic instability. I ran Qwen2.5-VL-7B throughout 5 random seeds at two temperatures on all 100 imaginative and prescient gadgets:

    Temperature Median per-item std Imply Frac gadgets with std ≤ 0.15
    T = 0.6 (noisy single-judge baseline) 0.40 0.40 31%
    T = 0 (deterministic flooring) 0.00 0.024 95%

    The noisy baseline matched the anticipated instability virtually precisely: about 0.40 per-item customary deviation, in keeping with the literature. At T = 0, 95% of things sat at or under the unique design goal of 0.15.

    Picture by writer

    /

    Within the diffusion framing, temperature is without doubt one of the ahead noise sources. I’m not saying each manufacturing decide ought to run at T = 0 ceaselessly. The purpose is {that a} decide harness ought to measure this instability explicitly as an alternative of pretending the rating is deterministic, and report a posterior σ alongside the purpose estimate.

    Consequence: conformal protection matches the calibration goal

    The conformal layer goals for empirical protection ≥ 1 − α with α = 0.10. Throughout the three runs with sufficient gadgets for secure split-conformal calibration:

    Run n_test Empirical protection Goal Imply interval width
    Claude TEXT-only 80 0.950 0.900 4.51
    Open-source TEXT 80 1.000 0.900 4.50
    Claude VISION 20 1.000 0.900 3.50

    All three are above goal. Per-bin protection (fail / mid / move tiers) can be above 0.92 in each cell. The intervals are 3.5–4.5 score-units vast on a 1–5 scale. That’s the value of full protection on a heterogeneous calibration set.

    Limitations

    Just a few caveats value flagging up entrance. The gold labels are high-confidence Lingo-Choose classifier outputs, not a Tier-3 human-adjudicated set. The CODA-LM-style corner-case stress break up (cut-ins at night time, occluded VRUs, ambiguous near-misses) is just not included but. One of the best ECE measured is 0.084, not the unique 0.05 goal; a post-hoc isotonic or Platt calibration on the calibration break up would virtually definitely shut that hole. The present VLM ensemble has two dedicated judges (Qwen2.5-VL and InternVL2-8B) somewhat than the deliberate three. The imaginative and prescient runs are smaller than the text-only runs (100 gadgets vs 200)

    due to obtainable body maps, so cross-modality numbers are model-level summaries somewhat than strictly per-item comparable. And the denoising step is a single-step analytical Tweedie correction somewhat than a multi-step realized sampler. These are sincere limitations, however in addition they map immediately onto the roadmap under.

    Conclusion

    An important lesson from this venture is just not that one mannequin beat one other. It’s that the metric you optimize for throughout eval-of-eval determines the decide you ship.

    If I had optimized for Pearson r alone, I might have shipped a text-only Claude decide that hardly used the ordinal scale and caught 2% of safety-critical failures. Utilizing the total eval-of-eval desk (ordinal κ, fail-detection F1, calibration, stochastic stability) flipped the choice to an open VLM decide with calibrated uncertainty and a routing rule that sends ambiguous instances to people. Identical knowledge, completely different metric, completely different decide in manufacturing.

    That’s the distinction between an eval that appears good in a benchmark desk and an eval that may help actual engineering choices.

    For realized analysis methods, particularly in autonomous driving, robotics, and healthcare, we should always cease treating evaluator scores as floor reality. They’re measurements. Measurements have noise. Noise has construction. If we will measure that construction, we will construct higher evaluators.

    That’s what DiffuJudge-AV tries to do: make the failure modes of the evaluator seen earlier than the evaluator turns into a part of the manufacturing resolution loop. Wang Lun’s essay closes with a line value quoting: “When you can consider appropriately, you may practice appropriately.” This work is one small contribution towards that ambition.

    Future work

    The roadmap follows immediately from the constraints above: a small Tier-3 expert-anchored golden set on the 50 hardest LingoQA gadgets, a CODA-LM corner-case stress break up, a 3rd specialised VLM decide (LLaVA-Critic-7B, or NVILA-8B as soon as vLLM lands the structure), a realized Tweedie MLP that replaces the analytical Gaussian KDE with a small denoiser educated on perturbation-level / judge-family / item-embedding options, a post-hoc isotonic calibration layer to shut the remaining ECE hole, and a NIM-served manufacturing ensemble with A/B comparability tooling and mannequin versioning.

    References

    • Shi, Y. et al. Judging the Judges: A Systematic Research of Place Bias in LLM-as-a-Choose. IJCNLP-AACL 2025.
    • Chen, X. et al. Evaluating Scoring Bias in LLM-as-a-Choose. arXiv 2506.22316, 2025.
    • Thakur, A. et al. Ranking Roulette: Self-Inconsistency in LLM-as-a-Choose. arXiv 2510.27106, 2025.
    • SPUQ: Semantically-Perturbed Uncertainty Quantification. arXiv 2403.02509, 2024.
    • Sheng, H. et al. Analyzing Uncertainty of LLM-as-a-Choose by way of Conformal Prediction. EMNLP 2025 / arXiv 2509.18658.
    • Ye, S. et al. CALM: A Reasoning-Calibrated Multi-Step Eval-of-Eval Framework. ICLR 2025.
    • Wang, L. Your Evals Will Break and You Received’t See It Coming, weblog, 2025.
    • Robbins, H. An empirical Bayes method to statistics. Berkeley Symp. 1956 (Tweedie’s components).
    • Manor, H. & Michaeli, T. On the Posterior Distribution in Denoising: Software to Uncertainty Quantification. ICLR 2024 / arXiv 2309.13598.
    • Marcu, A. et al. LingoQA: Visible Query Answering for Autonomous Driving. ECCV 2024 (Wayve).
    • Sima, C. et al. DriveLM: Driving with Graph Visible Query Answering. ECCV 2024.
    • Liu, Z. et al. NVILA: Environment friendly Frontier Visible Language Fashions. NVIDIA, arXiv 2412.04468, 2024.
    • Lin, J. et al. VILA: On Pre-training for Visible Language Fashions. NeurIPS 2024 / arXiv 2312.07533 (NVIDIA).
    • Wang, S. et al. OmniDrive: A Holistic LLM-Agent Framework for Autonomous Driving with 3D Notion, Reasoning, and Planning. NVIDIA, arXiv 2405.01533, 2024.
    • Mao, J. et al. A Survey on Multimodal Massive Language Fashions for Autonomous Driving. NVIDIA / Tsinghua, arXiv 2311.12320, 2023.
    • Najm, W. G. et al. Pre-Crash State of affairs Typology for Crash Avoidance Analysis. NHTSA / Volpe, 2007.
    • ASAM e.V. OpenSCENARIO 1.x specification. 2022–2024.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    They Requested It. I Built It. Nobody Ever Used It.

    May 27, 2026

    Learning From Pairwise Preferences: An Introduction to the Bradley Terry Model

    May 27, 2026

    How to Effectively Run Many Claude Code Sessions in Parallel

    May 27, 2026

    Most AI Agents Fail in Production Because They’re Built Backwards

    May 27, 2026

    The Domain Shift: Moving Data Governance from Product Triage to Infrastructure Investment

    May 26, 2026

    The AI Model Confidence Trap

    May 26, 2026
    Leave A Reply Cancel Reply

    Editors Picks

    Post-viral depression linked to immune changes in study

    May 28, 2026

    Czech startup Passwd automates access management for Google Workspace teams (Sponsored)

    May 28, 2026

    Oura’s New Ring 5 Is Smaller and Lighter—and Adds an AI Health Coach

    May 28, 2026

    Reactor, which says its AI platform can generate video in real-time with near-zero latency, emerges from stealth with a $59M Series A led by Lightspeed (Todd Spangler/Variety)

    May 28, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    How to Apply Agentic Coding to Solve Problems

    January 31, 2026

    Exercise reverses muscle aging by clearing damaged proteins

    January 17, 2026

    Stylish Harper Tiny House breaks from the standard small living design

    October 15, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.