Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • DraftKings expands sports prediction market offerings via Railbird
    • Roku’s New Home Screen Has More Personalization… and a Large Ad
    • How to Effectively Run Many Claude Code Sessions in Parallel
    • Industry-standard LLM benchmarks in DataRobot
    • Why Do Palletizing Automation Projects Fail? 5 Pitfalls and How to Fix Them
    • Blacksheep one electric moto: $40k bespoke luxury
    • 10 European TravelTech startups reshaping the journey experience in 2026
    • I Like Ferrari’s Luce EV. But This Is Why It’s Heartbreaking
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Wednesday, May 27
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»AI Technology News»Industry-standard LLM benchmarks in DataRobot
    AI Technology News

    Industry-standard LLM benchmarks in DataRobot

    Editor Times FeaturedBy Editor Times FeaturedMay 27, 2026No Comments10 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    Each LLM deployment has a ceiling, a latency curve, and a unit price. Most groups function blindly, discovering their deployment limits solely when over-provisioning exhausts their GPU price range or peak visitors causes a catastrophic failure.

    Three numbers matter: most sustained concurrency earlier than GPU saturation, end-to-end latency at that concurrency, and value per million tokens at sustained load. These metrics emerge from how the mannequin interacts along with your {hardware}, runtime, tokenizer, and visitors combine.

    DataRobot 11.8 adjustments that with LLM Profiling Jobs: a local integration of NVIDIA AIPerf, the industry-standard generative AI benchmarking software. One authenticated POST benchmarks any DataRobot LLM deployment serving an OpenAI-compatible internet server, sweeps the concurrency vary and use instances you outline, and returns the empirical inputs to Quota Reservations (accessible in DataRobot 11.9).

    Why LLM capability is tough to foretell

    LLM inference doesn’t scale linearly. Compute and reminiscence calls for per request rely dynamically on immediate size, response size, sampling parameters, and KV cache utilization.A deployment that serves 50 brief chat turns per second can stall at 5 long-context RAG requests per second on the identical {hardware}. 4 distinct behaviors make static or speculative capability estimates unreliable:

    • Latency is non-linear in concurrency. Time to first token and inter-token latency keep roughly flat throughout a large concurrency vary, then rise sharply as soon as GPU reminiscence bandwidth or compute saturates. TTFT rises when prefill compute saturates; inter-token latency rises when decode reminiscence bandwidth saturates. Which one bites first is determined by the workload combine and the deployment’s GPU configuration (single card or a cluster). The saturation knee is the working level that issues, and it may’t be inferred from a single low-load measurement.
    • Throughput and latency commerce off. You possibly can squeeze extra complete tokens per second out of a deployment by operating it at greater concurrency, at the price of slower per-user response. The precise trade-off is determined by your SLO, not on a generic advice.
    • Use case combine issues. Two deployments operating the identical mannequin on the identical {hardware} can have very completely different capability if one serves brief Q&A and the opposite serves long-context summarization. The combo must be within the check, or the check is fallacious.
    • Caching and routing change the reply. Prefix caching (widespread in agentic coding with periodic compaction) and KV-aware routing can raise efficient throughput dramatically. Profiles run in opposition to a chilly deployment with random inputs symbolize the ground, not the ceiling.

    LLM Profiling Jobs make these curves seen.

    How LLM benchmarks assist

    • Defend capability and quota selections with measured information. When finance questions a four-H100 footprint, or when cross-functional groups negotiate shared capability, you may justify the structure with empirical profiling information. Saturation knee, SLO goal, and forecast visitors make GPU sizing an evidence-based line merchandise. The identical numbers feed Quota Reservations instantly.
    • Account for price per shopper. Complete token throughput plus the GPU occasion price offers a cost-per-million-tokens determine that helps chargeback or showback. Attribute spend to shoppers proportionally to their reservations, not by guesswork.
    • Evaluate fashions and {hardware} on equal phrases. Maintain the workload profile fixed and differ one dimension at a time: the identical mannequin on completely different GPU configurations (a B200 node vs a B300 node, or 4×H100 vs 8×H100), or completely different fashions on the identical configuration (Qwen3.6 35B-A3B MoE vs Qwen3.6 27B dense). As a result of AIPerf metrics match NVIDIA’s revealed NIM benchmarks, the numbers are additionally instantly akin to public benchmarks for a similar mannequin and {hardware} mixtures. The precise enter for procurement and capacity-sizing selections earlier than a {hardware} order.
    • Show a change is protected earlier than you ship it. Earlier than a mannequin improve, vLLM bump, driver swap, or GPU migration, rerun the identical profile and evaluate in opposition to the prior baseline. Regressions present up within the metrics, not in incident stories.

    What LLM benchmark metrics imply

    The 4 headline metrics AIPerf returns map on to consumer expertise and to GPU economics:

    • Time to first token (TTFT, ms). Measures how lengthy a consumer waits between submitting a immediate and seeing the primary character; this metric is dominated by prefill compute.
    • Inter-token latency (ITL, ms). Common time between successive output tokens as soon as era has began. Units the perceived “typing velocity” of the response.
    • Request throughput (requests/sec). Full request-and-response cycles per second on the examined concurrency. The idea for the Capability (RPM) worth on Quota Reservations.
    • Complete token throughput (tokens/sec). Complete tokens (enter plus output) processed per second throughout all concurrent requests. The idea for cost-per-token economics.

    For every metric, AIPerf stories averages and percentiles (p50, p90, p99). When GPU saturation is detected through the sweep, estimatedCapacity stories the iteration instantly earlier than it. When saturation isn’t detected (the widespread case, for the reason that profiler isn’t co-located with the deployment), estimatedCapacity stories the final iteration examined. Sweep broad sufficient that the curve clearly bends, or deal with the outcome as a decrease sure.

    Submitting a job

    A profiling request takes 4 parameters: a deploymentId (the ID of the DataRobot LLM deployment you wish to profile), a listing of concurrency ranges to comb, a request depend scalar (what number of requests every concurrent employee points), and a number of use instances. Every use case defines an enter sequence size (ISL), an output sequence size (OSL), normal deviations for each, and a weight (prob). Weights throughout all use instances should sum to 100.

    export DATAROBOT_ENDPOINT="https://app.datarobot.com"
    export DR_API_KEY=""
    export HUGGINGFACE_DR_CRED_ID=""
    export DEPLOYMENT_ID=""
    export CONCURRENCIES="[1,10,50,100]"
    export REQUEST_COUNT_SCALAR=2
    export MODEL_TOKENIZER="openai/gpt-oss-20b"
    export USE_CASES='[{"isl":200,"islStddev":15,"osl":1000,"oslStddev":15,"prob":100}]'
     
    curl -X POST -H "Authorization: Bearer ${DR_API_KEY}" 
         -H "Content material-Sort: utility/json" 
         "${DATAROBOT_ENDPOINT}/api/v2/llmProfilingJobs/" 
         -d @- <

    A 202 Accepted response returns the job ID, an execution ID, and a standing ID:

    {
      "id": "69e09f9e25fdfdfab0d27925",
      "jobExecutionId": "69e09f9f25fdfdfab0d27926",
      "statusId": "5633f028-3f68-4f83-bddc-560d266d6bd2"
    }
    

    Monitoring and retrieving LMM benchmark outcomes

    Ballot the Standing API with the returned statusId. When the job finishes, the API returns 303 See Different and the Location header factors to the outcomes endpoint:

    curl -s -L -i 
      -H "Authorization: Bearer ${DR_API_KEY}" 
      "${DATAROBOT_ENDPOINT}/api/v2/standing/${STATUS_ID}/"
    

    Fetch the complete outcomes with the profiling job id:

    curl -H "Authorization: Bearer ${DR_API_KEY}" 
         "${DATAROBOT_ENDPOINT}/api/v2/llmProfilingJobs/${LLM_PROFILING_JOB_ID}/profilingResults/"
    

    Instance payload (truncated):

    {
      "estimatedCapacity": {
        "metrics": [
          { "name": "request_throughput",     "units": "requests/sec", "measurements": [{ "name": "avg", "value": 8.84    }] },
          { "identify": "inter_token_latency",    "items": "ms",           "measurements": [{ "name": "avg", "value": 23.79   }] },
          { "identify": "time_to_first_token",    "items": "ms",           "measurements": [{ "name": "avg", "value": 833.06  }] },
          { "identify": "total_token_throughput", "items": "tokens/sec",   "measurements": [{ "name": "avg", "value": 4524.80 }] }
        ]
      },
      "outcomes": [ "...per-iteration benchmark data..." ]
    }
    

    estimatedCapacity is the sustained working level. outcomes accommodates one entry per concurrency stage examined, with the complete metric set.

    Studying the curve

    The estimated-capacity numbers let you know the sustained ceiling. The per-iteration outcomes present you the way the deployment behaves as load climbs towards that ceiling. The desk under is an illustrative instance.

    Concurrent requests TTFT (ms) Complete throughput (tokens/sec) Notice
    1 ~150 ~600 Low load, near-floor latency
    10 ~250 ~2,500 Throughput scales almost linearly
    50 ~800 ~4,500 estimatedCapacity returned from this iteration
    100 ~1,500 ~4,600 Saturated: TTFT roughly doubles, throughput plateaus

    When AIPerf detects GPU saturation through the sweep, it identifies the iteration earlier than it (concurrency 50 right here) and returns these metrics as estimatedCapacity. When saturation isn’t detected, estimatedCapacity is just the final iteration examined, which is why the sweep wants to increase previous the knee. Something previous that time trades user-perceived latency for marginal throughput beneficial properties. If the product spec requires TTFT underneath 1 second, the curve reveals the deployment helps as much as roughly 50 concurrent requests with margin: provision GPU so peak concurrent demand stays at or under that stage.

    From profiling outcome to Quota Reservations config

    The bridge from a profiling run to a Quota Reservations configuration is direct:

    Quota setting The place it comes from Instance (from pattern above)
    Capability (RPM) estimatedCapacity.request_throughput × 60 8.84 req/sec × 60 ≈ 530 RPM
    Utilization Threshold Decide 70–80% of Capability so enforcement engages earlier than the saturation knee 80% → enforcement at ~424 RPM
    Reserved % per shopper Sized to the minimal every precedence shopper wants throughout competition 30% Manufacturing Agent A, 20% Agent B, 30% Agent C, 20% unreserved pool
    Refill charge Capability / 60 (requests per second) 530 / 60 ≈ 8.83 req/sec

    For a primer on how Capability, Utilization Threshold, and Reserved % work together underneath load, see Rate Limiting vs. Quota Reservations.

    A labored price instance

    Take the pattern outcome: 4,524 complete tokens per second sustained (enter plus output). That’s roughly 16.3 million tokens per hour from one deployment.

    If the underlying GPU occasion prices $X per hour, the associated fee per million tokens is $X / 16.3. For an occasion at $4 per hour, that’s about $0.25 per million tokens. For $12 per hour, about $0.74. To calculate price per million output tokens—the usual benchmark for public API comparisons—divide the overall price by the workload’s output share. For instance, given an ISL of 200 and an OSL of 1000, output accounts for roughly 83% of complete tokens. At a $4 hourly occasion value, this interprets to roughly $0.30 per million output tokens.

    Each benchmark run offers you a contemporary, correct cost-per-token determine for the precise mannequin, {hardware}, and quantization mixture you’re operating. After a vLLM improve or a {hardware} swap, re-run the identical profile and make sure your unit economics improved as a substitute of trusting a vendor declare. That is the inspiration for per-token and per-agent price transparency in chargeback.

    Selecting your inputs

    A helpful profile begins with two questions: what concurrency vary do you anticipate in manufacturing, and what does your visitors really seem like?

    • Concurrencies to comb. Begin broad ([1, 10, 50, 100]) to find the saturation knee, then slender (reminiscent of [40, 50, 60, 70]) for an SLO-grade studying round that time.
    • Request depend scalar. Set it excessive sufficient that every iteration runs lengthy sufficient to easy out noise. A scalar of two is an inexpensive place to begin. Elevate it if variance seems to be excessive.
    • Use instances. Match your actual visitors combine. In case you serve 70% brief chat turns (ISL 200, OSL 300) and 30% long-context RAG (ISL 4000, OSL 800), outline two use instances with prob: 70 and prob: 30. Testing a blended visitors combine exposes tail-latency conduct (reminiscent of p99 spikes) {that a} single-use-case common obscures.
    • Tokenizer. Set it explicitly. The benchmark is determined by correct token counts, so the matching tokenizer is a part of an accurate measurement.

    Operational notes

    • Profiling generates artificial load. Run jobs in opposition to a non-production LLM deployment or throughout a upkeep window.
    • As a result of the visitors is artificial, prefill cache hits received’t seem in token metrics.
    • Profiling treats the deployment as a black field. Whether or not the deployment runs on one GPU or many, and no matter mixture of tensor, pipeline, information, or knowledgeable parallelism it makes use of, the profile measures the externally observable outcome.
    • Jobs may be canceled with a DELETE to the profiling job ID. Cancellation is best-effort and should not cease a run that’s almost full.
    • Earlier than you submit, retailer your Hugging Face token in DataRobot Credential Management as an “API Token (API Key)” credential. AIPerf makes use of it to fetch the mannequin tokenizer, and the saved credential prevents rate-limit errors.

    Get entry

    LLM Profiling Jobs are in personal preview in DataRobot 11.8. To allow in your tenant, contact your DataRobot account staff. They are going to activate the Allow Dynamic Quota Capability Profiling characteristic flag (the inner identify for LLM Profiling Jobs) and configure the profiling job picture in your cluster.

    Be taught extra



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    Rethinking organizational design in the age of agentic AI

    May 26, 2026

    A reality check on the AI jobs hysteria

    May 26, 2026

    It’s time to address the looming crisis in entry-level work.

    May 26, 2026

    A practical guide for platform teams managing shared AI deployments

    May 22, 2026

    Google I/O showed how the path for AI-driven science is shifting

    May 22, 2026

    DataRobot for Developers: Skills in Cursor, Gemini, and Claude

    May 22, 2026
    Leave A Reply Cancel Reply

    Editors Picks

    DraftKings expands sports prediction market offerings via Railbird

    May 27, 2026

    Roku’s New Home Screen Has More Personalization… and a Large Ad

    May 27, 2026

    How to Effectively Run Many Claude Code Sessions in Parallel

    May 27, 2026

    Industry-standard LLM benchmarks in DataRobot

    May 27, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    Self-driving Waymo cars keep SF residents awake all night by honking at each other

    August 15, 2024

    Today’s NYT Mini Crossword Answers for Feb. 1

    February 1, 2025

    Google expands Veo 3 availability to 71 additional countries, and Gemini Pro subscribers now get a trial pack of 10 Veo 3 generations on the Gemini web app (Matthias Bastian/The Decoder)

    May 25, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.