, Microsoft launched its newest Healthcare AI paper, Sequential Prognosis with Language Fashions, and it exhibits immense promise. They label it “The Path to Medical Superintelligence”. Are docs going to get overtaken by AI? Is that this actually a revolutionary development in our subject? Though the paper has simply been submitted for overview and may have further experimentation, this text will go over the details of the paper and supply some dialogue and limitations of the paper.
The general headlines are eye-popping: a way to extend AI diagnostic efficiency to 80% (with Microsoft’s new SDBench metric). However let’s see how that occurs.
For a short abstract of the paper, researchers created a brand new benchmark, SDBench, primarily based on scientific instances. Not like most situations, efficiency was primarily based on diagnostic accuracy and whole price to get to the analysis. This isn’t a brand new AI mannequin however a MAI Diagnostic Orchestrator known as MAI-DxO (which we are going to focus on extra afterward). This AI orchestration is model-agnostic, and plenty of variants of experiments had been carried out to acquire the cost-accuracy Pareto frontier. Closing outcomes cite physicians at 20% accuracy and MAI-DxO at 80%. Nonetheless, these percentages don’t essentially inform the entire story.
What’s Sequential Prognosis?
To start out, the paper is named Sequential Prognosis with Language Fashions. So what precisely is it? When sufferers arrive at a physician, they should recite their affected person historical past to offer context for the physician. By way of iterative questioning and testing, docs can slender down their speculation for a analysis. The paper cites a number of concerns throughout sequential analysis that later come into play for improvement: informative questions, balancing diagnostic yield and price with affected person burden, and realizing when to make a assured analysis [1].
SDBench
The Sequential Prognosis Benchmark is a novel benchmark launched by Microsoft Analysis. Previous to this paper, most medical benchmarks contain a number of alternative questions and solutions. Google famously used MedQA, consisting of US Medical Licensing Examination (USMLE) fashion questions, within the improvement of their medical LLM, MeD-PaLM 2 (you might keep in mind the headlines MeD-PaLM initially made because the medical LLM passing the USMLE [2]. Such a Q+A benchmark appears applicable since docs are licensed by the USMLE a number of alternative questions. Nonetheless, there may be an argument that these questions check some stage of memorization and never essentially deep understanding. Within the age of LLMs being recognized for memorization, this isn’t essentially one of the best benchmark.
To counter this, SDBench combines 304 New England Journal of Medication (NEJM) clinicopathological convention (CPC) instances revealed between 2017 and 2025 [1]. It’s designed to imitate the iterative course of a human doctor undertakes to diagnose a affected person. In these situations, an AI mannequin (or human doctor) begins with a affected person’s authentic historical past and should iteratively make selections to slender in on a analysis. On this scenario, the decision-making mannequin is named the diagnostic agent, and the mannequin revealing data is named the gatekeeper agent. We are going to focus on these brokers extra within the subsequent sections.
One other novel a part of SDBench is the consideration of price. Each analysis might be way more correct with limitless cash and assets for limitless checks, however that’s unrealistic. Subsequently, each query requested and check ordered incurs a simulated monetary price, mirroring real-world healthcare economics with Present Procedural Terminology (CPT) codes. This implies AI efficiency is evaluated not solely on diagnostic accuracy (evaluating its ultimate analysis to the NEJM’s gold normal) but additionally on its skill to attain that analysis in a cheap method.
Judging the Prognosis with SDBench
The pure query that arises is, “how precisely are these diagnoses evaluated for correctness throughout the SD Bench framework?” This isn’t easy, as ailments usually have a number of names, making direct string matching unreliable. To deal with this, Microsoft researchers created a choose agent.
The total diagram of every little thing that was simply described for SDBench is proven in Determine 1.
Brokers and AI
A very powerful factor to recollect is that MAI-DxO is model-agnostic. It’s an AI orchestrator. Maybe not a well-known time period, however Microsoft defines it for us. “Within the context of generative AI, an orchestrator is sort of a digital conductor serving to to coordinate a number of steps in attaining a fancy process. In healthcare, the function of orchestration is essential given the excessive stakes of every determination” [3]. Subsequently, any mannequin can be utilized because the brokers. That is nice as a result of the system doesn’t go outdated each time a brand new mannequin comes out. A full diagram of MAI-DxO is proven in Determine 3.

Earlier, it was talked about that there have been 3 brokers current: diagnostic, gatekeeper, and choose. It’s attention-grabbing to consider the diagnostic and gatekeeper judges functioning as some kind of GAN (Generative Adversarial Community) the place the diagnostic agent is making an attempt to enhance whereas being restricted by the gatekeeper’s data. Let’s examine the brokers additional.
Diagnostic Agent
For the diagnostic agent, the language mannequin orchestrates 5 distinct elements without delay. It isn’t recognized how distinct every function is educated however doubtless a specialised part or fine-tuned LLM for that process. The 5 roles are as follows:
- Dr. Speculation– Accommodates high 3 most certainly diagnoses at every step within the sequential analysis
- Dr. Take a look at Chooser– Selects 3 diagnostic checks at every time step to aim to discriminate between hypotheses for diagnoses
- Dr. Challenger– Acts because the satan’s advocate making an attempt to undermine the present hypotheses for diagnoses
- Dr. Stewardship– Focuses on price by minimizing price whereas maximizing diagnostic yield
- Dr. Guidelines– High quality management for the complete diagnostic agent guaranteeing positive legitimate outcomes and consistency
The 5-panel diagnostic agent should agree on 1 of three selections after every time step within the sequential analysis: asking a query, ordering a diagnostic check, or producing a ultimate analysis. The ultimate analysis is predicated on a confidence threshold the agent can attain. There’s additionally a price range tracker that’s elective if the agent desires to take price into consideration.
Gatekeeper Agent
The gatekeeper agent was applied utilizing ChatGPT o4-mini. The agent’s tips and responses had been created and guided by physicians. This meant the gatekeeper wouldn’t reveal extra data than an precise affected person or diagnostic check would. The paper cites that there will be hindsight bias in academic case write-ups, however the agent is meant to forestall this.
An preliminary bias introduced within the system was that the diagnostic agent would immediate for data not current within the case. When the gatekeeper returned “Not Responding”, the diagnostic agent would achieve hints. This might be equally in comparison with mode collapse with GANs the place the generator can trick the adversarial community. To right this, the gatekeeper would return artificial knowledge primarily based on what the question/check would most certainly yield.
Choose Agent
The choose agent is applied from ChatGPT’s o3 mannequin and follows a rubric proven in Determine 2, however it was primarily based on HealthBench, launched by OpenAI in Could 2025 [4]. To depend as an accurate analysis, researchers mentioned the choose agent should assess a rating of at the least 4.

To determine if the choose agent was doing its job accurately, the diagnoses produced by the diagnostic agent had been additionally graded by physicians. There was solely a disagreement in a number of instances, and within the majority of these instances, the choose was discovered to be overly strict.
Experimentation
Previous to coaching, 56 of the latest instances from the dataset had been used for testing, and the remainder had been used for coaching. By way of the completely different brokers, Microsoft examined many alternative basis fashions: GPT-3.5-turbo, GPT-4o, GPT-4.1, GPT-4.1-mini, GPT-4.1-nano, o3, o4-mini, Claude 4 Sonnet, Claude 4 Opus, Gemini 2.5 Professional, Gemini 2.5 Flash, Grok-3, Grok-3-mini, Llama 4 Maverick, and Deepseek-R1.
As an apart, the mannequin was prompted utilizing XML formatting which not too long ago appears to be one of the simplest ways to immediate LLMs together with JSON prompting. XML formatting appears to be hottest for Claude fashions.
In testing the accuracy-cost outcomes from SDBench, 5 essential variants had been experimented with:
- Prompt Reply– Prognosis have to be produced solely from preliminary presentation of affected person (no observe up questions/checks allowed)
- Query Solely– Diagnostic agent can ask questions however order no checks
- Budgeted– Applied a budgeting system the place checks will be canceled as soon as price is seen
- No Funds– Precisely because it appears. There is no such thing as a price range consideration
- Ensemble– Much like mannequin ensembling with a number of diagnostic agent panels run in parallel
The efficiency of every variant shall be proven in outcomes, however outcomes are much like what you’ll count on in conventional machine studying with completely different knowledge stratification, constraints, and mannequin ensembling.
Outcomes
Now that we’ve got lined the idea of the paper and its agentic setup, we will have a look at the outcomes. The MAI-DxO in its ultimate kind has one of the best diagnostic accuracy when ensembling, and it has one of the best accuracy at a given price range as proven in Determine 3. All particular person LLMs referred to are the results of simply feeding the case to the LLM and asking for a analysis.

From this determine, the outcomes look wonderful. The Pareto frontier is outlined by outcomes from MAI-DxO. MAI-DxO destroys different fashions and physicians in each diagnostic accuracy and price. That is the place the foremost information headlines about docs not being essential as a result of AI supremacy comes from. At an analogous price range, MAI-DxO is 4 instances extra correct than the sampled physicians.
The paper exhibits just a few extra figures containing outcomes, however for the sake of simplicity, that is the principle end result proven. Different outcomes embrace MAI-DxO boosting efficiency of off-the-shelf fashions and Pareto Frontier curves exhibiting the mannequin doesn’t purely memorize data.
How Good are these Outcomes?
You is perhaps questioning if these outcomes are actually that good. Regardless of these wonderful outcomes, the researchers do a fantastic job of nuancing their outcomes, explaining the drawbacks the system has. Let’s go over a few of these nuances defined within the paper.
To start out, a affected person abstract is just not normally introduced in 2-3 concise sentences. Sufferers might by no means instantly current their essential criticism, their essential criticism will not be the precise difficulty, they usually might discuss for minutes upon preliminary historical past. If MAI-DxO had been for use in apply, it could must be educated to deal with all of those situations. The affected person doesn’t all the time know what’s incorrect or easy methods to specific it accurately.
As well as, the paper mentions that the NEJM instances introduced had been a few of the most difficult instances to exist. Lots of the high docs on the earth wouldn’t be capable to clear up these. MAI-DxO carried out nice on these, however how do they carry out on regular daily instances taking over the vast majority of many docs’ careers. AI brokers don’t assume like us. Simply because they’ll clear up onerous instances doesn’t imply they’ll clear up simpler ones. There are additionally extra components equivalent to wait instances for checks and affected person consolation that issue into diagnoses. Extra outcomes are wanted to exhibit and show this.
The 20% accuracy for physicians can also be a bit deceptive. The paper does job of discussing this difficulty within the limitations part. The physicians weren’t allowed to make use of the web when going by means of the instances. What number of instances have we heard at school that we’ll all the time be capable to use the Web in actual life? Even docs must lookup data too. With search engines like google and yahoo, docs would doubtless get a far increased rating on the instances.
Earlier within the paper, we mentioned that the gatekeeper agent generates artificial knowledge to forestall the diagnostic agent from gaining hints. The standard of this artificial knowledge must be additional examined. There’s nonetheless potential for hints to be leaked from these checks as we don’t truly know the human outcomes for these instances. All this to say, this method might not generalize because the diagnostic agent could also be slowed down by complicated check outcomes from an inaccurate diagnostic check it ordered.
What’s the Takeaway?
On the planet of Healthcare AI, Microsoft’s MAI-DxO is extraordinarily promising. Just some years in the past, it appeared loopy that the world would have AI brokers. Now, a system can carry out sequential, medical reasoning and clear up NEJM instances balancing price and accuracy.
Nonetheless, this isn’t with out its limitations. We should discover a true gold normal to match healthcare AI brokers to. If each paper benchmarks doctor accuracy a unique manner, it will likely be troublesome to inform how good AI actually is. We additionally want to find out a very powerful components in diagnostics. Are price and accuracy the one 2 components or ought to there be extra? SDBench looks as if a step in the best path changing memorization testing with conceptual studying, however there may be extra to contemplate.
The headlines everywhere in the information shouldn’t scare you. We’re nonetheless a methods from medical superintelligence. Even when a fantastic system had been to be created, years of validation and regulatory approval would ensue. We’re nonetheless within the early levels of intelligence, however AI does maintain the facility to revolutionize medication.
References
[1] Nori, Harsha, et. al. “Sequential Prognosis with Language Fashions.” arXiv:2506.22405v1 (June 2025).
[2] Singhal, Karan, et. al. “Towards expert-level medical query answering with giant language fashions.” Nature Medication (January 2025).
[3] https://microsoft.ai/new/the-path-to-medical-superintelligence/
[4] Arora, Rahul, et. al. “HealthBench: Evaluating Massive Language Fashions In the direction of Improved Human Well being.” arXiv:2505.08775v1 (Could 2025).

