Multi-Agent Arena: Insights from London Great Agent Hack 2025

Persons are going to make use of increasingly more AI. Acceleration goes to be the trail ahead for computing. These elementary tendencies, I utterly consider in them.

Jensen Huang. Nvidia CEO

I had the wonderful alternative to take part within the Great Agent Hack 2025, hosted by Holistic AI at UCL[2, 3]. The hackathon was structured round three huge challenges: Agent Iron Man, Agent Glass Box, and Dear Grandma, every representing a special philosophy of agentic AI. These weren’t simply artistic names for handy classes; they mirrored three pillars of how we take into consideration brokers in the present day: robustness, transparency, and consumer security (of anybody, together with your grandma 😄). Being immersed in that surroundings for a weekend was a sort of reset button for me: it was energising, it jogged my memory why I get pleasure from working on this subject, and it left me genuinely impressed to continue learning and constructing, even when there’s by no means sufficient time to discover every little thing that’s taking place round AI.

On this hackathon, greater than 50 initiatives had been developed throughout three tracks. The main target of this text will probably be on key moments from the occasion and a handful of initiatives that stood out to me personally, whereas recognizing that each staff contributed one thing precious to the broader dialog on constructing sturdy and reliable brokers. For readers who need to discover the complete vary of concepts, the entire gallery of 51 submissions is offered right here: https://hai-great-agent-hack-2025.devpost.com/project-gallery?page=1 [4].

Determine 1. Official leaflet and my T-shirt from The Nice Agent Hack 2025. Picture by the creator.

Hosted by the UCL Centre for Digital Innovation (CDI), we spent the weekend in some really distinctive areas in East London, the sort of place the place you stroll previous the Orbit Tower (the purple sculpture from the 2012 Olympics) after which code beneath a rotating floating Earth contained in the constructing (Determine 2). London was already lined in Christmas lights in every single place you walked, so transferring between the hackathon and the town felt like stepping between a analysis lab and a vacation postcard.

**Determine 2.** East London views: UCL East campus and the ArcelorMittal Orbit (additionally known as Orbit Tower) (left), and the floating Earth set up contained in the UCL Centre for Digital Innovation (proper). Pictures by the creator.

In whole, the hackathon introduced collectively greater than 200 individuals and roughly 25 completely different awards throughout all types of classes. Groups weren’t dropped in chilly: earlier than the weekend that they had entry to tutorials, instance notebooks, and different assets that helped them put together [5], select a monitor, and hit the bottom operating as soon as the clock began. As deliverables, every staff was anticipated to submit a public GitHub repository, document a brief demo, and create a poster or slide deck to current their resolution to the jury, which made it a lot simpler to grasp the complete workflow and real-world potential of each undertaking.

The jury got here from a surprisingly numerous mixture of organisations: Holistic AI (the organiser), the UCL Centre for Digital Innovation (CDI), AWS, Valyu, NVIDIA, Entrepreneurs First, and others, together with corporations within the expertise and concepts on show. They chose the winners for every of the three fundamental tracks, but in addition handed out an entire constellation of thriller and particular awards that celebrated far more than simply probably the most technically superior resolution.

Amongst these particular awards there was a Courageous Soldier-style prize for the staff that confirmed true resilience and stored going even when their teammates began disappearing, actually leaving one soldier standing; a Greatest Pitch award, as a result of promoting your concept can also be a part of getting the job finished (particularly since technical professionals are likely to wrestle a bit with this); and a Highest Useful resource Utilization prize for the groups that actually leaned into AWS and squeezed each final spark out of the cloud. These and different award classes are summarised on the hackathon web site [2].

One of the vital curious issues in regards to the weekend was the prospect to see NVIDIA’s extremely‑compact AI supercomputer up shut and even take a photograph with the long-lasting leather-based‑jacket setup to recreate the well-known Elon Musk × Jensen Huang “leather-based jacket second” [6] proven on the large display (Determine 3). To make it even higher, a number of the brokers we had been attempting to interrupt within the Pricey Grandma problem had been really operating on related NVIDIA GPU {hardware}, so this tiny supercomputer was actually the mind behind the brokers that rivals had been attacking.

**Determine 3.** The complete NVIDIA expertise: the leather-jacket picture setup with the DGX Spark (left) and a close-up of the ultra-compact DGX Spark (proper). Photographs by the creator.

The Agentic Enviornment

As talked about firstly of this text, the guts of the weekend was structured round three tracks (Determine 4). Every one explored a special query about fashionable AI brokers: construct them so that they work, make them clear, and the way to verify they don’t go rogue.

Groups might choose whichever monitor finest match their use case, however in follow many initiatives naturally crossed monitor boundaries; an indication of how keen folks had been to study, join, and produce collectively completely different features of the agent lifecycle (sure, the concept the extra tracks you be part of the better your possibilities of successful was floating round too, however we’ll skip that for now 😉).

**Determine 4.** The three tracks of the Nice Agent Hack 2025: *Agent Iron Man* (construct brokers that don’t break), *Agent Glass Box* (perceive agent behaviour), and *Dear Grandma* (assault like a purple staff, defend like a guardian). Picture by Creator.

Observe A. Agent Iron Man: Brokers that work, and final

This was the engineering actuality examine monitor. The purpose was to construct a high-performing, production-ready multi-agent structure with clear agent roles, instruments, and reminiscence wired collectively in a manner that might really survive exterior a hackathon.

Analysis targeted on issues that often solely damage you in manufacturing: efficiency (pace, latency, value), robustness (how the agent handles software failures, unhealthy inputs, and edge circumstances), structure high quality (clear separation between brokers, secure software orchestration, wise fallbacks), and monitoring (observability, structured outputs, primary well being checks). Groups had been additionally anticipated to account for carbon footprint by favouring smaller or cheaper fashions the place doable and measuring vitality and token utilization, so the agent stays a conservative, accountable use of compute.

This monitor can also be a small style of what’s coming as brokers develop into extra broadly used and programs develop extra advanced, with many companies speaking to one another whereas nonetheless needing to fulfill tight latency and value targets.

Between the initiatives, one which caught my eye was FairQuote [4]: an clever automotive‑insurance coverage underwriting system that makes use of an orchestrator agent plus specialised consumption, pricing, and coverage brokers that coordinate to gather information, assess threat, calculate premiums, and generate explainable insurance policies in a single dialog; architecturally, it factors towards the following wave of multi‑agent enterprise workflows, the place robustness, clear duties, and robust observability matter simply as a lot because the underlying fashions.

Underwriting is an effective instance as a result of it’s one of many hardest and most business-critical issues in insurance coverage. It sits on the intersection of regulation, actuarial science, and buyer expertise: each choice about accepting a threat, pricing it, or making use of exclusions passes by means of this course of. When underwriting is gradual or opaque, clients get pissed off, companions lose belief, and insurers threat mispriced portfolios and regulatory scrutiny. When it really works effectively, it quietly retains the system secure, allocating capital effectively, defending the stability sheet, and supporting truthful pricing throughout segments.

So, on this monitor, it was nice to see not solely stable engineering, but in addition the actual issues groups tackled: underwriting, end-to-end claims dealing with, fraud investigation, and even emergency-services dispatch, the place multi-agent programs coordinated triage and choice assist in actual time. Even when the weekend outputs had been nonetheless demos, they pointed towards the multi-agent patterns, safeguards, and monitoring that may matter as related architectures transfer from hackathon tables into reside enterprise environments.

Staff software decisions lined up intently with the hackathon’s beneficial stack: AWS AgentCore with the Strands Brokers SDK for orchestration, Amazon Nova and different Bedrock-hosted fashions (smaller SLMs to remain frugal), and analysis frameworks like AgentHarm [7]. The latter allows you to check whether or not an LLM agent can appropriately sequence artificial instruments corresponding to dark-web search, internet scrapers, e mail senders, cost or bank-transfer capabilities, and code or shell instruments; so you’ll be able to measure each its robustness to jailbreaks and the way succesful it stays at executing multi-step dangerous workflows as soon as security boundaries are bypassed.

Observe B. Agent Glass Field: Brokers you’ll be able to see, and belief

The transparency monitor targeted on making agentic programs explainable, auditable, and interpretable for people and organisations. Groups had been requested to construct brokers whose reasoning, reminiscence updates, and actions may very well be traced and inspected in actual time, as an alternative of remaining opaque black bins. In follow, the initiatives fell into a number of households: observability pipelines, explainability instruments, governance and security layers and professional‑discovery or traceability instruments.

For me, one of many initiatives that finest captured the thought of a “glass field” was GenAI Explainer. Everyone knows text-to-image diffusion fashions might be highly effective however dangerous: conventional diffusion programs have already been proven to breed societal biases [8], and even newer fashions like FLUX.1 can nonetheless mirror patterns of their coaching information [9] whereas providing virtually no perception into why a selected picture seems the best way it does. On the hackathon, the GenAI Explainer staff tackled this by wrapping FLUX.1 with an explainability layer that permits you to see how every phrase or section of a immediate influences the generated picture, audit outputs for model, authorized, or security compliance, and iteratively refine prompts whereas watching the impression reside, with each era step tracked. In follow, they turned diffusion from a black field into one thing a lot nearer to a glass-box, auditable workflow.

In the long run, Observe B was a reminder that algorithmic transparency is not elective: authorized and threat groups more and more want to point out that automated selections are explainable and never biased, and the sort of ‘glass‑field’ considering behind initiatives like GenAI Explainer is one thing we must always carry into each agentic software we construct.

On this monitor, staff software decisions mixed tracing platforms corresponding to LangSmith or LangFuse, AWS observability companies like CloudWatch, X‑Ray, or Bedrock monitoring, and analysis instruments like AgentGraph [10] (changing traces into interactive information graphs), AgentSeer [11] (constructing motion graphs and doing failure/vulnerability evaluation), and the Who_and_When failure‑attribution [12] dataset to analyse and visualise agent traces in depth, to say just some.

Observe C. Pricey Grandma: Brokers that keep secure, and behave

On this monitor, groups got seven secret LLM brokers 🐺🦊🦅🐻🐜🐘🦎, every represented by an animal, and the mission was to interrupt them, perceive them, and establish them. These seven hidden “stealth brokers” symbolised completely different behaviours, strengths, and assault surfaces that groups wanted to uncover. The problem was to construct a purple‑teaming framework that might assault any of the seven reside animal‑agent endpoints utilizing the API supplied by the occasion organisers, backed by NVIDIA powered infrastructure.

Within the hackathon, every “animal” agent was a reside AI system uncovered by means of a single API service, with completely different routes for every animal. Groups might ship prompts to those animal‑particular routes and observe how the brokers behaved in actual time, every with its personal persona and capabilities, which helped purple‑teamers design focused checks and assaults.

Determine 5. Instance of a jailbreak check in opposition to a number of the “animal” brokers: in entrance of a DAN‑type immediate, every mannequin responds with a playful refusal and a constant security message, revealing each their shared guardrails and their distinct personalities.

Observe C wasn’t restricted to the seven “animal” brokers behind the API; attacking business programs like ChatGPT, Claude, or Gemini was additionally allowed so long as groups handled it as a part of a scientific safety evaluation.

On this manner, the answer ought to analyse, assault, and clarify AI agent vulnerabilities, carry out behavioural forensics, and perceive why the assault works.

The jailbreaking lab staff use a two‑step course of the place they first constructed an assault library of confirmed jailbreak prompts, based mostly on methods reported within the literature corresponding to Base64 obfuscation, CSS/HTML injection, and different immediate‑stage tips. Second, they utilized a genetic algorithm to mutate and enhance these prompts: each time an assault from the 1st step partially succeeded, the algorithm would tweak it (altering wording, including context, combining two prompts, or additional obfuscating directions) in order that profitable variants had been stored and weak ones had been discarded. Over time, this evolutionary search produced stronger and stronger adversarial prompts and even uncovered solely new methods to interrupt the brokers.

HSIA was one other standout undertaking that pushed these concepts into the robotics world. As a substitute of attacking the animal brokers, they focused a Visible–Language–Motion (VLA) robotic system and confirmed how its notion may very well be corrupted on the semantic stage. The pixels within the picture stayed precisely the identical; what modified was the inner caption generated by the mannequin. With delicate, fastidiously crafted perturbations, the VLA system might flip from “I see a bottle within the picture” to “I see a knife within the picture,” though no knife was current, main the robotic to behave on a false perception about its surroundings. Their work highlights that multimodal programs might be compromised with out touching the uncooked picture, exposing a important vulnerability for next-generation robotic AI.

Classes Discovered

If I needed to summarise what this hackathon taught me, it could be:

Be a Courageous Soldier. Perseverance issues greater than competitors. It’s not about beating others; it’s about staying resilient, adapting when issues break (as a result of they will), and delivering the most effective model of your concept. Occasions like this aren’t simply technical challenges; they’re alternatives to showcase your expertise and the sort of dedication corporations genuinely worth.

Put together forward of time. The groups that did effectively weren’t essentially probably the most senior, they had been those who arrived already understanding the format, the expectations, the analysis standards, and had gone by means of the tutorials and assets shared upfront.

Grasp the 5-minute pitch. That is important. Evaluators and judges transfer quick. You may spend a number of days constructing one thing, however you solely get a couple of minutes to make them care. So, have a pitch prepared that explains the worth of your undertaking clearly, shortly, and in a manner that sparks curiosity. If these 5 minutes are nice, the judges will ask for extra. This is applicable equally to junior profiles and senior engineers (storytelling is a part of the job). I wrestle with this too; in actual life we often don’t have a lot time to show our concepts.

These Occasions Are Changing into Extra Significant Than Ever. These occasions are gaining extra curiosity yearly, and the organisers even doubled the variety of spots this yr, which reveals how precious the expertise is. That’s why it’s so vital to take part provided that you really need to be there and might commit your time and vitality.

Research the sponsors. Earlier than the occasion, search for the businesses concerned and take into consideration which of them may be most concerned about your method. Tailor your pitch accordingly. Sponsors aren’t simply judges they’re potential collaborators, mentors, and even future teammates.

Robust Fundamentals Beat Shiny Fashions. One key takeaway from the hackathon is that successful wasn’t about utilizing the most recent or most hyped fashions. The highest groups didn’t succeed as a result of they relied on the most important or flashiest architectures, they excelled as a result of they constructed sturdy options on high of stable, well-understood methods: genetic algorithms, sturdy diffusion fashions, between different. The true differentiator was how creatively they mixed these foundations with agentic methodologies, intelligent analysis setups, and good engineering to deal with persistent challenges.

Collaborative Innovation Accelerates Progress. The occasion highlighted how cross-disciplinary collaboration between academia, trade, and AI governance consultants can considerably strengthen each AI growth and governance frameworks. Even individuals who weren’t in technical roles contributed precious concepts grounded in actual issues from their very own domains, bringing views that pure engineering alone can’t present. It’s additionally an ideal alternative to attach with folks exterior your ordinary technical bubble, increasing not simply your community, however the best way you concentrate on the impression and purposes of AI.

Lastly, a much bigger reflection: brokers are evolving quick, and with that comes new architectural challenges, security considerations, and duties. These aren’t hypothetical issues of the long run, they’re taking place proper now. Being accountable with AI purposes shouldn’t be a hype-driven slogan; it’s a part of the each day job of any AI or information science skilled.

Conclusions

These occasions are quietly shaping how we take into consideration AI governance. While you put highly effective agentic programs beneath time stress and in messy, real looking eventualities, you’re compelled to confront unpredictable behaviour head-on. That’s the place the actual studying occurs: how can we stability speedy innovation with belief and security? How can we design analysis frameworks and guardrails that permit us transfer quick with out shedding management? This hackathon didn’t simply reward intelligent fashions, it rewarded considerate governance.

And whereas there are many AI occasions popping up in every single place, this is among the few it’s best to actually keep watch over, the sort that genuinely helps you develop, exposes you to real-world challenges, and reminds you why it’s value staying curious and conserving your expertise sharp.

References

References so as of look:

[1] “NVIDIA CEO Jensen Huang kicks off CES 2025. The Future is Right here!” SupplyChainToday, 2025. Link.

[2] Nice Agent Hack 2025: Holistic AI x UCL. Obtainable at: https://hackathon.holisticai.com/ (accessed November 22, 2025).

[3] Valyu AI. (2025). The Nice Agent Hack 2025: Agent Efficiency, Reliability and Valyu-Powered Retrieval. Retrieved from https://www.valyu.ai/blogs/the-great-agent-hack-2025-agent-performance-reliability-and-valyu-powered-retrieval

[4] Nice Agent Hack 2025. “Venture gallery — Nice Agent Hack 2025: Construct and check clear, sturdy, and secure AI brokers for actual‑world impression.” Devpost. Obtainable at: https://hai-great-agent-hack-2025.devpost.com/project-gallery?page=1.

[5] Holistic AI. (2025). Hackathon 2025 [Source code]. GitHub. https://github.com/holistic-ai/hackathon-2025 (Final accessed: November 30, 2025)

[6] Elon Musk Surprised by Jensen Huang’s DGX Spark Reward. (n.d.). YouTube Shorts. https://www.youtube.com/shorts/l7x_Tfrbubs

[7] Andriushchenko, M., Souly, A., Dziemian, M., Duenas, D., Lin, M., Wang, J., Hendrycks, D., Zou, A., Kolter, Z., Fredrikson, M., Winsor, E., Wynne, J., Gal, Y., & Davies, X. (2024). AgentHarm: A benchmark for measuring harmfulness of LLM brokers. arXiv. https://arxiv.org/abs/2410.09024

[8] Tiku N., Schaul Ok. and Chen S. (2023, November 01). That is how AI picture mills see the world. Washington Submit. https://www.washingtonpost.com/technology/interactive/2023/ai-generated-images-bias-racism-sexism-stereotypes/ (final accessed Aug 20, 2025).

[9] Porikli, S., & Porikli, V. (2025). Hidden Bias within the Machine: Stereotypes in Textual content-to-Picture Fashions. Obtainable at: https://openreview.net/pdf?id=u4KsKVp53s

[10] Wu, Z., Cho, S., Munoz, C., King, T., Mohammed, U., Kazimi, E., Pérez-Ortiz, M., Bulathwela, S., & Koshiyama, A. (2025). AgentGraph: Hint-to-Graph platform for interactive evaluation and robustness testing in agentic AI programs. Holistic AI & College Faculty London.

[11] Wicaksono, I., Wu, Z., Patel, R., King, T., Koshiyama, A., & Treleaven, P. (2025). Thoughts the Hole: Evaluating Mannequin- and Agentic-Stage Vulnerabilities in LLMs with Motion Graphs

[12] Zhang, S., Yin, M., Zhang, J., Liu, J., Han, Z., Zhang, J., Li, B., Wang, C., Wang, H., Chen, Y., & Wu, Q. (2025). Which agent causes process failures and when? On automated failure attribution of LLM multi-agent programs (arXiv Preprint No. 2505.00212).

Source link

Multi-Agent Arena: Insights from London Great Agent Hack 2025

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

How to Find the Optimal Coding Agent Interface

I Completed Five Years in Analytics Consulting: 5 Lessons That Changed How I Work

GPU-Resident Top-K for Agentic RAG: I Built a CUDA Kernel So My Retrieval Step Would Stop Bouncing Off the GPU

Can Machine Learning Predict the World Cup?

Automate Writing Your LLM Prompts

These Were My Favorite Things Samsung Unpacked During Its 2026 Galaxy Event

AI minister role boosted but tech department axed in Burnham shake-up

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

The risk of weather data sabotage is rising

Featured Picks

What Does Palantir Actually Do?

Remember James Van Der Beek by Streaming Dawson’s Creek and His Other Roles

Backing founders from “day zero”, Dutch VC Volve Capital closes Fund I at €9 million

Multi-Agent Arena: Insights from London Great Agent Hack 2025

The Agentic Enviornment

Observe A. Agent Iron Man: Brokers that work, and final

Observe B. Agent Glass Field: Brokers you’ll be able to see, and belief

Observe C. Pricey Grandma: Brokers that keep secure, and behave

Classes Discovered

Conclusions

References

Related Posts