Home

Measured Leadership with Agentic AI on Open Models - Signal65

Measured Leadership with Agentic AI on Open Models - Signal65

JV Roig, AI Platform Engineer at Kamiwaza, joins the Ryan Shrout & Mitch Lewis to discuss why real-world enterprise tasks need new agentic AI benchmarks, sharing key insights from recent Signal65 and Kamiwaza research on the KAMI index.

Agentic AI is exposing the limits of traditional LLM benchmarks and forcing researchers to rethink how enterprise AI value is defined and measured.


Analysts Ryan Shrout and Mitch Lewis are joined by Kamiwaza's JV Roig, AI Platform Engineer, for a conversation on a joint whitepaper between Signal65 and Kamiwaza on Measured Leadership with Agentic AI on Open Models. They dig into the challenges that conventional LLM benchmarks pose, and how the Kamiwaza Agentic Merit Index (KAMI) provides more relevant evaluation for real-world, enterprise-focused agentic tasks.

Key Takeaways Include:

🔹 Traditional AI Benchmarks are Limited: Existing measures face issues with data contamination and construct validity, particularly when assessing a model’s real-world enterprise capabilities.

🔹 The Kamiwaza Agentic Merit Index (KAMI): KAMI models end-to-end enterprise tasks and uses the Picard framework to randomize tests and generate unbiased, relevant results.

🔹 Findings on Model Performance: Performance varies significantly by model architecture and size, with larger models using explicit “thinking mode” prompt strategies performing better, and the impact of quantization is minimal.

🔹 Model-specific Quirks & Schema Handling: Certain models misinterpret non-rounded SQL thresholds and overlook schema inspection unless prompted, affecting real-world usability.

🔹 Evolving Benchmark Scenarios: Future iterations of KAMI will add knowledge retrieval and hallucination tests, with early results indicating the performance gap between proprietary and open models is rapidly shrinking.

Download the whitepaper here, and learn more at Kamiwaza.

Subscribe to our YouTube channel, so you never miss an episode.

Or listen to the audio here:


Disclaimer: Signal65’s Insights from the AI Lab is for information and entertainment purposes only. Over the course of this webcast, we may talk about companies that are publicly traded, and we may even reference that fact and their equity share price, but please do not take anything that we say as a recommendation about what you should do with your investment dollars. We are not investment advisors, and we ask that you do not treat us as such.

Transcript

Ryan Shrout:

Hey everybody, welcome to a Signal 65 video insights. I'm your host, Ryan Shrout, president at Signal65. I'm joined by Mitch, one of our esteemed AI evaluators and engineers from Signal65, and JV from our good partners at Kamiwaza. Both of you, thanks for joining me.

JV Roig:

Hey, glad to be here.


Ryan Shrout:
Wanted to have an interesting conversation tonight about agentic AI evaluations. We published a report earlier this month this year called Measured Leadership with Agentic AI on Open Models. This is really what I'll kind of frame as the first of what will likely be many different reports and kind of analysis of performance and quality evaluations of different AI models around agentic use cases. It had some really, really interesting results about accuracy of these agentic models across large sets of tasks, about how quantization affects or does not affect the quality and output of these models, and even, you know kind of the impact of these thinking models, these reasoning models that spend a lot of time in the background, and they'll spend a lot of tokens in the background trying to decide what its next step is going to be. I thought I'd Mitch and JV on to kind of talk us through some of the structure behind the benchmark, why it was unique, and then some of the results and tease what what other stuff are going to come out and bring forward. So, you know, let me set the stage this way. There are lots of AI benchmarks out there. There are a lot of leaderboards and there's a lot of measurement systems. There's a lot of people talking about those measurement systems. But JV, I'm curious from your point of view, you know, what was kind of the problem statement that you were trying to address when you started developing the idea for KAMI, which is the Kamiwaza Agentic Merit Index as the name that that we came up for that test?

JV Roig:

Right, yeah. Thanks, Ryan, So the discussion here is a bit wonky. So this was largely left out in our Signal65 report, but we do link to it. to a paper in that report where the nitty-gritty details are. And to keep it short, the problem is that most of these benchmarks that we have been accustomed to see, every new model launch, they suffer from a few problems, and the two that are most important for us right now. is 1. Benchmark data contamination and 2. Construct validity. So essentially, benchmark data contamination means a lot of these benchmark information itself, the tests, have become part of the AI's training data. Which means it's now impossible for us to see, are they getting this right because they have just essentially over-fit to the test? Or are they really smart enough to understand and problem-solve? So this is the first problem. And we've seen this not just from our own internal experiments, but we've seen this in literature from research, for example, from Apple's machine learning research. They have this paper called GSM Symbolic, where GSM is grade school math. Grade school math like GSM 8K is one of these standard benchmarks that you've probably seen in model launches. And what they saw was, when we just changed entity names and variable values, but it's the exact same GSM test, performance just dropped. So that tells us they can't actually do grade school math, right? They're mostly memorizing the test. The other problem is contract validity, which is like, are we actually measuring what we are measuring, what we purportedly measure? And when you look at it from the enterprise perspective, what the enterprise actually needs isn't like, you know, you don't grow up to be a lawyer so that you can answer bar exam questions all day. That's not what a lawyer does. So asking these Q&A questions is not what enterprise agentic AI deployments would look like. So our benchmark instead models what actual jobs in enterprise deployment would be like. Here's a CSV file, process it this way, get back to me with a JSON, instead of Q&A-type questions. So those two problems.

Ryan Shrout:

It reminds me a little bit, maybe I'll show my age, but I'll say it's my experience instead, is going back in kind of graphics and 3D game benchmarking, we had this Quake/Quack debate, right, which is where hardware providers were kind of gaming the system by detecting the EXE that was being run as a very common benchmark, Quake 3, and they would adjust performance accordingly. But if you rename the executable file from Quake to Quack, the performance dropped because of it. Now, that was more of a malicious reasoning for it. And maybe some of this in the AI world is or is not really malicious. It's just kind of the nature of the beast when you're training new language models. But it's interesting to kind of frame it that way. And I'm curious, Mitch, from your view, as, you know, one of the primary people on our team that has been looking at different AI models and performance, both in terms of like GPU-level, platform-level performance, and kind of the quality of these different language models, if this is something that has kind of stood out to you as maybe like a potential problem that we were going to see eventually and how we were addressing it?

Mitch Lewis:

Yeah, I mean, I think JV kind of hit the nail on the head of like testing the right thing. So, you know, as AI models have advanced, you know, we've seen, you know, kind of basic question answer get better. And there's lots of different benchmarks that you know, test reasoning, or maybe they test math, but a lot of it is, you know, here's a question, and then here's a one-shot response. So that's a little bit different than, you know, actually moving into what are the big enterprise problems, agentic workloads that are really, okay, here's a model with a set of tools, and it's going to you know, take those tools and then inference and loop over them, calling different tools to try and complete some tasks, going and getting data out of a database and answering a real business question, which is, you know, it's a different use case, but that also has different kind of hardware requirements, like you're talking about of how many tokens it's spitting out and things like that.

Ryan Shrout:

The paper you referenced before, J.V., that was not our paper, but part of your paper, more of a kind of an academic look at things, is about the Picard framework, as you've named it. Give me a little bit of a quick description about what it is. It's the idea of these kind of randomized test suites that are still gradable, that kind of remove some of those other limitations that you talked about before.

JV Roig:

Right. This dovetails perfectly with what Mitch was saying, where it's not just the Q&A that we should be measuring. And some benchmarks do, in fact, have tool calling benchmarks and some sort of agentic benchmarks. But what's unique with what we're doing, or at least what I believe is very unique about it, is this underlying Picard framework that powers the agentic merit index. deals head-on with this problem of benchmark contamination. And the way it does that is that it does multi-level randomization of the entire sandbox that the agent sees when it is being dealt a task. So essentially what happens is and the AI agent might say, “Oh, please process the CSV.” That's a general task. But the exact instructions that it sees are going to be different every time that particular test is instantiated. Meaning sometimes the path to the CSV is different, and the name of the CSV is different. It means that sometimes the question that is asked is, the threshold is going to be different. Like, oh, how many orders are above 50,000? But it's not always 50,000. Sometimes it's randomly like 49,753. It's not even a random amount. And when they look at the CSV file, maybe the structure is the same, but the data inside the CSV file is also always different at each instantiation of the test. It is always randomized that way. With the current technology that we understand right now, it is as much unmemorizable as it could possibly be. because we are explicitly trying to deal with the memorization problem. Because if you do the math and you try to say, okay, so how many possible combinations could you get from a typical Picard test? That would vastly outnumber the number of atoms in the entire universe. So that framework allows you and the teams to set up kind of these different categories of testing in the categories that we have.

Ryan Shrout:

In this first paper, for example, go from basic reasoning, which is something simple like respond with a specific word, right? Seems like the most simple of tasks, but as it turns out, not everything got 100% when going through even some of the simple work. But you have file system operations, CSV processing, database processing, response format, instruction following. So a lot of categories that quantify or can be quantified as business-critical tasks in some categorization method. Are there any, as I think through those categories, this could be for J.V. and for Mitch, like interesting ones that stood out in their complexity that a lot of models got surprisingly good at, or some that were super simple that maybe you were surprised how many models did not do very well at in our testing, which we'll get to the results specifically in a second, but.

Mitch Lewis:

Well, maybe before jumping into that, one other thing that I want to kind of highlight, going back to, you know, what, how the testing actually works that I think is kind of important is that there is actually a deterministic answer key also. So not only are the results randomize or not uh sorry not the results the questions are randomized but the results are generated at runtime so there is a deterministic answer key um so when you are getting those you know results for all those different categories uh it's not an llm as a judge or something like that um but but to get back onto your point ryan uh i think there's you know a bunch of There's kind of interesting results in each of those seven categories. You know, you talked about the one that was as simple as give a certain word, right? That was really a test of are these models even suitable for an agentic scenario at all? Or if you give them tools, is that going to blow things up entirely? So it was really just a basic reasoning test. And most of them did really well. A couple of them didn't do well at all because instead of just answering the word, they would go off to the tools and try and call a tool to go do something, which is strange behavior. And that's a quick way to say, hey, that model's not the right one to be using for our genetic workloads and things like that.

JV Roig:

I'm glad Mitch pointed out the deterministic answer key. That was one of the core design principles, and Picarne just completely glossed over it. That's amazing. Okay. So for me, one of the more interesting results was in both of the CSV and SQL tests, but primarily, let's talk about the SQL tests. There was this question that I kind of referenced earlier where sometimes their threshold would be different. They would be asked, “How many orders do we have that are above threshold X?” For example, how many orders are above 50,000? But that variable was just randomized by Picard and I set it to something around the range of 30,000 to 50,000, something like that. but that range was completely random to the first digit, meaning actually that threshold could sometimes be 37,123, right? Which is like, for a human, that's literally no problem. Like, why would you care, right? If they ask you for 37,123, then okay, you'll figure out every order that's above that threshold. there was this very curious quirk that some models showed. And these are models that aren't like super small models. Like small models, like 32B models, sure. But even giant MOEs that have 600 billion parameters also fell to this sort of quirk at times, of course, to varying degrees, where sometimes when they see that the threshold is not a round number, instead of filtering the request using the order amount, like where order amount is greater than 50,000, for example, they would sometimes end up choosing order ID for those narrow figures. So they would filter the database through order ID greater than 37,120.

Ryan Shrout:

So they were just changing whatever parameter they were looking up even, they were referencing. Right.

JV Roig:

And it could possibly be because when they were trained, they probably didn't see a lot of examples where filtering an SQL database using an amount where the amount in the example is in the round figure. but they saw lots of examples of that where IDs are just like random, non-round figures. Interesting. And that kind of tripped them up, right? Now we can know, right? They're black boxes. This is my best guess based on what they probably saw on train. It's interesting, right? And you wouldn't see those quirks unless you had a framework like Picard running behind GAMI that's actually designed to expose them to this real world, messy enterprise data. That's one of the things we do very well. in the real world enterprise data, they are not going to be clean and perfect. And oh, is this fair to the LLM? I'm sorry. In the enterprise world, things are not going to be fair to the LLM. No, no.

Mitch Lewis:

Well, I think a similar but even simpler issue that we saw with some of the database tasks was a lot of the models would just assume the schema and just jump in and try and do something. So we actually have some, and they would do really bad, so we actually added some tests that gave them hints afterwards of, hey, in the prompt, hey, go first, look at the schema, then go find the data. So kind of similar thing.

JV Roig:

Right, and we saw that for most of the models that actually improved performance. And to be super clear, we gave them the tools to inspect the schema. When we sent them out to the task, they had tools to query the database. They also had the tools to actually inspect all of the schemas. But for a lot of those models we found, most of them were just, oh, it's probably this kind of schema. I'm just going to query immediately.

Ryan Shrout:

So, there's a couple of other things worth pointing out. One of them is, we're not going to go through every page and every result in the paper. It will take forever. There's a lot of detail and a lot of statistical analysis that JV and team have put in. to make sure these are run. How many times were each of these scenarios run on these tests? Was it 10, 30? What was the number, JV?

JV Roig:

We had eight independent runs, but each run would have 30 instantiations of a question. So we had like 19 questions times 30 samples for each times eight runs. So that's It's a lot for everyone.

Ryan Shrout:

It's a lot of tokens. It's a lot of tokens to process, right? So this isn't just a one-off answer, and then we generate some percentages from that. It's also worth noting that this is an early revision of the benchmark, an early version of how we analyze things using this benchmark. We were looking mostly at Qwen models, Llama models. There's a 5.4 in there. There's a Mistral in there. Not a broad set of models. That's something that we're working on later. We'll mention that here at the end. But like top performer, Quin 3, 235B derivative, A22B instruct, 2507 FP8, 88.8% as the number one overall leader in this first kind of pass at things. Did any model stand out as being, were you surprised that the two Quinn models took the leadership position in this early step? I think it's easy to say the Lama 3.1 8B Instruct being way down the list at like 10% accuracy across this test. This is a surprise. Was there anything else inside those bookends that stood out to you, JV?

JV Roig:

Personally, I was very, I was personally very biased towards loving the Huyen family. That's a personal, that's just a personal bias I have. Everybody knows it, so I'm owning up to it. What surprised me actually was they did not Except for the 235B and the 2507 variants, I was surprised actually that the Qwen 3 models did not originally perform as well as I expected. In fact, if you could see from the chart, you'll see all of the original Qwen 3 models, meaning they do not have the 2507 refreshed designation. they were all basically underperforming the LAMA 3.1 and LAMA 3.3 70B variants, even QEN 2.5 72B. And even more interesting was that those original QEN 3 models, when you test them for agenticness specifically, they were actually not just lagging behind Lama 3, they were lagging behind Qwen 2.5, their older generation. That's crazy. That was kind of crazy. And you'd see something like Qwen 2.5 14b, which personally, I also overlooked. I was just like 72b or 7b, 14b. Qwen 2.5 14b was like a super strong performer.

Mitch Lewis:

I think that's a really important point, JV. When you look at the Qwen 3 models and the Qwen 2.5 model that did as good or better on our tests, it shows kind of that discrepancy of some of these other benchmarks that are out there and how maybe they're not best fit for measuring this kind of agentic capability. And it's in our paper, there's a table of previous results of other benchmarks measure various reasoning and other things. And it's showing, you know, those results had shown that, you know, even the small Qwen 3, like 4 billion model was doing, you know, better, almost as good as the 2.5 model. And then when you give it these kind of real world enterprise agentic tasks, it's like, hold on a minute, this is you know, maybe there is a difference here.

JV Roig:

Yeah, I totally love that, right? Like Coentry 4B, when originally released showed benchmarks that was basically saying this is so much better even than our older 72B model. And it's like, what? And then when we put it to the test, no, of course that's not true. Of course it's not.

Ryan Shrout:

We do have a chart in the paper worth looking at that shows, we try to sort the results by model size across each of the categories of workloads. And generally speaking, the larger the model, the better it performs. There's one particular example that stands out to me where that's kind of not the case, the text extraction. example, the very large models perform 10 points lower than the large models that we've categorized as 50 to 100 billion parameters. So there's some interesting things there. We also call out quantization, going from full FP16 down to FP8 didn't seem to really impact performance. And there was a couple of places where maybe it was actually a point or two higher on the quantized model. And then some thinking versus non-thinking stuff. And that, you know, I guess we just assumed that would be the case, but it was interesting to see that play out in the actual testing, right? That the thinking models did perform better across the board in our particular examples. Right, JV?

JV Roig:

Right. Yeah. So, especially for the smaller models, thinking helps them achieve all of the tasks better. Of course, at the cost of a lot more latency and tokens. At a certain point, enterprises will have to make the calculation of their effective GPU utilization for a small model with so much more tokens versus a slightly bigger, more capable model without the without needing thinking to be able to do those tasks.

Ryan Shrout:

And I think where this particularly becomes interesting is where we go next with this testing, right? Because obviously these were all open models that we've looked at. Some of them we were running them on hardware that we had access to either through the Signal65 AI lab or other places. But when it comes to, hey, maybe you're accessing these models through an API that now you're paying per token, suddenly the impact of thinking modes, the impact of accuracy needs to be weighed against the impact of cost of some of these actions. That'll be interesting. So I'm curious, like, From both of you, I'll ask Mitch first, as a little tease of what is coming next from this, new models, slightly different methodologies or improved benchmark itself, what's something that's interesting to you?

Mitch Lewis:

Yeah. A little tease, we are looking at a lot more models. In this paper, I think there was 30, 31 models, something like that. I think we've tested probably another 30. So a lot of them being kind of those proprietary models, things where, you know, you are paying for an API endpoint. So those questions do become really interesting. And going back to, you know, kind of the, what we saw in this, in these results around the thinking models, one of the things we did see, like I was saying, giving a hint in the prompt could make a big difference. So on the database task, we saw the thinking models do way better without the hint. Once we gave the hint to the non-thinking coin models, they suddenly were about equal. So, you know, it's those kind of things that we can look at of how do we drive or, you know, how can an enterprise drive efficiency if they are paying, you know, per token or they are really concerned about which models can I fit on my infrastructure, how can I be more efficient, things like that.

Ryan Shrout:

And JV, what about from you? Anything like kind of early results that seem somewhat surprising? And again, we'll, we're going to come back and we'll do another video on this kind of once that report comes out, talk about what it, what it means from a open versus closed model of pricing and all of that discussion. But just curious for, for people that are watching to something to look for.

JV Roig:

Yeah, what we've seen with what Mitch references, the batch of another 30 models, including proprietary ones from the most popular providers right now. What we see from early results is that there's four realistic agentic enterprise tasks. the gap between proprietary and powerful open models are shrinking rapidly. Realistically, if you use something like the Kamiwaza agentic merit index, run it on your use cases specifically, for example, like the Picard framework, and then actually analyze the failures, you'll probably get much, much farther than than if you were just vibe testing them. And we've also seen, we're also as an addendum to, you know, how can we make this test better? We're actually also experimenting with adding a new form of scenario specifically about knowledge retrievals. Because right now we were just examining, okay, can they do all of these different business tasks? A big blind spot in the current test is, what about hallucinations? We can't directly measure hallucinations, right? If I give them our knowledge base, like, hey, if I stuff our HR documents worth 20,000 tokens into a specific chatbot, how would all of these different models perform? Which of them would be faithful to our knowledge base? Which would be able to say, no, that doesn't say that. And which would completely make up new facts depending on how you prompt it. So that's a new test design that we are adding to the next version of our benchmark.

Ryan Shrout:

Very cool. Well, I'm looking forward to it. JV and Mitch, I want to thank you for joining and having this discussion with me. I think it's something we could, I mean, we talk about it for basically an hour every week. So the fact that we could try to squeeze this into 30 minutes for the full report is pretty difficult to do. But I'm looking forward to future iterations of the test and kind of your improvements there. And then Mitch, on the analysis side, and how do we kind of paint this picture for model proficiency? Do we even eventually get into actual kind of compute performance? Another area maybe that will be interesting to look at as we compare closed models that you can run through APIs and endpoints versus open models that you can run on-prem will be interesting discussions. But I definitely encourage everybody watching or listening to go to Signal65.com, look up this paper. There's a ton of great detail in here, and it's a good kind of kickoff point and starting point for where all the other agentic AI performance and leadership discussions that we have going forward into 2026 will start. Signal65.com, again, measured leadership with agentic AI on open models. Mitch, JV, thanks for joining me, and we'll see everybody soon on the next video insights from Signal65.

MORE VIDEOS

The Cooling Point: Re-Architecting Data Centers for AI - Six Five Connected with Diana Blass

Avi Shetty, Hecheng Han, Dr. George Zhang, Neil Edmunds, John Griffith, Josh Grossman, and Francesca Cain-Watson join Diana Blass to discuss the evolution of liquid cooling in AI systems, exploring design tradeoffs, architectures, and deployment strategies.

The Six Five Pod | Ep. 288: OpenAI’s Valuation Debate, Marvell’s Network Bets, and the Next Bottlenecks for AI Growth

This week, Patrick Moorhead and Daniel Newman unpack why today’s AI moment feels less like the endgame and more like Netflix’s DVD-by-mail phase—the very beginning of a transformation that will redefine the tech industry. They dig into soaring AI valuations and the growing debate over whether today’s leaders signal durable platforms or bubble dynamics, then shift to what really matters under the hood—AI infrastructure, with a sharp focus on networking and memory, informed by insights from Marvell’s Industry Analyst Day. This episode also breaks down recent market moves across major tech players, before closing with a forward-looking take on where AI is headed and what it will take to stay competitive as the pace of change continues to accelerate.

Building Enterprise-Ready Agentic AI with Search - Six Five On The Road

Steve Kearns, GM of Search Solutions at Elastic, joins Nick Patience to share how Elastic is enabling enterprises to move from RAG to agentic AI, solving operational challenges, and powering the next generation of autonomous workflows.

See more

Other Categories

CYBERSECURITY

QUANTUM