Turning 127 Million Data Points Into an Industry Report

this 12 months, I printed an trade report referred to as Remediation at Scale analyzing how software safety (AppSec) groups repair vulnerabilities of their code. The dataset: tens of 1000’s of repositories, a full 12 months of scan information, and organizations starting from startups to enterprises. In complete, north of 127 million information factors spanning particular person findings, scan occasions, and remediation actions throughout two varieties of safety scanning (SAST and SCA).

I’m a Senior Technical PMM at Semgrep with a background in pc science, information science, and options engineering. I like constructing issues. This venture let me mix all of that in a single movement: writing the SQL, constructing scripts to handle the evaluation, parsing and cleansing the info, discovering the story the info is telling, and delivery the ultimate polished asset.

This put up walks via 5 classes I picked up alongside the best way. In case you’ve ever needed to take an enormous dataset, discover the narrative inside it, and switch it into one thing a technical and non-technical viewers can act on, a few of this could be helpful.

1. Begin with the info, not the story

The temptation with any information venture is to resolve your narrative first, then go searching for numbers to again it up. I did the other.

I spent weeks in pure exploration mode. Querying Snowflake, taking a look at distributions, working aggregations throughout totally different dimensions. No speculation, no angle. Simply making an attempt to grasp what the info really confirmed.

This was uncomfortable. Stakeholders wished to know what the report would say. I didn’t have a solution but.

But it surely turned out to be an important part of all the venture. The information instructed a narrative I wouldn’t have guessed: the hole between top-performing safety groups and everybody else wasn’t about tooling. It was about systematic follow-through on remediation. I by no means would have landed on that framing if I’d began with a thesis.

You additionally need to be keen to kill your darlings. There have been a number of findings I wished to be true that the info didn’t assist. On the flip facet, a number of the most attention-grabbing insights got here from locations I wasn’t trying. I used native LLMs by way of Ollama to categorise 10,000+ text-based triage data into 20 thematic classes. What emerged was a transparent sample: the commonest themes have been about take a look at recordsdata, framework protections, and trusted companies. That instructed a narrative about how groups really use triage tooling that I by no means would have discovered by taking a look at mixture metrics.

Just a few issues that helped throughout exploration:

Run diagnostic queries first. I constructed a set of 12+ information high quality checks earlier than touching the evaluation. Certainly one of them caught {that a} key metric (parse_rate) solely had protection for a fraction of repos. I switched to another subject (NUM_BYTES_SCANNED) with 90%+ protection. With out that diagnostic, all the findings-per-lines-of-code evaluation would have been mis-computed.
Construct checkpoint/resume into your pipeline. I had 108+ SQL queries throughout a number of report sections. I wrote a shell script that auto-discovered .sql recordsdata, tracked which of them had already produced output CSVs, and skipped them on re-runs. When queries failed halfway via (they usually did), I might choose up proper the place I left off as a substitute of re-running all the things.
Doc as you go. Each attention-grabbing outcome, each useless finish, each assumption. That working log grew to become the spine of the report’s methodology part and saved me weeks once I wanted to retrace my steps.

Shell script for auto-discovering and working queries for the report. Picture by Writer.

2. Grow to be the area knowledgeable

You’ll be able to’t inform a narrative about information you don’t perceive. Earlier than I might write a single part, I wanted to know the way static evaluation scanners work, how remediation flows function in apply, and what metrics really matter to safety groups.

A number of firms within the area publish annual studies on comparable subjects. I collected and skim as many as I might discover. To not copy, however to grasp the format, the depth, and the expectations. Studying them gave me a way of:

What the trade expects from this type of useful resource
What’s already well-covered
The place there’s room to say one thing new

This additionally helped me spot gaps. Most studies deal with detection quantity. Only a few dig into what occurs after detection. That grew to become our angle.

Skipping this part would have meant writing a report filled with surface-level observations that didn’t differentiate in opposition to the opposite nice content material produced by others.

3. Speak to your audience early and usually

Early variations of the evaluation simply confirmed averages. Common repair price, common time to remediate, common findings per repo. The numbers have been positive. The story was boring.

The breakthrough got here after speaking to precise practitioners: the safety engineers, AppSec leads, and CISOs who could be studying the ultimate product. Everybody wished to reply one query: how do I evaluate to groups which might be doing this effectively?

That suggestions straight formed two of the most important selections within the report.

First, it led to a cohort-based segmentation. I cut up organizations into two teams: the highest 15% by repair price (“leaders”) and everybody else (“the sphere”). That is much like how survey-based studies phase by maturity stage, besides I used to be utilizing behavioral information slightly than self-reported responses. Abruptly the info had distinction:

Leaders repair 2–3x extra vulnerabilities
They resolve findings caught throughout code overview 9x quicker than findings from full repository scans
They undertake workflow automation options at increased charges and extract extra worth from them

The segmentation was the distinction between “listed here are some numbers” and “right here is one thing you’ll be able to act on.”

Bar chart showing the different in code vulnerabilty fix rates between “Leader” cohorts and “Field” cohorts. — Splitting cohorts into leaders and subject provides the reader a body of reference for the place their program stands. It additionally helps body speaking factors and findings. Picture by Writer.

Second, it reshaped the report’s construction. Individuals didn’t simply need benchmarks. They wished to know what to do about them. “Nice, the chief cohort fixes extra code safety vulnerabilities. How do I turn out to be a frontrunner?” That suggestions led me so as to add an evidence-based suggestions part organized by implementation pace:

Fast wins for this week
Course of adjustments for this quarter
Strategic investments for the half

The ultimate report reads as very like a playbook because it does a benchmark. None of that may have occurred with out placing early drafts in entrance of precise readers.

4. Get design concerned early

This one I nearly realized too late. Information studies dwell or die on how they give the impression of being. A wall of charts with no visible hierarchy is simply as dangerous as no information in any respect.

I introduced in our design group sooner than I usually would and hung out strolling them via the area. What does “reachability evaluation” imply? Why does the cohort cut up matter? When the designers understood the story, they made decisions (shade coding for cohorts, callout packing containers for key insights, earlier than/after code examples) that strengthened it with out me having to clarify in textual content.

Unused proof-of-concept rendering of the report cowl graphic. Notice the two.4x Remediation Hole. Picture used with permission.

5. Give your self time

This venture took months. The information exploration alone was weeks. Then there have been iterations on the evaluation as I discovered new angles, design cycles, authorized evaluations, and rounds of suggestions from stakeholders throughout the corporate.

If I had tried to ship this in 1 / 4, the outcome would have been forgettable.

The place it landed

Trying again, the 2 issues I’d change are each about pace. I’d write down each definition and assumption on day one. Issues like “what counts as an energetic repository” or “how can we calculate repair price” appear apparent in the beginning. They turn out to be contested quick. I ultimately created a proper definitions doc protecting 40+ metrics, however doing it earlier would have saved a number of rounds of rework. And I’d herald a second set of eyes throughout exploration. Working solo meant nobody to gut-check whether or not a discovering was attention-grabbing or simply noise.

The report itself, Remediation at Scale, covers six evidence-backed patterns that separate high-performing safety groups from the remaining. In case you’ve tackled the same data-heavy reporting venture, I’d be curious to listen to what you realized alongside the best way.

Source link

Turning 127 Million Data Points Into an Industry Report

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

How to Find the Optimal Coding Agent Interface

I Completed Five Years in Analytics Consulting: 5 Lessons That Changed How I Work

GPU-Resident Top-K for Agentic RAG: I Built a CUDA Kernel So My Retrieval Step Would Stop Bouncing Off the GPU

Can Machine Learning Predict the World Cup?

Automate Writing Your LLM Prompts

These Were My Favorite Things Samsung Unpacked During Its 2026 Galaxy Event

AI minister role boosted but tech department axed in Burnham shake-up

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

The risk of weather data sabotage is rising

Featured Picks

Mosquito species choose who to bite based on scent

Researchers find Llama 3.1 recalls large parts of popular copyrighted books, possibly weakening AI industry claims that such memorization is fringe behavior (Timothy B. Lee/Understanding AI)

‘Legally Blonde’ Prequel ‘Elle’ Gets New Pinkified Teaser Trailer

Turning 127 Million Data Points Into an Industry Report

1. Begin with the info, not the story

2. Grow to be the area knowledgeable

3. Speak to your audience early and usually

4. Get design concerned early

5. Give your self time

The place it landed

Related Posts