back to delimit.ai
Worked-example report v1 / Delimit team / 2026-05-08 / self-attestation

Same merge gate, third artifact class: turning the classifier on our own MCP server

A real merge gate, turned inward.

Glama added the Tool Definition Quality Score in April 2026; on the rollout, delimit-mcp-server scored C-quality. A deterministic TDQS linter (LED-2108) drove a truth-preserving retrofit across 197 of 202 docstrings. Aggregate grade C (mean 2.66) became A (mean 4.52); no tool sits below B.

What we scanned
delimit-mcp-server
202 tool docstrings
What we found
C → A (197 lifted)
no tool below B
Commit pair
PR #134
db7c46d → 2993c6d
Test delta
+35 tests
4539 → 4574 pass
The pattern-accumulation point. The previous seven reports in this series ran the merge gate against external OpenAPI surfaces from named vendors. This report points the same primitive inward. The merge gate on AI-written code is one application of the kernel: classify deterministically against a published taxonomy, fail closed on violations, produce a record. The LED-193 autonomous-action daemon is the second application: classify each proposed autonomous action against a precondition contract, fail closed on mismatch, never auto-merge. The TDQS linter on our own MCP tool docstrings is the third application: classify each docstring against the six TDQS dimensions, fail closed below the threshold band, produce a graded record. Same primitive, third artifact class. The gate is general.
Why a self-attestation. Vendor reports answer whether the gate works on an external surface; this one answers whether the same kernel runs when the artifact class shifts.

What we did

We added delimit_tdqs_lint (LED-2108) as a new MCP tool on the gateway. The implementation lives at ai/tdqs_lint.py in the delimit-gateway repo. The linter walks the registered tool surface, extracts each docstring, and classifies it against the six TDQS dimensions: side_effects (does the docstring name the on-disk, network, or process side effects), conciseness (is the docstring tight enough to fit a tool-picker prompt without crowding), coverage (does it explain every parameter and the return shape), parameter_semantics (does it explain what each parameter means in context, not just its type), disambiguation (does it tell the model when to pick this tool over a neighboring tool), and when_to_use (does it tell the model when in a session the tool should fire). Each dimension scores from 1 to 5; the per-tool grade is the rounded mean.

We ran the linter against the full delimit-mcp-server surface and got the baseline: 1 A / 16 B / 145 C / 40 D, aggregate mean 2.66, aggregate grade C. We then walked the worst-graded namespaces in 12 batches (LED-2109), one batch per namespace cluster. Each batch read the linter output for its namespace, rewrote the docstrings truth-preservingly to add the missing dimensions named by the linter, and re-ran the linter to confirm the lift. No batch was committed without a clean re-classification. Fourteen commits landed across the 12 batches; two batches required a follow-up commit to close residual gaps the first pass missed.

Truth-preserving means: the docstring may not claim behavior the implementation does not deliver. If the linter said a tool was missing a side-effect contract and the implementation has no side effects to declare, the new docstring says "no side effects (read-only)" rather than fabricating a write surface. The Prior-View versus Evidence rule binds here: the linter output is a deterministic classifier verdict, not a ground-truth specification. The rewrite is constrained to the actual implementation.

Headline numbers

Baseline aggregate
C (2.66)
Post-retrofit aggregate
A (4.52)
Distribution (old)
1 / 16 / 145 / 40
A / B / C / D
Distribution (new)
139 / 63 / 0 / 0
A / B / C / D

The total tool count classified by the linter on both runs was 202. Of those, 197 docstrings were retrofitted; the remaining 5 were already at A on the baseline run and were not touched. Per-dimension mean breakdown on the post-retrofit run: side_effects 4.94, conciseness 3.88, coverage 4.11, parameter_semantics 4.47, disambiguation 4.82, when_to_use 4.93. The two lowest-scoring dimensions in the new state (conciseness at 3.88 and coverage at 4.11) are the ones with the inherent tradeoff: a docstring tight enough to score 5 on conciseness tends to leave room for missed coverage, and the linter rewards both. The retrofit hit a B on those two dimensions across the B-graded tail (63 tools) and an A everywhere else; the aggregate-A is the correct verdict on a population where the modal grade is A and no tool is below B.

The gateway test count moved from 4539 to 4574 across the work, a delta of +35 tests. Every test in the new total passes. The new tests cover the linter module, the per-dimension scorers, the (Pro) test contract preservation that fell out of one of the retrofit batches, and the regression coverage that pins the post-retrofit grade at A so a future docstring edit cannot silently regress the band.

Findings

2 baseline observations, 2 post-retrofit shifts, 1 pattern-accumulation finding. Each finding cites the exact data from the linter run, the surface affected, and the consumer impact. There is no "breaking" partition on this report because no tool wire signature was modified; the work is a docstring improvement, classified deterministically.

  1. baseline statefinding F1
    change type: aggregate_grade C (mean 2.66 across 202 tools)
    surface: all delimit-mcp-server tools (server.py + delegated modules, 202 docstrings classified)

    The pre-retrofit linter run scored 1 tool at A, 16 at B, 145 at C, and 40 at D. The aggregate mean across the six TDQS dimensions was 2.66, which rounds to a C. The Glama public badge state immediately before the retrofit shipped was A-license / C-quality / A-maintenance. The C is the data point this report exists to explain. A C-quality grade does not mean the tools were broken; the wire-shape behavior of every tool was unchanged. It means a Glama scorer reading the docstring alone, without running the tool, could not always tell which tool to pick, what the parameters meant in context, what the side-effect surface was, or when in a session the tool should fire. That gap is what TDQS measures, and the gap is what shipped fixed.

  2. baseline statefinding F2
    change type: worst_5_baseline (mean 1.33, all governance namespace)
    surface: delimit_gov_health, delimit_gov_status, delimit_gov_policy, delimit_gov_run, delimit_gov_verify

    The five lowest-scoring tools at baseline all had a per-tool mean of 1.33, the floor of the scoring scale. Every one of them lived in the governance namespace, the namespace that owns the merge gate itself. The pattern is not a coincidence. The gov_* tools were among the earliest tools added to the server and the docstrings were written close to the implementation, when the author had full mental context and did not need the docstring to recover it. By the time TDQS classified them, the docstrings were terse, missing the side-effect contract, missing the disambiguation against neighboring tools, and missing the when-to-use guidance. The retrofit lifted all five into the A or B band by adding the missing dimensions truth-preservingly: nothing was claimed that the implementation did not deliver. The lift was a writing change, not a behavior change.

  3. post-retrofit shiftfinding F3
    change type: aggregate_grade A (mean 4.52 across 202 tools)
    surface: all delimit-mcp-server tools after the LED-2109 retrofit

    The post-retrofit linter run scored 139 tools at A, 63 at B, 0 at C, and 0 at D. The aggregate mean rose from 2.66 to 4.52, which rounds to an A. The full per-dimension mean breakdown: side_effects 4.94, conciseness 3.88, coverage 4.11, parameter_semantics 4.47, disambiguation 4.82, when_to_use 4.93. Every C-grade and D-grade tool was lifted at least one full band; the lowest grade in the new distribution is B. The retrofit touched 197 of 202 docstrings; the remaining 5 were already at A on the baseline run and were left untouched. No tool wire signature was modified, no parameter renamed, no return shape changed. The shift is entirely in the docstring layer. A Pro user with the prior server installed locally at ~/.delimit/server/ keeps working exactly as before; the same MCP tool calls, the same return shapes, the same behavior. The retrofit is an additive doc improvement, classified deterministically by the same kind of gate that classifies OpenAPI diffs.

  4. pattern accumulationfinding F4
    change type: kernel_application_3 (TDQS linter as inward-facing classifier)
    surface: ai/tdqs_lint.py (the new MCP tool delimit_tdqs_lint, LED-2108) plus 14 commits across 12 retrofit batches (LED-2109)

    The retrofit was driven by a new MCP tool, delimit_tdqs_lint, that classifies each docstring against the six TDQS dimensions deterministically and returns a per-tool grade plus the missing-dimension list. The tool lives at ai/tdqs_lint.py in the delimit-gateway repo, committed as part of PR #134. Once the linter existed, the retrofit ran as 12 batches of docstring edits, each batch ending with a re-run of the linter against the touched namespace and a commit when the namespace lifted. This is the same primitive Delimit ships externally as the merge gate on AI-written code: classify deterministically against a published taxonomy, fail closed on violations, produce a record. Pointed at OpenAPI specs, the primitive is a diff plus semver classifier. Pointed at autonomous-action proposals (the LED-193 daemon), the primitive is a precondition-and-PR gate that never auto-merges. Pointed inward at our own MCP tool definitions, the primitive is the TDQS linter. Same kernel, third artifact class. The classifier is not new; the artifact class is.

  5. post-retrofit shiftfinding F5
    change type: test_delta (4539 to 4574, +35, all pass)
    surface: delimit-gateway test suite, pre and post the LED-2108 + LED-2109 work

    The gateway test count moved from 4539 to 4574 across the work, a delta of plus 35. Every test in the new total passes. The 35 added tests cover the new linter module: the per-dimension scorers, the aggregator, the (Pro) test contract preservation that fell out of one of the retrofit batches (commit f663e27), and the regression coverage that pins the post-retrofit grade at A so a future docstring edit cannot silently regress the band. The merge gate that runs on the delimit-gateway repo on every push exercised these tests on PR #134 before merge. A post-merge regression that drops a tool below B will fail the test suite and block the next push. The fail-closed property the merge gate ships externally also runs against the docstring layer of our own MCP tools.

Same kernel, three artifact classes

The Delimit governance kernel is small. It classifies an input against a published taxonomy, applies a fail-closed verdict on violations, and emits a record that any reader can replay. Three artifact classes have now run through it.

  1. Merge gate on AI-written code. The product. The classifier is the 27-type OpenAPI diff taxonomy plus a deterministic semver bump rule. The fail-closed point is the breaking-change partition. The record is a signed, replayable attestation per pull request. Reports 1 through 7 in this series exercise this surface against external vendor APIs.
  2. LED-193 daemon on autonomous actions. An internal application. The classifier is the per-action precondition contract. The fail-closed point is a precondition mismatch (commit 4a391aa tightened the skipped-versus-failed semantics on this surface). The record is a per-tick PR-only output that never auto-merges. Same primitive, applied to the autonomous-action artifact class.
  3. TDQS linter on tool definitions. This report. The classifier is the six-dimension TDQS taxonomy. The fail-closed point is any tool below the threshold band. The record is the per-tool grade plus the aggregate distribution. Same primitive, applied to the MCP tool definition artifact class.

The reading is straight: classify deterministically, fail closed on violations, produce signed evidence, and the same kernel runs against whichever artifact class you point it at. The merge gate is not a feature limited to OpenAPI; the OpenAPI diff is the shipped surface. The kernel underneath is general.

What this report is not

Not a marketing reframe. The product Delimit sells is the merge gate for AI-written code, with a signed, replayable attestation as the productized output. TDQS on our own MCP tool definitions is depth signal, not a new product wedge. The point of writing it up under the same worked-example taxonomy is to make the kernel visible: the same primitive that classifies an OpenAPI diff classifies a docstring, and the same fail-closed verdict that blocks a breaking change blocks a C-graded docstring shipping in the public catalog.

Not a defect claim against Glama, who shipped TDQS publicly and gave the directory a useful new dimension. The C-quality baseline was on us; the docstrings were terse because they were written close to the implementation. The retrofit is the disciplined response: read the public classifier, build our own deterministic implementation of the same taxonomy, run it against our surface, and lift the grade truth-preservingly. Glama's next public scrape of the badge will see whatever its scorer sees on the retrofitted docstrings; that result is independent of this report.

The attestation artifact

A Delimit attestation is a bounded evidence record at a single commit pair. The pre-merge gateway commit on this run is db7c46d (the parent of the merge); the post-merge commit is 2993c6d (the squash-merge of PR #134 onto main). The same Delimit version run against the same two commits produces the same per-tool grade table; that is the replayable property. The attestation does not opine on whether a docstring rewrite is "good writing"; it opines on whether the docstring satisfies each of the six TDQS dimensions deterministically, and it emits the per-tool grade plus the per-dimension breakdown. A clean A is as much an artifact as a fail; a fail with a per-dimension miss inventory is an even denser artifact, because every line is something the docstring author knows to fix.

For the precise list of checks, the explicit out-of-scope list, and the reproducibility guarantee, see the attestation methodology v1. This report is the MCP-tool-definition surface of the same primitive that powers the merge gate for AI-written code.

Reproduce locally

Anyone can re-run the linter against the same two commits and verify the same per-tool grade table comes out. The full command sequence:

# Clone the gateway at the merge commit
git clone https://github.com/delimit-ai/delimit-gateway.git
cd delimit-gateway

# Baseline run (pre-merge parent)
git checkout db7c46d
python -m ai.tdqs_lint --all > /tmp/tdqs-baseline.json

# Post-retrofit run (merge commit)
git checkout 2993c6d
python -m ai.tdqs_lint --all > /tmp/tdqs-postretrofit.json

# Compare aggregate grades
python -c "
import json
b = json.load(open('/tmp/tdqs-baseline.json'))
n = json.load(open('/tmp/tdqs-postretrofit.json'))
print('baseline:', b['aggregate'])
print('postretrofit:', n['aggregate'])
"

If the bytes you get differ from the bytes in this report, that is itself a finding worth reporting; raise it on the Delimit repo and we will look. The two commits above are content-addressed by their SHAs; db7c46d is the parent of the squash-merge of PR #134 (so the linter sees the pre-retrofit docstring state), and 2993c6d is the squash-merge itself (so the linter sees the post-retrofit state). The linter module is at ai/tdqs_lint.py on the merge commit; it is not present on the parent. The baseline-run command above runs the merge-commit copy of the linter against the parent-commit docstrings, which isolates the docstring change from the linter change.

For your own MCP server

If you publish an MCP server through Glama or any other directory that scores tool-definition quality, the same kind of deterministic classifier is straightforward to run against your own surface. The TDQS taxonomy is public at glama.ai/blog/2026-04-03-tool-definition-quality-score-tdqs; reading the methodology and writing a six-dimension scorer is a single-day exercise. Once the scorer exists, the retrofit shape we used (one batch per namespace cluster, truth-preserving rewrites, re-run after each batch) is the cheapest path from a C-band baseline to an A-band steady state. The merge gate on AI-written code is on the npm Marketplace at delimit-ai/delimit-action. Free for individual maintainers. Pro tier $10/month for teams.

The signed, replayable attestation is the artifact your reviewers, auditors, or directory scorers can read without rerunning the gate. A C-to-A grade lift with a full per-dimension breakdown is worth publishing too; it is what disciplined tool-definition writing looks like under the gate.