What a year of AI frontier evolution looks like under a cross-vendor merge gate
A real cross-vendor merge gate, end to end.
Delimit ran against the public OpenAI OpenAPI between commit 498c71d (2025-04-29, last in-repo manual update) and the live Stainless-hosted spec retrieved 2026-05-07: 553 surface deltas, dialect 3.0 to 3.1, info.version held at 2.3.0.
- OpenAI publishes its OpenAPI in the open: the manual branch at
github.com/openai/openai-openapi/tree/manual_specand the live spec atapp.stainless.com/api/spec/documented/openai/openapi.documented.yml. - The analysis is reproducible byte-for-byte from the same commit and the same ETag.
- No coordination with OpenAI; the public spec is the only input.
What we did
We cloned github.com/openai/openai-openapi and checked out the manual_spec branch, which holds the last in-repo authoritative version of the spec. The most recent commit on that branch is 498c71d (2025-04-29). The master branch was reduced to a README at e1cb7a8 on 2025-06-17, with spec authority moved to a Stainless-hosted live URL referenced from the README.
We pulled that live URL (HTTP 200, content-type application/yaml, 2.79 MB, 80,000 lines) on 2026-05-07 and treated it as the new spec for the diff. Each spec was passed to delimit lint in its standard configuration. The diff engine classified each change against its 27-type taxonomy and the semver classifier produced a bump recommendation.
We then walked the same two specs manually and reconciled the surface deltas the engine flagged that were artifacts of the OAS 3.0 to OAS 3.1 dialect shift. We surface both numbers below so the reader sees the engine verdict and the wire-shape verdict side by side.
Headline numbers
The path count went from 99 to 161 (+62 paths), the operation count from 148 to 241 (+93 operations), the schema count from 477 to 939 (+462 schemas). The diff engine classified 553 changes: 318 breaking, 132 additive (with patch=0). The largest single category is 196 type_changed events, and on manual reconciliation every one of those is the OAS 3.0 to OAS 3.1 nullable-to-anyOf representation shift, not a wire change (see finding F6). With those 196 reclassified, the wire-breaking partition is roughly 122, still squarely in the major bracket. Sixty-two new endpoints landed (Conversations, Responses utilities, Realtime calls, ChatKit, Videos, Containers, Skills, Audio voices, RBAC), the Assistants surface was formally deprecated, and AdminApiKeyAuth was added to 50 admin operations. The advertised info.version stayed 2.3.0 on both ends because OpenAI carries its real evolution in path expansion plus the Stainless-managed change-log, not in the OpenAPI version field. The gate's independent semver classification is major.
Findings
3 breaking, 3 additive, 1 flagged as spec hygiene. Each finding cites the exact change-type from the 27-type taxonomy, the surface affected, and the consumer impact. The hygiene finding here is the OAS 3.0 to OAS 3.1 representation reconciliation that accounts for 196 of the 318 engine-flagged breaking changes; we surface it so the reader sees both the engine verdict and the wire-shape verdict.
- additivefinding F1change type: endpoint_added (62 new paths, 63 new operations)surface:
/conversations, /responses, /realtime/calls, /chatkit, /videos, /containers, /skills, /audio/voices, /audio/voice_consents, /fine_tuning/alpha/graders, /organization/groups, /organization/rolesSixty-two new paths and sixty-three new operations landed across the window. The shape of the additions is the AI frontier between 2025 and 2026 in plain text: a Conversations surface for managing dialog state outside of Assistants (4 paths), a Responses surface with cancel and compact and input_tokens utilities (3 paths plus the pre-existing /responses), a Realtime calls surface for accepting, hanging up, referring, and rejecting in-flight realtime calls (6 paths), a ChatKit session and thread surface for client-managed chat state (5 paths), a Videos surface for generation, edits, extensions, and remixes that maps to the Sora line (8 paths), a Containers surface for sandboxed file workspaces (5 paths), a Skills surface for versioned reusable model behaviors (6 paths), an audio Voices and Voice Consents surface for the BYO-voice flow (3 paths), Fine-Tuning grader run and validate plus pause and resume (4 paths), and a fully-rebuilt org admin RBAC under /organization/groups and /organization/roles plus per-project group and role assignments (19 paths). Every one of these is purely additive: no existing path, operation, parameter, or response schema was altered to make room. A consumer pinned to chat completions plus the previous Assistants surface keeps working without code changes; a consumer that wants to use the new surfaces opts in.
- additivefinding F2change type: deprecated_added (Assistants API formally retired)surface:
/assistants, /assistants/{assistant_id} (POST, GET, DELETE on each), plus ModelResponseProperties.user and CreateChatCompletionResponse.system_fingerprint and CreateChatCompletionStreamResponse.system_fingerprintFive Assistants operations and three model-response fields were marked deprecated in the window. The Assistants surface (create, list, retrieve, update, delete) is the long-standing path for stateful agent behavior; the new spec keeps the operations live but flags every one of them with deprecated=true. The successor surface is the combination of /responses (the Responses API, already present at the start of the window and now expanded with cancel, compact, and input_tokens) and /conversations (new, for managing conversation state explicitly). The deprecation is the textbook two-step migration shape: the new path lands additive, the old path keeps working but is flagged so any code generator or human reader sees the signal. The non-Assistants deprecations are the user identifier on chat completions (replaced by safety_identifier and prompt_cache_key) and the system_fingerprint field on chat completion responses (no longer reported). All three are non-breaking: the field is still present in the schema, just flagged.
- additivefinding F3change type: security_added (AdminApiKeyAuth required on 50 admin operations)surface:
/organization/admin_api_keys, /organization/audit_logs, /organization/certificates, /organization/costs, /organization/invites, /organization/projects, /organization/usage/*, /organization/users (50 operations total)A new AdminApiKeyAuth security scheme was added at the spec root and applied to fifty admin-tier operations across the org-management surface. This is org-scope access hardening: the surfaces that read audit logs, mutate billing certificates, list per-project costs, manage invites and users, and read per-resource usage now declare an admin-class credential as a security requirement. The diff engine flags this as non-breaking because OpenAPI security additions are non-breaking by spec convention (a client without the admin key would already have been failing at the wire); we surface it as an additive finding because the visibility of the requirement in the spec is itself the artifact, and a downstream code generator that reads this attestation will know to thread an admin credential through any of the fifty operations.
- breakingfinding F4change type: enum_value_removed (24 removals across realtime events and project-user roles)surface:
#/components/schemas/RealtimeServerEvent* (response.text.delta, response.audio_transcript.done, semantic_vad turn-detection), #/components/schemas/ProjectUserCreateRequest.role (owner, member)Twenty-four enum values were removed across the window. The largest concentration is on the Realtime API server-event types: response.text.delta and response.audio_transcript.done were retired in favor of the new response.output_text.delta and response.output_audio_transcript.done shapes that align Realtime with the Responses API event taxonomy, and the semantic_vad turn-detection mode was retired from RealtimeTranscriptionSessionCreateRequest. The second concentration is on ProjectUserCreateRequest.role: owner and member were removed from the role enum because the project-user role is now derived from the new role-and-group RBAC (roles assigned via /projects/{project_id}/roles, groups via /projects/{project_id}/groups). A consumer that hardcoded a removed string will fail validation against the new spec; the migration is to read response.output_text.delta in the realtime stream and to assign roles via the new RBAC surface. These are intentional model-shape consolidations as the API matures, not silent regressions.
- breakingfinding F5change type: required_field_added (55 new required fields across response and tool-call schemas)surface:
#/components/schemas/FileCitationBody.filename, #/components/schemas/Response*Event.sequence_number, #/components/schemas/Invite.created_at, #/components/schemas/UsageTimeBucket.results, plus 51 moreFifty-five fields became required in the window. The largest cluster is sequence_number on the Responses API event stream (response.output_text.delta, response.refusal.delta, response.error, and roughly twenty other event types), reflecting the addition of explicit ordering to the streamed event contract that lets a client reorder out-of-order deliveries and detect dropped events. The next cluster is on FileCitationBody, where filename joined the required list so a citation always names its source. Smaller clusters tightened the Invite, UsageTimeBucket, and ApiKey response shapes. A consumer that does not surface these fields on serialization will fail validation against the new spec. The visibility of the required-list change in the attestation is the artifact: a downstream client knows exactly which response shapes need an updated parser without diffing 80,000 lines of YAML.
- spec hygienefinding F6change type: engine_under_classification (OAS 3.0 nullable to OAS 3.1 anyOf shift)surface:
spec-wide (196 properties on shared response schemas)The diff engine flagged 196 property type_changed events with the pattern type changed from X to None, where X is one of the six primitive types (string, object, integer, number, boolean, array). On inspection, every one of these is the OAS 3.0-to-OAS 3.1 nullable representation shift: the old spec declared openapi: 3.0.0 and used the property pattern type: X plus nullable: true, and the new spec declares openapi: 3.1.0 and uses the property pattern anyOf: [{type: X}, {type: null}]. The wire shape is identical (both forms accept either X or null); the schema document representation differs because OAS 3.1 retired the nullable keyword and adopted the JSON Schema 2020-12 union form. The engine is correct that the schema property changed; the engine is currently narrow on this particular axis because it does not yet collapse the two equivalent nullable encodings into a single non-breaking change-type. We surface this as a hygiene observation so the reader sees both the engine verdict (196 type_changed) and the underlying truth (one cross-version representation shift, applied at scale). The engine roadmap closes this gap; the visibility of the gap is the artifact this report was written to make legible. With those 196 reclassified, the wire-breaking partition of the diff drops from 318 to roughly 122, still squarely in the major bracket.
- breakingfinding F7surface:
/organization/projects/{project_id}/api_keys/{key_id} (path renamed to {api_key_id}), /responses/{response_id}/input_items?before, /organization/certificates/{certificate_id}?cert_idThree explicit removals landed in the window. The largest is the rename of the project api-keys path parameter from {key_id} to {api_key_id}, which the diff engine sees as endpoint_removed plus endpoint_added of the new path. The second is the removal of the before query parameter on /responses/{response_id}/input_items, replaced by a different cursor convention. The third is the removal of the cert_id query parameter on /organization/certificates/{certificate_id} GET, which was redundant with the path parameter. All three are small, surgical, and intentional: the renamed path parameter aligns the project api-keys path with the rest of the org admin surface where api_key_id is the canonical name; the dropped before parameter cleans up a duplicate cursor; the dropped cert_id parameter cleans up a duplicate identifier. A consumer that pinned the old path or sent the old query parameter has a small fix to make. The migration is mechanical.
What this report is not
Not a defect claim. Not a security advisory. Not a judgment of the OpenAI team's release process. OpenAI ships one of the largest public APIs on the web, against an installed base measured in millions of integrations, under a year of unprecedented product expansion (Realtime calls, Conversations, ChatKit, Sora, Containers, Skills, RBAC for orgs). The changes flagged above are the textbook shape of an AI-lab API surface evolving fast: 62 new paths, 50 new admin-credential gates, an Assistants surface deprecated in favor of Responses plus Conversations, and an OpenAPI dialect upgrade from 3.0 to 3.1. The merge gate flagged a major-class semver bump and 122 wire-breaking changes (after the OAS-dialect reconciliation in F6); both reflect the actual shape of the diff.
The findings above do not say OpenAI did anything wrong, and they do not say OpenAI did anything special either. They say: here is exactly which paths landed, which fields became required, which enums shrank, which admin operations now require an admin credential, and which surfaces were deprecated in 373 days. That visibility is the artifact. A downstream consumer who reads this attestation knows exactly which client code paths to update, and an auditor knows exactly what shipped without taking anyone's word for it.
The attestation artifact
A Delimit attestation is a bounded evidence record at a single commit pair (or a commit plus a content-addressed live snapshot, as in this report). The same Delimit version run against the same two inputs produces the same bytes; that is the replayable property. The attestation does not opine on whether a change should have shipped, only on what shipped and how the change-type taxonomy classifies it. A clean pass is as much an artifact as a fail; a major-class fail with a 318-line breaking-change inventory is an even denser artifact, because every one of those lines is something a downstream consumer needs to know.
For the precise list of checks, the explicit out-of-scope list, and the reproducibility guarantee, see the attestation methodology v1. This report is the OpenAPI-diff surface of the same primitive that powers the merge gate for AI-written code.
Reproduce locally
Anyone can re-run the analysis above against the same commit and the same live URL and verify the same diff comes out. The full command sequence:
# Install the CLI npm install -g delimit-cli # Clone the repo and check out the manual_spec branch git clone https://github.com/openai/openai-openapi cd openai-openapi git checkout manual_spec # Extract the old spec at the cited commit (last in-repo manual update) git show 498c71d:openapi.yaml > /tmp/openai-old.yaml # Pull the live spec that the master-branch README links to curl -o /tmp/openai-new.yml \ https://app.stainless.com/api/spec/documented/openai/openapi.documented.yml # Run the merge gate delimit lint /tmp/openai-old.yaml /tmp/openai-new.yml
If the bytes you get differ from the bytes in this report, that is itself a finding worth reporting; raise it on the Delimit repo and we will look. The Stainless-hosted live spec evolves continuously (this report fixes the snapshot at ETag 1f85369d, retrieved 2026-05-07); a later pull will produce a different new-side input and therefore a different diff, which is the expected shape of a live-URL attestation.
For your own API surface
If you ship a public API and want this kind of pre-merge attestation in your CI pipeline, install delimit-cli and run delimit lint <old> <new> against your own specs. The GitHub Action is on the Marketplace at delimit-ai/delimit-action. Free for individual maintainers. Pro tier $10/month for teams.
The signed, replayable attestation is the artifact your reviewers, auditors, or downstream consumers can read without rerunning the gate. A major-class fail with a full breaking-change inventory is worth publishing too; it is what disciplined evolution looks like on a deeply-watched API surface.