fix(llm): route non-OpenAI Azure deployments via Chat Completions#29775
Closed
akirykouski wants to merge 1 commit into
Closed
fix(llm): route non-OpenAI Azure deployments via Chat Completions#29775akirykouski wants to merge 1 commit into
akirykouski wants to merge 1 commit into
Conversation
Azure AI Foundry hosts two distinct families: OpenAI-native deployments (gpt-*, o-series) which speak the Responses API, and partner deployments (DeepSeek, Kimi, Llama, etc.) which only speak Chat Completions. The Azure provider routed every model through Responses by default. For partner deployments this works at the network layer but Azure silently drops max_output_tokens during the Responses-to-Chat translation, capping the underlying call at the chat default (4096 tokens) and producing premature "finish_reason: length" truncations regardless of the user's configured limit.output. Auto-detect by model id so the common case Just Works while keeping useCompletionUrls as an explicit override either direction.
Contributor
|
Thanks for your contribution! This PR doesn't have a linked issue. All PRs must reference an existing issue. Please:
See CONTRIBUTING.md for details. |
Author
|
Filed #29776 with the full bug repro and linked it via |
Collaborator
|
This is an experimental package that is opt in w/ OPENCODE_EXPERIMENTAL_NATIVE_LLM, it shouldnt be hit by users currently and it is under active development, not ready for prs currently |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issue for this PR
Closes #29776
Also likely fixes #12879 (Kimi K2.5 on Azure Foundry rejected with
role: "developer") — same root cause: Azure partner deployments routed to Responses API instead of Chat Completions. Different visible symptom (the Responses API emitsrole: "developer"for system instructions, which Azure Foundry's chat-completions endpoint rejects). After this change those models route through Chat Completions and emitrole: "system". Marking as "likely fixes" rather than "closes" because I reproduced only the truncation symptom; the role-validation symptom would need confirmation from someone with a Kimi K2.5 deployment.Related: #20078 (LM Studio case has the same shape —
limit.outputignored — but a different code path; this PR is the Azure-specific half).Type of change
What does this PR do?
Azure AI Foundry hosts two distinct families of model deployments:
gpt-*,o-series) which support the Responses APIpackages/llm/src/providers/azure.tscurrently routes everything through Responses by default. For partner deployments the request gets accepted at the network layer but two things break:max_output_tokensis silently dropped during Azure's internal Responses→Chat translation. The underlying chat call uses its default of 4096 and the response comes back withfinish_reason: "length"regardless of whatlimit.outputthe user set (this is the truncation symptom — see Azure AI Foundry partner deployments (DeepSeek/Kimi/Llama) capped at 4096 output tokens #29776).role: "developer"(a Responses-API convention). Azure Foundry's chat-completions endpoint only acceptssystem | user | assistant | tool, so it 422s the whole request (see Bad request when using Kimi K2.5 on Azure Foundry #12879).Concrete repro of (1) on
DeepSeek-V4-Pro(Azure AI Foundry,limit.output: 16384):tokens.outputfinish/chat/completionswithmax_tokens: 32000stoplengthstopThe fix auto-detects by model id:
gpt-*/o1-*/o3-*/o4-*go through Responses; everything else uses Chat.useCompletionUrls: true | falseremains as an explicit override either direction, so existing configs aren't affected.How did you verify your code works?
bun test packages/llm/test/provider/— 150 pass, 0 fail (new test filepackages/llm/test/provider/azure.test.tscovers default routing for OpenAI-native ids, default routing for partner ids, o-series, and bothuseCompletionUrlsoverrides)bun --cwd packages/llm run typecheck— cleanazure/DeepSeek-V4-Proand confirmed the fix lifts the 4096 cap: model now emits 14k tokens withfinish: stopScreenshots / recordings
N/A (no UI changes).
Checklist