Skip to content

fix(llm): route non-OpenAI Azure deployments via Chat Completions#29775

Closed
akirykouski wants to merge 1 commit into
anomalyco:devfrom
akirykouski:fix/azure-chat-completions-default
Closed

fix(llm): route non-OpenAI Azure deployments via Chat Completions#29775
akirykouski wants to merge 1 commit into
anomalyco:devfrom
akirykouski:fix/azure-chat-completions-default

Conversation

@akirykouski
Copy link
Copy Markdown

@akirykouski akirykouski commented May 28, 2026

Issue for this PR

Closes #29776

Also likely fixes #12879 (Kimi K2.5 on Azure Foundry rejected with role: "developer") — same root cause: Azure partner deployments routed to Responses API instead of Chat Completions. Different visible symptom (the Responses API emits role: "developer" for system instructions, which Azure Foundry's chat-completions endpoint rejects). After this change those models route through Chat Completions and emit role: "system". Marking as "likely fixes" rather than "closes" because I reproduced only the truncation symptom; the role-validation symptom would need confirmation from someone with a Kimi K2.5 deployment.

Related: #20078 (LM Studio case has the same shape — limit.output ignored — but a different code path; this PR is the Azure-specific half).

Type of change

  • Bug fix
  • New feature
  • Refactor / code improvement
  • Documentation

What does this PR do?

Azure AI Foundry hosts two distinct families of model deployments:

  • OpenAI-native (gpt-*, o-series) which support the Responses API
  • Partner deployments (DeepSeek, Kimi, Llama, etc.) which only support Chat Completions

packages/llm/src/providers/azure.ts currently routes everything through Responses by default. For partner deployments the request gets accepted at the network layer but two things break:

  1. max_output_tokens is silently dropped during Azure's internal Responses→Chat translation. The underlying chat call uses its default of 4096 and the response comes back with finish_reason: "length" regardless of what limit.output the user set (this is the truncation symptom — see Azure AI Foundry partner deployments (DeepSeek/Kimi/Llama) capped at 4096 output tokens #29776).
  2. System instructions get emitted with role: "developer" (a Responses-API convention). Azure Foundry's chat-completions endpoint only accepts system | user | assistant | tool, so it 422s the whole request (see Bad request when using Kimi K2.5 on Azure Foundry #12879).

Concrete repro of (1) on DeepSeek-V4-Pro (Azure AI Foundry, limit.output: 16384):

max_tokens path actual tokens.output finish
direct curl /chat/completions with max_tokens: 32000 14001 stop
opencode (default Responses routing) 4096 length
opencode after this change 14001 stop

The fix auto-detects by model id: gpt-* / o1-* / o3-* / o4-* go through Responses; everything else uses Chat. useCompletionUrls: true | false remains as an explicit override either direction, so existing configs aren't affected.

How did you verify your code works?

  • bun test packages/llm/test/provider/ — 150 pass, 0 fail (new test file packages/llm/test/provider/azure.test.ts covers default routing for OpenAI-native ids, default routing for partner ids, o-series, and both useCompletionUrls overrides)
  • bun --cwd packages/llm run typecheck — clean
  • Reproduced the truncation symptom (Azure AI Foundry partner deployments (DeepSeek/Kimi/Llama) capped at 4096 output tokens #29776) on azure/DeepSeek-V4-Pro and confirmed the fix lifts the 4096 cap: model now emits 14k tokens with finish: stop

Screenshots / recordings

N/A (no UI changes).

Checklist

  • I have tested my changes locally
  • I have not included unrelated changes in this PR

Azure AI Foundry hosts two distinct families: OpenAI-native deployments
(gpt-*, o-series) which speak the Responses API, and partner deployments
(DeepSeek, Kimi, Llama, etc.) which only speak Chat Completions. The
Azure provider routed every model through Responses by default. For
partner deployments this works at the network layer but Azure silently
drops max_output_tokens during the Responses-to-Chat translation,
capping the underlying call at the chat default (4096 tokens) and
producing premature "finish_reason: length" truncations regardless of
the user's configured limit.output.

Auto-detect by model id so the common case Just Works while keeping
useCompletionUrls as an explicit override either direction.
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for your contribution!

This PR doesn't have a linked issue. All PRs must reference an existing issue.

Please:

  1. Open an issue describing the bug/feature (if one doesn't exist)
  2. Add Fixes #<number> or Closes #<number> to this PR description

See CONTRIBUTING.md for details.

@akirykouski
Copy link
Copy Markdown
Author

Filed #29776 with the full bug repro and linked it via Closes in the PR body. CI is green now.

@rekram1-node
Copy link
Copy Markdown
Collaborator

This is an experimental package that is opt in w/ OPENCODE_EXPERIMENTAL_NATIVE_LLM, it shouldnt be hit by users currently and it is under active development, not ready for prs currently

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Azure AI Foundry partner deployments (DeepSeek/Kimi/Llama) capped at 4096 output tokens Bad request when using Kimi K2.5 on Azure Foundry

2 participants