Released: February 16, 2026 Downloads (Last Month): 390,092 Parameters: 397B total / 17B activated (MoE) Architecture: Mixture-of-Experts (512 experts, 11 active) Context Length: Up to 1,010,000 tokens
Overview
Qwen3.5-397B-A17B is Alibaba Cloud's next-generation foundation model, representing a significant leap forward in AI capabilities. Released on February 16, 2026, this model features a hybrid architecture combining Gated Delta Networks with sparse Mixture-of-Experts, enabling efficient inference with minimal latency.
What sets Qwen3.5 apart is its focus on native multimodal agents — a model designed from the ground up to handle text, images, and video while excelling at agentic tasks like tool calling, web browsing, and complex multi-step workflows.
Key Features
1. Unified Vision-Language Foundation
Qwen3.5 achieves cross-generational parity with Qwen3, outperforming Qwen3-VL models across reasoning, coding, agents, and visual understanding benchmarks. The model was trained with early fusion on multimodal tokens, enabling seamless understanding across text, images, and video.
2. Efficient Hybrid Architecture
- Gated Delta Networks + Mixture-of-Experts (MoE)
- 512 total experts, 11 activated per inference
- Only 17B parameters active at any time (vs 397B total)
- High-throughput inference with minimal latency
This design makes Qwen3.5 significantly more efficient than dense models of comparable capability, reducing compute costs for production deployments.
3. Ultra-Long Context
- Native context length: 262,144 tokens
- Extended context: Up to 1,010,000 tokens via RoPE scaling
- Perfect for long-document analysis, codebase understanding, and extended conversations
4. Global Linguistic Coverage
Support for 201 languages and dialects, enabling worldwide deployment with nuanced cultural and regional understanding.
5. Agentic Capabilities
Qwen3.5 excels at tool calling and agentic workflows: - Built-in tool support via Qwen-Agent framework - Strong performance on BFCL-V4, TAU2-Bench, and agent benchmarks - MCP-Mark compatibility for Model Context Protocol servers
Benchmark Performance
Qwen3.5 achieves competitive performance across multiple benchmarks:
Language & Reasoning
| Benchmark | Qwen3.5-397B | GPT5.2 | Claude 4.5 |
|---|---|---|---|
| MMLU-Pro | 87.8 | 87.4 | 89.5 |
| MMLU-Redux | 94.9 | 95.0 | 95.6 |
| IFEval (Instruction Following) | 92.6 | 94.8 | 90.9 |
| HMMT Nov 25 | 92.7 | 100 | 93.3 |
Coding
| Benchmark | Qwen3.5-397B |
|---|---|
| LiveCodeBench v6 | 83.6 |
| SWE-bench Verified | 76.4 |
| SecCodeBench | 68.3 |
Vision-Language
| Benchmark | Qwen3.5-397B | Claude 4.5 |
|---|---|---|
| MMMU | 85.0 | 80.7 |
| MMMU-Pro | 79.0 | 70.6 |
| RealWorldQA | 83.9 | 77.0 |
| VideoMME (with subs) | 87.5 | 77.6 |
Hardware Requirements
Minimum (for testing)
- VRAM: ~32GB (with quantization)
- RAM: 64GB system RAM
- GPU: 1x A100 (40GB) or equivalent
- Context: Reduced (~32K tokens)
Recommended (for production)
- VRAM: 256-512GB (8 GPUs with tensor parallelism)
- RAM: 128GB+ system RAM
- GPU: 8x A100/H100 with NVLink
- Context: Full 262K+ tokens
For Local Use
Qwen3.5 is supported by multiple local inference frameworks: - Ollama: Native support - LM Studio: UI-based deployment - MLX-LM: Apple Silicon optimized - llama.cpp: CPU inference - KTransformers: CPU-GPU heterogeneous computing
Inference Frameworks
Qwen3.5 supports multiple serving frameworks for different use cases:
1. SGLang (Recommended for Throughput)
python -m sglang.launch_server \
--model-path Qwen/Qwen3.5-397B-A17B \
--port 8000 \
--tp-size 8 \
--context-length 262144 \
--reasoning-parser qwen3
2. vLLM (Production-Grade)
vllm serve Qwen/Qwen3.5-397B-A17B \
--port 8000 \
--tensor-parallel-size 8 \
--max-model-len 262144 \
--reasoning-parser qwen3
3. Hugging Face Transformers (Quick Testing)
pip install "transformers[serving] @ git+https://github.com/huggingface/transformers.git@main"
transformers serve --force-model Qwen/Qwen3.5-397B-A17B --port 8000
Thinking Mode vs. Instruct Mode
Qwen3.5 operates in thinking mode by default, generating reasoning content signified by <think> tags before producing final responses. This is similar to models like QwQ-32B.
To disable thinking mode and get direct responses:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="EMPTY"
)
response = client.chat.completions.create(
model="Qwen/Qwen3.5-397B-A17B",
messages=[{"role": "user", "content": "Your question here"}],
extra_body={
"chat_template_kwargs": {"enable_thinking": False}
}
)
Use Cases
1. Long-Document Analysis
- Legal contract review
- Scientific paper analysis
- Technical documentation summarization
- Multi-part document comprehension
2. Coding & Software Development
- Large codebase understanding
- Automated code review
- Bug detection and fixing
- Terminal automation (via Qwen Code)
3. Multimodal Agents
- Web browsing with visual understanding
- Document processing (OCR + understanding)
- Image analysis with detailed reasoning
- Video understanding and QA
4. Global Applications
- Multi-language customer support
- Localization and translation
- Cultural nuance-aware content generation
- Regional compliance checking
Comparison with Other Models
| Model | Parameters | Downloads (Last Mo) | Context | Multimodal |
|---|---|---|---|---|
| Qwen3.5-397B | 397B/17B active | 390K | 1M+ | ✓ Text/Vision/Video |
| Qwen3-VL-235B | 235B/22B active | - | - | ✓ Text/Vision |
| Mixtral-8x7B | 46.7B/12.9B active | High | 32K | ✗ Text-only |
| DeepSeek-V3 | 671B/37B active | Very High | 128K | ✗ Text-only |
Getting Started
Installation
# Using transformers
pip install "transformers[serving] @ git+https://github.com/huggingface/transformers.git@main"
pip install torch torchvision pillow
# Using vLLM
pip install vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly
# Using SGLang
uv pip install 'git+https://github.com/sgl-project/sglang.git#subdirectory=python&egg=sglang[all]'
Python Example
from transformers import AutoModelForCausalLM, AutoTokenizer
model_path = "Qwen/Qwen3.5-397B-A17B"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True
)
# Text generation
input_text = "Explain quantum computing in simple terms."
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=2048)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Quantized Variants
Qwen3.5 is available in multiple quantized formats for efficient inference:
- Qwen/Qwen3.5-397B-A17B-FP8: FP8 quantization (68.8K downloads)
- Additional quantizations (INT4, INT8) may be available through third-party repos
Quantized models reduce memory requirements significantly, enabling deployment on smaller GPU clusters.
Model Variants
As of February 24, 2026, the Qwen3.5 series includes:
- Qwen3.5-397B-A17B (Main release)
- 390,092 downloads last month
- Full multimodal capabilities
-
Best for production workloads
-
Qwen3.5-397B-A17B-FP8
- 68,800 downloads
- FP8 quantized for efficiency
- Lower memory footprint
Note: Alibaba has indicated "more sizes are coming" — expect additional variants (smaller MoE models, dense versions) in future releases.
Strengths
✅ Massive context window (1M+ tokens for ultra-long documents) ✅ Strong multimodal performance (text, vision, video) ✅ Efficient MoE architecture (only 17B active) ✅ Excellent agentic capabilities (tool calling, web browsing) ✅ Broad language support (201 languages) ✅ Open-source (Apache 2.0 license likely) ✅ Multiple framework support (SGLang, vLLM, Transformers, etc.)
Weaknesses
⚠️ High hardware requirements for full context (512GB+ VRAM recommended) ⚠️ Large model size (403B params) requires significant storage ⚠️ New model — community ecosystem still developing ⚠️ Thinking mode by default — may add latency for simple queries ⚠️ Limited deployment examples compared to more established models
Best Practices
Sampling Parameters
Thinking Mode (Default): - Temperature: 0.6 - Top P: 0.95 - Top K: 20
Instruct (Non-Thinking) Mode: - Temperature: 0.7 - Top P: 0.8 - Top K: 20 - Presence Penalty: 1.5
Output Length
- Standard queries: 32,768 tokens recommended
- Complex reasoning (math/coding): 81,920 tokens for benchmarks
Context Management
- Maintain minimum 128K context to preserve thinking capabilities
- Use RoPE scaling for ultra-long contexts (>262K tokens)
- Context folding strategies for search agents
Licensing
Qwen3.5 follows Alibaba's open-source licensing model (check the specific license on the Hugging Face repository for exact terms). Previous Qwen models used Apache 2.0 or similar permissive licenses.
Resources
- Hugging Face: Qwen/Qwen3.5-397B-A17B
- GitHub: QwenLM/Qwen3.5
- Blog: Qwen3.5 Announcement
- Collection: Qwen3.5 Models
- Qwen-Agent: Agentic Framework
- Qwen Code: Terminal Agent
Conclusion
Qwen3.5-397B-A17B represents a significant advancement in open-source AI, combining massive scale with efficient Mixture-of-Experts architecture and strong multimodal capabilities. With its focus on native agentic workflows, ultra-long context, and broad language support, it's well-suited for:
- Enterprise applications requiring tool integration
- Long-document analysis and research
- Multilingual global deployments
- Complex agentic workflows
For organizations with sufficient GPU resources, Qwen3.5 offers a compelling alternative to proprietary models like GPT-5.2 and Claude 4.5, especially for use cases requiring multimodal understanding, ultra-long context, or agentic capabilities.
Last Updated: February 24, 2026