【vllm】vLLM v1 系统级架构分析（总）-编程阁

vLLM v1 系统级架构分析

分析日期：2026-04-20
代码目录：vllm/vllm/v1

整体架构概览
架构模式与设计思路
整体运行流程
子模块详细分析
- 4.1 engine — 引擎层
- 4.2 core/sched — 核心调度层
- 4.3 worker — 工作执行层
- 4.4 attention — 注意力计算层
- 4.5 sample — 采样层
- 4.6 spec_decode — 推测解码层
- 4.7 executor — 执行器层
- 4.8 pool — 池化模块
- 4.9 structured_output — 结构化输出
- 4.10 kv_offload — KV缓存卸载
- 4.11 metrics — 指标监控
模块调用关系与数据流
架构图索引

1. 整体架构概览

vLLM v1 是 vLLM 推理引擎的第二代架构，采用六层分层架构 + 插件式后端设计。相比 v0 架构，v1 的核心改进在于：

进程分离：将 EngineCore 运行在独立后台进程中，通过 ZMQ/TensorIPC 与前端通信，实现计算与 I/O 重叠
统一调度：Scheduler 统一管理生成和池化请求，支持分块预填充（chunked prefill）
插件式后端：Attention、SpecDecode、StructuredOutput、KVOffload 均通过注册机制 + 运行时选择实现
多设备抽象：WorkerBase → GPUWorker/CPUWorker/XPUWorker 多设备后端

核心数据流

用户请求 → AsyncLLM → InputProcessor → CoreClient → [ZMQ] → EngineCore → Scheduler → SchedulerOutput → Executor → Worker → GPUModelRunner → [Model Forward + Attention + Sample] → ModelRunnerOutput → [返回路径] → EngineCore → CoreClient → OutputProcessor → RequestOutput

模块统计

模块目录	文件数	核心职责
engine/	14	API入口、输入处理、输出组装、进程间通信
core/	8+	请求调度、KV缓存管理、前缀缓存
worker/	30+	GPU/CPU/XPU执行、批量管理、CUDA Graph
attention/	30+	注意力后端（FlashAttn/FlashInfer/MLA等）
sample/	10+	采样、logits处理、TopK/TopP
spec_decode/	10+	推测解码（Eagle/Medusa/Ngram）
pool/	3	池化元数据、晚交互评分
structured_output/	7	结构化输出（xgrammar/outlines）
kv_offload/	12	KV缓存CPU卸载、LRU/ARC淘汰
executor/	8	执行器抽象、多进程/Ray分布式
metrics/	8	Prometheus指标、性能统计

2. 架构模式与设计思路

2.1 分层架构（Layered Architecture）

v1 采用严格的六层分层：

层次	名称	核心组件	职责
L1	API/Frontend	AsyncLLM, InputProcessor, OutputProcessor	请求接收、参数验证、输出格式化
L2	Engine Core	EngineCore, CoreClient, Coordinator	调度循环、进程间通信、数据并行协调
L3	Scheduling	Scheduler, KVCacheManager, RequestQueue	请求调度、KV缓存块管理、前缀缓存
L4	Worker/Execution	GPUWorker, GPUModelRunner, InputBatch	模型执行、批量状态管理、CUDA Graph
L5	Functional Subsystems	Attention, Sample, SpecDecode, Pool, StructOutput, KVOffload	具体计算功能实现
L6	Executor/Distributed	MultiprocExecutor, RayExecutor	多GPU编排、分布式通信

层次间通信方式：

L1 ↔ L2：ZMQ（多进程）或直接调用（单进程 InprocClient）
L2 ↔ L3：EngineCore 直接调用 Scheduler
L3 ↔ L4：SchedulerOutput 通过 Executor 传递到 Worker
L4 ↔ L5：GPUModelRunner 通过接口调用各子系统
L4 ↔ L6：Executor 创建并管理 Worker 进程

2.2 插件式后端（Plugin Backend）

多个子系统采用抽象基类 + 注册表 + 运行时选择模式：

AttentionBackend (ABC) → FlashAttn / FlashInfer / MLA / TritonAttn / ... ↕ Registry + AttentionSelector (根据硬件/模型自动选择) Executor (ABC) → UniprocExecutor / MultiprocExecutor / RayExecutor ↕ Executor.get_class() (根据配置选择) KVOffloadManager (ABC) → CPUOffloadManager ↕ factory.py (根据配置创建) StructuredOutputBackend → xgrammar / outlines / lm_format_enforcer ↕ StructuredOutputManager (根据请求选择)

2.3 数据并行架构

v1 支持 Data Parallel (DP) 推理：

DPCoordinator：协调多引擎的数据并行
CoreClient：通过 CRC32 哈希将晚交互请求路由到特定引擎
Wave机制：DP 场景下按"波"调度请求，确保同步完成

2.4 关键设计决策

EngineCore 独立进程：避免 GIL 限制，调度循环不阻塞 API 层
TensorIPC：GPU 张量零拷贝传输（通过共享内存），避免序列化开销
分块预填充：长 prompt 可跨多步调度，与 decode 请求混合执行
Block-based KV Cache：以固定大小 block 为单位管理 KV 缓存，支持前缀共享
CUDA Graph 重放：将 decode 步骤捕获为 CUDA Graph，通过重放避免 CPU 开销

3. 整体运行流程

3.1 生成请求（Generation Request）完整生命周期

┌──────────────────────────────────────────────────────────────────┐ │ Phase 1: 请求接收 │ │ AsyncLLM.generate(prompt, SamplingParams) │ │ → InputProcessor.process() │ │ → Renderer: prompt → token_ids + mm_features │ │ → SamplingParams.verify() → 设置 task, temperature 等 │ │ → CoreClient.add_request(EngineCoreRequest) │ │ → ZMQ socket 发送 msgpack 编码的请求 │ ├──────────────────────────────────────────────────────────────────┤ │ Phase 2: 调度 │ │ EngineCore 调度循环: │ │ → Scheduler.schedule() │ │ → RequestQueue: 取出待调度请求 │ │ → KVCacheManager: 分配 KV cache blocks │ │ → 前缀缓存匹配（hash-based block reuse） │ │ → 构建 SchedulerOutput (NewRequestData + CachedRequestData) │ │ → Executor.execute_model(SchedulerOutput) │ ├──────────────────────────────────────────────────────────────────┤ │ Phase 3: 执行 │ │ Executor → GPUWorker → GPUModelRunner │ │ → InputBatch: 组装批量数据 │ │ → Model.forward() → hidden_states │ │ → Attention: 根据后端选择执行注意力计算 │ │ → Sampler: logits → penalties → topk_topp → sample tokens │ │ → (若 spec_decode: draft tokens → rejection sampling) │ │ → (若 pool: PoolingRunner.pool() → embedding) │ │ → (若 structured_output: grammar-guided sampling) │ │ → ModelRunnerOutput (token_ids, logprobs, pooler_output, ...) │ ├──────────────────────────────────────────────────────────────────┤ │ Phase 4: 输出处理 │ │ EngineCore.update_from_output() │ │ → Scheduler 处理完成的请求，释放 KV blocks │ │ → EngineCoreOutput → CoreClient → OutputProcessor │ │ → Detokenizer: token_ids → text │ │ → LogprobsProcessor: 格式化 logprobs │ │ → RequestOutput 返回给用户 │ └──────────────────────────────────────────────────────────────────┘

3.2 分块预填充（Chunked Prefill）

长 prompt (e.g. 4096 tokens, chunk_size=1024) │ ├─ Step 1: schedule 1024 tokens (prefill chunk 1) │ → KVCacheManager 分配 blocks 0-15 │ → Model forward (prefill attention) │ → 保存 hidden_states 到 PoolingStates（若 pool 请求） │ → 不输出任何 token（未完成全部 prefill） │ ├─ Step 2: schedule 1024 tokens (prefill chunk 2) │ → 复用 blocks 0-15, 新增 blocks 16-31 │ → Model forward │ → 仍不输出 token │ ├─ ... (chunk 3, 4) │ └─ Step 4: 最后一个 chunk 完成 → PoolingCursor.is_finished() == True → 执行池化聚合 → 输出 embedding / classification → (若是生成请求) → 开始 decode 步骤

3.3 推测解码流程

Step 1: Draft (推测) SpecDecodeProposer (Eagle/Medusa/Ngram) → 生成 k 个 draft tokens (概率分布) Step 2: Verify (验证) GPUModelRunner.execute_model() → 将 draft tokens 一起送入模型 → 模型输出每个位置的 logits Step 3: Reject (拒绝采样) RejectionSampler → 比较模型 logits vs draft 概率 → 接受匹配的 draft tokens → 拒绝不匹配的，从模型分布重新采样 → 输出最终 token 序列 + 接受长度 Metric: acceptance_rate = accepted / proposed

4. 子模块详细分析

4.1 engine — 引擎层

核心作用

engine 模块是 v1 的入口和编排层，负责请求的全生命周期管理：从 API 接收到输出返回。它是前端（API 进程）与后端（EngineCore 进程）的桥梁。

关键类/方法

类	文件	核心方法	说明
`AsyncLLM`	async_llm.py	`generate()`,`encode()`,`abort()`	异步 API 主入口
`InputProcessor`	input_processor.py	`process()`,`process_pooling()`	将原始 prompt 转为 EngineCoreRequest
`OutputProcessor`	output_processor.py	`process_outputs()`	将 EngineCoreOutput 转为 RequestOutput
`EngineCore`	core.py	`run()`,`step()`	后台进程中的调度循环
`CoreClient`	core_client.py	`add_request()`,`get_outputs()`	前端与 EngineCore 的通信抽象
`InprocClient`	core_client.py	直接方法调用	单进程模式
`AsyncMPClient`	core_client.py	ZMQ async socket	多进程异步模式
`SyncMPClient`	core_client.py	ZMQ sync socket	多进程同步模式
`Detokenizer`	detokenizer.py	`decode()`	增量 detokenization
`Coordinator`	coordinator.py	DP 协调	数据并行引擎协调
`TensorIPC`	tensor_ipc.py	`send()`,`recv()`	GPU 张量零拷贝传输

数据结构

结构	说明
`EngineCoreRequest`	msgspec.Struct，包含 request_id, prompt_token_ids, sampling_params, pooling_params 等
`EngineCoreOutput`	msgspec.Struct，包含 request_id, new_token_ids, finish_reason, pooling_output 等
`EngineCoreOutputs`	msgspec.Struct，包含 outputs 列表 + scheduler_stats

4.2 core/sched — 核心调度层

核心作用

core/sched 模块是 v1 的调度中枢，决定每个 step 处理哪些请求、分配多少 KV cache blocks、如何混合 prefill 和 decode。

关键类/方法

类	文件	核心方法	说明
`Scheduler`	scheduler.py	`schedule()`,`update_from_outputs()`	主调度器，决定请求调度
`SchedulerInterface`	interface.py	抽象接口	调度器抽象基类
`AsyncScheduler`	async_scheduler.py	异步调度	支持异步调度模式
`KVCacheManager`	kv_cache_manager.py	`allocate()`,`free()`,`get_prefix_cache_blocks()`	KV 缓存块管理
`KVCacheUtils`	kv_cache_utils.py	`generate_scheduler_kv_cache_config()`	块哈希、调度器配置
`BlockPool`	block_pool.py	`get_free_block()`,`free_block()`	空闲块池管理
`RequestQueue`	request_queue.py	`push()`,`pop()`	优先级请求队列
`EncoderCacheManager`	encoder_cache_manager.py	MM 编码器输出缓存	多模态编码器缓存

调度策略

优先级调度：请求按 priority 排序
分块预填充：长 prompt 分多个 chunk 调度，与 decode 请求混合
前缀缓存：通过 block hash 匹配已有 KV cache，避免重复计算
抢占：当 GPU 内存不足时，抢占低优先级请求释放 blocks
ALL pooling 检测：若当前批次全部为 pool 请求，跳过 decode 步骤

SchedulerOutput 数据结构

@dataclassclassSchedulerOutput:scheduled_new_reqs:list[NewRequestData]# 新请求scheduled_cached_reqs:list[CachedRequestData]# 缓存请求（有前缀命中）num_scheduled_tokens:int# 本步总 token 数total_num_scheduled_tokens:list[int]# 各组 token 数grammar_outputs:list[GrammarOutput]# 结构化输出...

4.3 worker — 工作执行层

核心作用

worker 模块是 v1 的执行引擎，负责模型加载、forward pass、CUDA Graph 捕获/重放、批量状态管理。

关键类/方法

类	文件	核心方法	说明
`WorkerBase`	worker_base.py	`init_device()`,`load_model()`,`execute_model()`	工作器抽象基类
`GPUWorker`	gpu_worker.py	`execute_model()`,`determine_available_memory()`	GPU 工作器
`GPUModelRunner`	gpu_model_runner.py	`execute_model()`,`_execute_pooling()`	GPU 模型执行器（~5700行，最核心文件）
`InputBatch`	gpu_input_batch.py	`add_request()`,`get_sampling_metadata()`,`get_pooling_metadata()`	批量状态管理
`CPUWorker`	cpu_worker.py	CPU 后端执行
`XPUWorker`	xpu_worker.py	XPU (Intel GPU) 后端
`CachedRequestState`	gpu_input_batch.py	请求级缓存状态

GPUModelRunner 核心流程

execute_model(scheduler_output) → 解析 new/cached requests → InputBatch.add_request() / update() → _execute_forward() → Model.forward(hidden_states) → Attention (根据 backend) → Sampler / PoolingRunner → LateInteractionRunner.postprocess() → _execute_decode() (CUDA Graph 重放模式) → 构建 ModelRunnerOutput

GPU 子系统

子目录	职责
gpu/sample/	采样器（Gumbel、LogitBias、MinP、Penalties、Logprob）
gpu/pool/	池化执行器（PoolingRunner、LateInteractionRunner）
gpu/mm/	多模态（EncoderRunner、EncoderCache、RoPE）
gpu/spec_decode/	推测解码（Eagle Speculator、RejectionSampler）
gpu/model_states/	模型状态管理（Default、Whisper）
gpu/metrics/	Logits 指标

4.4 attention — 注意力计算层

核心作用

attention 模块实现了 v1 的注意力计算抽象层，通过后端注册 + 运行时选择，支持多种 GPU 注意力实现。

关键类/方法

类/函数	文件	说明
`AttentionBackend`	backend.py	抽象基类，定义接口
`AttentionMetadata`	backend.py	注意力批量元数据
`FlashAttnBackend`	backends/flash_attn.py	FlashAttention-2 后端
`FlashInferBackend`	backends/flashinfer.py	FlashInfer 后端
`TritonAttnBackend`	backends/triton_attn.py	Triton 自定义后端
`FlexAttentionBackend`	backends/flex_attention.py	PyTorch FlexAttention
`MLABackend`	backends/mla/	Multi-head Latent Attention（DeepSeek系列）
`MambaAttnBackend`	backends/mamba_attn.py	Mamba/SSM 后端
`AttentionSelector`	selector.py	根据硬件/模型自动选择后端
`BackendRegistry`	backends/registry.py	后端注册表

注意力操作（ops/）

文件	说明
`paged_attn.py`	Paged Attention kernel
`chunked_prefill_paged_decode.py`	混合 prefill + decode
`prefix_prefill.py`	前缀缓存 prefill
`merge_attn_states.py`	合并注意力状态
`triton_decode_attention.py`	Triton decode 专用
`triton_prefill_attention.py`	Triton prefill 专用
`flashmla.py`	FlashMLA kernel
`dcp_alltoall.py`	Disaggregated prefill 通信

MLA 子系统

MLA (Multi-head Latent Attention) 是 DeepSeek 系列模型的专用注意力机制，包含多种实现：

flashmla.py— FlashMLA 官方 kernel
cutlass_mla.py— CUTLASS 实现
flashinfer_mla.py— FlashInfer MLA
triton_mla.py— Triton 自定义
aiter_triton_mla.py— AMD ROCm 实现
indexer.py— MLA 索引构建

4.5 sample — 采样层

核心作用

sample 模块负责从模型 logits 中采样下一个 token，包括各种惩罚、约束和采样策略。

关键类/方法

类	文件	核心方法	说明
`Sampler`	sampler.py	`forward()`	主采样器，9步采样管线
`SamplingMetadata`	metadata.py	—	批量采样参数
`LogitsProcessor`	logits_processor/interface.py	`apply()`	Logits 处理器接口
`BuiltinLogitsProc`	logits_processor/builtin.py	—	内建处理器（temperature, top_k, min_p 等）
`TopKTopPSampler`	ops/topk_topp_sampler.py	`forward()`	GPU TopK/TopP 采样
`RejectionSampler`	rejection_sampler.py	`forward()`	推测解码拒绝采样

Sampler 9步管线

1. (若请求) 计算/克隆 logprobs 2. Logits → float32 3. 应用 allowed_token_ids 白名单 4. 应用 bad_words 排除 5. 应用非 argmax-invariant 处理器（min_tokens, logit_bias） 6. 应用惩罚（repetition/frequency/presence） 7. 采样： a. 若 all_greedy → argmax b. 应用 temperature c. 应用 min_p d. 应用 top_k / top_p e. 随机采样 8. 收集 top logprobs 9. 返回 SamplerOutput

4.6 spec_decode — 推测解码层

核心作用

spec_decode 模块实现推测解码（Speculative Decoding），通过先让轻量级 draft 模型生成候选 token，再由目标模型并行验证，在不损失质量的前提下加速推理。

关键类/方法

类	文件	说明
`EagleProposer`	eagle.py	Eagle/Eagle3 推测解码
`MedusaProposer`	medusa.py	Medusa 多头推测
`NgramProposer`	ngram_proposer.py	CPU n-gram推测
`NgramProposerGPU`	ngram_proposer_gpu.py	GPU n-gram推测
`SuffixDecoding`	suffix_decoding.py	后缀数组推测
`DFlashProposer`	dflash.py	DFlash 推测
`DraftModelProposer`	draft_model.py	基于独立 draft 模型
`SpecDecodeMetadata`	metadata.py	批量推测解码元数据
`SpecDecodeMetrics`	metrics.py	接受率等指标
`extract_hidden_states`	extract_hidden_states.py	提取 draft 用的隐藏状态

推测解码流程

1. Proposer 生成 k 个 draft tokens 2. 将 draft tokens 与当前序列拼接 3. 一次 forward pass 并行验证所有位置 4. RejectionSampler: - 从左到右逐个验证 - 接受: draft token == target distribution 样本 - 拒绝: 从 target distribution 重新采样 5. 输出: 接受的 token + 补充采样的 token

4.7 executor — 执行器层

核心作用

executor 模块负责创建和管理 Worker 进程，是分布式推理的核心编排层。

关键类/方法

类	文件	说明
`Executor`	abstract.py	抽象基类
`UniprocExecutor`	uniproc_executor.py	单 GPU 进程
`MultiprocExecutor`	multiproc_executor.py	多 GPU 进程（spawn）
`RayExecutor`	ray_executor.py	Ray 分布式执行
`RayExecutorV2`	ray_executor_v2.py	Ray V2 API
`CUDAGraphDispatcher`	cudagraph_dispatcher.py	CUDA Graph 捕获/重放调度

Executor 抽象接口

classExecutor(ABC):defdetermine_available_memory()->intdefinitialize_cache(num_gpu_blocks)defexecute_model(scheduler_output)->ModelRunnerOutputdefcollective_rpc(method,timeout,args)->list[Any]defcheck_health()->None

执行器选择逻辑

Executor.get_class(vllm_config): if isinstance(backend, type(Executor)): → 直接使用 elif backend == "ray": → RayExecutor elif backend == "mp": → MultiprocExecutor elif TP == 1: → UniprocExecutor else: → MultiprocExecutor (default)

4.8 pool — 池化模块

核心作用

pool 模块为池化任务（Embedding、Classification）提供元数据构建和晚交互评分。

关键类/方法

类/函数	文件	说明
`PoolingMetadata`	metadata.py	批量池化元数据（prompt_lens, token_ids, cursor）
`PoolingCursor`	metadata.py	GPU 索引追踪器（first/last token 位置）
`PoolingStates`	metadata.py	分块预填充隐藏状态缓存
`get_late_interaction_engine_index()`	late_interaction.py	CRC32 引擎路由
`compute_maxsim_score_batched()`	late_interaction.py	批量 MaxSim 评分
`build_late_interaction_query_params()`	late_interaction.py	构建 cache_query 参数
`build_late_interaction_doc_params()`	late_interaction.py	构建 score_doc 参数

（详细分析见前一份 G-77-pool 报告）

4.9 structured_output — 结构化输出

核心作用

structured_output 模块确保模型输出符合预定义的格式约束（JSON Schema、Regex、Grammar），避免无效输出。

关键类/方法

类/函数	文件	说明
`StructuredOutputManager`	init.py	管理结构化输出请求
`StructuredOutputRequest`	request.py	单个请求的语法约束
`BackendXGrammar`	backend_xgrammar.py	xgrammar 后端
`BackendOutlines`	backend_outlines.py	outlines 后端
`BackendGuidance`	backend_guidance.py	guidance 后端
`BackendLMFormatEnforcer`	backend_lm_format_enforcer.py	lm-format-enforcer 后端
`StructuredOutputGrammar`	backend_types.py	语法对象抽象

工作原理

1. 用户请求含 json_schema / regex / grammar 约束 2. StructuredOutputManager 创建对应后端的 Grammar 3. 每个 decode 步骤: Grammar → 允许的 token mask 4. Sampler 应用 mask: 只从允许 token 中采样 5. 保证输出始终符合约束

4.10 kv_offload — KV缓存卸载

核心作用

kv_offload 模块实现KV Cache 的 CPU 卸载，将不活跃的 KV blocks 从 GPU 转移到 CPU，释放 GPU 内存以服务更多请求。

关键类/方法

类	文件	说明
`OffloadingManager`(ABC)	abstract.py	卸载管理器抽象基类
`CPUOffloadManager`	cpu/manager.py	CPU 卸载实现
`LRUPolicy`	cpu/policies/lru.py	LRU 淘汰策略
`ARCPolicy`	cpu/policies/arc.py	ARC 自适应淘汰策略
`SharedOffloadRegion`	cpu/shared_offload_region.py	共享卸载区域
`OffloadMediums`	mediums.py	卸载介质抽象
`ReuseManager`	reuse_manager.py	Block 重用管理

卸载操作

lookup() → 查找已卸载的 block 链长度 prepare_load() → 准备加载（保护 block 不被淘汰） touch() → 标记 block 为最近使用 complete_load() → 完成加载 prepare_store() → 准备存储（可能触发淘汰） complete_store() → 完成存储（block 可被加载）

4.11 metrics — 指标监控

核心作用

metrics 模块提供运行时指标收集和导出，用于性能分析和监控。

关键类/方法

类/函数	文件	说明
`PrometheusMetrics`	prometheus.py	Prometheus 指标导出
`StatLoggerManager`	loggers.py	统计日志管理
`PerfStats`	perf.py	性能统计
`IterationStats`	stats.py	每步迭代统计
`SchedulerStats`	stats.py	调度器统计
`MetricsReader`	reader.py	指标读取

5. 模块调用关系与数据流

5.1 核心调用链

AsyncLLM.generate() → InputProcessor.process() → Renderer.encode() → token_ids → SamplingParams.verify() → CoreClient.add_request(EngineCoreRequest) → [ZMQ msgpack] → EngineCore EngineCore.step() → Scheduler.schedule() → RequestQueue.pop() → KVCacheManager.allocate() → BlockPool.get_free_block() → Prefix cache match (hash lookup) → return SchedulerOutput → Executor.execute_model(SchedulerOutput) → GPUWorker.execute_model() → GPUModelRunner.execute_model() → InputBatch.add_request() → Model.forward(hidden_states, kv_caches) → AttentionBackend.forward() → Sampler.forward(logits) → LogitsProcessor → TopKTopP → sample → [optional] LateInteractionRunner.postprocess() → return ModelRunnerOutput → Scheduler.update_from_outputs() → 释放完成的请求 blocks → 构建完成请求列表 → [TensorIPC] → CoreClient → OutputProcessor → Detokenizer.decode() → LogprobsProcessor.format() → return RequestOutput → 用户

5.2 模块间数据传递矩阵

源模块	目标模块	数据结构	传递方式
AsyncLLM	InputProcessor	Prompt + Params	方法调用
InputProcessor	CoreClient	EngineCoreRequest	ZMQ msgpack
CoreClient	EngineCore	EngineCoreRequest	ZMQ socket
EngineCore	Scheduler	—	直接调用
Scheduler	Executor	SchedulerOutput	方法调用
Executor	Worker	SchedulerOutput	进程间通信
Worker	GPUModelRunner	scheduler_output	方法调用
GPUModelRunner	Attention	hidden_states + kv_caches	方法调用
GPUModelRunner	Sampler	logits + SamplingMetadata	方法调用
GPUModelRunner	Pool	hidden_states + PoolingMetadata	方法调用
Sampler	GPU Sample ops	logits tensor	GPU kernel
EngineCore	CoreClient	EngineCoreOutputs + GPU tensors	ZMQ + TensorIPC
CoreClient	OutputProcessor	EngineCoreOutputs	方法调用
OutputProcessor	AsyncLLM	RequestOutput	asyncio Event

5.3 GPU 数据流（单步执行）

SchedulerOutput │ ├─ new_reqs ──── InputBatch.add_request() │ → token_ids_cpu, seq_lens, block_ids │ → SamplingMetadata / PoolingMetadata 构建 │ ├─ cached_reqs ── InputBatch.update() │ → 复用已有 block_ids, 更新 num_computed_tokens │ └─ num_scheduled_tokens → 确定 forward batch 大小 Model.forward(input_ids, positions, kv_caches) │ ├─ hidden_states ──→ Attention.forward() │ │ │ ├─ Prefill: chunked_prefill_paged_decode │ └─ Decode: paged_attention │ └─ logits ──→ Sampler.forward() │ ├─ LogitsProcessor.apply() ├─ apply_all_penalties() ├─ TopKTopPSampler.forward() └─ 返回 SamplerOutput(token_ids, logprobs) [若 Pool 请求]: hidden_states ──→ PoolingRunner.pool() → F.normalize(last_hidden_states) → LateInteractionRunner.postprocess() → compute_maxsim_score_batched() (若晚交互) ModelRunnerOutput: ├── sampled_token_ids (GPU tensor) ├── logprobs (CPU numpy) ├── pooler_output (list[Tensor]) └── spec_decode_draft_ids (optional)

vLLM v1 系统级架构分析

目录

1. 整体架构概览

核心数据流

模块统计

2. 架构模式与设计思路

2.1 分层架构（Layered Architecture）

2.2 插件式后端（Plugin Backend）

2.3 数据并行架构

2.4 关键设计决策

3. 整体运行流程

3.1 生成请求（Generation Request）完整生命周期

3.2 分块预填充（Chunked Prefill）

3.3 推测解码流程

4. 子模块详细分析

4.1 engine — 引擎层

核心作用

关键类/方法

数据结构

4.2 core/sched — 核心调度层

核心作用

关键类/方法

调度策略

SchedulerOutput 数据结构

4.3 worker — 工作执行层

核心作用

关键类/方法

GPUModelRunner 核心流程

GPU 子系统

4.4 attention — 注意力计算层

核心作用

关键类/方法

注意力操作（ops/）

MLA 子系统

4.5 sample — 采样层

核心作用

关键类/方法

Sampler 9步管线

4.6 spec_decode — 推测解码层

核心作用

关键类/方法

推测解码流程

4.7 executor — 执行器层

核心作用

关键类/方法

Executor 抽象接口

执行器选择逻辑

4.8 pool — 池化模块

核心作用

关键类/方法

4.9 structured_output — 结构化输出

核心作用

关键类/方法

工作原理

4.10 kv_offload — KV缓存卸载

核心作用

关键类/方法

卸载操作

4.11 metrics — 指标监控

核心作用

关键类/方法

5. 模块调用关系与数据流

5.1 核心调用链

5.2 模块间数据传递矩阵

5.3 GPU 数据流（单步执行）

告别SharedPreferences卡顿！手把手教你用MMKV 1.2.10优化Android本地存储（附性能对比）

Driver Store Explorer：彻底解决Windows驱动管理难题的5个高效技巧

用STM32 HAL库点亮你的第一个TM1638显示板：从接线到显示‘Hello World‘的保姆级教程

你的CNN有一半计算是浪费的？深入浅出解读GhostNet的‘特征图冗余’与廉价变换

Kubernetes 1.18.6集群部署后，如何配置Dashboard并开启IPVS模式提升网络性能？

Anaconda卸载不干净？试试官方推荐的‘anaconda-clean’工具（附Windows/Mac详细步骤）