Qwen2.5-1.5B服务化：Qwen2.5-1.5B REST API封装与Swagger文档生成-编程阁

Qwen2.5-1.5B服务化：Qwen2.5-1.5B REST API封装与Swagger文档生成

1. 为什么需要把本地对话助手变成REST API？

你已经拥有了一个运行流畅的本地Qwen2.5-1.5B对话助手——Streamlit界面简洁、响应快、隐私有保障。但很快你会发现，它只服务于“一个人”和“一个浏览器”。当你想让手机App调用它、让企业内部系统集成它、让自动化脚本批量测试它，甚至让其他开发同事在不装Python环境的情况下快速试用时，Streamlit的单点Web界面就显得力不从心了。

这时候，真正的服务化价值才浮现出来：不是做一个能用的工具，而是提供一个可被任何系统调用的能力。
REST API就是这个能力的通用语言。它不关心你是用Python、JavaScript、Java还是Shell写调用代码；它不依赖图形界面，也不绑定特定设备；它把“提问→思考→回答”这个过程，抽象成一个标准的HTTP请求：POST /v1/chat/completions，附带JSON格式的输入，返回结构化的JSON输出。

更重要的是，API化之后，你获得的不只是调用便利性，还有三重升级：

可编排性：能把Qwen2.5-1.5B嵌入到更复杂的流程里，比如“用户提交表单 → 调用API生成初稿 → 自动发邮件 → 记录日志”；
可监控性：可以记录每次请求耗时、失败率、高频问题，真正看清模型在真实场景中的表现；
可协作性：前端工程师不用懂模型加载逻辑，后端工程师不用碰Streamlit，测试同学直接用curl就能压测——大家基于同一个接口契约工作。

本文不讲“怎么再部署一个Streamlit”，而是带你亲手把那个熟悉的本地对话助手，变成一个专业、稳定、自带文档、开箱即用的REST服务。整个过程无需改模型、不换框架、不牺牲隐私——所有推理依然100%在你自己的机器上完成。

2. 从Streamlit到FastAPI：轻量级服务化改造实战

2.1 架构演进：保留核心，替换入口

我们不推翻重来。原项目中真正有价值的是两部分：
模型加载与推理逻辑（transformers.AutoModelForCausalLM+apply_chat_template）
生成参数配置与显存管理策略（torch.no_grad()、device_map="auto"、st.cache_resource）

而Streamlit只是最外层的“展示壳”。服务化改造的本质，就是把这层壳换成FastAPI——一个专为构建API设计的高性能Python框架。它轻量（单文件即可启动）、成熟（生产环境广泛使用）、生态完善（天然支持异步、OpenAPI、中间件），且与transformers无缝兼容。

不是“用FastAPI重写一遍”，而是“把Streamlit里已验证的推理函数，原样抽出来，挂到FastAPI路由上”。

2.2 核心代码重构：三步完成迁移

第一步：提取可复用的推理引擎

新建inference_engine.py，将原Streamlit中模型加载与生成逻辑解耦：

# inference_engine.py from transformers import AutoTokenizer, AutoModelForCausalLM, TextIteratorStreamer import torch from threading import Thread from typing import List, Dict, Optional MODEL_PATH = "/root/qwen1.5b" class QwenInferenceEngine: def __init__(self): self.tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True) self.model = AutoModelForCausalLM.from_pretrained( MODEL_PATH, device_map="auto", torch_dtype="auto", trust_remote_code=True ) self.model.eval() # 确保推理模式 def generate_response( self, messages: List[Dict[str, str]], max_new_tokens: int = 1024, temperature: float = 0.7, top_p: float = 0.9, stream: bool = False ) -> str | None: # 严格使用官方聊天模板拼接上下文 text = self.tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) model_inputs = self.tokenizer([text], return_tensors="pt").to(self.model.device) with torch.no_grad(): if stream: streamer = TextIteratorStreamer( self.tokenizer, skip_prompt=True, skip_special_tokens=True ) generation_kwargs = dict( **model_inputs, streamer=streamer, max_new_tokens=max_new_tokens, temperature=temperature, top_p=top_p, do_sample=True ) thread = Thread(target=self.model.generate, kwargs=generation_kwargs) thread.start() return streamer else: outputs = self.model.generate( **model_inputs, max_new_tokens=max_new_tokens, temperature=temperature, top_p=top_p, do_sample=True ) response = self.tokenizer.decode(outputs[0][model_inputs.input_ids.shape[1]:], skip_special_tokens=True) return response.strip() # 全局单例，避免重复加载 engine = QwenInferenceEngine()

这段代码完全复用了原项目的模型路径、自动设备映射、无梯度推理、官方模板等关键设计，只是去掉了Streamlit专属缓存（改用Python模块级变量），并增加了流式响应支持——这是API服务的重要能力。

第二步：定义标准OpenAI兼容接口

新建main.py，用FastAPI实现/v1/chat/completions接口：

# main.py from fastapi import FastAPI, HTTPException, Depends, status from fastapi.middleware.cors import CORSMiddleware from pydantic import BaseModel, Field from typing import List, Optional, Dict, Any import uvicorn from inference_engine import engine app = FastAPI( title="Qwen2.5-1.5B Local API", description="本地部署的Qwen2.5-1.5B-Instruct模型REST服务，完全私有化、零数据上传", version="1.0.0" ) # 允许前端跨域（如Vue/React项目调用） app.add_middleware( CORSMiddleware, allow_origins=["*"], allow_credentials=True, allow_methods=["*"], allow_headers=["*"], ) class ChatMessage(BaseModel): role: str = Field(..., description="消息角色，必须是 'system'、'user' 或 'assistant'") content: str = Field(..., description="消息内容文本") class ChatCompletionRequest(BaseModel): model: str = Field(default="qwen2.5-1.5b-instruct", description="模型标识符") messages: List[ChatMessage] = Field(..., description="对话历史列表，按时间顺序排列") max_tokens: Optional[int] = Field(default=1024, description="最大生成token数") temperature: Optional[float] = Field(default=0.7, description="采样温度") top_p: Optional[float] = Field(default=0.9, description="核采样概率阈值") stream: Optional[bool] = Field(default=False, description="是否启用流式响应") class ChatCompletionResponse(BaseModel): id: str object: str = "chat.completion" created: int model: str choices: List[Dict[str, Any]] usage: Dict[str, int] @app.post("/v1/chat/completions", response_model=ChatCompletionResponse) async def chat_completions(request: ChatCompletionRequest): try: # 将OpenAI格式messages转换为Qwen所需格式 qwen_messages = [ {"role": msg.role, "content": msg.content} for msg in request.messages ] if request.stream: # 流式响应：返回SSE格式 from fastapi.responses import StreamingResponse import json def stream_generator(): streamer = engine.generate_response( messages=qwen_messages, max_new_tokens=request.max_tokens, temperature=request.temperature, top_p=request.top_p, stream=True ) for new_text in streamer: if new_text: chunk = { "id": "chatcmpl-123", "object": "chat.completion.chunk", "created": 1710000000, "model": request.model, "choices": [{"delta": {"content": new_text}, "index": 0, "finish_reason": None}] } yield f"data: {json.dumps(chunk)}\n\n" # 发送结束标记 yield "data: [DONE]\n\n" return StreamingResponse(stream_generator(), media_type="text/event-stream") else: # 非流式：直接返回完整响应 response_text = engine.generate_response( messages=qwen_messages, max_new_tokens=request.max_tokens, temperature=request.temperature, top_p=request.top_p, stream=False ) return { "id": "chatcmpl-123", "object": "chat.completion", "created": 1710000000, "model": request.model, "choices": [{ "message": {"role": "assistant", "content": response_text}, "index": 0, "finish_reason": "stop" }], "usage": { "prompt_tokens": len(engine.tokenizer.encode( engine.tokenizer.apply_chat_template(qwen_messages, tokenize=False) )), "completion_tokens": len(engine.tokenizer.encode(response_text)), "total_tokens": 0 # 可扩展为实时计算 } } except Exception as e: raise HTTPException( status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, detail=f"推理失败: {str(e)}" ) if __name__ == "__main__": uvicorn.run(app, host="0.0.0.0:8000", port=8000, reload=False)

这个接口严格遵循OpenAI的Chat Completions API规范，意味着：

你的前端代码无需修改，只需把https://api.openai.com/v1/chat/completions换成http://localhost:8000/v1/chat/completions
Postman、curl、JavaScriptfetch都能直接调用
后续可无缝对接LangChain、LlamaIndex等生态工具

第三步：一键启动与健康检查

添加简单健康检查端点，方便运维监控：

@app.get("/health") def health_check(): return {"status": "healthy", "model": "qwen2.5-1.5b-instruct", "device": str(engine.model.device)}

启动命令也极简：

pip install fastapi uvicorn transformers torch python main.py

服务启动后，访问http://localhost:8000/health返回{"status":"healthy",...}即表示模型已就绪；访问http://localhost:8000/docs则进入自动生成的交互式文档页面。

3. 自动生成Swagger文档：让API自己“说话”

3.1 为什么Swagger不是可选项，而是必选项？

你可能觉得：“我写个README说明下接口不就行了？”
但现实是：
❌ 手写文档容易过时（改了代码忘了更新文档）
❌ 手写文档难以覆盖所有字段细节（比如top_p的合法范围、messages数组长度限制）
❌ 手写文档无法直接测试（看到文档还得另开Postman填参数）

而FastAPI内置的Swagger UI（通过/docs访问）是活的文档：
它100%由代码类型注解（Pydantic模型）自动生成，代码变、文档自动变
每个字段都有清晰描述、默认值、是否必填、数据类型提示
页面内直接点击“Try it out”，填参数、发请求、看响应，全程可视化
支持导出OpenAPI JSON规范，供其他工具（如API测试平台、SDK生成器）消费

这才是专业服务该有的样子——不是“我告诉你怎么用”，而是“你自己点几下就能跑通”。

3.2 实战：三处关键注解，激活完整文档

回顾前面的ChatCompletionRequest模型定义，这三处注解是Swagger丰富性的核心：

字段级描述与约束
```
role: str = Field(..., description="消息角色，必须是 'system'、'user' 或 'assistant'")
```
→ Swagger中显示为带文字说明的输入框，并标注“Required”

模型级描述与版本信息

app = FastAPI( title="Qwen2.5-1.5B Local API", description="本地部署的Qwen2.5-1.5B-Instruct模型REST服务...", version="1.0.0" )

→ Swagger首页显示项目名称、简介、版本号，一目了然

路由级响应模型声明
```
@app.post("/v1/chat/completions", response_model=ChatCompletionResponse)
```
→ Swagger不仅显示请求体结构，还明确展示成功响应的完整JSON Schema，包括嵌套对象（如choices里的message）

启动服务后，打开http://localhost:8000/docs，你会看到：

左侧清晰列出所有API端点（/v1/chat/completions、/health）
点击任一端点，右侧展开详细说明：请求方法、URL、请求体示例、响应示例、错误码
“Schema”标签页展示ChatCompletionRequest和ChatCompletionResponse的完整字段树
“Example Value”提供可直接复制的JSON样例，连messages数组里该填几个对象都示范好了

这已经不是文档，而是交互式教学沙盒。

4. 生产就绪增强：让本地服务真正可靠

4.1 显存安全阀：自动清理与超时熔断

本地GPU资源有限，长时间运行可能因显存累积导致OOM。我们在API层增加两道保险：

请求级显存清理：每次推理完成后，显式调用torch.cuda.empty_cache()（仅当使用CUDA时）：
```
# 在generate_response返回后添加 if torch.cuda.is_available(): torch.cuda.empty_cache()
```
全局请求超时控制：防止某个慢请求长期占用资源。用Uvicorn启动参数限制：
```
uvicorn main:app --host 0.0.0.0 --port 8000 --timeout-keep-alive 5 --limit-concurrency 2
```
其中--limit-concurrency 2表示最多同时处理2个请求，超出则排队，避免并发压垮显存。

4.2 日志可观测：记录每一次真实对话

添加结构化日志，便于排查问题和分析使用模式：

import logging from datetime import datetime logging.basicConfig( level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s", handlers=[ logging.FileHandler("qwen_api.log"), logging.StreamHandler() ] ) logger = logging.getLogger(__name__) # 在chat_completions函数开头添加 logger.info(f"Received request from {request.client.host if hasattr(request, 'client') else 'unknown'} " f"with {len(request.messages)} messages, max_tokens={request.max_tokens}")

日志文件会记录每次请求的IP、消息数、参数，当某次响应异常时，你能立刻定位到对应时间点的上下文。

4.3 Docker一键封装：彻底解决环境依赖

最后，把整个服务打包成Docker镜像，实现“一次构建，随处运行”：

# Dockerfile FROM nvidia/cuda:12.1.1-runtime-ubuntu22.04 WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY . . EXPOSE 8000 CMD ["uvicorn", "main:app", "--host", "0.0.0.0:8000", "--port", "8000", "--reload", "False"]

配套requirements.txt：

fastapi==0.110.0 uvicorn==0.29.0 transformers==4.40.0 torch==2.2.0+cu121 accelerate==0.29.0

构建与运行：

docker build -t qwen2.5-1.5b-api . docker run -p 8000:8000 -v /root/qwen1.5b:/app/model:ro qwen2.5-1.5b-api

注意：模型目录通过-v挂载为只读卷，既保证容器内可读，又防止误操作修改模型文件。

5. 总结：从玩具到工具，只差一个API的距离

回看整个过程，我们没有做任何“高大上”的技术突破：
🔹 没更换模型，用的还是那个1.5B的Qwen2.5-1.5B-Instruct；
🔹 没重写推理，核心逻辑100%复用原项目；
🔹 没牺牲隐私，所有数据仍在你本地GPU上完成计算。

我们做的，只是把能力从“界面”解放出来，交给“协议”。
当Streamlit界面还在等待你点击发送时，REST API已经默默支撑起一个自动化报告生成系统；
当别人还在为环境配置焦头烂额时，你的同事已经用curl三行代码完成了第一次集成；
当文档还躺在Markdown里无人问津时，Swagger UI正让每个新接触者5分钟内跑通第一个请求。

这才是本地大模型落地的真实路径：
先让它在本地跑起来（你已经做到了），
再让它被任何人、任何系统、任何语言调用起来（现在你也掌握了）。

下一步，你可以：