GLM-4-9B-Chat-1M生产环境部署：支持高并发的本地服务架构-编程阁

GLM-4-9B-Chat-1M生产环境部署：支持高并发的本地服务架构

1. 项目背景与挑战

想象一下，你需要分析一份长达500页的PDF报告，或者需要理解一个包含数万行代码的复杂项目。传统的AI模型往往因为“记性不好”而束手无策，它们处理不了这么长的内容。这就是GLM-4-9B-Chat-1M要解决的问题。

这个模型最厉害的地方在于它的“超长记忆力”——能一次性处理长达100万个token的文本。这是什么概念？差不多相当于一本《哈利波特与魔法石》的全文。这意味着你可以把整本书、整个项目代码库、或者一份复杂的法律合同直接丢给它，让它帮你分析、总结、问答。

但能力越强，部署的挑战也越大。一个拥有90亿参数的模型，如果按常规方法部署，对显存的需求是巨大的，普通消费级显卡根本跑不起来。更别说还要考虑在生产环境中，如何让多个用户同时使用，并且保证服务的稳定和快速响应。

本文将带你一步步搭建一个既能发挥GLM-4-9B-Chat-1M全部威力，又能支撑高并发访问的本地服务架构。我们不仅能让它跑起来，还要让它跑得稳、跑得快。

2. 核心架构设计思路

在开始动手之前，我们先理清思路。一个好的生产环境部署，不能只是简单地把模型跑起来，它需要像一个坚固的房子，有地基、有承重墙、有门窗。

我们的架构设计围绕三个核心目标展开：

目标一：资源高效利用模型很大，但我们的显卡显存有限。所以必须采用量化技术，把模型“压缩”到单张显卡能装下的程度，同时尽可能保持其原有的聪明才智。

目标二：稳定支撑多人使用一个只能自己用的模型价值有限。我们需要让它能同时服务多个用户，当很多人一起提问时，系统不能崩溃，响应速度也不能变得不可接受。

目标三：易于维护和扩展今天部署好了，明天可能就需要更新模型或者增加新功能。我们的架构应该让这些操作变得简单，而不是牵一发而动全身。

基于这些目标，我们选择了以下技术栈组合：

模型加载与推理：使用transformers库，这是目前最成熟、生态最丰富的深度学习框架。
量化压缩：采用bitsandbytes库进行4-bit量化，这是平衡精度和显存占用的最佳实践之一。
Web服务框架：选用FastAPI，它轻量、异步特性好，非常适合构建高性能的API服务。
并发与队列管理：使用asyncio和自定义队列机制，来公平、高效地处理来自多个用户的请求。
前端交互界面（可选）：使用Streamlit快速构建一个美观易用的聊天界面，方便非技术用户直接使用。

整个系统的数据流大致是这样的：用户从前端（或直接调用API）发送请求 → FastAPI服务接收请求 → 请求进入等待队列 → 调度器将请求分配给空闲的模型实例 → 模型生成结果 → 结果返回给用户。下面我们就来一步步实现它。

3. 基础环境搭建与模型准备

工欲善其事，必先利其器。我们先来把地基打好。

3.1 创建独立的Python环境

为了避免包版本冲突，强烈建议使用虚拟环境。打开你的终端，执行以下命令：

# 创建并激活一个名为 glm4 的虚拟环境 python -m venv glm4-env source glm4-env/bin/activate # Linux/Mac # 如果是Windows，使用：glm4-env\Scripts\activate # 升级pip到最新版本 pip install --upgrade pip

3.2 安装核心依赖库

接下来，安装我们架构所需的全部Python包。你可以创建一个requirements.txt文件，内容如下：

torch>=2.0.0 transformers>=4.35.0 accelerate>=0.24.0 bitsandbytes>=0.41.0 fastapi>=0.104.0 uvicorn[standard]>=0.24.0 pydantic>=2.0.0 sentencepiece>=0.1.99 # 用于tokenizer streamlit>=1.28.0 # 可选，用于构建前端界面

然后一次性安装它们：

pip install -r requirements.txt

关键点说明：

torch的版本需要与你的CUDA版本匹配。如果你有NVIDIA显卡，建议访问 PyTorch官网获取适合你系统的安装命令。
bitsandbytes的安装有时会遇到问题。如果安装失败，可以尝试先安装pip install bitsandbytes，如果不行，可能需要从源码编译。

3.3 下载GLM-4-9B-Chat-1M模型

模型文件比较大（大约4-5GB），我们需要从Hugging Face模型仓库下载。这里提供两种方式：

方式一：使用transformers库自动下载（推荐）代码中指定模型名称，运行时会自动下载并缓存。但首次下载需要较长时间和稳定网络。

方式二：手动下载后加载如果你网络环境不好，或者想在无外网的环境部署，可以先用能上网的机器下载好模型文件，再拷贝到生产服务器。

# 这是一个测试脚本，检查模型是否能正常加载 # 文件名为：test_model_load.py from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_name = "THUDM/glm-4-9b-chat-1m" print(f"开始加载模型: {model_name}...") # 先加载tokenizer tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) print("Tokenizer 加载成功。") # 关键配置：使用4-bit量化加载模型，节省显存 model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.float16, # 半精度 low_cpu_mem_usage=True, # 减少CPU内存占用 trust_remote_code=True, load_in_4bit=True, # 启用4-bit量化 device_map="auto" # 自动将模型层分配到可用的GPU上 ) print("模型加载成功！") print(f"模型所在设备: {model.device}") print(f"模型参数精度: {model.dtype}")

运行这个脚本python test_model_load.py，如果一切顺利，你会看到成功的提示，并且模型被加载到了GPU上。这个过程可能会花几分钟，并且需要至少8GB的GPU显存。

4. 构建高性能FastAPI后端服务

模型准备好了，现在我们来建造承重墙——构建一个能处理高并发请求的后端服务。

4.1 设计API接口

我们设计两个核心接口：

/chat：用于处理单轮对话或简短问答。
/chat/long：专门用于处理超长文本的分析任务，比如总结、问答。

首先创建项目目录结构：

glm4-production/ ├── app/ │ ├── __init__.py │ ├── main.py # FastAPI应用主文件 │ ├── models.py # 数据模型（Pydantic） │ ├── inference.py # 模型推理核心逻辑 │ └── queue_manager.py # 请求队列管理 ├── requirements.txt └── README.md

4.2 实现请求队列管理

高并发的核心是好的队列管理。我们不能让用户的请求直接冲击模型，而是要让它们排队，按顺序处理。

# app/queue_manager.py import asyncio from typing import Dict, Any from collections import deque import time class RequestQueue: """一个简单的内存请求队列管理器""" def __init__(self, max_queue_size=50): self.queue = deque() self.max_queue_size = max_queue_size self.current_processing = 0 self.max_concurrent = 1 # 同时处理的最大请求数，可根据GPU数量调整 self.lock = asyncio.Lock() async def add_request(self, request_id: str, request_data: Dict[str, Any]) -> bool: """添加请求到队列""" async with self.lock: if len(self.queue) >= self.max_queue_size: return False # 队列已满 self.queue.append({ 'id': request_id, 'data': request_data, 'timestamp': time.time() }) return True async def get_next_request(self): """获取下一个待处理的请求""" async with self.lock: if self.current_processing >= self.max_concurrent: return None # 已达到最大并发数 if not self.queue: return None # 队列为空 request = self.queue.popleft() self.current_processing += 1 return request async def request_completed(self): """标记一个请求处理完成""" async with self.lock: self.current_processing -= 1 def get_queue_status(self): """获取队列状态""" return { 'queue_size': len(self.queue), 'processing': self.current_processing, 'max_concurrent': self.max_concurrent, 'max_queue_size': self.max_queue_size }

4.3 实现模型推理引擎

这是最核心的部分，负责与GLM模型交互。

# app/inference.py from transformers import AutoTokenizer, AutoModelForCausalLM import torch from typing import Dict, Any import asyncio class GLMInferenceEngine: """GLM模型推理引擎""" def __init__(self, model_name: str = "THUDM/glm-4-9b-chat-1m"): self.model_name = model_name self.tokenizer = None self.model = None self.is_initialized = False async def initialize(self): """异步初始化模型""" if self.is_initialized: return # 在独立线程中加载模型，避免阻塞事件循环 loop = asyncio.get_event_loop() await loop.run_in_executor(None, self._load_model) self.is_initialized = True print("模型初始化完成。") def _load_model(self): """实际加载模型的函数（在后台线程中运行）""" print("开始加载模型，这可能需要几分钟...") self.tokenizer = AutoTokenizer.from_pretrained( self.model_name, trust_remote_code=True ) self.model = AutoModelForCausalLM.from_pretrained( self.model_name, torch_dtype=torch.float16, low_cpu_mem_usage=True, trust_remote_code=True, load_in_4bit=True, device_map="auto" ) # 设置为评估模式 self.model.eval() print(f"模型加载完成，设备: {self.model.device}") async def generate_response(self, messages: list, max_tokens: int = 2048) -> str: """生成回复""" if not self.is_initialized: await self.initialize() # 构建模型所需的prompt格式 # GLM-4-9B-Chat-1M使用特殊的对话格式 formatted_prompt = self._format_chat_prompt(messages) # 编码输入 inputs = self.tokenizer( formatted_prompt, return_tensors="pt", padding=True ).to(self.model.device) # 生成参数配置 generate_kwargs = { "max_new_tokens": max_tokens, "temperature": 0.7, # 控制随机性，0.7是比较平衡的值 "top_p": 0.9, # 核采样参数，使输出更有创意 "do_sample": True, "repetition_penalty": 1.1, # 避免重复 } # 执行生成 with torch.no_grad(): # 禁用梯度计算，节省显存 outputs = self.model.generate( **inputs, **generate_kwargs ) # 解码输出 response = self.tokenizer.decode( outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True ) return response def _format_chat_prompt(self, messages: list) -> str: """将对话历史格式化为模型需要的prompt""" # GLM-4-9B-Chat-1M使用类似以下格式： # [Round 1] # 问：{用户问题} # 答：{模型回答} # [Round 2] # ... prompt = "" for i, msg in enumerate(messages): role = msg.get("role", "") content = msg.get("content", "") if role == "user": prompt += f"[Round {i//2 + 1}]\n问：{content}\n" elif role == "assistant": prompt += f"答：{content}\n" # 如果是用户的新消息，需要加上"答："前缀 if messages and messages[-1].get("role") == "user": prompt += "答：" return prompt async def process_long_text(self, text: str, task: str = "summarize") -> str: """处理长文本的专用方法""" # 根据任务类型构建不同的指令 instructions = { "summarize": "请总结以下文本的核心内容：\n\n", "qa": "请仔细阅读以下文本，然后回答我的问题：\n\n", "analyze": "请分析以下文本，指出其中的关键信息和逻辑结构：\n\n" } instruction = instructions.get(task, "请处理以下文本：\n\n") full_prompt = instruction + text[:500000] # 限制输入长度 # 使用更保守的生成参数，确保长文本处理的稳定性 messages = [{"role": "user", "content": full_prompt}] return await self.generate_response( messages, max_tokens=1024 # 对于总结任务，不需要太长的回复 )

4.4 实现FastAPI主应用

现在我们把所有部分组合起来。

# app/main.py from fastapi import FastAPI, HTTPException, BackgroundTasks from fastapi.middleware.cors import CORSMiddleware from pydantic import BaseModel from typing import List, Optional import uuid import asyncio import time from .models import ChatRequest, LongTextRequest, ChatResponse, QueueStatus from .inference import GLMInferenceEngine from .queue_manager import RequestQueue # 创建FastAPI应用实例 app = FastAPI( title="GLM-4-9B-Chat-1M API服务", description="支持百万长文本处理的高性能本地大模型API", version="1.0.0" ) # 添加CORS中间件，允许前端跨域访问 app.add_middleware( CORSMiddleware, allow_origins=["*"], # 生产环境应设置为具体的前端域名 allow_credentials=True, allow_methods=["*"], allow_headers=["*"], ) # 全局实例 inference_engine = GLMInferenceEngine() request_queue = RequestQueue(max_queue_size=100) request_cache = {} # 用于存储处理中的请求结果 # 数据模型 class ChatRequest(BaseModel): messages: List[dict] max_tokens: Optional[int] = 2048 class LongTextRequest(BaseModel): text: str task: str = "summarize" # summarize, qa, analyze class ChatResponse(BaseModel): request_id: str status: str # queued, processing, completed, error response: Optional[str] = None queue_position: Optional[int] = None estimated_wait_time: Optional[float] = None class QueueStatus(BaseModel): queue_size: int processing: int max_concurrent: int max_queue_size: int @app.on_event("startup") async def startup_event(): """应用启动时初始化模型""" print("正在初始化模型，请稍候...") # 在后台初始化模型，不阻塞应用启动 asyncio.create_task(inference_engine.initialize()) print("API服务启动完成。") @app.get("/") async def root(): """根端点，返回服务信息""" return { "service": "GLM-4-9B-Chat-1M API", "status": "running", "model": "THUDM/glm-4-9b-chat-1m", "context_length": "1M tokens", "quantization": "4-bit" } @app.get("/health") async def health_check(): """健康检查端点""" return {"status": "healthy", "model_loaded": inference_engine.is_initialized} @app.get("/queue/status") async def get_queue_status(): """获取当前队列状态""" status = request_queue.get_queue_status() return QueueStatus(**status) @app.post("/chat", response_model=ChatResponse) async def chat(request: ChatRequest): """处理聊天请求""" request_id = str(uuid.uuid4()) # 检查队列是否已满 queue_status = request_queue.get_queue_status() if queue_status['queue_size'] >= queue_status['max_queue_size']: raise HTTPException(status_code=503, detail="服务繁忙，请稍后再试") # 添加到队列 queue_data = { "type": "chat", "messages": request.messages, "max_tokens": request.max_tokens } added = await request_queue.add_request(request_id, queue_data) if not added: raise HTTPException(status_code=503, detail="无法处理请求") # 计算预估等待时间（简单估算：每个请求约2-10秒） queue_pos = queue_status['queue_size'] + 1 estimated_wait = queue_pos * 5 # 平均5秒每个请求 # 启动后台任务处理请求 asyncio.create_task(process_request(request_id)) return ChatResponse( request_id=request_id, status="queued", queue_position=queue_pos, estimated_wait_time=estimated_wait ) @app.post("/chat/long", response_model=ChatResponse) async def chat_long(request: LongTextRequest): """处理长文本请求""" request_id = str(uuid.uuid4()) # 检查文本长度（简单检查） if len(request.text) > 2000000: # 约100万汉字 raise HTTPException(status_code=400, detail="文本过长，请控制在100万字以内") # 添加到队列 queue_data = { "type": "long_text", "text": request.text, "task": request.task } added = await request_queue.add_request(request_id, queue_data) if not added: raise HTTPException(status_code=503, detail="无法处理请求") # 启动后台任务处理请求 asyncio.create_task(process_request(request_id)) queue_status = request_queue.get_queue_status() queue_pos = queue_status['queue_size'] + 1 return ChatResponse( request_id=request_id, status="queued", queue_position=queue_pos, estimated_wait_time=queue_pos * 15 # 长文本处理更耗时 ) @app.get("/chat/result/{request_id}") async def get_chat_result(request_id: str): """获取请求结果""" if request_id not in request_cache: raise HTTPException(status_code=404, detail="请求ID不存在或已过期") result = request_cache[request_id] # 如果请求已完成，从缓存中移除（可选，可设置过期时间） if result["status"] in ["completed", "error"]: # 保留一段时间后再清除 pass return result async def process_request(request_id: str): """后台处理请求的任务""" try: # 从队列获取请求 request_data = await request_queue.get_next_request() if not request_data or request_data['id'] != request_id: # 更新缓存状态 request_cache[request_id] = { "request_id": request_id, "status": "error", "response": "请求处理出错" } return # 更新缓存状态为处理中 request_cache[request_id] = { "request_id": request_id, "status": "processing", "response": None } # 根据请求类型处理 if request_data['data']['type'] == 'chat': response = await inference_engine.generate_response( messages=request_data['data']['messages'], max_tokens=request_data['data'].get('max_tokens', 2048) ) else: # long_text response = await inference_engine.process_long_text( text=request_data['data']['text'], task=request_data['data']['task'] ) # 更新缓存为完成状态 request_cache[request_id] = { "request_id": request_id, "status": "completed", "response": response } except Exception as e: # 处理过程中发生错误 request_cache[request_id] = { "request_id": request_id, "status": "error", "response": f"处理请求时出错: {str(e)}" } finally: # 无论成功失败，都标记请求处理完成 await request_queue.request_completed() if __name__ == "__main__": import uvicorn uvicorn.run( "app.main:app", host="0.0.0.0", port=8080, reload=False, # 生产环境设为False workers=1 # 由于GPU限制，通常只运行一个worker )

4.5 创建数据模型文件

# app/models.py from pydantic import BaseModel from typing import List, Optional class ChatRequest(BaseModel): messages: List[dict] max_tokens: Optional[int] = 2048 class LongTextRequest(BaseModel): text: str task: str = "summarize" class ChatResponse(BaseModel): request_id: str status: str response: Optional[str] = None queue_position: Optional[int] = None estimated_wait_time: Optional[float] = None class QueueStatus(BaseModel): queue_size: int processing: int max_concurrent: int max_queue_size: int

5. 部署与性能优化

服务代码写好了，现在我们来部署它，并做一些优化，让它真正能在生产环境稳定运行。

5.1 启动服务

在项目根目录下，创建一个启动脚本run.sh（Linux/Mac）或run.bat（Windows）：

#!/bin/bash # run.sh # 激活虚拟环境 source glm4-env/bin/activate # 设置环境变量 export PYTHONPATH=$PYTHONPATH:$(pwd) export CUDA_VISIBLE_DEVICES=0 # 指定使用哪张GPU # 启动服务 cd app python -m uvicorn main:app --host 0.0.0.0 --port 8080 --workers 1

给脚本执行权限并运行：

chmod +x run.sh ./run.sh

服务启动后，你可以通过以下方式测试：

访问API文档：打开浏览器，访问http://localhost:8080/docs，你会看到自动生成的Swagger UI界面，可以在这里直接测试API。
健康检查：访问http://localhost:8080/health
队列状态：访问http://localhost:8080/queue/status

5.2 性能监控与优化建议

生产环境部署后，监控和优化是持续的过程。以下是一些关键点：

监控GPU使用情况：

# 使用nvidia-smi监控GPU状态 watch -n 1 nvidia-smi # 或者使用更详细的工具 pip install gpustat gpustat -i 1

优化建议：

批处理请求：如果多个用户的请求可以合并处理，能显著提高吞吐量。但GLM-4-9B-Chat-1M是对话模型，批处理需要仔细设计。
调整生成参数：
- 降低max_new_tokens：如果不是必需，不要生成太长的回复。
- 调整temperature：对于事实性问答，可以降低到0.3-0.5；对于创意写作，可以提高到0.8-1.0。
使用更高效的注意力机制： GLM-4-9B-Chat-1M支持Flash Attention 2，如果你的显卡支持（如A100、H100），可以启用它以加速推理。
实现请求超时和重试机制：在客户端代码中，对于长时间没有响应的请求，应该设置超时并重试。

5.3 压力测试

在正式开放给用户前，建议进行压力测试。你可以使用locust或wrk这样的工具模拟多个并发用户。

创建一个简单的测试脚本：

# test_stress.py import asyncio import aiohttp import time async def send_request(session, url, data): async with session.post(url, json=data) as response: return await response.json() async def main(): url = "http://localhost:8080/chat" # 模拟10个并发用户 tasks = [] async with aiohttp.ClientSession() as session: for i in range(10): data = { "messages": [{"role": "user", "content": f"测试消息 {i}"}], "max_tokens": 100 } task = asyncio.create_task(send_request(session, url, data)) tasks.append(task) results = await asyncio.gather(*tasks) # 分析结果 completed = 0 errors = 0 for result in results: if "request_id" in result: completed += 1 else: errors += 1 print(f"总请求数: {len(tasks)}") print(f"成功排队: {completed}") print(f"失败: {errors}") if __name__ == "__main__": asyncio.run(main())

运行这个测试，观察服务的表现。如果队列堆积严重或错误率太高，可能需要调整max_queue_size或优化模型推理速度。

6. 构建Streamlit前端界面（可选）

如果你想让非技术用户也能方便地使用这个服务，可以快速搭建一个Streamlit前端。这对于内部工具或演示非常有用。

创建一个新的文件streamlit_app.py：

# streamlit_app.py import streamlit as st import requests import json import time st.set_page_config( page_title="GLM-4-9B-Chat-1M 长文本助手", page_icon="", layout="wide" ) # 配置API地址 API_BASE = "http://localhost:8080" st.title(" GLM-4-9B-Chat-1M 长文本助手") st.markdown(""" 这是一个本地部署的百万上下文大模型，可以处理超长文本。 所有数据都在本地处理，**绝对安全私密**。 """) # 侧边栏配置 with st.sidebar: st.header("服务状态") try: # 健康检查 health_resp = requests.get(f"{API_BASE}/health", timeout=5) if health_resp.status_code == 200: st.success(" 服务运行正常") else: st.error(" 服务异常") except: st.error(" 无法连接到服务") # 队列状态 try: queue_resp = requests.get(f"{API_BASE}/queue/status", timeout=5) if queue_resp.status_code == 200: queue_data = queue_resp.json() st.subheader("队列状态") st.metric("等待中", queue_data['queue_size']) st.metric("处理中", queue_data['processing']) st.metric("最大并发", queue_data['max_concurrent']) except: st.warning("无法获取队列状态") st.divider() st.markdown("### 使用提示") st.info(""" 1. 对于长文本，使用「长文本分析」标签页 2. 普通对话使用「智能对话」标签页 3. 响应时间取决于队列长度和文本长度 """) # 创建标签页 tab1, tab2 = st.tabs([" 长文本分析", " 智能对话"]) with tab1: st.header("长文本分析") task_type = st.selectbox( "选择分析任务", ["summarize", "qa", "analyze"], format_func=lambda x: { "summarize": " 总结核心内容", "qa": "❓ 问答（请在文本后提问）", "analyze": " 深度分析" }[x] ) long_text = st.text_area( "输入长文本", height=300, placeholder="粘贴你的长文本在这里...（支持最多100万字）" ) if task_type == "qa": question = st.text_input("你的问题") long_text = f"{long_text}\n\n问题：{question}" if st.button("开始分析", type="primary", use_container_width=True): if not long_text.strip(): st.warning("请输入文本内容") else: with st.spinner("正在提交请求..."): try: response = requests.post( f"{API_BASE}/chat/long", json={"text": long_text, "task": task_type}, timeout=10 ) if response.status_code == 200: data = response.json() request_id = data['request_id'] # 轮询获取结果 result_placeholder = st.empty() result_placeholder.info(f"请求已排队，位置: {data['queue_position']}，预估等待: {data['estimated_wait_time']}秒") max_retries = 60 # 最多等待60秒 for i in range(max_retries): time.sleep(1) result_resp = requests.get( f"{API_BASE}/chat/result/{request_id}", timeout=5 ) if result_resp.status_code == 200: result_data = result_resp.json() if result_data['status'] == 'completed': result_placeholder.success("分析完成！") st.markdown("### 分析结果") st.write(result_data['response']) break elif result_data['status'] == 'error': result_placeholder.error(f"处理出错: {result_data['response']}") break else: # 还在处理中 result_placeholder.info(f"处理中... ({i+1}/{max_retries}秒)") else: result_placeholder.warning("获取结果失败") break else: result_placeholder.warning("请求处理超时") else: st.error(f"提交失败: {response.text}") except Exception as e: st.error(f"请求出错: {str(e)}") with tab2: st.header("智能对话") # 初始化会话历史 if "messages" not in st.session_state: st.session_state.messages = [] # 显示历史消息 for message in st.session_state.messages: with st.chat_message(message["role"]): st.markdown(message["content"]) # 用户输入 if prompt := st.chat_input("输入你的问题..."): # 添加用户消息 st.session_state.messages.append({"role": "user", "content": prompt}) with st.chat_message("user"): st.markdown(prompt) # 准备API请求 with st.chat_message("assistant"): message_placeholder = st.empty() message_placeholder.markdown("正在思考...") try: # 发送请求 response = requests.post( f"{API_BASE}/chat", json={ "messages": st.session_state.messages, "max_tokens": 1024 }, timeout=30 ) if response.status_code == 200: data = response.json() request_id = data['request_id'] # 轮询获取结果 max_retries = 30 for i in range(max_retries): time.sleep(1) result_resp = requests.get( f"{API_BASE}/chat/result/{request_id}", timeout=5 ) if result_resp.status_code == 200: result_data = result_resp.json() if result_data['status'] == 'completed': assistant_response = result_data['response'] message_placeholder.markdown(assistant_response) st.session_state.messages.append({ "role": "assistant", "content": assistant_response }) break elif result_data['status'] == 'error': message_placeholder.error(f"出错: {result_data['response']}") break else: message_placeholder.warning("获取结果失败") break else: message_placeholder.warning("响应超时") else: message_placeholder.error(f"请求失败: {response.text}") except Exception as e: message_placeholder.error(f"连接出错: {str(e)}") # 清空对话按钮 if st.session_state.messages: if st.button("清空对话历史", use_container_width=True): st.session_state.messages = [] st.rerun() # 页脚 st.divider() st.caption(""" **技术说明**：本服务基于 GLM-4-9B-Chat-1M 模型，使用 4-bit 量化技术，在本地 GPU 上运行。 所有数据处理均在本地完成，无数据外传风险。 """)

运行Streamlit应用：

streamlit run streamlit_app.py

现在你可以通过浏览器访问http://localhost:8501来使用这个美观的聊天界面了。

7. 总结

通过本文的步骤，我们成功搭建了一个支持高并发的GLM-4-9B-Chat-1M本地服务架构。让我们回顾一下关键成果：

架构优势：

资源高效：通过4-bit量化，让90亿参数的大模型能在消费级显卡（如RTX 4070 12GB）上运行。
高并发支持：实现了请求队列机制，能同时处理多个用户的请求，避免服务被单个长任务阻塞。
生产就绪：提供了完整的API接口、健康检查、队列状态监控，便于集成到现有系统。
灵活扩展：架构清晰，易于添加新功能或替换模型组件。

实际应用场景：

企业内部知识库问答：将公司文档、手册上传，员工可以自然语言提问。
长文档分析：律师分析法律合同，研究员阅读长篇论文。
代码审查助手：上传整个项目代码，让AI帮助发现潜在问题。
个性化写作助手：基于你的写作风格和大量历史文档，辅助创作。

后续优化方向：

模型缓存优化：对于频繁使用的prompt模板，可以缓存部分计算结果。
多GPU支持：如果有多张显卡，可以实现模型并行，进一步提升吞吐量。
持久化队列：将内存队列改为Redis等持久化存储，避免服务重启丢失请求。
更精细的监控：集成Prometheus和Grafana，监控请求延迟、GPU利用率等关键指标。

这个架构为你提供了一个坚实的起点。你可以基于此继续优化，让它更好地适应你的具体业务需求。最重要的是，所有的数据都在本地处理，完全符合对数据安全和隐私有严格要求的企业场景。

获取更多AI镜像
想探索更多AI镜像和应用场景？访问 CSDN星图镜像广场，提供丰富的预置镜像，覆盖大模型推理、图像生成、视频生成、模型微调等多个领域，支持一键部署。

GLM-4-9B-Chat-1M生产环境部署：支持高并发的本地服务架构