CLIP 图文检索系统：构建跨模态语义搜索引擎-编程阁

CLIP 图文检索系统：构建跨模态语义搜索引擎

1. 引言

CLIP (Contrastive Language-Image Pre-training) 是 OpenAI 在 2021 年提出的跨模态模型，它将图像和文本映射到同一个语义空间，使得"用文字搜图片"和"用图片搜文字"成为可能。

应用场景：

电商商品图搜（“红色连衣裙” → 搜出所有红色连衣裙图片）
素材管理（输入描述 → 找到匹配的设计素材）
内容审核（图片 → 检索相似违规内容）
医学影像检索（症状描述 → 找到相似病例影像）

2. CLIP 原理

2.1 双塔架构

图像 → Image Encoder (ViT/ResNet) → 图像嵌入向量 ─┐ ├→ 对比学习 → 余弦相似度 文本 → Text Encoder (Transformer) → 文本嵌入向量 ─┘

2.2 对比学习目标

给定一个 batch 的 N 个 (图像, 文本) 对： - 正样本对：匹配的 (图像_i, 文本_i) - 负样本对：不匹配的 (图像_i, 文本_j), j ≠ i 损失函数：对称交叉熵损失 L = 0.5 * (L_image_to_text + L_text_to_image) 目标：正样本对的余弦相似度最大化，负样本对最小化

3. 环境搭建

pipinstalltorch torchvision pipinstalltransformers pipinstallfaiss-gpu# GPU 版 FAISSpipinstallpillow numpy

4. 图像特征提取

importtorchfromtransformersimportCLIPModel,CLIPProcessorfromPILimportImageimportosimportnumpyasnpclassCLIPFeatureExtractor:"""CLIP 特征提取器"""def__init__(self,model_name="openai/clip-vit-large-patch14"):self.device="cuda"iftorch.cuda.is_available()else"cpu"self.model=CLIPModel.from_pretrained(model_name).to(self.device)self.processor=CLIPProcessor.from_pretrained(model_name)self.model.eval()@torch.no_grad()defencode_image(self,image_path:str)->np.ndarray:"""提取单张图像特征"""image=Image.open(image_path).convert("RGB")inputs=self.processor(images=image,return_tensors="pt")inputs={k:v.to(self.device)fork,vininputs.items()}features=self.model.get_image_features(**inputs)features=features/features.norm(dim=-1,keepdim=True)# L2 归一化returnfeatures.cpu().numpy().flatten()@torch.no_grad()defencode_text(self,text:str)->np.ndarray:"""提取文本特征"""inputs=self.processor(text=[text],return_tensors="pt",padding=True)inputs={k:v.to(self.device)fork,vininputs.items()}features=self.model.get_text_features(**inputs)features=features/features.norm(dim=-1,keepdim=True)returnfeatures.cpu().numpy().flatten()defencode_batch_images(self,image_paths:list,batch_size=32)->np.ndarray:"""批量提取图像特征"""all_features=[]foriinrange(0,len(image_paths),batch_size):batch_paths=image_paths[i:i+batch_size]images=[Image.open(p).convert("RGB")forpinbatch_paths]inputs=self.processor(images=images,return_tensors="pt",padding=True)inputs={k:v.to(self.device)fork,vininputs.items()}features=self.model.get_image_features(**inputs)features=features/features.norm(dim=-1,keepdim=True)all_features.append(features.cpu().numpy())returnnp.vstack(all_features)

5. 向量索引与检索

5.1 使用 FAISS 构建索引

importfaissclassCLIPSearchEngine:"""基于 CLIP + FAISS 的图文检索引擎"""def__init__(self,model_name="openai/clip-vit-large-patch14"):self.extractor=CLIPFeatureExtractor(model_name)self.index=Noneself.image_paths=[]self.dimension=768# ViT-L/14 输出维度defbuild_index(self,image_dir:str):"""构建图像索引"""# 收集所有图片extensions={'.jpg','.jpeg','.png','.bmp','.webp'}self.image_paths=[os.path.join(image_dir,f)forfinos.listdir(image_dir)ifos.path.splitext(f)[1].lower()inextensions]print(f"索引{len(self.image_paths)}张图片...")# 提取特征features=self.extractor.encode_batch_images(self.image_paths)features=features.astype('float32')# 构建 FAISS 索引iflen(self.image_paths)<10000:# 小规模：精确搜索self.index=faiss.IndexFlatIP(self.dimension)# 内积 = 余弦相似度（已归一化）else:# 大规模：IVF 近似搜索nlist=min(int(len(self.image_paths)**0.5),1000)quantizer=faiss.IndexFlatIP(self.dimension)self.index=faiss.IndexIVFFlat(quantizer,self.dimension,nlist)self.index.train(features)self.index.nprobe=20# 搜索的聚类数self.index.add(features)print(f"索引构建完成，共{self.index.ntotal}向量")defsave_index(self,path:str):"""保存索引到磁盘"""faiss.write_index(self.index,f"{path}.index")np.save(f"{path}.paths.npy",np.array(self.image_paths))defload_index(self,path:str):"""加载索引"""self.index=faiss.read_index(f"{path}.index")self.image_paths=np.load(f"{path}.paths.npy").tolist()defsearch_by_text(self,query:str,top_k=10)->list:"""用文字搜图片"""text_features=self.extractor.encode_text(query).astype('float32')text_features=text_features.reshape(1,-1)scores,indices=self.index.search(text_features,top_k)results=[]forscore,idxinzip(scores[0],indices[0]):results.append({"path":self.image_paths[idx],"score":float(score),})returnresultsdefsearch_by_image(self,image_path:str,top_k=10)->list:"""用图片搜图片"""img_features=self.extractor.encode_image(image_path).astype('float32')img_features=img_features.reshape(1,-1)scores,indices=self.index.search(img_features,top_k)results=[]forscore,idxinzip(scores[0],indices[0]):results.append({"path":self.image_paths[idx],"score":float(score),})returnresults

5.2 使用示例

# 构建索引engine=CLIPSearchEngine()engine.build_index("/data/product_images")engine.save_index("product_index")# 文字搜图results=engine.search_by_text("红色连衣裙，时尚风格",top_k=5)forrinresults:print(f"{r['score']:.3f}|{r['path']}")# 以图搜图results=engine.search_by_image("query.jpg",top_k=5)forrinresults:print(f"{r['score']:.3f}|{r['path']}")

6. Web API 服务

fromfastapiimportFastAPI,UploadFile,Filefromfastapi.responsesimportJSONResponseimportshutil,tempfile app=FastAPI()engine=CLIPSearchEngine()engine.load_index("product_index")@app.get("/search/text")asyncdefsearch_text(query:str,top_k:int=10):results=engine.search_by_text(query,top_k)returnJSONResponse(content={"results":results})@app.post("/search/image")asyncdefsearch_image(file:UploadFile=File(...),top_k:int=10):withtempfile.NamedTemporaryFile(delete=False,suffix=".jpg")astmp:shutil.copyfileobj(file.file,tmp)tmp_path=tmp.name results=engine.search_by_image(tmp_path,top_k)os.unlink(tmp_path)returnJSONResponse(content={"results":results})

# 启动服务uvicorn app:app--host0.0.0.0--port8000# 测试curl"http://localhost:8000/search/text?query=蓝色运动鞋&top_k=5"

7. 性能优化

优化方法	适用场景	效果
FAISS IVF	>10K 图片	搜索速度提升 10x
FAISS PQ	>1M 图片	内存减少 8-16x
GPU 索引	实时搜索	毫秒级响应
特征缓存	重复查询	避免重复编码
批量编码	建索引时	吞吐提升 5x