CLIP 图文检索系统:构建跨模态语义搜索引擎
1. 引言
CLIP (Contrastive Language-Image Pre-training) 是 OpenAI 在 2021 年提出的跨模态模型,它将图像和文本映射到同一个语义空间,使得"用文字搜图片"和"用图片搜文字"成为可能。
应用场景:
- 电商商品图搜(“红色连衣裙” → 搜出所有红色连衣裙图片)
- 素材管理(输入描述 → 找到匹配的设计素材)
- 内容审核(图片 → 检索相似违规内容)
- 医学影像检索(症状描述 → 找到相似病例影像)
2. CLIP 原理
2.1 双塔架构
图像 → Image Encoder (ViT/ResNet) → 图像嵌入向量 ─┐ ├→ 对比学习 → 余弦相似度 文本 → Text Encoder (Transformer) → 文本嵌入向量 ─┘2.2 对比学习目标
给定一个 batch 的 N 个 (图像, 文本) 对: - 正样本对:匹配的 (图像_i, 文本_i) - 负样本对:不匹配的 (图像_i, 文本_j), j ≠ i 损失函数:对称交叉熵损失 L = 0.5 * (L_image_to_text + L_text_to_image) 目标:正样本对的余弦相似度最大化,负样本对最小化3. 环境搭建
pipinstalltorch torchvision pipinstalltransformers pipinstallfaiss-gpu# GPU 版 FAISSpipinstallpillow numpy4. 图像特征提取
importtorchfromtransformersimportCLIPModel,CLIPProcessorfromPILimportImageimportosimportnumpyasnpclassCLIPFeatureExtractor:"""CLIP 特征提取器"""def__init__(self,model_name="openai/clip-vit-large-patch14"):self.device="cuda"iftorch.cuda.is_available()else"cpu"self.model=CLIPModel.from_pretrained(model_name).to(self.device)self.processor=CLIPProcessor.from_pretrained(model_name)self.model.eval()@torch.no_grad()defencode_image(self,image_path:str)->np.ndarray:"""提取单张图像特征"""image=Image.open(image_path).convert("RGB")inputs=self.processor(images=image,return_tensors="pt")inputs={k:v.to(self.device)fork,vininputs.items()}features=self.model.get_image_features(**inputs)features=features/features.norm(dim=-1,keepdim=True)# L2 归一化returnfeatures.cpu().numpy().flatten()@torch.no_grad()defencode_text(self,text:str)->np.ndarray:"""提取文本特征"""inputs=self.processor(text=[text],return_tensors="pt",padding=True)inputs={k:v.to(self.device)fork,vininputs.items()}features=self.model.get_text_features(**inputs)features=features/features.norm(dim=-1,keepdim=True)returnfeatures.cpu().numpy().flatten()defencode_batch_images(self,image_paths:list,batch_size=32)->np.ndarray:"""批量提取图像特征"""all_features=[]foriinrange(0,len(image_paths),batch_size):batch_paths=image_paths[i:i+batch_size]images=[Image.open(p).convert("RGB")forpinbatch_paths]inputs=self.processor(images=images,return_tensors="pt",padding=True)inputs={k:v.to(self.device)fork,vininputs.items()}features=self.model.get_image_features(**inputs)features=features/features.norm(dim=-1,keepdim=True)all_features.append(features.cpu().numpy())returnnp.vstack(all_features)5. 向量索引与检索
5.1 使用 FAISS 构建索引
importfaissclassCLIPSearchEngine:"""基于 CLIP + FAISS 的图文检索引擎"""def__init__(self,model_name="openai/clip-vit-large-patch14"):self.extractor=CLIPFeatureExtractor(model_name)self.index=Noneself.image_paths=[]self.dimension=768# ViT-L/14 输出维度defbuild_index(self,image_dir:str):"""构建图像索引"""# 收集所有图片extensions={'.jpg','.jpeg','.png','.bmp','.webp'}self.image_paths=[os.path.join(image_dir,f)forfinos.listdir(image_dir)ifos.path.splitext(f)[1].lower()inextensions]print(f"索引{len(self.image_paths)}张图片...")# 提取特征features=self.extractor.encode_batch_images(self.image_paths)features=features.astype('float32')# 构建 FAISS 索引iflen(self.image_paths)<10000:# 小规模:精确搜索self.index=faiss.IndexFlatIP(self.dimension)# 内积 = 余弦相似度(已归一化)else:# 大规模:IVF 近似搜索nlist=min(int(len(self.image_paths)**0.5),1000)quantizer=faiss.IndexFlatIP(self.dimension)self.index=faiss.IndexIVFFlat(quantizer,self.dimension,nlist)self.index.train(features)self.index.nprobe=20# 搜索的聚类数self.index.add(features)print(f"索引构建完成,共{self.index.ntotal}向量")defsave_index(self,path:str):"""保存索引到磁盘"""faiss.write_index(self.index,f"{path}.index")np.save(f"{path}.paths.npy",np.array(self.image_paths))defload_index(self,path:str):"""加载索引"""self.index=faiss.read_index(f"{path}.index")self.image_paths=np.load(f"{path}.paths.npy").tolist()defsearch_by_text(self,query:str,top_k=10)->list:"""用文字搜图片"""text_features=self.extractor.encode_text(query).astype('float32')text_features=text_features.reshape(1,-1)scores,indices=self.index.search(text_features,top_k)results=[]forscore,idxinzip(scores[0],indices[0]):results.append({"path":self.image_paths[idx],"score":float(score),})returnresultsdefsearch_by_image(self,image_path:str,top_k=10)->list:"""用图片搜图片"""img_features=self.extractor.encode_image(image_path).astype('float32')img_features=img_features.reshape(1,-1)scores,indices=self.index.search(img_features,top_k)results=[]forscore,idxinzip(scores[0],indices[0]):results.append({"path":self.image_paths[idx],"score":float(score),})returnresults5.2 使用示例
# 构建索引engine=CLIPSearchEngine()engine.build_index("/data/product_images")engine.save_index("product_index")# 文字搜图results=engine.search_by_text("红色连衣裙,时尚风格",top_k=5)forrinresults:print(f"{r['score']:.3f}|{r['path']}")# 以图搜图results=engine.search_by_image("query.jpg",top_k=5)forrinresults:print(f"{r['score']:.3f}|{r['path']}")6. Web API 服务
fromfastapiimportFastAPI,UploadFile,Filefromfastapi.responsesimportJSONResponseimportshutil,tempfile app=FastAPI()engine=CLIPSearchEngine()engine.load_index("product_index")@app.get("/search/text")asyncdefsearch_text(query:str,top_k:int=10):results=engine.search_by_text(query,top_k)returnJSONResponse(content={"results":results})@app.post("/search/image")asyncdefsearch_image(file:UploadFile=File(...),top_k:int=10):withtempfile.NamedTemporaryFile(delete=False,suffix=".jpg")astmp:shutil.copyfileobj(file.file,tmp)tmp_path=tmp.name results=engine.search_by_image(tmp_path,top_k)os.unlink(tmp_path)returnJSONResponse(content={"results":results})# 启动服务uvicorn app:app--host0.0.0.0--port8000# 测试curl"http://localhost:8000/search/text?query=蓝色运动鞋&top_k=5"7. 性能优化
| 优化方法 | 适用场景 | 效果 |
|---|---|---|
| FAISS IVF | >10K 图片 | 搜索速度提升 10x |
| FAISS PQ | >1M 图片 | 内存减少 8-16x |
| GPU 索引 | 实时搜索 | 毫秒级响应 |
| 特征缓存 | 重复查询 | 避免重复编码 |
| 批量编码 | 建索引时 | 吞吐提升 5x |
8. 总结
CLIP 图文检索的核心优势在于语义理解——不需要精确的关键词匹配,只要语义相关就能检索到。结合 FAISS 向量数据库,可以实现百万级图片的毫秒级检索。