news 2026/4/19 15:54:07

Elasticsearch:如何使用 LLM 在摄入数据时提取需要的信息

作者头像

张小明

前端开发工程师

1.2k 24
文章封面图
Elasticsearch:如何使用 LLM 在摄入数据时提取需要的信息

在很多的应用场景中,我们可以使用 LLM 来帮助我们提取需要的结构化数据。这些结构化的数据可以是分类,也可以是获取同义词等等。在我之前的文章 “如何自动化同义词并使用我们的 Synonyms API 进行上传” 里,我们展示了如何使用 LLM 来生成同义词,并上传到 Elasticsearch 中。在今天的例子里,我们把 LLM 提取数据的流程放到我们的 ingest pipeline 里。这样在摄入的同时,会自动提前所需要的信息!

创建 LLM Chat completion 端点

我们可以参考之前的文章 “Elasticsearch:使用推理端点及语义搜索演示”。我们可以创建一个如下的 chat completion 端点:

PUT _inference/completion/azure_openai_completion { "service": "azureopenai", "service_settings": { "api_key": "${AZURE_API_KEY}", "resource_name": "${AZURE_RESOURCE_NAME}", "deployment_id": "${AZURE_DEPLOYMENT_ID}", "api_version": "${AZURE_API_VERSION}" } }

创建一个 ingest pipeline

我们可以使用如下的一个方法来测试 pipeline:

在上面,我们定义了一个 EXTRACTION_PROMPT 变量:

Extract audio product information from this description. Return raw JSON only. Do NOT use markdown, backticks, or code blocks. Fields: category (string, one of: Headphones/Earbuds/Speakers/Microphones/Accessories), features (array of strings from: wireless/noise_cancellation/long_battery/waterproof/voice_assistant/fast_charging/portable/surround_sound), use_case (string, one of: Travel/Office/Home/Fitness/Gaming/Studio). Description:

如果你还不了解如何定义这个变量,请参考我之前的文章 “Kibana:如何设置变量并应用它们”。

POST _ingest/pipeline/_simulate { "description": "Use LLM to interpret messages to come out categories", "pipeline": { "processors": [ { "script": { "source": "ctx.prompt = params.EXTRACTION_PROMPT + ctx.description", "params": { "EXTRACTION_PROMPT": "${EXTRACTION_PROMPT}" } } }, { "inference": { "model_id": "azure_openai_completion", "input_output": { "input_field": "prompt", "output_field": "ai_response" } } }, { "json": { "field": "ai_response", "add_to_root": true } }, { "json": { "field": "ai_response", "add_to_root": true } }, { "remove": { "field": [ "prompt", "ai_response" ] } } ] }, "docs": [ { "_source": { "name": "Wireless Noise-Canceling Headphones", "description": "Premium wireless Bluetooth headphones with active noise cancellation, 30-hour battery life, and premium leather ear cushions. Perfect for travel and office use.", "price": 299.99 } } ] }

提示:你可以使用任何一个你喜欢的大模型来创建上面的端点。

上面命令运行的结果就是:

{ "docs": [ { "doc": { "_index": "_index", "_version": "-3", "_id": "_id", "_source": { "use_case": "Travel", "features": [ "wireless", "noise_cancellation", "long_battery" ], "price": 299.99, "name": "Wireless Noise-Canceling Headphones", "description": "Premium wireless Bluetooth headphones with active noise cancellation, 30-hour battery life, and premium leather ear cushions. Perfect for travel and office use.", "model_id": "azure_openai_completion", "category": "Headphones" }, "_ingest": { "timestamp": "2026-01-22T13:56:11.926494Z" } } } ] }

上面的测试非常成功。我们可以进一步创建 pipeline:

PUT _ingest/pipeline/product-enrichment-pipeline { "processors": [ { "script": { "source": "ctx.prompt = params.EXTRACTION_PROMPT + ctx.description", "params": { "EXTRACTION_PROMPT": "${EXTRACTION_PROMPT}" } } }, { "inference": { "model_id": "azure_openai_completion", "input_output": { "input_field": "prompt", "output_field": "ai_response" } } }, { "json": { "field": "ai_response", "add_to_root": true } }, { "json": { "field": "ai_response", "add_to_root": true } }, { "remove": { "field": [ "prompt", "ai_response" ] } } ] }

创建索引并写入数据

我们接下来创建一个叫做 products 的索引:

PUT products { "settings": { "default_pipeline": "product-enrichment-pipeline" } }

如上所示,我们把 default_pipeline,也即默认的 pipeline 设置为 product-enrichment-pipeline。这样我们像正常地写入数据的时候,这个 pipeline 也会被自动调用:

POST _bulk { "index": { "_index": "products", "_id": "1" } } { "name": "Wireless Noise-Canceling Headphones", "description": "Premium wireless Bluetooth headphones with active noise cancellation, 30-hour battery life, and premium leather ear cushions. Perfect for travel and office use.", "price": 299.99 } { "index": { "_index": "products", "_id": "2" } } { "name": "Portable Bluetooth Speaker", "description": "Compact waterproof speaker with 360-degree surround sound. 20-hour battery life, perfect for outdoor adventures and pool parties.", "price": 149.99 } { "index": { "_index": "products", "_id": "3" } } { "name": "Studio Condenser Microphone", "description": "Professional USB microphone with noise cancellation and voice assistant compatibility. Ideal for podcasting, streaming, and home studio recording.", "price": 199.99 }

注意:依赖于大模型的速度,上面的调用可能需要一点时间来完成!

如上所示,我们写入数据。我们使用如下的命令来查看我们的数据:

GET products/_search?filter_path=**.hits
{ "hits": { "hits": [ { "_index": "products", "_id": "1", "_score": 1, "_source": { "use_case": "Travel", "features": [ "wireless", "noise_cancellation", "long_battery" ], "price": 299.99, "name": "Wireless Noise-Canceling Headphones", "description": "Premium wireless Bluetooth headphones with active noise cancellation, 30-hour battery life, and premium leather ear cushions. Perfect for travel and office use.", "model_id": "azure_openai_completion", "category": "Headphones" } }, { "_index": "products", "_id": "2", "_score": 1, "_source": { "use_case": "Travel", "features": [ "waterproof", "surround_sound" ], "price": 149.99, "name": "Portable Bluetooth Speaker", "description": "Compact waterproof speaker with 360-degree surround sound. 20-hour battery life, perfect for outdoor adventures and pool parties.", "model_id": "azure_openai_completion", "category": "Speakers" } }, { "_index": "products", "_id": "3", "_score": 1, "_source": { "use_case": "Studio", "features": [ "noise_cancellation", "voice_assistant" ], "price": 199.99, "name": "Studio Condenser Microphone", "description": "Professional USB microphone with noise cancellation and voice assistant compatibility. Ideal for podcasting, streaming, and home studio recording.", "model_id": "azure_openai_completion", "category": "Microphones" } } ] } }

有了如上所示的结构化数据,我们就可以针对我们的数据进行搜索或统计了。

祝大家学习愉快!

版权声明: 本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若内容造成侵权/违法违规/事实不符,请联系邮箱:809451989@qq.com进行投诉反馈,一经查实,立即删除!
网站建设 2026/4/16 12:47:00

DeepSeek-R1-Distill-Qwen-1.5B社区贡献指南:二次开发流程

DeepSeek-R1-Distill-Qwen-1.5B社区贡献指南:二次开发流程 你是不是也遇到过这样的情况:手头有个轻量但能力扎实的模型,想加个新功能却卡在环境配置上?想改个提示词模板却发现文档里没写清楚怎么热更新?或者想把模型集…

作者头像 李华
网站建设 2026/4/19 14:23:08

Sunshine:突破游戏串流限制的开源方案搭建教程与性能优化实践指南

Sunshine:突破游戏串流限制的开源方案搭建教程与性能优化实践指南 【免费下载链接】Sunshine Sunshine: Sunshine是一个自托管的游戏流媒体服务器,支持通过Moonlight在各种设备上进行低延迟的游戏串流。 项目地址: https://gitcode.com/GitHub_Trendin…

作者头像 李华
网站建设 2026/4/17 15:35:53

TurboDiffusion加速原理揭秘:rCM时间步蒸馏技术实战解析

TurboDiffusion加速原理揭秘:rCM时间步蒸馏技术实战解析 1. TurboDiffusion是什么:不只是快,而是重新定义视频生成效率 TurboDiffusion不是简单地给现有模型“提速”,它是一套从底层算法到工程实现全面重构的视频生成加速框架。…

作者头像 李华
网站建设 2026/4/16 11:12:52

Qwen2.5-0.5B能在手机运行吗?ARM架构适配进展

Qwen2.5-0.5B能在手机运行吗?ARM架构适配进展 1. 小模型,大期待:为什么0.5B参数值得认真对待 很多人看到“0.5B”(5亿参数)的第一反应是:这算大模型吗?它能干啥?值不值得在手机上折…

作者头像 李华
网站建设 2026/4/18 8:06:18

鸣潮自动化工具智能部署指南

鸣潮自动化工具智能部署指南 【免费下载链接】ok-wuthering-waves 鸣潮 后台自动战斗 自动刷声骸上锁合成 自动肉鸽 Automation for Wuthering Waves 项目地址: https://gitcode.com/GitHub_Trending/ok/ok-wuthering-waves ok-wuthering-waves是一款专为《鸣潮》游戏设…

作者头像 李华
网站建设 2026/4/16 10:40:44

如何突破多语言排版瓶颈?企业级开源字体解决方案全解析

如何突破多语言排版瓶颈?企业级开源字体解决方案全解析 【免费下载链接】source-han-sans-ttf A (hinted!) version of Source Han Sans 项目地址: https://gitcode.com/gh_mirrors/so/source-han-sans-ttf 在全球化业务扩张过程中,企业是否正面临…

作者头像 李华