一、引言
上一篇文章梳理了内容出海工具链的全景架构。本文直接动手——用Python + Docker + GitHub Actions搭建一条可自动化的内容出海翻译配音产线,核心目标:上传一个中文视频,自动产出英语/日语/西语三个版本,发布到YouTube。
完整代码在文中可复制运行,架构如下:
Git Push → GitHub Actions触发 → Docker容器启动 → Whisper语音识别 → DeepL/GPT翻译 → ElevenLabs/Cutrix配音 → FFmpeg合成 → YouTube API上传 → 钉钉通知二、环境准备
2.1 依赖清单
# 系统依赖apt-getinstall-yffmpeg python3.11 python3-pip# Python依赖pipinstallfaster-whisper openai deepl elevenlabs pyyaml google-api-python-client pipinstallgoogle-auth-oauthlib boto3 requests2.2 API Key 准备
| 服务 | 用途 | 获取地址 | 月免费额度 |
|---|---|---|---|
| DeepL API | 文本翻译 | deepl.com/pro-api | 50万字符 |
| ElevenLabs | TTS配音 | elevenlabs.io/api | 1万字符 |
| OpenAI API (GPT-4o) | 翻译纠错+术语校验 | platform.openai.com | — |
| YouTube Data API v3 | 视频上传 | console.cloud.google.com | 1万单位/天 |
| Cutrix API | 翻译配音(备选) | cutrix.cc | — |
生产环境建议用Cutrix API替代DeepL+ElevenLabs组合,单接口完成翻译+配音+字幕,省掉三个API的集成维护。
2.3 目录结构
content-globalization-pipeline/ ├── Dockerfile ├── docker-compose.yml ├── requirements.txt ├── config.yaml # 语种/API Key/平台配置 ├── src/ │ ├── main.py # 主流程入口 │ ├── transcribe.py # Whisper语音识别 │ ├── translate.py # 翻译引擎 │ ├── dubbing.py # TTS配音 │ ├── compose.py # FFmpeg合成 │ ├── distribute.py # 多平台分发 │ └── notify.py # 通知 ├── .github/workflows/ │ └── pipeline.yml # CI/CD配置 └── tests/ └── test_pipeline.py三、核心代码实现
3.1 配置管理 (config.yaml)
pipeline:source_lang:"zh"target_langs:["en","ja","es"]video_dir:"/data/videos"output_dir:"/data/output"asr:engine:"faster-whisper"model:"large-v3"compute_type:"float16"# GPU用float16,CPU用int8translation:engine:"deepl"# deepl | gpt4o | cutrixglossary_path:"/data/glossary.yaml"# 术语表dubbing:engine:"elevenlabs"# elevenlabs | cutrixvoice_map:# 多说话人声线配置default:"zh-CN-YunxiNeural"distribution:youtube:enabled:truecategory_id:"28"# Science & Technologyprivacy_status:"private"# 先私密,审核后改公开notify:dingtalk_webhook:"${DINGTALK_WEBHOOK}"3.2 主流程 (src/main.py)
importyamlimportloggingfrompathlibimportPathfromtranscribeimporttranscribe_videofromtranslateimporttranslate_segmentsfromdubbingimportgenerate_dubbingfromcomposeimportcompose_videofromdistributeimportupload_to_youtubefromnotifyimportsend_dingtalk logging.basicConfig(level=logging.INFO)logger=logging.getLogger(__name__)defrun_pipeline(video_path:str,config:dict):"""一条视频 → 多语种版本的完整流水线"""video_name=Path(video_path).stem results={}fortarget_langinconfig["pipeline"]["target_langs"]:logger.info(f"[{target_lang}] 开始处理:{video_name}")# Step 1: Whisper语音识别segments=transcribe_video(video_path,model=config["asr"]["model"],compute_type=config["asr"]["compute_type"])logger.info(f"[{target_lang}] ASR完成:{len(segments)}个片段")# Step 2: 逐句翻译translated=translate_segments(segments,source_lang=config["pipeline"]["source_lang"],target_lang=target_lang,engine=config["translation"]["engine"],glossary=config["translation"].get("glossary_path"))logger.info(f"[{target_lang}] 翻译完成")# Step 3: TTS配音audio_path=generate_dubbing(translated,target_lang=target_lang,engine=config["dubbing"]["engine"],voice=config["dubbing"]["voice_map"].get("default"))logger.info(f"[{target_lang}] 配音完成:{audio_path}")# Step 4: FFmpeg合成output_path=compose_video(video_path=video_path,audio_path=audio_path,translated_segments=translated,target_lang=target_lang,output_dir=config["pipeline"]["output_dir"])logger.info(f"[{target_lang}] 合成完成:{output_path}")# Step 5: YouTube上传ifconfig["distribution"]["youtube"]["enabled"]:video_id=upload_to_youtube(video_path=output_path,title=f"{video_name}[{target_lang.upper()}]",target_lang=target_lang,config=config["distribution"]["youtube"])results[target_lang]=video_id logger.info(f"[{target_lang}] 上传完成:{video_id}")returnresultsif__name__=="__main__":importsyswithopen("config.yaml")asf:config=yaml.safe_load(f)forvideoinPath(config["pipeline"]["video_dir"]).glob("*.mp4"):try:results=run_pipeline(str(video),config)send_dingtalk(f"✅{video.name}处理完成\n"+"\n".join(f"{l}: https://youtu.be/{v}"forl,vinresults.items()))exceptExceptionase:logger.error(f"❌{video.name}失败:{e}")send_dingtalk(f"❌{video.name}失败:{str(e)[:200]}")raise3.3 Whisper语音识别 (src/transcribe.py)
fromfaster_whisperimportWhisperModelfromdataclassesimportdataclass@dataclassclassSegment:start:floatend:floattext:strdeftranscribe_video(video_path:str,model:str="large-v3",compute_type:str="float16")->list[Segment]:"""Whisper转写,返回带时间戳的句段列表"""whisper=WhisperModel(model,device="cuda",compute_type=compute_type)segments,info=whisper.transcribe(video_path,beam_size=5,vad_filter=True,# 过滤静音vad_parameters=dict(min_silence_duration_ms=500# 最小静音间隔500ms))detected_lang=info.language results=[]forseginsegments:# 合并过短片段(< 1秒)到前一句ifresultsandseg.end-seg.start<1.0:results[-1].text+=" "+seg.text.strip()results[-1].end=seg.endelse:results.append(Segment(start=seg.start,end=seg.end,text=seg.text.strip()))returnresults3.4 翻译引擎 (src/translate.py)
importdeeplimportyamlfrompathlibimportPathdeftranslate_segments(segments,source_lang,target_lang,engine="deepl",glossary=None):"""逐句翻译,支持术语表绑定"""# 加载术语表(确保专有名词一致)glossary_map={}ifglossaryandPath(glossary).exists():withopen(glossary)asf:glossary_map=yaml.safe_load(f).get(target_lang,{})translator=deepl.Translator(os.environ["DEEPL_API_KEY"])lang_map={"zh":"ZH","en":"EN-US","ja":"JA","es":"ES"}results=[]forseginsegments:text=seg.text# 术语替换:先替换术语表内的词,再做翻译forcn_term,target_terminglossary_map.items():ifcn_termintext:text=text.replace(cn_term,f"<glossary>{cn_term}</glossary>")result=translator.translate_text(text,source_lang=lang_map.get(source_lang,"ZH"),target_lang=lang_map.get(target_lang,"EN-US"),formality="prefer_less"# 短剧/短视频用口语化)translated=result.text# 还原术语标记forcn_term,target_terminglossary_map.items():translated=translated.replace(f"<glossary>{cn_term}</glossary>",target_term)seg.translated=translated results.append(seg)returnresults3.5 FFmpeg合成 (src/compose.py)
importsubprocessfrompathlibimportPathdefcompose_video(video_path,audio_path,translated_segments,target_lang,output_dir):"""合成多语言视频:替换音频 + 烧录字幕"""output_dir=Path(output_dir)output_dir.mkdir(parents=True,exist_ok=True)video_name=Path(video_path).stem output_path=output_dir/f"{video_name}_{target_lang}.mp4"# 生成SRT字幕srt_path=output_dir/f"{video_name}_{target_lang}.srt"withopen(srt_path,"w")asf:fori,seginenumerate(translated_segments,1):f.write(f"{i}\n")f.write(f"{_format_time(seg.start)}-->{_format_time(seg.end)}\n")f.write(f"{seg.translated}\n\n")# FFmpeg合成:替换音频 + 烧录字幕cmd=["ffmpeg","-y","-i",video_path,"-i",audio_path,"-vf",f"subtitles={srt_path}:force_style='FontSize=18,PrimaryColour=&H00FFFFFF,OutlineColour=&H00000000,BackColour=&H80000000'","-c:v","libx264","-preset","medium","-crf","23","-c:a","aac","-b:a","128k","-map","0:v:0","-map","1:a:0",str(output_path)]subprocess.run(cmd,check=True,capture_output=True)returnoutput_pathdef_format_time(seconds:float)->str:h=int(seconds//3600)m=int((seconds%3600)//60)s=int(seconds%60)ms=int((seconds*1000)%1000)returnf"{h:02d}:{m:02d}:{s:02d},{ms:03d}"四、Docker化部署
4.1 Dockerfile
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04 RUN apt-get update && apt-get install -y \ ffmpeg python3.11 python3-pip \ && rm -rf /var/lib/apt/lists/* WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY src/ ./src/ COPY config.yaml . ENTRYPOINT ["python3", "src/main.py"]4.2 docker-compose.yml
version:"3.8"services:pipeline:build:.volumes:-./data/videos:/data/videos-./data/output:/data/output-./config.yaml:/app/config.yaml-./data/glossary.yaml:/data/glossary.yamlenvironment:-DEEPL_API_KEY=${DEEPL_API_KEY}-ELEVENLABS_API_KEY=${ELEVENLABS_API_KEY}-OPENAI_API_KEY=${OPENAI_API_KEY}-DINGTALK_WEBHOOK=${DINGTALK_WEBHOOK}deploy:resources:reservations:devices:-driver:nvidiacount:1capabilities:[gpu]restart:"no"# 跑完即停,不常驻五、CI/CD集成(GitHub Actions)
5.1.github/workflows/pipeline.yml
name:Content Globalization Pipelineon:push:paths:-"data/videos/**.mp4"# 推送视频文件时触发workflow_dispatch:# 手动触发inputs:video_file:description:"视频文件名(留空处理全部)"required:falsejobs:translate-and-distribute:runs-on:[self-hosted,gpu]# 需要GPU的Whisper转写timeout-minutes:120steps:-uses:actions/checkout@v4-name:Set up Docker Buildxuses:docker/setup-buildx-action@v3-name:Build pipeline imagerun:docker compose build-name:Run pipelineenv:DEEPL_API_KEY:${{secrets.DEEPL_API_KEY}}ELEVENLABS_API_KEY:${{secrets.ELEVENLABS_API_KEY}}OPENAI_API_KEY:${{secrets.OPENAI_API_KEY}}DINGTALK_WEBHOOK:${{secrets.DINGTALK_WEBHOOK}}YOUTUBE_CLIENT_SECRET:${{secrets.YOUTUBE_CLIENT_SECRET}}run:docker compose run--rm pipeline-name:Commit output filesif:success()run:|git config user.name "pipeline-bot" git config user.email "bot@cutrix.cc" git add data/output/ data/output/*.srt git commit -m "auto: 内容出海处理完成 $(date +%Y-%m-%d_%H:%M)" || true git push-name:Notify failureif:failure()uses:actions/github-script@v7with:script:|const runId = context.runId; const msg = `❌ Pipeline失败: ${context.repo.owner}/${context.repo.repo}\nRun: https://github.com/${context.repo.owner}/${context.repo.repo}/actions/runs/${runId}`; // 钉钉通知已在main.py中内置,此处为兜底5.2 术语表自动更新(Git pre-commit hook)
#!/bin/bash# .git/hooks/pre-commit — 术语变更时自动提醒检查翻译一致性ifgitdiff--cached--name-only|grep-q"glossary.yaml";thenecho"⚠️ 术语表已变更,请确认:"echo" 1. 历史已翻译视频是否需要重新处理?"echo" 2. 新术语是否已在各语种版本中统一?"echo""echo"如需跳过检查: git commit --no-verify"exit1fi六、监控与告警
6.1 Pipeline健康指标
# src/monitor.py — 记录每次pipeline运行的关键指标importtimeimportjsonfromdataclassesimportdataclass,asdictfromdatetimeimportdatetime@dataclassclassPipelineMetrics:video_name:strduration_seconds:floatsource_duration_minutes:floattarget_langs:list[str]asr_wer:float# 词错率translation_time:floatdubbing_time:floatupload_success:booltimestamp:str=""def__post_init__(self):self.timestamp=datetime.now().isoformat()deflog(self,path="/data/metrics.jsonl"):"""追加写入指标日志"""withopen(path,"a")asf:f.write(json.dumps(asdict(self),ensure_ascii=False)+"\n")@propertydefrealtime_factor(self)->float:"""实时率:处理耗时/视频时长,<1 表示比实时快"""returnself.duration_seconds/(self.source_duration_minutes*60)6.2 告警规则
| 指标 | 告警阈值 | 处理动作 |
|---|---|---|
| 单视频处理耗时 | > 视频时长 × 3 | 检查Whisper配置/GPU状态 |
| ASR词错率 | > 15% | 检查音频质量,可能需要降噪预处理 |
| API调用超时 | 连续3次 | 切换备用翻译/TTS引擎 |
| 上传失败率 | > 10% | 检查YouTube API配额 |
七、常见问题与解决
7.1 Whisper在短剧/直播场景的识别率低
短剧常有背景音乐、多人同时说话、方言口音。解决方案:在ASR之前加一道人声分离(UVR5/Demucs),分离出纯净人声再送入Whisper。
# 人声分离预处理importsubprocessdefseparate_vocals(video_path:str)->str:"""用Demucs分离人声"""audio_path=video_path.replace(".mp4","_vocals.wav")subprocess.run(["demucs","--two-stems=vocals","-o","/tmp/demucs_output",video_path],check=True)returnaudio_path7.2 长视频(>30分钟)翻译质量下降
LLM翻译长文本时会"遗忘"前文,导致术语不一致。解决:分句翻译时维护一个上下文窗口(前3句+后2句),用简化的上下文帮助翻译引擎理解当前句的语境。
7.3 YouTube API配额不够用
YouTube Data API v3默认每天1万单位配额,上传一个视频约消耗1600单位(含metadata更新)。处理方案:
- 用Service Account申请配额提升(最多可提至100万单位)
- 非紧急内容用
privacy_status: private先上传,等配额恢复后批量改公开 - 高频场景对接YouTube Studio的Content ID批量上传
八、总结
本文搭建的产线实现了:推送视频文件到Git仓库 → 自动触发多语种翻译配音 → YouTube分发 → 钉钉通知的全自动化流程。
关键设计决策回顾:
| 决策点 | 选择 | 原因 |
|---|---|---|
| ASR引擎 | faster-whisper large-v3 | CTranslate2推理,比原版Whisper快4倍 |
| 翻译引擎(主) | DeepL API | 中文→日语/西语质量最优 |
| 翻译引擎(兜底) | GPT-4o | 处理DeepL不擅长的口语化/网络用语 |
| TTS引擎 | ElevenLabs | 情感还原度最高 |
| 分发方式 | YouTube API直接上传 | 减少中间步骤,95%+成功率 |
| 运行环境 | Docker + GPU Self-hosted Runner | 降低云端GPU成本 |
如果你的团队没有GPU服务器,或者不想维护这一整套流水线,可以用Cutrix API替代Whisper→DeepL→ElevenLabs→FFmpeg这个四段式链路,一个API调用完成翻译+配音+字幕输出,产线代码量减少70%以上。
参考资料
- faster-whisper GitHub
- DeepL API Documentation
- ElevenLabs API
- YouTube Data API v3
- FFmpeg Documentation
- GitHub Actions Documentation