别只当脚本小子！用Python+Requests库自动化复现CTFshow Web信息收集题-编程阁

从脚本小子到安全工程师：Python自动化实战CTFshow信息收集题

在CTF竞赛中，Web信息收集往往是解题的第一步，也是最能体现工程师思维差异的环节。大多数参赛者会手动检查网页源码、响应头或敏感文件，但真正的效率来自于将重复性工作自动化。本文将带你用Python的Requests库构建一套自动化信息收集工具，不仅能快速解决CTFshow系列题目，更能将这种能力迁移到实际安全评估中。

1. 自动化信息收集的核心框架

信息收集自动化需要建立系统化的检测流程。我们先设计一个基础框架，包含以下核心组件：

class WebScanner: def __init__(self, target_url): self.target = target_url self.session = requests.Session() self.session.headers.update({ 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)', 'Accept-Language': 'en-US,en;q=0.9' }) def check_html_comments(self): """检测HTML注释中的flag""" pass def check_response_headers(self): """检查响应头中的敏感信息""" pass def scan_common_files(self): """扫描常见敏感文件路径""" pass

这个框架的扩展性极强，我们可以逐步实现每个方法。先看最基础的HTML注释检测：

from bs4 import BeautifulSoup def check_html_comments(self): try: resp = self.session.get(self.target) soup = BeautifulSoup(resp.text, 'html.parser') comments = soup.find_all(string=lambda text: isinstance(text, Comment)) return [c.extract() for c in comments if 'ctfshow' in c] except Exception as e: print(f"注释检查失败: {e}") return []

这个方法用BeautifulSoup解析HTML，提取所有包含ctfshow关键字的注释。对应CTFshow的Web1和Web2题目，这类自动化检测可以瞬间完成人工需要几分钟的工作。

2. 响应头与敏感文件扫描

许多CTF题目会将flag藏在非常规位置，我们需要扩展扫描维度。首先是响应头检查：

def check_response_headers(self): sensitive_headers = ['flag', 'ctfshow', 'secret'] try: resp = self.session.get(self.target) return { h: resp.headers[h] for h in resp.headers if any(key in h.lower() for key in sensitive_headers) } except Exception as e: print(f"响应头检查失败: {e}") return {}

对于CTFshow Web3这类题目，这个方法可以直接捕获响应头中的flag。接下来实现敏感文件扫描：

COMMON_FILES = [ 'robots.txt', '.git/', '.svn/', 'index.php.swp', 'www.zip', 'db/db.mdb', 'admin.php' ] def scan_common_files(self): results = {} for path in COMMON_FILES: url = f"{self.target.rstrip('/')}/{path}" try: resp = self.session.get(url, timeout=3) if resp.status_code == 200: results[path] = resp.text[:100] + '...' # 截取部分内容 except: continue return results

这个扫描器覆盖了CTFshow Web4到Web20的大部分考点。例如Web4的robots.txt泄露、Web7的.git泄露、Web9的vim交换文件等场景。

3. 高级检测技巧实战

基础扫描完成后，我们需要实现更高级的检测逻辑。首先是源码泄露检测：

def check_source_leaks(self): leaks = {} # PHPS源码检测 resp = self.session.get(f"{self.target}.phps") if resp.status_code == 200 and '<?php' in resp.text: leaks['phps'] = resp.text # ZIP源码包检测 resp = self.session.get(f"{self.target}/www.zip") if resp.status_code == 200 and resp.headers['Content-Type'] == 'application/zip': leaks['www.zip'] = "ZIP文件内容已获取" return leaks

这个方法专门处理像CTFshow Web5的phps泄露和Web6的www.zip源码泄露场景。对于版本控制泄露，我们可以更深入地检测：

def check_vcs_leaks(self): vcs_files = { 'git': ['.git/HEAD', '.git/config'], 'svn': ['.svn/entries'], 'hg': ['.hg/store'] } leaks = {} for vcs_type, paths in vcs_files.items(): for path in paths: url = f"{self.target.rstrip('/')}/{path}" resp = self.session.get(url) if resp.status_code == 200: leaks[vcs_type] = url break return leaks

这套检测逻辑能够发现CTFshow Web7的Git泄露和Web8的SVN泄露问题。

4. 自动化工具集成与实战

将上述模块整合成完整的自动化扫描工具：

def full_scan(self): report = { 'url': self.target, 'html_comments': self.check_html_comments(), 'headers': self.check_response_headers(), 'common_files': self.scan_common_files(), 'source_leaks': self.check_source_leaks(), 'vcs_leaks': self.check_vcs_leaks(), 'cookies': self.session.cookies.get_dict() } # 结果过滤与标记 findings = [] for category, result in report.items(): if result: # 只保留有发现的条目 findings.append((category, result)) return dict(findings)

使用示例：

scanner = WebScanner('http://challenge.ctf.show/web1') results = scanner.full_scan() print(json.dumps(results, indent=2))

典型输出结构：

{ "html_comments": [ "<!-- ctfshow{a6b0c2d1-30cc-4372-a0e7-f874c2c73ea1} -->" ], "headers": { "X-Flag": "ctfshow{header_flag_example}" } }

5. 防御对抗与高级技巧

实际CTF和渗透测试中，经常会遇到防护措施。我们需要增强工具的对抗能力：

对抗WAF检测：

def stealth_request(self, url): headers = { 'X-Forwarded-For': '127.0.0.1', 'Referer': 'https://www.google.com/', 'Accept': 'text/html,application/xhtml+xml', 'Accept-Encoding': 'gzip, deflate' } return self.session.get(url, headers=headers)

动态User-Agent轮换：

USER_AGENTS = [ 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)', 'Googlebot/2.1 (+http://www.google.com/bot.html)' ] def random_agent_request(self, url): agent = random.choice(USER_AGENTS) return self.session.get(url, headers={'User-Agent': agent})

智能路径爆破：

def smart_bruteforce(self, wordlist=None): if not wordlist: wordlist = [ 'admin', 'backup', 'config', 'flag.txt', 'secret', 'login' ] found = {} for path in wordlist: resp = self.session.get(f"{self.target}/{path}") if resp.status_code == 200: found[path] = resp.text[:200] # 截取部分内容 return found

这些增强功能使工具能够应对更复杂的CTF题目和真实环境。

6. 性能优化与工程实践

当扫描大型目标时，性能成为关键因素。我们可以采用以下优化策略：

多线程扫描：

from concurrent.futures import ThreadPoolExecutor def concurrent_scan(self, paths): with ThreadPoolExecutor(max_workers=5) as executor: futures = { executor.submit(self.session.get, f"{self.target}/{p}"): p for p in paths } results = {} for future in concurrent.futures.as_completed(futures): path = futures[future] try: resp = future.result() if resp.status_code == 200: results[path] = resp.text[:100] except Exception as e: print(f"{path} 扫描失败: {e}") return results

缓存机制：

from functools import lru_cache @lru_cache(maxsize=100) def cached_request(self, url): return self.session.get(url)

结果持久化：

def save_report(self, report, format='json'): timestamp = datetime.now().strftime('%Y%m%d_%H%M%S') filename = f"scan_report_{timestamp}.{format}" with open(filename, 'w') as f: if format == 'json': json.dump(report, f, indent=2) else: for k, v in report.items(): f.write(f"== {k} ==\n{v}\n\n") return filename

这些工程化改进使工具更适合实际安全评估场景。

7. 从CTF到实战的思维转变

CTF题目往往简化了真实场景，我们需要调整自动化工具的策略：

真实环境考虑因素：

扫描频率控制，避免触发防护机制
合法性检查，确保有授权测试
结果验证，防止误报
日志记录，便于审计追踪

增强的扫描策略：

def responsible_scan(self, rate_limit=1): """负责任的安全扫描""" start_time = time.time() report = {} for module in [ self.check_html_comments, self.check_response_headers, self.scan_common_files ]: try: result = module() if result: report[module.__name__] = result time.sleep(rate_limit) # 请求间隔 except Exception as e: print(f"模块 {module.__name__} 执行失败: {e}") duration = time.time() - start_time print(f"扫描完成，耗时 {duration:.2f} 秒") return report

这种扫描方式更适合真实环境，既保证了检测效果，又避免了过大的网络影响。

在CTFshow Web19这类题目中，我们还需要处理前端加密的情况。可以扩展工具的解密能力：

from Crypto.Cipher import AES def decrypt_aes(ciphertext, key, iv): cipher = AES.new(key.encode(), AES.MODE_CBC, iv.encode()) plaintext = cipher.decrypt(bytes.fromhex(ciphertext)) return plaintext.decode().rstrip('\0')

将这些技术组合起来，就能构建一个真正强大的自动化信息收集系统，而不仅仅是解决CTF题目的临时工具。