
"写个爬虫要多久?"
以前你要:
分析网页结构(30 分钟) 写解析代码(1 小时) 处理反爬(2 小时) 调试修复(1 小时)
现在,用 OpenClaw + AI 代理,10 分钟搞定。
🎯 实战目标
任务:自动抓取 GitHub Trending 项目
# 期望输出{"name": "openclaw/openclaw","stars": "15.2k","description": "AI 自动化框架","language": "Python"}🔧 工具准备
安装:
npm install -g openclawpip install playwrightplaywright install📝 步骤 1:配置 OpenClaw
创建配置文件config.json:
{"name":"github-trending-crawler","description":"自动抓取 GitHub Trending","triggers":["每天 9:00"],"actions":[{"type":"browser","url":"https://github.com/trending","screenshot":true},{"type":"extract","selector":"article.Box-row","fields":["name","stars","description"]},{"type":"save","format":"json","path":"./output/trending.json"}]}🤖 步骤 2:让 AI 生成代码
提示词:
请用 OpenClaw 和 Playwright 写一个爬虫:- 目标:https://github.com/trending- 提取:项目名称、star 数、描述、语言- 输出:JSON 格式- 要求:处理反爬、添加延迟、错误重试AI 生成的代码:
from playwright.async_api import async_playwrightimport jsonimport asyncioasyncdefcrawl_github_trending():asyncwith async_playwright() as p: browser = await p.chromium.launch( headless=True, args=['--no-sandbox'] ) page = await browser.new_page()# 访问页面await page.goto('https://github.com/trending', wait_until='networkidle')# 提取数据 projects = await page.evaluate('''() => { const rows = document.querySelectorAll('article.Box-row'); return Array.from(rows).map(row => ({ name: row.querySelector('h2 a')?.textContent.trim(), stars: row.querySelector('[aria-label="Stars"]')?.textContent.trim(), description: row.querySelector('p')?.textContent.trim(), language: row.querySelector('[itemprop="programmingLanguage"]')?.textContent.trim() })); }''')await browser.close()return projects# 运行projects = asyncio.run(crawl_github_trending())withopen('trending.json', 'w') as f: json.dump(projects, f, indent=2)🪝 步骤 3:处理反爬
常见问题 + 解决方案:
# 1. 设置 User-Agentawait page.set_extra_http_headers({'User-Agent': 'Mozilla/5.0 ...'})# 2. 添加随机延迟import randomawait asyncio.sleep(random.uniform(2, 5))# 3. 使用代理池proxy = {'server': 'http://proxy.example.com:8080','username': 'user','password': 'pass'}browser = await p.chromium.launch(proxy=proxy)# 4. 处理验证码(需要人工介入)ifawait page.query_selector('#captcha'):print("⚠️ 检测到验证码,需要人工处理")await browser.close()📊 步骤 4:数据清洗
原始数据:
{"name":" openclaw/openclaw ","stars":"15.2k ","description":null}清洗代码:
defclean_data(projects): cleaned = []for p in projects: cleaned.append({'name': p['name'].strip() if p['name'] else'','stars': int(p['stars'].replace('k', '000').strip()) if p['stars'] else0,'description': p['description'].strip() if p['description'] else'','language': p['language'].strip() if p['language'] else'Unknown' })return cleaned📤 步骤 5:推送结果
推送到飞书/微信:
import requestsdefsend_to_feishu(data, webhook_url): content = "🔥 GitHub Trending Top 5\n\n"for i, p inenumerate(data[:5], 1): content += f"{i}. {p['name']} ⭐{p['stars']}\n" requests.post(webhook_url, json={"msg_type": "text","content": {"text": content} })🎁 完整代码
crawler.py:
from playwright.async_api import async_playwrightimport jsonimport asyncioimport randomclassGithubTrendingCrawler:def__init__(self):self.url = "https://github.com/trending"asyncdefcrawl(self):asyncwith async_playwright() as p: browser = await p.chromium.launch(headless=True) page = await browser.new_page()await page.goto(self.url, wait_until='networkidle')await asyncio.sleep(random.uniform(2, 5)) projects = awaitself.extract_data(page)await browser.close()returnself.clean_data(projects)asyncdefextract_data(self, page):returnawait page.evaluate('''() => { const rows = document.querySelectorAll('article.Box-row'); return Array.from(rows).map(row => ({ name: row.querySelector('h2 a')?.textContent, stars: row.querySelector('[aria-label="Stars"]')?.textContent, description: row.querySelector('p')?.textContent, language: row.querySelector('[itemprop="programmingLanguage"]')?.textContent })); }''')defclean_data(self, projects):# 清洗逻辑...pass# 运行if __name__ == "__main__": crawler = GithubTrendingCrawler() data = asyncio.run(crawler.crawl())print(json.dumps(data, indent=2))📌 总结
传统爬虫:
分析 → 编码 → 调试 → 修复3-4 小时AI 代理爬虫:
描述需求 → AI 生成 → 微调 → 运行10 分钟关键:
✅ 让 AI 写基础代码 ✅ 人工处理反爬和边界情况 ✅ 添加监控和告警
你用 AI 写过爬虫吗?评论区聊聊~

夜雨聆风