PDF 处理指南
概述
本技能涵盖了使用 Python 库和命令行工具进行 PDF 处理的基本操作。关于高级功能、JavaScript 库和详细示例,请参阅 reference.md。如果您需要填写 PDF 表单,请阅读 forms.md 并遵循其中的说明。

快速开始
from pypdf import PdfReader, PdfWriter# 读取 PDFreader = PdfReader("document.pdf")print(f"Pages: {len(reader.pages)}")# 提取文本text = ""for page in reader.pages:text += page.extract_text()
Python 库
pypdf - 基本操作
合并 PDF
from pypdf import PdfWriter, PdfReaderwriter = PdfWriter()for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:reader = PdfReader(pdf_file)for page in reader.pages:writer.add_page(page) with open("merged.pdf", "wb") as output:writer.write(output)
拆分 PDF
reader = PdfReader("input.pdf")for i, page in enumerate(reader.pages):writer = PdfWriter()writer.add_page(page)with open(f"page_{i+1}.pdf", "wb") as output:writer.write(output)
提取元数据
reader = PdfReader("document.pdf")meta = reader.metadataprint(f"Title: {meta.title}")print(f"Author: {meta.author}")print(f"Subject: {meta.subject}")print(f"Creator: {meta.creator}")
旋转页面
reader = PdfReader("input.pdf")writer = PdfWriter()page = reader.pages[0]page.rotate(90)# 顺时针旋转 90 度writer.add_page(page)with open("rotated.pdf", "wb") as output:writer.write(output)
pdfplumber - 文本和表格提取
提取带布局的文本
import pdfplumberwith pdfplumber.open("document.pdf") as pdf:for page in pdf.pages:text = page.extract_text()print(text)
提取表格
with pdfplumber.open("document.pdf") as pdf:for i, page in enumerate(pdf.pages):tables = page.extract_tables()for j, table in enumerate(tables):print(f"Table {j+1} on page {i+1}:")for row in table:print(row)
高级表格提取
import pandas as pdwith pdfplumber.open("document.pdf") as pdf:all_tables = []for page in pdf.pages:tables = page.extract_tables()for table in tables:if table:# 检查表格是否非空df = pd.DataFrame(table[1:], columns=table[0])all_tables.append(df)# 合并所有表格if all_tables:combined_df = pd.concat(all_tables, ignore_index=True)combined_df.to_excel("extracted_tables.xlsx", index=False)
reportlab - 创建 PDF
基本 PDF 创建
from reportlab.lib.pagesizesimport letter from reportlab.pdfgenimport canvas c = canvas.Canvas("hello.pdf", pagesize=letter)width, height = letter# 添加文本c.drawString(100, height - 100, "Hello World!")c.drawString(100, height - 120, "This is a PDF created with reportlab")# 添加一条线c.line(100, height - 140, 400, height - 140)# 保存c.save()
创建多页 PDF
from reportlab.lib.pagesizesimport letter from reportlab.platypusimport SimpleDocTemplate, Paragraph, Spacer, PageBreakfrom reportlab.lib.stylesimport getSampleStyleSheetdoc = SimpleDocTemplate("report.pdf", pagesize=letter)styles = getSampleStyleSheet()story = []# 添加内容title = Paragraph("Report Title", styles['Title'])story.append(title)story.append(Spacer(1, 12))body = Paragraph("This is the body of the report. " * 20, styles['Normal'])story.append(body)story.append(PageBreak())# 第 2 页story.append(Paragraph("Page 2", styles['Heading1']))story.append(Paragraph("Content for page 2", styles['Normal']))# 构建 PDF doc.build(story)
命令行工具
pdftotext (poppler-utils)
# 提取文本pdftotext input.pdf output.txt# 提取文本并保留布局pdftotext -layout input.pdf output.txt# 提取特定页面pdftotext -f 1 -l 5 input.pdf output.txt# 第 1-5 页
qpdf
# 合并 PDFqpdf --empty --pages file1.pdffile2.pdf -- merged.pdf# 拆分页面qpdf input.pdf --pages . 1-5 -- pages1-5.pdfqpdf input.pdf --pages . 6-10 -- pages6-10.pdf# 旋转页面qpdf input.pdf output.pdf --rotate=+90:1# 将第 1 页旋转 90 度 # 移除密码qpdf --password=mypassword --decrypt encrypted.pdf decrypted.pdf
pdftk (如果可用)
# 合并pdftk file1.pdf file2.pdfcat output merged.pdf# 拆分pdftk input.pdf burst# 旋转pdftk input.pdf rotate 1east output rotated.pdf
常见任务
从扫描的 PDF 中提取文本
# 需要:pip install pytesseract pdf2imageimport pytesseract from pdf2imageimport convert_from_path# 将 PDF 转换为图像images = convert_from_path('scanned.pdf')# 对每一页进行OCR text = ""for i, image in enumerate(images):text += f"Page {i+1}:\n"text += pytesseract.image_to_string(image)text += "\n\n" print(text)
添加水印
from pypdf import PdfReader, PdfWriter# 创建水印(或加载现有的)watermark = PdfReader("watermark.pdf").pages[0]# 应用到所有页面reader = PdfReader("document.pdf")writer = PdfWriter()for page in reader.pages:page.merge_page(watermark)writer.add_page(page)with open("watermarked.pdf", "wb") as output:writer.write(output)
提取图像
# 使用 pdfimages (poppler-utils)pdfimages -j input.pdf output_prefix# 这将提取所有图像为 output_prefix-000.jpg, output_prefix-001.jpg 等。
密码保护
from pypdf import PdfReader, PdfWriterreader = PdfReader("input.pdf")writer = PdfWriter()for page in reader.pages:writer.add_page(page)# 添加密码writer.encrypt("userpassword", "ownerpassword")with open("encrypted.pdf", "wb") as output:writer.write(output)
快速参考
writer.add_page(page) | ||
page.extract_text() | ||
page.extract_tables() | ||
qpdf --empty --pages ... | ||
后续步骤
- 关于高级 pypdfium2 用法,请参阅 reference.md
- 关于 JavaScript 库 (pdf-lib),请参阅 reference.md
- 如果您需要填写 PDF 表单,请遵循 forms.md 中的说明
- 关于故障排除指南,请参阅 reference.md
安装命令
npx skills add https://github.com/composiohq/awesome-claude-skills --skill pdf每周安装次数
11
代码仓库
https://github.com/composiohq/awesome-claude-skills
GitHub 星标数
43.1K
首次出现
Jan 23, 2026
安全审计
Gen Agent Trust HubPass SocketFail SnykFail
安装于
gemini-cli9
github-copilot8
opencode8
claude-code7
codex7
amp7
更多技能>>>
find-skills 技能搜索工具 - 让AI更智能的skill
Skills之Scrapling:Python网络爬虫框架,支持反机器人绕过、JS渲染和Cloudflare保护网站抓取 GitHub Stars 13.9万+
Skills之AI SEO优化指南:让内容被元宝,百度,ChatGPT、Google AI概览等AI系统引用为来源 GitHub Stars 2.7万+
Skills之Laravel TDD 测试驱动开发指南:PHPUnit 与 Pest 实现 80% 以上测试覆盖率 GitHub Stars 17.8万+
Skills之UI/UX Pro Max 前端设计技能:50+样式、97调色板、57字体配对、99条UX规则,前端设计降AI率 GitHub Stars 7.4万+
Skills之X API 技能:Python 实现发帖、搜索、数据分析与机器人开发 GitHub Stars 17.4万+

夜雨聆风