RAG 文件解析:PDF / Word / Excel / HTML 全格式文本提取

RAG 文件解析：PDF / Word / Excel / HTML 全格式文本提取

本文是【AI 专题精讲】系列第 01 篇。

引言

你想搭一个知识库，让 AI 能基于你的文档来回答问题（RAG）。第一步是什么？

不是选向量数据库，不是调 Embedding 模型——第一步是把文件变成纯文本。

听起来简单，但实际做的时候坑多：PDF 里的表格解析出来全是乱的、Word 的嵌套样式丢失关键信息、Excel 合并单元格一团糟、HTML 满屏广告混在正文里……

今天的目标：用 Python 实现一个统一的 FileParser 类，传入任意 PDF / Word / Excel / HTML 文件，输出结构化的纯文本。

为什么文件解析是 RAG 的第一关

RAG 的完整链路：

文件 → 解析成文本 → 切片 → Embedding → 存入向量库 → 检索 → 喂给 LLM → 生成回答

解析质量决定了后面每一步的上限。garbage in，garbage out。

格式	解析难度
PDF	★★★★★（最难，本质是排版描述文件）
Word (.docx)	★★★
Excel (.xlsx)	★★★
HTML	★★

统一输出格式

先定义统一结构，不管什么格式进来，出来的数据结构一致：

@dataclass
classParsedDocument:
    filename: str
    content: str        # 合并后的全文纯文本
    file_type: str      # pdf / docx / xlsx / html
    metadata: dict
    pages: list[str]    # 按页/sheet 分隔的文本列表

pages 字段很重要——后续切片时可以用页码信息做定位，回答时能告诉用户”来源：第 3 页”。

PDF 解析

工具选型：PyMuPDF 作为主力，pdfplumber 处理表格。

pip install PyMuPDF pdfplumber

基础文本提取（PyMuPDF）

import fitz

defparse_pdf(filepath: str) -> ParsedDocument:
    doc = fitz.open(filepath)
    pages = []

for page in doc:
        text = page.get_text("text").strip()
if text:
            pages.append(text)

    doc.close()
return ParsedDocument(
        filename=filepath.split("/")[-1],
        content="\n\n".join(pages),
        file_type="pdf",
        metadata={"page_count": len(pages)},
        pages=pages,
    )

表格提取（pdfplumber）

纯文字提取碰到表格会变成一堆错位文字，pdfplumber 能识别表格结构，并转成 Markdown 格式：

deftable_to_markdown(table: list[list]) -> str:
    cleaned = [[str(cell) if cell else""for cell in row] for row in table]
    header = "| " + " | ".join(cleaned[0]) + " |"
    separator = "| " + " | ".join(["---"] * len(cleaned[0])) + " |"
    rows = ["| " + " | ".join(row) + " |"for row in cleaned[1:]]
return"\n".join([header, separator] + rows)

把表格转成 Markdown——LLM 对 Markdown 表格的理解能力远强于纯文本拼接。

扫描件 PDF

如果 PDF 是扫描件，get_text() 会返回空，需要走 OCR 分支（PyMuPDF 渲染成图片 → pytesseract 识别）。实际项目中先尝试 get_text()，空了再走 OCR。

Word 解析

.docx 底层是 XML，结构比 PDF 清晰，python-docx 很好用。

pip install python-docx

from docx import Document

defparse_docx(filepath: str) -> ParsedDocument:
    doc = Document(filepath)
    paragraphs = []

for para in doc.paragraphs:
        text = para.text.strip()
ifnot text:
continue
# 标题样式转 Markdown，保留文档结构
if para.style.name.startswith("Heading"):
            level = para.style.name.replace("Heading ", "")
            text = f"{'#' * int(level)}{text}"
        paragraphs.append(text)

# 表格也提取
for table in doc.tables:
        rows = [[cell.text.strip() for cell in row.cells] for row in table.rows]
        paragraphs.append(table_to_markdown(rows))

    content = "\n\n".join(paragraphs)
return ParsedDocument(filename=..., content=content, file_type="docx", ...)

要点：Word 的 Heading 1 / Heading 2 转成 # / ##，保留文档结构，对 LLM 理解非常有帮助。

Excel 解析

Excel 的特殊性：本身就是结构化数据，需要把表格数据转成 LLM 能理解的文本格式。

pip install openpyxl

from openpyxl import load_workbook

defparse_xlsx(filepath: str) -> ParsedDocument:
    wb = load_workbook(filepath, read_only=True, data_only=True)
    pages = []

for sheet_name in wb.sheetnames:
        ws = wb[sheet_name]
        rows = []
for row in ws.iter_rows(values_only=True):
if all(cell isNonefor cell in row):
continue
            rows.append([str(cell) if cell isnotNoneelse""for cell in row])

if rows:
            md = table_to_markdown(rows)
            pages.append(f"## Sheet: {sheet_name}\n\n{md}")

    wb.close()
return ParsedDocument(...)

注意：

read_only=True：大文件必须开，否则内存炸。
data_only=True：读公式计算后的值，不然得到 =SUM(A1:A10)。
合并单元格需要特殊处理（左上角值填充被合并的空格）。

HTML 解析

难点在于去噪——网页里真正有价值的”正文”可能只占 HTML 的 20%。

推荐 readability-lxml（Firefox 阅读模式的算法），能自动识别正文区域：

from readability import Document as ReadabilityDocument
from bs4 import BeautifulSoup

defparse_html_readability(filepath: str) -> ParsedDocument:
with open(filepath, encoding="utf-8") as f:
        html = f.read()

    doc = ReadabilityDocument(html)
    soup = BeautifulSoup(doc.summary(), "lxml")
    content = "\n".join(
        line for line in soup.get_text(separator="\n", strip=True).split("\n") if line.strip()
    )
return ParsedDocument(...)

在线网页只需加个 httpx.get(url) 获取 html，其余逻辑一致。

统一入口：FileParser 类

classFileParser:
    SUPPORTED_TYPES = {".pdf": "pdf", ".docx": "docx", ".xlsx": "xlsx", ".html": "html"}

defparse(self, filepath: str) -> ParsedDocument:
        ext = os.path.splitext(filepath)[1].lower()
        file_type = self.SUPPORTED_TYPES.get(ext)
ifnot file_type:
raise ValueError(f"不支持的文件格式: {ext}")

        parser_map = {"pdf": parse_pdf, "docx": parse_docx, "xlsx": parse_xlsx, "html": parse_html_readability}
return self._post_process(parser_map[file_type](filepath))

def_post_process(self, doc: ParsedDocument) -> ParsedDocument:
# 清理多余空白、标准化换行
        content = re.sub(r"\n{3,}", "\n\n", doc.content).strip()
        doc.content = content
return doc

defparse_batch(self, filepaths: list[str]) -> list[ParsedDocument]:
        results = []
for fp in filepaths:
try:
                results.append(self.parse(fp))
except Exception as e:
                print(f"解析失败 [{fp}]: {e}")
return results

使用：

parser = FileParser()
doc = parser.parse("./docs/技术方案.pdf")
print(f"字数: {doc.total_chars}, 页数: {doc.page_count}")

生产环境注意事项

文件大小限制：大文件会吃内存，设 50MB 上限。
超时控制：某些 PDF 解析可能卡住（大扫描件做 OCR），用 asyncio.wait_for 加 60 秒超时。
编码检测：HTML 文件编码不一定是 UTF-8，用 chardet 检测。
安全检查：限制文件扩展名白名单，使用 tempfile 并及时清理。

总结

文件解析是 RAG 第一关——解析质量决定检索和回答的上限。
每种格式都有专门的坑：PDF 最难，HTML 重在去噪。
统一输出格式（ParsedDocument）是关键设计——后续切片、入库、检索都基于这个结构。
工具选择：PDF 用 PyMuPDF + pdfplumber，Word 用 python-docx，Excel 用 openpyxl，HTML 用 readability-lxml。
表格转 Markdown：LLM 对 Markdown 表格的理解能力远强于纯文本拼接。

关注本公众号，持续更新【AI 专题精讲】系列。下一篇：RAG 文档切片策略，固定长度 vs 递归 vs 语义切分。