多层继承 | 文档解析器层次-夜雨聆风

多层继承 | 文档解析器层次

"""任务：实现多层继承的文档处理器场景：基础处理器 -> 文本处理器 -> 特定格式处理器"""import osimport refrom abc import ABC, abstractmethodclass BaseProcessor:    """基础处理器"""    def validate(self, filepath: str) -> bool:        """验证文件存在"""        if not os.path.exists(filepath):            raise FileNotFoundError(f'文件不存在: {filepath}')        return True    @abstractmethod    def process(self, filepath: str):        """抽象方法：子类必须实现"""        pass    def get_file_info(self, filepath: str) -> dict:        """获取文件信息"""        return {            'path': filepath,            'size': os.path.getsize(filepath) if os.path.exists(filepath) else 0,            'name': os.path.basename(filepath)        }class TextProcessor(BaseProcessor):    """文本处理器（中间层）"""    def clean_text(self, text: str) -> str:        """清洗文本"""        text = re.sub(r'[ \t]+', ' ', text)        return text.strip()    def process(self, filepath):        """实现基础处理流程"""        self.validate(filepath)        try:            with open(filepath, 'r', encoding='utf-8')as f:                text = f.read()        except:            text = f'模拟读取 {filepath}'        return self.clean_text(text)class MarkdownProcessor(TextProcessor):    """Markdown处理器"""    def extract_headings(self, text):        """提取标题"""        headings = re.findall(r'^#+\s+(.+)$', text, re.MULTILINE)        return headings    def extract_code_blocks(self, text: str) -> list:        """提取代码块"""        import re        code_blocks = re.findall(r'```[\w]*\n(.*?)```', text, re.DOTALL)        return code_blocks    def process(self, filepath):        """重写处理逻辑"""        cleaned_text = super().process(filepath)        # 额外处理        headings = self.extract_headings(cleaned_text)        code_blocks = self.extract_code_blocks(cleaned_text)        # 获取文件信息（调用祖父类方法）        file_info = self.get_file_info(filepath)        return {            'text': cleaned_text,            'headings': headings,            'code_blocks': code_blocks,            'file_info': file_info        }print("=== 测试多层继承 ===\n")# 创建测试文件with open('test.md', 'w', encoding='utf-8') as f:    f.write("""# 标题1这是一段文本。## 标题2更多内容。```python""")processor = MarkdownProcessor()assert isinstance(processor, MarkdownProcessor) assert isinstance(processor, TextProcessor) assert isinstance(processor, BaseProcessor)result = processor.process('test.md')print(f"提取的标题: {result['headings']}") print(f"代码块数: {len(result['code_blocks'])}") print(f"文件信息: {result['file_info']}")assert len(result['headings']) == 2 assert result['headings'][0] == '标题1'os.remove('test.md')print("\n多层继承测试通过\n")

一、场景

这是一个多层级文档解析系统的基础架构原型，可用于博客文章解析、文档自动生成目录、代码片段提取工具、解析 MD 文件等。

二、个人理解/最初卡在哪里

1、重写的 process 方法：

def process(self, filepath):    cleaned_text = super().process(filepath)  # 调用父类的处理    headings = self.extract_headings(cleaned_text)  # 提取标题    code_blocks = self.extract_code_blocks(cleaned_text)  # 提取代码块    file_info = self.get_file_info(filepath)  # 文件信息    return {        'text': cleaned_text,        'headings': headings,        'code_blocks': code_blocks,        'file_info': file_info    }

白话流程：

super().process先调用爸爸 TextProcessor 读取文件 + 清洗文本
self.extract_headings自己（MarkdownProcessor）提取标题
self.extract_code_blocks 自己提取代码块
self.get_file_info 调用爷爷 BaseProcessor 获取文件信息
return 一个大字典把所有结果打包返回

一句话总结：先用父类把文本读好洗好，再额外加工标题、代码块，最后一起返回。

2、正则解释

headings = re.findall(r'^#+\s+(.+)$', text, re.MULTILINE)

提取 Markdown 标题，把所有 # 标题这种行的标题文字提取出来。

^ = 一行的开头
#+ = 匹配 1 个或多个 #
\s+ = 匹配空格
(.+) = 捕获标题文字（我们要的内容）
$ = 一行的结尾
re.MULTILINE = 让 ^ 和 $ 按行匹配

code_blocks = re.findall(r'```[\w]*\n(.*?)```', text, re.DOTALL)

提取 Markdown 代码块，提取出所有 … 之间的代码。

` = 反引号（代码块符号）
“` = 匹配三个反引号
[\w]* = 匹配语言名，如 python，后面的* = 匹配 0 个或多个
\n = 换行
(.*?) = 捕获代码内容
“` = 匹配结束的三个反引号
re.DOTALL = 让 . 可以匹配换行

^ 有两个意思：

用在 [ ] 里面 = 取反

[^0-9] → 不是数字的字符

用在正则开头 = 匹配一行的开头

^#+\s+ → 一行开头是 # 号

三、代码中的关键洞察

这是三层继承：Base → Text → Markdown，上层提供基础功能，下层扩展功能。

爷爷：验证、文件信息（所有处理器都能用）
爸爸：文本读取、清洗（所有文本文件都能用）
儿子：MD 专属解析（只给 MD 用）

不用改父类，直接写子类就能新增功能。

四、关键代码/可复用片段

文件验证

def validate(self, filepath: str) -> bool:    if not os.path.exists(filepath):        raise FileNotFoundError(f'文件不存在: {filepath}')    return True

文件信息

def get_file_info(self, filepath: str) -> dict:    return {        'path': filepath,        'size': os.path.getsize(filepath),        'name': os.path.basename(filepath)    }

super () 调用父类（继承必备）

cleaned_text = super().process(filepath)