QuantaAlpha源码深度解析:LLM驱动的自进化因子挖掘框架ARR27.75%
QuantaAlpha深度解析:LLM驱动的自进化因子挖掘框架
摘要:本文深入剖析QuantaAlpha——一个基于大语言模型(LLM)的自进化量化因子挖掘框架。通过轨迹级进化、多样化规划和结构化质量控制,QuantaAlpha实现了从自然语言研究方向到可验证Alpha因子的全自动转化。我们将揭示其核心架构设计、进化机制和技术创新点。

【加入星球可获取论文+代码+中文解读】
一、引言:为什么需要自进化因子挖掘?
量化投资的核心在于发现能够预测资产价格变动的Alpha因子。传统因子挖掘面临三大挑战:
-
1. 人工依赖重:需要领域专家手工设计因子表达式 -
2. 搜索空间有限:人类思维难以穷尽高维组合可能性 -
3. 过拟合风险:复杂因子容易在历史数据上表现优异但泛化能力差
近年来,大语言模型(LLM)展现出强大的代码生成和逻辑推理能力。QuantaAlpha创造性地将LLM与进化算法结合,构建了一个自进化系统:它不仅能生成因子,还能根据回测反馈持续优化策略,形成”假设-实现-验证-进化”的闭环。
核心成果
在CSI 300数据集上的实验显示:
-
• 信息系数(IC):0.1501 -
• 年化超额收益(ARR):27.75% -
• 最大回撤(MDD):仅7.98% -
• 零样本迁移:在CSI 500和S&P 500上无需重新训练即可保持优异表现
二、系统架构概览
QuantaAlpha的整体架构遵循四层流水线设计:
反馈
2.1 核心模块职责
|
|
|
|
|---|---|---|
| CLI入口 | quantaalpha/cli.py |
|
| 流水线编排 | pipeline/factor_mining.py |
|
| 假设生成 | factors/proposal.py |
|
| 进化引擎 | pipeline/evolution/ |
|
| 质量门控 | factors/regulator/ |
|
| 回测引擎 | backtest/runner.py |
|
三、核心技术深度剖析
3.1 多样化规划(Diversified Planning)
传统方法通常从单一方向开始探索,容易陷入局部最优。QuantaAlpha引入并行规划机制,将用户的初始想法扩展为多个正交的研究方向。
工作原理
# pipeline/planning.py 核心逻辑
def generate_parallel_directions(initial_direction, n=2):
"""
使用LLM将初始方向扩展为n个差异化方向
例如:"价量因子" → ["动量反转类", "波动率异常类", ...]
"""
prompt = f"""
基于研究方向'{initial_direction}',请生成{n}个差异化的子方向:
- 每个方向应具有独特的理论基础
- 避免语义重叠
- 输出JSON格式
"""
return llm.generate(prompt)
配置示例
# configs/experiment.yaml
planning:
enabled: true
num_directions: 2 # 生成2个并行方向
use_llm: true # 启用LLM生成(否则使用内置模板)
max_attempts: 5 # JSON解析失败时的重试次数
这种设计类似于集成学习的思想:通过多样性提升整体探索能力。
3.2 五步工作流循环(5-Step Workflow Loop)
QuantaAlpha的核心执行单元是一个五步循环,定义在pipeline/loop.py的AlphaAgentLoop类中。这是理解整个系统的基础。
3.2.1 核心执行流程
class AlphaAgentLoop(LoopBase):
"""
五步工作流:
Step 1: factor_propose - LLM生成研究假设
Step 2: factor_construct - 将假设转化为因子表达式
Step 3: factor_calculate - 代码实现与调试
Step 4: factor_backtest - 回测验证
Step 5: feedback - 生成反馈并更新轨迹
"""
@stop_event_check # 支持外部中断
def factor_propose(self, prev_out):
"""步骤1:假设生成"""
idea = self.hypothesis_generator.gen(self.trace)
self._last_hypothesis = idea
return idea
@stop_event_check
def factor_construct(self, prev_out):
"""步骤2:因子构造"""
factor = self.factor_constructor.convert(
prev_out["factor_propose"], # 使用上一步的假设
self.trace
)
return factor
@stop_event_check
def factor_calculate(self, prev_out):
"""步骤3:代码实现"""
factor = self.coder.develop(prev_out["factor_construct"])
return factor
@stop_event_check
def factor_backtest(self, prev_out):
"""步骤4:回测验证"""
exp = self.runner.develop(
prev_out["factor_calculate"],
use_local=self.use_local
)
self._last_experiment = exp
return exp
@stop_event_check
def feedback(self, prev_out):
"""步骤5:反馈生成"""
feedback = self.summarizer.generate_feedback(
prev_out["factor_backtest"],
prev_out["factor_propose"],
self.trace
)
# 关键:将(hypothesis, experiment, feedback)加入历史轨迹
self.trace.hist.append((
prev_out["factor_propose"],
prev_out["factor_backtest"],
feedback
))
self._last_feedback = feedback
# 自动保存到因子库
manager.add_factors_from_experiment(...)
3.2.2 数据流转示意图
┌──────────────┐
│ Trace │ ← 包含完整的历史轨迹 (hypothesis, experiment, feedback)
└──────┬───────┘
│
▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Propose │────▶│ Construct │────▶│ Calculate │
│ (LLM Gen) │ │ (Expr Gen) │ │ (Code Dev) │
└──────────────┘ └──────────────┘ └──────┬───────┘
│
▼
┌──────────────┐ ┌──────────────┐ │
│ Feedback │◀────│ Backtest │◀──────────┘
│ (Summary) │ │ (Qlib) │
└──────┬───────┘ └──────────────┘
│
└──▶ 更新 Trace.hist,进入下一轮循环
3.2.3 关键设计要点
-
1. 状态保持:通过 self.trace.hist保存所有历史,供后续LLM调用时作为上下文 -
2. 中断保护: @stop_event_check装饰器允许外部随时终止实验 -
3. 自动持久化:每轮结束后自动将因子保存到 all_factors_library.json
3.3 轨迹级进化(Trajectory-Level Evolution)
这是QuantaAlpha最核心的创新。不同于传统的参数调优,系统在策略层面进行进化。
3.3.1 状态机驱动的进化控制器
进化过程由EvolutionController统一管理,它本质上是一个有限状态机(FSM):
# pipeline/evolution/controller.py
class EvolutionController:
def __init__(self, config: EvolutionConfig):
self._current_round = 0 # 当前轮次
self._current_phase = RoundPhase.ORIGINAL # 当前阶段
self._directions_completed = set() # 已完成的方向
self._mutation_targets = [] # 突变目标池
self._crossover_groups = [] # 交叉组合池
def get_next_task(self) -> Optional[dict]:
"""状态转移函数:根据当前状态决定下一个任务"""
if self._current_phase == RoundPhase.ORIGINAL:
return self._get_original_task()
elif self._current_phase == RoundPhase.MUTATION:
return self._get_mutation_task()
elif self._current_phase == RoundPhase.CROSSOVER:
return self._get_crossover_task()
状态转移逻辑:
所有方向完成
mutation_disabled
所有突变完成
crossover_disabled
所有交叉完成
mutation_disabled
达到max_rounds
达到max_rounds
ORIGINAL
MUTATION
CROSSOVER
这种设计的关键优势是支持灵活的配置组合:
-
• 双开模式:Original → Mutation → Crossover → Mutation → … -
• 纯突变模式:Original → Mutation → Mutation → … -
• 纯交叉模式:Original → Crossover → Crossover → …
3.3.2 正交变异算子(Orthogonal Mutation Operator)
核心原理:变异不是随机扰动,而是语义层面的正交探索。
# pipeline/evolution/mutation.py
class MutationOperator:
def generate_mutation(self, parent: StrategyTrajectory):
"""
生成与父代正交的新假设
关键步骤:
1. 提取父代的假设、因子表达式、回测指标、反馈
2. 调用LLM生成"对立方向"的探索策略
3. 确保新策略与父代在理论基础上差异化
"""
prompt = f"""
父代假设:{parent.hypothesis}
父代因子:{parent.factors}
父代表现:RankIC={parent.backtest_metrics.get('RankIC')}
反馈意见:{parent.feedback}
请生成一个**正交**的新假设:
- 如果父代关注动量,你应该探索均值回归
- 如果父代使用价量数据,你可以尝试订单流特征
- 避免重复父代已失败的思路
"""
return llm.generate(prompt)
正交性保障机制:
系统通过多维度约束确保变异的方向性:
|
|
|
|
|---|---|---|
| 市场行为 |
|
|
| 数据源 |
|
|
| 时间尺度 |
|
|
| 数学变换 |
|
|
Fallback机制:当LLM调用失败时,系统使用规则引擎生成备选方案:
def _generate_fallback_hypothesis(self, parent):
"""基于关键词匹配的启发式变异"""
if "momentum" in parent.hypothesis.lower():
return "Explore mean reversion: price tends to revert to historical average"
elif "volume" in parent.hypothesis.lower():
return "Explore pure price patterns: technical indicators without volume"
# ...更多规则
这确保了系统的鲁棒性,即使LLM服务不稳定也能继续运行。
3.3.3 策略交叉算子(Strategy Crossover Operator)
核心思想:从多个成功策略中提取互补优势,产生协同效应。
# pipeline/evolution/crossover.py
class CrossoverOperator:
def generate_crossover(self, parents: List[StrategyTrajectory]):
"""
融合多个父代策略
融合策略:
1. 分析各父代的强项和弱项
2. 识别互补性(如一个擅长牛市,一个擅长熊市)
3. 生成综合性的新假设
"""
prompt = f"""
父代1: {parents[0].hypothesis}, RankIC={parents[0].metrics['RankIC']}
父代2: {parents[1].hypothesis}, RankIC={parents[1].metrics['RankIC']}
请设计一个混合策略:
- 结合父代1的{strength_1}和父代2的{strength_2}
- 避免它们的共同弱点:{common_weakness}
- 探索可能的协同效应
"""
父代选择策略的数学原理:
系统实现了5种选择策略,其中加权采样的实现尤为精巧:
def _weighted_sample(self, candidates, inverse=False):
"""
性能加权采样
Args:
inverse: False=高性能高权重, True=低性能高权重(鼓励探索)
"""
metrics = [t.get_primary_metric() for t in candidates]
# 归一化到[0, 1]
normalized = [(m - min(metrics)) / (max(metrics) - min(metrics))
for m in metrics]
if inverse:
# 反向加权:低性能获得更高权重
# 公式:w_i = (1 - normalized_i) + 0.1
weights = [1 - n + 0.1 for n in normalized]
else:
# 正向加权:高性能获得更高权重
weights = [n + 0.1 for n in normalized]
# 无放回加权采样
return weighted_random_sample(candidates, weights)
多样性优先的组合评分:
def select_crossover_pairs(self, candidates):
"""选择交叉组合时优先考虑多样性"""
for combo in all_combinations:
# 多样性得分 = 方向数×2 + 阶段数 + 平均性能
directions = len(set(t.direction_id for t in combo))
phases = len(set(t.phase for t in combo))
avg_metric = mean(t.metric for t in combo)
score = directions * 2 + phases + avg_metric
# 选择得分最高的组合
return top_k(scored_combos)
这种设计避免了近亲繁殖(如两个相似的突变体交叉),提升了进化的探索效率。
3.3.4 候选轨迹的动态筛选
交叉操作的候选集选择遵循时间局部性原则:
def _get_crossover_candidates(self):
"""
根据进化历史动态选择候选轨迹
策略:
- 首次交叉:原始轮次 + 最新突变轮次
- 后续交叉:最新突变轮次 + 最新交叉轮次
"""
if not self.has_previous_crossover():
# 第一次交叉:结合原始探索和突变结果
candidates = original_trajs + latest_mutation_trajs
else:
# 后续交叉:结合最新的两种进化结果
candidates = latest_mutation_trajs + latest_crossover_trajs
return candidates
这确保了交叉操作始终基于最新的进化成果,而非任意久远的历史轨迹。
3.4 假设到因子的转化链
从抽象假设到可执行代码,QuantaAlpha设计了四段式转化:
关键代码分析
# factors/proposal.py - AlphaAgentHypothesis2FactorExpression
class AlphaAgentHypothesis2FactorExpression:
def convert(self, hypothesis, trace):
"""将假设转化为因子表达式"""
# 1. 准备上下文(包含历史轨迹反馈)
context = self.prepare_context(hypothesis, trace)
# 2. 调用LLM生成JSON格式的因子定义
response = APIBackend().chat_completion(
system_prompt="你是一个量化研究员...",
user_prompt=f"基于假设{hypothesis},生成因子表达式..."
)
# 3. 解析并验证表达式
for factor_name, factor_data in response.items():
expression = factor_data["expression"]
# 3.1 语法检查
if not regulator.is_parsable(expression):
retry_with_feedback()
# 3.2 一致性检查(可选)
if consistency_enabled:
passed, corrected = quality_gate.evaluate(
hypothesis, description, formulation, expression
)
if corrected:
expression = corrected # LLM自动修正
# 3.3 复杂度检查
if eval_dict['symbol_length'] > 200:
reject_or_simplify()
# 3.4 冗余检测(与因子库对比)
if redundancy_checker.is_duplicate(expression):
skip_factor()
return FactorExperiment(tasks)
动态历史窗口
为避免超出LLM上下文限制,系统采用自适应历史截断:
def gen(self, trace):
history_limit = 6 # 默认保留最近6轮历史
while history_limit >= 1:
try:
context = prepare_context(trace, history_limit)
return llm.generate(context)
except InputLengthError:
history_limit -= 1 # 逐步减少历史记录
logger.warning(f"缩减历史窗口至{history_limit}")
这种设计既保证了足够的上下文信息,又避免了token超限。
3.5 AST解析引擎(Abstract Syntax Tree Engine)
AST引擎是质量门控的核心组件,用于因子表达式的语法解析和冗余检测,位于factors/coder/factor_ast.py。
3.5.1 语法定义
系统使用pyparsing库定义了完整的因子表达式语法:
from pyparsing import (
Word, alphas, alphanums, infixNotation, opAssoc,
oneOf, Optional, delimitedList, Forward, Combine, Literal
)
# 启用packrat优化性能
ParserElement.enablePackrat()
sys.setrecursionlimit(4000) # 提高递归限制
# AST节点类型
@dataclass
class VarNode(Node): # 变量节点:$close, $open
name: str
@dataclass
class NumberNode(Node): # 数字节点:14, 1e-8
value: float
@dataclass
class FunctionNode(Node): # 函数节点:ts_mean(...), ts_rank(...)
name: str
args: List[Node]
@dataclass
class BinaryOpNode(Node): # 二元运算:+, -, *, /
op: str
left: Node
right: Node
@dataclass
class ConditionalNode(Node): # 条件表达式:a ? b : c
condition: Node
true_expr: Node
false_expr: Node
3.5.2 表达式解析
# 定义语法规则
var = Combine(Optional(Literal("$") + Word(alphas, alphanums + "_")))
number = Regex(r"[+-]?(\d+(\.\d*)?|\.\d+)([eE][+-]?\d+)?")
# 运算符优先级(从低到高)
expr <<= infixNotation(
operand,
[
(Literal("-"), 1, opAssoc.RIGHT, create_unary_op_node), # 一元负号
(oneOf("* /"), 2, opAssoc.LEFT, create_binary_op_node), # 乘除
(oneOf("+ -"), 2, opAssoc.LEFT, create_binary_op_node), # 加减
(oneOf("> < >= <= == !="), 2, opAssoc.LEFT, ...), # 比较
(oneOf("&& &"), 2, opAssoc.LEFT, ...), # 逻辑与
(oneOf("|| |"), 2, opAssoc.LEFT, ...), # 逻辑或
(("?", ":"), 3, opAssoc.RIGHT, create_conditional_node), # 三元运算符
]
)
def parse_expression(text: str) -> Node:
"""将字符串解析为AST"""
result = expr.parseString(text, parseAll=True)
return result[0]
示例:
expr = "ts_mean($close / $open, 20)"
ast = parse_expression(expr)
# AST结构:
# FUNC(ts_mean)
# / \
# OP(/) NUM(20)
# / \
# VAR($) VAR($open)
# close
3.5.3 最大公共子树匹配算法
这是冗余检测的核心算法,用于判断两个因子是否相似:
def find_largest_common_subtree(root1: Node, root2: Node):
"""
寻找两棵AST的最大公共子树
算法步骤:
1. 提取tree1的所有子树
2. 提取tree2的所有子树
3. 两两比较,找到最大的相同子树
"""
def get_all_subtrees(root: Node) -> List[Node]:
"""递归提取所有子树根节点"""
result = [root]
if isinstance(root, FunctionNode):
for arg in root.args:
result.extend(get_all_subtrees(arg))
elif isinstance(root, BinaryOpNode):
result.extend(get_all_subtrees(root.left))
result.extend(get_all_subtrees(root.right))
return result
def are_subtrees_equal(node1: Node, node2: Node) -> bool:
"""
递归比较两棵子树是否相同
特殊处理:交换律优化
对于 a+b 和 b+a,虽然结构不同但语义相同
"""
if not are_nodes_equal(node1, node2):
return False
if isinstance(node1, BinaryOpNode):
if is_commutative_op(node1.op): # +, *, ==等满足交换律
# 尝试两种顺序
return (are_subtrees_equal(node1.left, node2.left) and
are_subtrees_equal(node1.right, node2.right)) or \
(are_subtrees_equal(node1.left, node2.right) and
are_subtrees_equal(node1.right, node2.left))
else:
return are_subtrees_equal(node1.left, node2.left) and \
are_subtrees_equal(node1.right, node2.right)
# 主逻辑
subtrees1 = get_all_subtrees(root1)
subtrees2 = get_all_subtrees(root2)
max_match = None
max_size = 0
for st1 in subtrees1:
size1 = get_subtree_size(st1)
if size1 <= max_size:
continue # 剪枝:不可能更大
for st2 in subtrees2:
size2 = get_subtree_size(st2)
if size2 != size1 or size2 <= max_size:
continue
if are_subtrees_equal(st1, st2):
max_size = size1
max_match = SubtreeMatch(st1, st2, size1)
return max_match
复杂度分析:
-
• 时间复杂度:O(n₁ × n₂ × h),其中n为节点数,h为树高 -
• 空间复杂度:O(n₁ + n₂)存储所有子树 -
• 优化策略:通过剪枝(size1 <= max_size时跳过)减少比较次数
实际应用:
def match_alphazoo(prop_expr, factor_df):
"""
在因子库中查找与新表达式最相似的因子
Returns:
max_size: 最大公共子树大小
matched_subtree: 匹配的子树
matched_alpha: 匹配的因子名称
"""
max_size = 0
matched_subtree = None
matched_alpha = None
for index, (name, alpha_expr) in factor_df.iterrows():
try:
match = compare_expressions(prop_expr, alpha_expr)
if match is not None and match.size > max_size:
max_size = match.size
matched_subtree = match.root1
matched_alpha = name
except Exception as e:
print(f"Error comparing alpha: {e}")
return max_size, matched_subtree, matched_alpha
3.6 质量门控(Quality Gate)
为防止低质量因子进入回测环节,系统设置了三层过滤网:
3.6.1 一致性检查(Consistency Check)
问题:LLM生成的假设、描述、公式、表达式之间可能存在逻辑矛盾。
解决方案:使用LLM作为裁判进行语义对齐验证。
# factors/regulator/consistency_checker.py
class FactorConsistencyChecker:
def check_consistency(self, hypothesis, description, formulation, expression):
"""
检查四个层级的逻辑一致性:
1. 假设 → 描述:描述是否准确反映假设意图
2. 描述 → 公式:公式是否符合描述的逻辑
3. 公式 → 表达式:表达式是否正确实现公式
"""
prompt = f"""
请评估以下因子的一致性:
- 假设:{hypothesis}
- 描述:{description}
- 公式:{formulation}
- 表达式:{expression}
输出JSON:
{{
"is_consistent": true/false,
"severity": "none/minor/major/critical",
"corrected_expression": "修正后的表达式(如有)"
}}
"""
return llm.evaluate(prompt)
如果检测到不一致,系统会尝试自动修正(最多3次),而非直接丢弃。
3.6.2 复杂度控制(Complexity Control)
动机:过于复杂的因子容易过拟合。
指标:
-
• symbol_length:表达式字符串长度(阈值:200) -
• base_features:使用的底层特征数量(阈值:5) -
• free_args_ratio:自由参数占比(阈值:0.5)
# 复杂度检查逻辑
if symbol_length > 200:
logger.warning("因子过于复杂,建议简化")
# 可选择拒绝或要求LLM重新生成简化版本
3.6.3 冗余检测(Redundancy Detection)
目标:避免生成与已有因子高度相似的表达式。
技术:基于AST(抽象语法树)的子树匹配算法(详见3.5节)。
class RedundancyChecker:
def is_duplicate(self, new_expression, threshold=5):
"""
将新表达式解析为AST,与因子库中的所有因子进行子树匹配
如果最大公共子树大小超过阈值,判定为重复
"""
new_ast = parse_to_ast(new_expression)
for existing_expr in factor_zoo:
existing_ast = parse_to_ast(existing_expr)
common_subtree_size = ast_match(new_ast, existing_ast)
if common_subtree_size > threshold:
return True, existing_expr
return False, None
这确保了因子库的多样性,避免资源浪费在相似策略上。
3.7 LLM客户端架构
LLM是整个系统的核心驱动力,位于llm/client.py。本节深入分析其健壮性设计。
3.7.1 健壮JSON解析器
LLM返回的JSON经常存在格式问题,系统实现了5层容错解析:
def robust_json_parse(text: str, max_retries: int = 3) -> dict:
"""
多层级JSON解析策略
"""
# 策略1:直接解析
try:
return json.loads(text)
except json.JSONDecodeError:
pass
# 策略2:提取Markdown代码块中的JSON
json_block_pattern = r'```(?:json)?\s*\n?([\s\S]*?)\n?```'
matches = re.findall(json_block_pattern, text)
for match in matches:
try:
return json.loads(match.strip())
except json.JSONDecodeError:
continue
# 策略3:查找第一个完整的JSON对象(处理多余文本)
brace_count = 0
start_idx = end_idx = -1
in_string = escape_next = False
for i, char in enumerate(text):
if escape_next:
escape_next = False
continue
if char == '\\':
escape_next = True
continue
if char == '"' and not escape_next:
in_string = not in_string
continue
if in_string:
continue
if char == '{':
if brace_count == 0:
start_idx = i
brace_count += 1
elif char == '}':
brace_count -= 1
if brace_count == 0 and start_idx != -1:
end_idx = i
break
if start_idx != -1 and end_idx != -1:
json_str = text[start_idx:end_idx + 1]
try:
return json.loads(json_str)
except json.JSONDecodeError:
# 策略4:修复LaTeX转义符
fixed_str = fix_latex_escapes(json_str)
try:
return json.loads(fixed_str)
except json.JSONDecodeError:
pass
# 策略5:宽松正则提取
potential_jsons = re.findall(r'\{[^{}]*(?:\{[^{}]*\}[^{}]*)*\}', text)
for pj in potential_jsons:
try:
result = json.loads(pj)
if isinstance(result, dict) and len(result) > 0:
return result
except json.JSONDecodeError:
continue
raise json.JSONDecodeError("Could not parse JSON", text, 0)
典型问题场景:
# 场景1:LLM返回带解释的JSON
text1 = """
这是一个符合要求的因子:
{
"factor_name": "alpha_001",
"expression": "ts_mean($close, 20)"
}
希望这对你有帮助!
"""
# 策略3会提取出中间的JSON对象
# 场景2:Markdown代码块
text2 = """
```json
{
"factor_name": "alpha_002"
}
“””
策略2会提取代码块内容
场景3:LaTeX转义符
text3 = ‘{“formula”: “\frac{a}{b}”}’
策略4会将 \f 修复为 \f
这种多层容错机制显著提升了系统的**鲁棒性**,即使LLM输出不规范也能正常解析。
#### 3.7.2 SQLite缓存机制
为避免重复调用LLM,系统实现了基于MD5哈希的缓存:
```python
class SQliteLazyCache(SingletonBaseClass):
"""单例模式的SQLite缓存"""
def __init__(self, cache_location: str):
super().__init__() # 确保全局唯一实例
self.conn = sqlite3.connect(cache_location, timeout=20)
self.c = self.conn.cursor()
# 创建三张表
self.c.execute("""
CREATE TABLE IF NOT EXISTS chat_cache (
md5_key TEXT PRIMARY KEY, -- MD5哈希作为主键
chat TEXT -- 缓存的响应
)
""")
self.c.execute("""
CREATE TABLE IF NOT EXISTS embedding_cache (...)
""")
self.c.execute("""
CREATE TABLE IF NOT EXISTS message_cache (...)
""")
def chat_get(self, key: str) -> str | None:
"""根据key获取缓存"""
md5_key = md5_hash(key) # MD5哈希缩短key长度
self.c.execute("SELECT chat FROM chat_cache WHERE md5_key=?", (md5_key,))
result = self.c.fetchone()
return result[0] if result else None
def chat_set(self, key: str, value: str):
"""设置缓存(INSERT OR REPLACE实现upsert)"""
md5_key = md5_hash(key)
self.c.execute(
"INSERT OR REPLACE INTO chat_cache (md5_key, chat) VALUES (?, ?)",
(md5_key, value)
)
self.conn.commit()
缓存键生成:
def _try_create_chat_completion_or_embedding(...):
# 缓存键 = prompt内容 + seed
input_content_json = json.dumps(messages)
input_content_json = (
chat_cache_prefix +
input_content_json +
f"<seed={seed}/>" # seed确保重试时产生不同结果
)
if self.use_chat_cache:
cache_result = self.cache.chat_get(input_content_json)
if cache_result is not None:
return cache_result # 命中缓存,直接返回
# 未命中,调用API
response = call_llm_api(messages)
if self.dump_chat_cache:
self.cache.chat_set(input_content_json, response) # 写入缓存
return response
性能提升:在进化实验中,相同的prompt可能被多次调用(如交叉操作的父代选择),缓存可减少**30-50%**的API调用。
3.8 并行执行架构
为加速大规模实验,系统支持两种并行模式:
3.8.1 分支并行(Branch Parallelism)
在规划阶段,多个研究方向同时运行:
# pipeline/factor_mining.py
if parallel_execution and len(directions) > 1:
processes = []
for idx, direction in enumerate(directions):
p = Process(target=_run_branch, args=(direction, ...))
p.start()
processes.append(p)
for p in processes:
p.join() # 等待所有分支完成
3.8.2 进化任务并行(Evolution Task Parallelism)
在进化阶段,同一轮次的多个任务并发执行:
def _run_tasks_parallel(tasks):
"""并行执行多个进化任务(突变/交叉)"""
result_queue = Queue()
processes = []
for idx, task in enumerate(tasks):
p = Process(
target=_parallel_task_worker,
args=(task, directions, step_n, ..., result_queue, idx)
)
p.start()
processes.append(p)
# 收集结果
results = [result_queue.get() for _ in tasks]
return results
注意:为避免文件锁冲突,子进程会禁用文件锁:
RD_AGENT_SETTINGS.use_file_lock = False
RD_AGENT_SETTINGS.pickle_cache_folder_path_str = f"pickle_cache_{task_idx}"
这种Map-Reduce范式的并行架构使得系统能够充分利用多核CPU资源,显著提升了实验效率。
3.9 提示词工程(Prompt Engineering)
提示词是LLM驱动系统的核心。QuantaAlpha采用了结构化提示词模板,所有提示词都存储在YAML文件中,便于维护和优化。本节深入分析关键场景的提示词设计。
3.9.1 多样化规划提示词
文件位置:pipeline/prompts/planning_prompts.yaml
设计目标:将用户的初始研究方向扩展为多个正交的探索方向。
system: |-
You are a senior quant researcher. Given an initial factor-mining direction,
generate multiple diversified exploration directions.
Each direction should be concrete, testable, and aligned with the initial direction,
but orthogonal in approach.
user: |-
Initial direction:
{initial_direction}
Generate EXACTLY {n} diversified exploration directions. Requirements:
1) Diversity: directions must differ in data signals, time horizons, or reasoning frameworks.
2) Actionability: each direction must be specific enough to generate a hypothesis.
3) Format: return JSON wrapped in a ```json code block.
output_format: |-
```json
{"directions": ["direction 1", "direction 2", "..."]}
The array must contain exactly {n} strings. No extra text.
**关键设计要点**:
1. **角色定位明确**:"senior quant researcher"建立专业语境
2. **约束具体化**:明确要求"concrete, testable, orthogonal"
3. **格式强制**:要求JSON输出,避免自由文本导致解析困难
4. **数量控制**:"EXACTLY {n}"确保输出符合预期
**示例**:
```python
# 输入
initial_direction = "价量因子挖掘"
n = 3
# LLM输出
{
"directions": [
"基于成交量放大的动量效应",
"波动率与价格反转的非线性关系",
"微观结构中的流动性溢价特征"
]
}
3.9.2 假设生成提示词
文件位置:factors/prompts/proposal.yaml
设计目标:基于历史轨迹和反馈生成新的研究假设。
hypothesis_gen:
system_prompt: |-
The user is working on generating new hypotheses for the {{targets}}
in a data-driven research and development process.
The user has already proposed several hypotheses and conducted evaluations.
Your task is to check whether a similar hypothesis has already been generated.
If one exists and you agree with it, feel free to use it.
If you disagree, please generate an improved version.
{% if hypothesis_specification %}
**Important:** If the hypothesis_specification outlines the next steps,
ensure you adhere to those instructions.
{% endif %}
user_prompt: |-
{% if hypothesis_and_feedback|length == 0 %}
It is the first round of hypothesis generation.
{% else %}
It is not the first round. The former hypothesis and feedbacks are:
{{ hypothesis_and_feedback }}
Focus on the last one & the new hypothesis that it provides
and reasoning to see if you agree.
{% endif %}
{% if RAG %}
Additional reference information: {{RAG}}
**Note:** Assess whether the RAG aligns with the {{targets}}.
If it does not, it should not be used.
{% endif %}
核心机制:
-
1. 历史感知:通过 hypothesis_and_feedback传递完整的历史轨迹 -
2. 去重检查:要求LLM判断是否已有相似假设 -
3. RAG集成:可选的外部知识注入(检索增强生成) -
4. 条件渲染:使用Jinja2模板根据上下文动态调整提示词
实际应用场景:
# 第1轮:无历史信息
hypothesis_1 = llm.generate(
system_prompt=template.render(targets="alpha factors"),
user_prompt="It is the first round..."
)
# 输出:"Explore momentum effect in high-volume stocks"
# 第5轮:有4轮历史反馈
hypothesis_5 = llm.generate(
system_prompt=template.render(targets="alpha factors"),
user_prompt=f"""
Former hypotheses and feedbacks:
Round 1: Momentum strategy - RankIC=0.12, feedback: good but overfits
Round 2: Mean reversion - RankIC=0.08, feedback: too simple
Round 3: Volume-price divergence - RankIC=0.15, feedback: promising
Round 4: Order flow imbalance - RankIC=0.11, feedback: needs refinement
Focus on Round 4's suggestion to refine order flow features.
"""
)
# 输出:"Combine order flow imbalance with volume-weighted price impact"
3.9.3 正交变异提示词
文件位置:pipeline/prompts/evolution_prompts.yaml
设计目标:生成与父代策略完全正交的新假设,避免重复探索。
mutation:
system: |
You are a quantitative finance strategy expert specializing in
generating diversified factor mining strategies.
"Orthogonal" means the new strategy should:
1. Explore a completely different market hypothesis from the parent
2. Use different data dimensions or feature types
3. Be based on different investment logic or market perspectives
4. Avoid generating factors highly correlated with the parent strategy
user: |
Based on the following parent strategy, generate an orthogonal new strategy.
## Parent Strategy Information
### Hypothesis
{parent_hypothesis}
### Factor Expressions
{parent_factors}
### Backtest Metrics
{parent_metrics}
### Evaluation Feedback
{parent_feedback}
---
## Requirements
Please generate a new strategy direction that is **orthogonal** to the parent:
1. **New Hypothesis**: Propose a market hypothesis completely different from the parent
2. **Exploration Direction**: Describe the data dimensions and feature types
3. **Orthogonality Reasoning**: Explain why this new strategy is orthogonal
4. **Expected Characteristics**: Expected characteristics of factors
Output in JSON format:
```json
{{
"new_hypothesis": "Description of the new market hypothesis",
"exploration_direction": "Exploration direction description",
"orthogonality_reason": "Reasoning for orthogonality with parent",
"expected_characteristics": "Expected factor characteristics"
}}
```
正交性保障机制:
|
|
|
|
|---|---|---|
| 市场行为 |
|
|
| 数据源 |
|
|
| 时间尺度 |
|
|
| 数学变换 |
|
|
Fallback机制:当LLM调用失败时,使用预定义的备选假设:
fallback_templates:
- "Explore mean reversion characteristics opposite to price momentum"
- "Study nonlinear relationships between volume and price"
- "Analyze trend transition signals across cycles"
- "Mine liquidity features in market microstructure"
- "Build factors based on volatility regime switching"
- "Explore sector rotation and individual stock alpha relationship"
实际应用案例:
# 父代策略
parent = {
"hypothesis": "High volume stocks show momentum effect",
"factors": "ts_rank($volume / ts_mean($volume, 20), 5) * $return",
"metrics": "RankIC=0.12, ARR=15.3%",
"feedback": "Good performance but overfits to bull markets"
}
# LLM生成的正交变异
mutation_result = {
"new_hypothesis": "Low volume stocks exhibit mean reversion due to liquidity constraints",
"exploration_direction": "Focus on illiquidity premium and price reversal in low-volume regimes",
"orthogonality_reason": "Opposite market behavior (reversion vs momentum), different volume regime",
"expected_characteristics": "Negative correlation with momentum factors, better in bear markets"
}
3.9.4 策略交叉提示词
文件位置:pipeline/prompts/evolution_prompts.yaml
设计目标:融合多个父代策略的优势,产生协同效应。
crossover:
system: |
You are a quantitative finance strategy fusion expert, specializing in
combining the advantages of multiple strategies into a stronger hybrid strategy.
When fusing strategies, consider:
1. The core hypotheses and market views of each parent strategy
2. Factor types and features that performed well
3. Complementarity and potential synergies between strategies
4. Avoid inheriting common weaknesses from parent strategies
user: |
Based on the following multiple parent strategies, generate a fused hybrid strategy.
## Parent Strategy Information
{parent_summaries}
---
## Requirements
1. **Hybrid Hypothesis**: A new market hypothesis that fuses the advantages
2. **Fusion Logic**: Explain how to combine the strengths of each parent
3. **Innovation Points**: New characteristics brought by the hybrid strategy
4. **Expected Benefits**: Why the hybrid may outperform individual parents
Output in JSON format:
```json
{{
"hybrid_hypothesis": "Description of the fused market hypothesis",
"fusion_logic": "Fusion logic explanation",
"innovation_points": "Innovation points description",
"expected_benefits": "Expected benefits explanation"
}}
```
# 父代策略格式化模板
parent_template: |
### Parent {idx}: {phase_name}
**Direction ID**: {direction_id}
**Hypothesis**: {hypothesis}
**Factors**:
{factors}
**Metrics**:
{metrics}
**Feedback**:
{feedback}
---
交叉融合策略:
# 两个父代策略
parent_1 = {
"hypothesis": "Momentum in high-volume stocks",
"metrics": "RankIC=0.12, strong in bull markets",
"feedback": "Overfits to upward trends"
}
parent_2 = {
"hypothesis": "Mean reversion in volatile stocks",
"metrics": "RankIC=0.10, strong in bear markets",
"feedback": "Weak in trending markets"
}
# LLM生成的混合策略
crossover_result = {
"hybrid_hypothesis": """
Regime-dependent strategy: Apply momentum in trending markets
and mean reversion in range-bound markets, using volatility
as regime indicator
""",
"fusion_logic": """
Combine parent_1's momentum strength in bull markets with
parent_2's reversion capability in bear markets.
Use volatility threshold to switch between strategies.
""",
"innovation_points": "Market regime detection + adaptive strategy selection",
"expected_benefits": "Better risk-adjusted returns across all market conditions"
}
3.9.5 一致性检查提示词
文件位置:factors/regulator/consistency_prompts.yaml
设计目标:验证因子的假设、描述、公式、表达式之间的逻辑一致性。
consistency_check_system: |-
You are an expert financial factor analyst.
Your task is to verify the logical consistency between:
1. **Hypothesis**: The market hypothesis
2. **Factor Description**: Natural language description
3. **Factor Formulation**: Mathematical formula (LaTeX)
4. **Factor Expression**: Symbolic expression
Check:
1. **Hypothesis → Description**: Does description follow from hypothesis?
2. **Description → Formulation**: Does formula represent description?
3. **Formulation → Expression**: Does expression implement formula?
**Important Rules:**
- Minor differences in window sizes (e.g., 10 vs 15 days) are acceptable
- Focus on whether the core logic and economic meaning are preserved
- Be lenient on implementation details that don't change fundamental behavior
**Severity Levels:**
- **none**: No issues found
- **minor**: Small inconsistencies that don't affect economic meaning
- **major**: Significant inconsistencies affecting interpretation
- **critical**: Expression completely contradicts hypothesis
Output Format (JSON):
{
"is_consistent": true/false,
"severity": "none/minor/major/critical",
"hypothesis_to_description": "Analysis...",
"description_to_formulation": "Analysis...",
"formulation_to_expression": "Analysis...",
"overall_feedback": "Overall assessment",
"corrected_expression": "Corrected expression if needed (null if no correction)"
}
consistency_check_user: |-
Please analyze the consistency of the following factor:
**Hypothesis:**
{{ hypothesis }}
**Factor Name:** {{ factor_name }}
**Factor Description:**
{{ factor_description }}
**Factor Formulation (LaTeX):**
{{ factor_formulation }}
**Factor Expression:**
{{ factor_expression }}
Please check the three consistency dimensions and output JSON.
实际检测案例:
# 输入的因子定义
factor = {
"hypothesis": "Stocks with increasing volume show momentum",
"description": "Measure volume growth rate and multiply with returns",
"formulation": r"\frac{volume_t}{mean(volume, 20)} \times return_t",
"expression": "($volume / ts_mean($volume, 20)) * $return"
}
# LLM的一致性检查输出
result = {
"is_consistent": True,
"severity": "none",
"hypothesis_to_description": "Description accurately captures volume-momentum relationship",
"description_to_formulation": "Formula correctly represents volume ratio times return",
"formulation_to_expression": "Expression perfectly implements the mathematical formula",
"overall_feedback": "All components are logically consistent",
"corrected_expression": None
}
不一致修正案例:
# 发现不一致的因子
factor_inconsistent = {
"hypothesis": "Volume spike predicts price reversal",
"description": "Detect abnormal volume and bet on mean reversion",
"formulation": r"-\frac{volume_t - mean(volume, 20)}{std(volume, 20)} \times return_t",
"expression": "($volume / ts_mean($volume, 20)) * $return" # ❌ 错误:应该是负号且用z-score
}
# LLM检测到不一致并修正
result = {
"is_consistent": False,
"severity": "major",
"hypothesis_to_description": "Consistent: both describe volume-reversal relationship",
"description_to_formulation": "Consistent: z-score normalization and negative sign correct",
"formulation_to_expression": "INCONSISTENT: Expression missing negative sign and uses ratio instead of z-score",
"overall_feedback": "Expression does not match formulation",
"corrected_expression": "-(($volume - ts_mean($volume, 20)) / ts_std($volume, 20)) * $return"
}
3.9.6 代码实现提示词
文件位置:factors/coder/prompts.yaml
设计目标:将因子表达式转换为可执行的Python代码,并提供调试反馈。
evolving_strategy_factor_implementation_v1_system: |-
User is trying to implement factors in the scenario:
{{ scenario }}
To help you write correct code, the user provides:
1. Correct code for similar factors (learn from these)
2. Failed former code and feedback (analyze and correct)
3. Suggestions and similar fail-to-correct pairs
You must write code based on your former latest attempt.
Read the former attempt carefully and do not modify the right parts.
Response format (JSON):
{
"code": "The Python code as a string."
}
evaluator_code_feedback_v1_system: |-
Your job is to check whether user's code aligns with the factor and scenario.
User provides:
- Source Python code
- Execution error message (if failed)
- Ground truth code (optional, for reference only)
- Factor value comparison analysis
**Rules:**
- Do NOT leak ground truth code to user
- Provide clear, short suggestions (no code snippets)
- Point out only critical issues
- If no big issue: respond "No critics found"
Output format:
critic 1: The critic message
critic 2: The critic message
代码生成流程:
# 第1次尝试:LLM生成代码
attempt_1 = {
"code": """
import pandas as pd
def calculate_factor(df):
volume_ratio = df['$volume'] / df['$volume'].rolling(20).mean()
return volume_ratio * df['$return']
"""
}
# 执行反馈
feedback_1 = """
Error: KeyError: '$volume'
The DataFrame uses columns without '$' prefix in the function scope.
"""
# 第2次尝试:LLM根据反馈修正
attempt_2 = {
"code": """
import pandas as pd
def calculate_factor(df):
# Fixed: access columns correctly
volume_col = '$volume'
return_col = '$return'
volume_ratio = df[volume_col] / df[volume_col].rolling(20).mean()
return volume_ratio * df[return_col]
"""
}
# 最终评估
final_decision = {
"final_decision": True,
"final_feedback": "Code executes successfully and produces expected factor values"
}
3.9.7 提示词设计的最佳实践
通过分析QuantaAlpha的提示词体系,我们可以总结出以下最佳实践:
1. 结构化输出强制
# ✅ 推荐:明确的JSON格式要求
Output in JSON format:
```json
{{
"field_1": "description",
"field_2": "description"
}}
❌ 避免:开放式文本输出
Please provide your answer.
**2. 角色定位清晰**
```yaml
# ✅ 推荐:明确专家身份
You are a quantitative finance strategy expert specializing in...
# ❌ 避免:模糊的角色定义
You are an AI assistant.
3. 约束条件具体化
# ✅ 推荐:列出具体要求
Requirements:
1. Diversity: directions must differ in data signals
2. Actionability: each direction must be specific enough
3. Format: return JSON wrapped in code block
# ❌ 避免:模糊的要求
Please generate diverse directions.
4. 示例驱动(Few-Shot Learning)
# ✅ 推荐:提供示例
Here is an example structure for the output:
{
"output_format_decision": true,
"output_format_feedback": "The output format is correct."
}
# ❌ 避免:无示例指导
5. 容错与Fallback
# ✅ 推荐:提供备选方案
fallback_templates:
- "Explore mean reversion..."
- "Study nonlinear relationships..."
# ❌ 避免:单一依赖LLM
6. 条件渲染与上下文感知
# ✅ 推荐:根据上下文动态调整
{% if hypothesis_and_feedback|length == 0 %}
It is the first round.
{% else %}
Former hypotheses and feedbacks: {{ hypothesis_and_feedback }}
{% endif %}
# ❌ 避免:静态提示词
7. 思维链引导(Chain-of-Thought)
# ✅ 推荐:要求解释推理过程
Please provide:
1. Analysis of each dimension
2. Reasoning for your conclusion
3. Final score with justification
# ❌ 避免:只要求最终答案
Give me a score.
3.9.8 提示词版本管理与A/B测试
QuantaAlpha采用YAML配置文件管理提示词,支持快速迭代和A/B测试:
# prompts_v1.yaml
mutation:
system: "You are a quantitative expert..."
user: "Generate orthogonal strategy..."
# prompts_v2.yaml (改进版)
mutation:
system: "You are a world-class quant researcher with 20 years experience..."
user: |
Generate orthogonal strategy with detailed reasoning:
1. Analyze parent strategy weaknesses
2. Identify unexplored market dimensions
3. Propose novel hypothesis
优势:
-
• 版本控制:可以回滚到之前的提示词版本 -
• 实验对比:同时运行不同版本的提示词进行性能对比 -
• 团队协作:非技术人员也能编辑和优化提示词 -
• 热更新:无需重启服务即可加载新提示词
四、关键技术细节
4.1 环境隔离与工作空间管理
每个实验都有独立的工作空间,避免相互干扰:
# core/conf.py
class RD_AGENT_SETTINGS:
workspace_path = "/path/to/experiment/workspace"
pickle_cache_folder_path_str = "/path/to/cache"
cache_with_pickle = True # 启用缓存加速重复计算
4.2 超时保护机制
防止单个实验无限期运行:
# pipeline/factor_mining.py
def force_timeout(seconds=LLM_SETTINGS.factor_mining_timeout):
def decorator(func):
def wrapper(*args, **kwargs):
signal.signal(signal.SIGALRM, handle_timeout)
signal.alarm(seconds)
try:
return func(*args, **kwargs)
finally:
signal.alarm(0)
return wrapper
return decorator
@force_timeout()
def main(...):
# 主实验逻辑
4.3 因子库持久化
所有成功因子自动保存到JSON文件:
// all_factors_library.json
[
{
"factor_name": "alpha_001",
"factor_description": "基于成交量放大的动量因子",
"factor_expression": "ts_rank(volume / ts_mean(volume, 20), 5)",
"rank_ic": 0.142,
"source_trajectory": "mutation_03_01"
}
]
五、实战演示
5.1 启动因子挖掘
# 基础用法
./run.sh "价量因子挖掘"
# 指定配置文件
python launcher.py mine \
--direction "微观结构因子" \
--config configs/experiment.yaml
5.2 查看进化过程
日志目录结构:
log/
├── branch_01/ # 第一个规划方向
│ ├── original_00_00/ # 原始探索
│ ├── mutation_01_00/ # 第一轮突变
│ └── crossover_02_00/# 第一轮交叉
├── branch_02/ # 第二个规划方向
└── trajectory_pool.json # 轨迹池快照
5.3 回测验证
# 使用自定义因子回测
python -m quantaalpha.backtest.run_backtest \
-c configs/backtest.yaml \
--factor-source custom \
--factor-json all_factors_library.json
# 结合基线因子(Alpha158)
python -m quantaalpha.backtest.run_backtest \
--factor-source combined \
--factor-json all_factors_library.json
六、技术创新总结
6.1 与传统方法的对比
|
|
|
|
|---|---|---|
| 假设生成 |
|
|
| 因子实现 |
|
|
| 策略优化 |
|
|
| 质量控制 |
|
|
| 知识积累 |
|
|
6.2 核心创新点
-
1. 轨迹级进化范式 -
• 不仅进化参数,更进化策略本身 -
• 保留完整的假设-实验-反馈链路 -
2. 结构化约束下的LLM生成 -
• 通过一致性检查确保逻辑严密性 -
• 复杂度控制防止过拟合 -
3. 正交探索机制 -
• 突变阶段刻意偏离父代思路 -
• 避免早熟收敛 -
4. 零样本迁移能力 -
• 在CSI 300上挖掘的因子可直接应用于CSI 500/S&P 500 -
• 证明了策略的普适性而非数据拟合 -
5. 系统化提示词工程 -
• YAML配置管理,支持版本控制和A/B测试 -
• 7类专用提示词覆盖完整工作流 -
• Fallback机制确保鲁棒性 -
• Few-shot学习和思维链引导提升生成质量
七、局限性与未来方向
当前局限
-
1. LLM依赖性强:生成质量受限于基座模型能力 -
2. 计算成本高:单次完整实验需数小时至数天 -
3. 黑盒特性:进化过程的决策逻辑不够透明
改进方向
-
1. 多模态输入:整合K线图、新闻情感等非结构化数据 -
2. 强化学习融合:将回测指标作为奖励信号直接优化 -
3. 可解释性增强:可视化进化树,追溯因子起源 -
4. 分布式扩展:支持跨机器集群并行
八、结语
QuantaAlpha代表了量化研究范式的转变:从人工驱动迈向智能自治。它不仅仅是一个工具,更是一个持续进化的研究伙伴——能够理解你的研究意图,自主探索未知领域,并将发现沉淀为可复用的知识。
随着LLM能力的不断提升和进化算法的优化,我们有理由相信,这类自进化系统将在金融、药物发现、材料科学等领域发挥越来越重要的作用。
参考文献
-
1. Han, J., et al. (2026). QuantaAlpha: An Evolutionary Framework for LLM-Driven Alpha Mining. arXiv:2602.07085 -
2. Microsoft Qlib Team. Qlib: An AI-oriented Quantitative Investment Platform -
3. RD-Agent: Automated R&D Framework (NeurIPS 2025)
作者:QuantaAlpha团队(清华大学、北京大学、中科院、CMU、HKUST联合研发)
如果你觉得这篇文章有帮助,欢迎Star项目支持开源社区! ⭐

夜雨聆风