codex 使用笔记:AI Research Skills-夜雨聆风

codex 使用笔记:AI Research Skills

skill路径地址：https://github.com/Orchestra-Research/AI-Research-SKILLs/tree/main

codex 安装说明：直接跟codex说“https://github.com/Orchestra-Research/AI-Research-SKILLs/tree/main 帮我安装这里面的所有skills” 即可

1 通用使用方法

直接点名 skill 名称，或者描述一个明显匹配的任务

用 vllm skill 帮我把这个模型部署成 OpenAI-compatible API。

用 ml-paper-writing skill 帮我把实验结果整理成 NeurIPS 风格的 related work 和 method。

如果不确定该用哪个 skill，可以直接说目标，例如“我要微调 Llama 3”、“我要写系统论文”。Codex 会根据 skill 的描述自动选择合适的 skill。

2 主要几种功能

2.1 研究编排与写作

Skill	什么时候用	典型提示词
`0-autoresearch-skill`	想让 agent 端到端推进 AI 研究项目，从文献、想法、实验到论文。	`用 autoresearch 帮我围绕 <主题> 建立研究计划、实验循环和产出结构。`
`brainstorming-research-ideas`	需要系统地产生高价值研究选题、找切入点或 pivot。	`用 brainstorming-research-ideas 帮我为 <方向> 生成 10 个可验证研究想法。`
`creative-thinking-for-research`	想用类比、约束重组、跨域组合等方式产生更不寻常的想法。	`用 creative-thinking-for-research 从 <领域A> 和 <领域B> 交叉生成新课题。`
`ml-paper-writing`	写 NeurIPS、ICML、ICLR、ACL、AAAI、COLM 等 ML/AI 论文。	`用 ml-paper-writing 帮我把这些实验结果写成 introduction 和 method。`
`systems-paper-writing`	写 OSDI、SOSP、ASPLOS、NSDI、EuroSys 等系统论文。	`用 systems-paper-writing 帮我重构这篇系统论文的 motivation 和 evaluation。`
`academic-plotting`	为论文生成高质量图表、系统架构图、实验图。	`用 academic-plotting 根据这些结果生成论文级柱状图和 caption。`
`presenting-conference-talks`	从论文生成会议报告、Beamer/PPTX、speaker notes。	`用 presenting-conference-talks 把这篇论文做成 15 分钟 oral talk。`

2.2 模型架构与训练配方

Skill	什么时候用	典型提示词
`litgpt`	学习或实现干净的 LLM 架构，训练/微调 Llama、Gemma、Phi、Qwen、Mistral。	`用 litgpt 帮我实现一个最小可读的 Llama fine-tuning 脚本。`
`nanogpt`	教学、从零理解 GPT、复现小型 GPT-2 或 Shakespeare 训练。	`用 nanogpt 帮我讲清楚 GPT 训练循环并改成我的数据集。`
`mamba`	研究或使用状态空间模型、线性时间长序列模型。	`用 mamba skill 帮我比较 Mamba 和 Transformer 在长上下文任务上的取舍。`
`rwkv`	研究 RNN+Transformer 混合架构、无限上下文、无 KV cache 推理。	`用 rwkv 帮我设计一个 RWKV 推理 demo。`
`torchtitan`	使用 PyTorch 原生分布式预训练 Llama/DeepSeek，自带 4D 并行。	`用 torchtitan 帮我规划 64 张 GPU 的 Llama 预训练配置。`
`ml-training-recipes`	通用 PyTorch 训练配方、调 loss spike、OOM、优化器、学习率、混合精度。	`用 ml-training-recipes 帮我诊断这个训练 loss 爆炸问题。`

2.3 Tokenization 与数据处理

Skill	什么时候用	典型提示词
`huggingface-tokenizers`	训练 BPE、WordPiece、Unigram tokenizer，追求高速生产级 tokenization。	`用 huggingface-tokenizers 帮我为中文英文混合语料训练 tokenizer。`
`sentencepiece`	多语言、CJK、原始 Unicode 文本、无需预分词的 tokenizer。	`用 sentencepiece 帮我训练一个适合日中英语料的 Unigram tokenizer。`
`ray-data`	分布式 ETL、批量推理、多模态数据加载、CPU/GPU streaming 数据处理。	`用 ray-data 把这些 Parquet 数据做分布式预处理管线。`
`nemo-curator`	大规模训练数据清洗、去重、质量过滤、PII/NSFW 处理。	`用 nemo-curator 帮我设计网页语料清洗和去重流程。`

2.4 微调与后训练

Skill	什么时候用	典型提示词
`axolotl`	YAML 驱动的 LLM 微调，LoRA/QLoRA、DPO/KTO/ORPO/GRPO、多模态。	`用 axolotl 给我写一个 Llama 3 QLoRA 微调配置。`
`llama-factory`	WebUI 或低代码微调，支持大量模型和量化 QLoRA。	`用 llama-factory 帮我设置一个无代码 SFT 微调流程。`
`unsloth`	快速低显存 LoRA/QLoRA，追求 2-5x 训练加速。	`用 unsloth 把这个 QLoRA 微调脚本改到更省显存。`
`peft`	HuggingFace PEFT，LoRA、QLoRA、DoRA、多 adapter。	`用 peft 帮我给这个 transformers 模型加 LoRA adapter。`
`trl-fine-tuning`	SFT、DPO、PPO、GRPO、reward model 等 RLHF/偏好对齐。	`用 trl-fine-tuning 帮我从 SFT 数据到 DPO 训练搭一条 pipeline。`
`grpo-rl-training`	用 TRL 做 GRPO，适合理解、推理、任务奖励优化。	`用 grpo-rl-training 帮我写一个数学推理 GRPO 训练方案。`
`openrlhf`	Ray + vLLM + ZeRO 的高性能 PPO/GRPO/RLOO/DPO 训练。	`用 openrlhf 帮我规划 7B 模型的分布式 GRPO 训练。`
`simpo`	Reference-free 偏好优化，比 DPO 更简单高效。	`用 simpo 帮我把 DPO 训练方案改成无 reference model 的 SimPO。`
`verl`	Volcano Engine RL，大规模 RLHF/GRPO/PPO 后训练。	`用 verl 帮我搭一个可扩展的 GRPO 后训练任务。`
`slime`	Megatron + SGLang 的 RL 训练，适合 GLM 和自定义数据生成。	`用 slime 帮我设计 Megatron 集成的 RL post-training 流程。`
`miles`	slime 的企业级 fork，FP8/INT4、MoE、speculative RL。	`用 miles 帮我优化 MoE 模型的 RL 训练吞吐。`
`torchforge`	Meta PyTorch-native agentic RL，算法和基础设施解耦。	`用 torchforge 帮我写一个可替换算法的 RL 训练骨架。`

2.5 机制可解释性

Skill	什么时候用	典型提示词
`transformer-lens`	HookPoint、activation cache、attention pattern、activation patching。	`用 transformer-lens 分析这个 transformer 的 induction heads。`
`saelens`	训练/分析 Sparse Autoencoder，研究 superposition 与可解释特征。	`用 saelens 帮我训练 SAE 并解释 top features。`
`pyvene`	因果干预、activation patching、interchange intervention training。	`用 pyvene 设计一个检验模型内部因果机制的实验。`
`nnsight`	远程或本地解释大型模型内部，可跑 70B+ 解释实验。	`用 nnsight 帮我在大模型上做 activation intervention。`

2.6 分布式训练与基础设施

Skill	什么时候用	典型提示词
`megatron-core`	训练 1B+ 到超大模型，tensor/pipeline/sequence/context/expert 并行。	`用 megatron-core 帮我配置 32 GPU 的 tensor + pipeline parallel。`
`deepspeed`	ZeRO、pipeline parallel、FP16/BF16/FP8、1-bit Adam。	`用 deepspeed 帮我把这个训练脚本改成 ZeRO-3。`
`pytorch-fsdp2`	PyTorch FSDP2、DTensor、DeviceMesh、distributed checkpoint。	`用 pytorch-fsdp2 帮我给这个模型加 fully_shard。`
`accelerate`	最少代码改动添加 DDP/FSDP/DeepSpeed/Megatron 分布式训练。	`用 accelerate 把单卡 PyTorch 脚本改成多卡训练。`
`pytorch-lightning`	Trainer、callbacks、自动 DDP/FSDP/DeepSpeed，减少训练样板代码。	`用 pytorch-lightning 重构这个训练循环。`
`ray-train`	多机分布式训练、Ray Tune、容错、弹性扩缩容。	`用 ray-train 帮我做多节点训练和超参搜索。`
`modal`	Serverless GPU、ML API 部署、批处理任务自动扩缩。	`用 modal 把这个推理脚本部署成可调用 API。`
`skypilot`	多云 GPU 调度、spot recovery、跨云成本优化。	`用 skypilot 帮我找最低成本的 8xA100 训练方案。`
`lambda-labs`	Lambda GPU 云，SSH、持久文件系统、多节点训练。	`用 lambda-labs 帮我规划 H100 实例训练环境。`

2.7 推理、Serving 与加速

Skill	什么时候用	典型提示词
`vllm`	PagedAttention、continuous batching、OpenAI-compatible API、高吞吐 serving。	`用 vllm 把这个 HuggingFace 模型部署成 OpenAI API。`
`tensorrt-llm`	NVIDIA GPU 上极致吞吐/低延迟，FP8/INT4、in-flight batching。	`用 tensorrt-llm 优化 H100 上的 70B 推理。`
`llama-cpp`	CPU、Apple Silicon、消费级 GPU、GGUF 本地推理。	`用 llama-cpp 帮我在 Mac/CPU 上跑这个模型。`
`sglang`	结构化生成、RadixAttention、agent 工作流、prefix sharing。	`用 sglang 帮我做 JSON constrained decoding serving。`
`speculative-decoding`	draft model、Medusa、lookahead，降低延迟提升 1.5-3.6x。	`用 speculative-decoding 给这个服务设计低延迟方案。`
`flash-attention`	长序列 attention 加速，减少显存，PyTorch SDPA/flash-attn。	`用 flash-attention 优化这个长上下文训练脚本。`

2.8 量化、压缩与模型合并]

Skill	什么时候用	典型提示词
`bitsandbytes`	INT8/NF4/FP4、QLoRA、8-bit optimizer，显存减少 50-75%。	`用 bitsandbytes 把这个 13B 模型改成 4-bit QLoRA。`
`gptq`	4-bit post-training quantization，大模型部署到消费级 GPU。	`用 gptq 给这个模型做 4-bit 量化并评估 perplexity。`
`awq`	Activation-aware 4-bit 量化，速度和准确率兼顾，适合 instruction/VLM。	`用 awq 帮我量化一个 70B instruction model。`
`hqq`	无校准数据的 4/3/2-bit Half-Quadratic Quantization。	`用 hqq 帮我在没有 calibration data 时量化模型。`
`gguf`	llama.cpp/GGUF 量化，CPU/Metal/消费级设备部署。	`用 gguf 帮我把模型转成适合 llama.cpp 的 Q4_K_M。`
`knowledge-distillation`	teacher-student 蒸馏，大模型能力转移到小模型。	`用 knowledge-distillation 设计 70B 到 7B 的蒸馏流程。`
`model-pruning`	Wanda、SparseGPT、N:M sparsity，减少模型大小和推理开销。	`用 model-pruning 帮我做 50% sparsity 的剪枝实验。`
`model-merging`	mergekit、SLERP、TIES、DARE，把多个微调模型能力合并。	`用 model-merging 把 math、code、chat 三个 adapter 合成一个模型。`
`moe-training`	Mixture of Experts，稀疏模型、路由、负载均衡、expert parallel。	`用 moe-training 帮我设计 Mixtral 风格 MoE 训练方案。`
`long-context`	RoPE、YaRN、ALiBi、position interpolation，把模型扩到 32k-128k。	`用 long-context 帮我把 4k 模型扩展到 32k context。`

2.9 评估与实验追踪

kill	什么时候用	典型提示词
`lm-evaluation-harness`	MMLU、GSM8K、HumanEval、TruthfulQA 等标准 LLM benchmark。	`用 lm-evaluation-harness 评估这个模型的 MMLU 和 GSM8K。`
`bigcode-evaluation-harness`	HumanEval、MBPP、MultiPL-E、pass@k，代码模型评测。	`用 bigcode-evaluation-harness 比较两个代码模型的 pass@k。`
`nemo-evaluator`	NVIDIA 评估 SDK，100+ benchmarks、多 backend、Docker/Slurm/云。	`用 nemo-evaluator 设计 Slurm 上的可复现模型评测。`
`weights-and-biases`	W&B 实验追踪、sweeps、artifacts、model registry。	`用 weights-and-biases 给训练脚本加自动 logging 和 sweeps。`
`mlflow`	开源 ML lifecycle，tracking、registry、deployment、autologging。	`用 mlflow 帮我组织实验追踪和模型注册。`
`tensorboard`	scalars、histograms、embeddings、profiling、训练可视化。	`用 tensorboard 帮我给训练循环加 loss 和 profiler 可视化。`
`swanlab`	开源/自托管实验追踪，轻量 dashboard 和媒体日志。	`用 swanlab 给这个训练项目加本地实验追踪。`

2.10 安全、对齐与 Guardrails

Skill	什么时候用	典型提示词
`constitutional-ai`	使用自我批判/修订和 RLAIF 做 harmlessness 对齐。	`用 constitutional-ai 帮我设计一个安全对齐数据生成流程。`
`llamaguard`	Meta LlamaGuard 输入/输出安全分类，内容安全过滤。	`用 llamaguard 给我的聊天服务加输入输出安全检查。`
`nemo-guardrails`	Colang 可编程 guardrails、事实核查、PII、jailbreak 检测。	`用 nemo-guardrails 给 RAG app 加防幻觉和 PII 过滤。`
`prompt-guard`	Meta 86M prompt injection/jailbreak detector，保护 RAG 和 agent。	`用 prompt-guard 帮我防御 prompt injection。`

2.11 Agent、RAG 与结构化生成

Skill	什么时候用	典型提示词
`langchain`	Agent、chains、tool calling、memory、RAG 快速原型和生产应用。	`用 langchain 帮我构建一个带工具调用的 agent。`
`llamaindex`	文档 ingestion、index、query engine、RAG 数据框架。	`用 llamaindex 为这些 PDF 做文档问答系统。`
`crewai`	多 agent 角色协作、顺序/层级执行、自主工作流。	`用 crewai 设计一个 researcher/writer/reviewer 多 agent 流程。`
`autogpt`	持续运行的自主 agent、可视化 workflow、复杂自动化。	`用 autogpt 帮我规划一个持续执行的调研 agent。`
`a-evolve`	自动进化/优化 agent prompt、skill、workflow 和评测循环。	`用 a-evolve 帮我建立一个自我改进 agent 的 benchmark loop。`
`chroma`	本地/open-source embedding DB，简单 RAG 和语义搜索。	`用 chroma 帮我快速搭一个本地 RAG demo。`
`faiss`	高性能向量相似搜索，十亿级向量、GPU index。	`用 faiss 帮我做大规模 embedding 检索实验。`
`sentence-transformers`	生成文本/图像 embedding，语义搜索、聚类、RAG。	`用 sentence-transformers 给我的语料生成 multilingual embeddings。`
`pinecone`	托管向量数据库，生产 RAG、低延迟、自动扩缩。	`用 pinecone 设计一个生产级 RAG 索引方案。`
`qdrant`	Rust 向量库，hybrid search、filtering、生产 RAG。	`用 qdrant 帮我实现带 metadata filter 的 hybrid search。`
`dspy`	声明式 LM programming，自动优化 prompt，模块化 RAG/agent。	`用 dspy 优化我的 RAG prompt 和 evaluator。`
`instructor`	Pydantic 结构化输出、自动重试、类型安全 extraction。	`用 instructor 把 LLM 输出稳定解析成 Pydantic schema。`
`guidance`	regex/grammar 约束生成，多步 workflow，保证格式。	`用 guidance 让模型只输出符合 grammar 的 JSON。`
`outlines`	FSM/grammar/Pydantic constrained decoding，本地模型/vLLM 支持。	`用 outlines 给 vLLM 加强制 JSON schema 输出。`

2.12 观测与调试

Skill	什么时候用	典型提示词
`langsmith`	LangChain/LLM app tracing、评测、production monitoring。	`用 langsmith 帮我定位这个 agent 调用链为什么失败。`
`phoenix`	开源 AI observability，OpenTelemetry tracing、LLM eval。	`用 phoenix 给我的 RAG app 加 trace 和 eval dashboard。`

2.13多模态、语音、图像与机器人

Skill	什么时候用	典型提示词
`clip`	图文匹配、zero-shot 图像分类、跨模态检索。	`用 clip 做一个文本搜图 demo。`
`whisper`	多语言语音识别、转录、翻译、语言识别。	`用 whisper 帮我批量转录这些音频。`
`llava`	视觉对话、图片问答、多轮 image chat。	`用 llava 帮我搭一个图片问答 chatbot。`
`stable-diffusion`	文生图、图生图、inpainting、diffusers pipeline。	`用 stable-diffusion 生成一组产品海报图。`
`segment-anything`	SAM 零样本图像分割，点/框/mask prompt。	`用 segment-anything 把图片里所有物体分割出来。`
`blip-2`	图像 caption、VQA、image-text retrieval。	`用 blip-2 给图片生成 caption 并做 VQA。`
`audiocraft`	MusicGen/AudioGen，文本生成音乐或声音效果。	`用 audiocraft 根据这段描述生成背景音乐方案。`
`openpi`	Physical Intelligence OpenPI 机器人策略 fine-tuning/serving。	`用 openpi 帮我把 pi0 模型适配到自定义机器人数据。`
`openvla-oft`	OpenVLA-OFT/OFT+ 机器人视觉语言动作模型训练与评估。	`用 openvla-oft 复现 LIBERO 上的 VLA action head 训练。`
`cosmos-policy`	NVIDIA Cosmos Policy 在 LIBERO/RoboCasa 的机器人策略评估。	`用 cosmos-policy 帮我做 headless GPU 机器人策略评测。`