LLM Wiki-OpenAI联合创始人Andrej Karpathy教你如何搭建个人知识库
A pattern for building personal knowledge bases using LLMs.
一种利用大语言模型构建个人知识库的模式
This is an idea file, it is designed to be copy pasted to your own LLM Agent (e.g. OpenAI Codex, Claude Code, OpenCode / Pi, or etc.). Its goal is to communicate the high level idea, but your agent will build out the specifics in collaboration with you.
这是一个概念文档,旨在被复制粘贴到你自己的 LLM 代理(例如 OpenAI Codex、Claude Code、OpenCode / Pi 等)中。它的目标是传达高层次的概念,但具体的细节将由你的代理与你协作构建。
The core idea
核心理念
Most people’s experience with LLMs and documents looks like RAG: you upload a collection of files, the LLM retrieves relevant chunks at query time, and generates an answer. This works, but the LLM is rediscovering knowledge from scratch on every question. There’s no accumulation. Ask a subtle question that requires synthesizing five documents, and the LLM has to find and piece together the relevant fragments every time. Nothing is built up. NotebookLM, ChatGPT file uploads, and most RAG systems work this way.
大多数人使用大语言模型(LLM)和文档的体验类似于 RAG(检索增强生成):你上传一组文件,LLM 在查询时检索相关的文本块,并生成答案。这可行,但 LLM 每次回答问题时都在从头重新发现知识。没有积累。如果你问一个需要综合五份文档的微妙问题,LLM 每次都必须重新寻找并将相关的片段拼凑起来。没有知识被建立起来的。NotebookLM、ChatGPT 的文件上传以及大多数 RAG 系统都是这样工作的。
The idea here is different. Instead of just retrieving from raw documents at query time, the LLM incrementally builds and maintains a persistent wiki — a structured, interlinked collection of markdown files that sits between you and the raw sources. When you add a new source, the LLM doesn’t just index it for later retrieval. It reads it, extracts the key information, and integrates it into the existing wiki — updating entity pages, revising topic summaries, noting where new data contradicts old claims, strengthening or challenging the evolving synthesis. The knowledge is compiled once and then kept current, not re-derived on every query.
这里的理念有所不同。与其仅在查询时从原始文档中检索,不如让 LLM 逐步构建并维护一个持久的维基——一个位于你和原始资源之间的、结构化的、相互链接的 Markdown 文件集合。当你添加一个新的来源时,LLM 不仅仅将其索引以供后续检索。它会阅读它,提取关键信息,并将其整合到现有的维基中——更新实体页面,修订主题摘要,记录新数据与旧主张相矛盾的地方,加强或挑战正在演变的综合结论。知识被编译一次,然后保持最新,而不是在每次查询时重新推导。
This is the key difference: the wiki is a persistent, compounding artifact. The cross-references are already there. The contradictions have already been flagged. The synthesis already reflects everything you’ve read. The wiki keeps getting richer with every source you add and every question you ask.
这是关键的区别:维基是一个持久的、复合的产物。交叉引用已经存在。矛盾之处已经被标记。综合结论已经反映了你阅读过的所有内容。随着你添加的每一个来源和提出的每一个问题,维基都在不断变得丰富。
You never (or rarely) write the wiki yourself — the LLM writes and maintains all of it. You’re in charge of sourcing, exploration, and asking the right questions. The LLM does all the grunt work — the summarizing, cross-referencing, filing, and bookkeeping that makes a knowledge base actually useful over time. In practice, I have the LLM agent open on one side and Obsidian open on the other. The LLM makes edits based on our conversation, and I browse the results in real time — following links, checking the graph view, reading the updated pages. Obsidian is the IDE; the LLM is the programmer; the wiki is the codebase.
你从不(或很少)自己编写维基—— LLM 负责编写和维护所有内容。你负责来源收集、探索和提出正确的问题。LLM 做所有繁重的工作——总结、交叉引用、归档和记录,这些工作使知识库随着时间的推移真正变得有用。在实践中,我一边开着 LLM 代理,另一边开着 Obsidian。LLM 根据我们的对话进行编辑,我实时浏览结果——跟随链接,检查图表视图,阅读更新的页面。Obsidian 是集成开发环境(IDE);LLM 是程序员;维基是代码库。
This can apply to a lot of different contexts. A few examples:
这可以应用于很多不同的场景。举几个例子:
- Personal
: tracking your own goals, health, psychology, self-improvement — filing journal entries, articles, podcast notes, and building up a structured picture of yourself over time. -
个人用途:追踪你个人的目标、健康状况、心理状态及自我提升——归档日记条目、文章和播客笔记,并随着时间推移逐步构建起一份结构化的个人画像。 - Research
: going deep on a topic over weeks or months — reading papers, articles, reports, and incrementally building a comprehensive wiki with an evolving thesis. -
个人:追踪你自己的目标、健康、心理、自我提升——归档日记条目、文章、播客笔记,并随着时间的推移构建一个结构化的自我画像。 -
研究:花数周或数月深入研究一个主题——阅读论文、文章、报告,并逐步构建一个带有不断演化的论点的综合维基。 - Reading a book
: filing each chapter as you go, building out pages for characters, themes, plot threads, and how they connect. By the end you have a rich companion wiki. Think of fan wikis like Tolkien Gateway — thousands of interlinked pages covering characters, places, events, languages, built by a community of volunteers over years. You could build something like that personally as you read, with the LLM doing all the cross-referencing and maintenance. -
阅读一本书:边读边归档每一章,为人物、主题、情节线索及其联系建立页面。到最后,你拥有一个丰富的伴读维基。想想像 Tolkien Gateway 这样的粉丝维基——成千上万相互链接的页面,涵盖人物、地点、事件、语言,由一群志愿者经过多年建立。你可以在阅读时个人构建类似的东西,由 LLM 负责所有的交叉引用和维护。 - Business/team
: an internal wiki maintained by LLMs, fed by Slack threads, meeting transcripts, project documents, customer calls. Possibly with humans in the loop reviewing updates. The wiki stays current because the LLM does the maintenance that no one on the team wants to do. -
商业/团队:一个由 LLM 维护的内部维基,由 Slack 讨论线索、会议记录、项目文档、客户通话提供数据。可能有人类参与审查更新。维基保持最新,因为 LLM 做了团队中没有人想做的维护工作。 - Competitive analysis, due diligence, trip planning, course notes, hobby deep-dives
— anything where you’re accumulating knowledge over time and want it organized rather than scattered. -
竞争分析、尽职调查、旅行计划、课程笔记、爱好的深度钻研——任何你随着时间的推移积累知识并希望其有条理而非散乱的情况。
Architecture
架构
There are three layers:
有三层:
Raw sources — your curated collection of source documents. Articles, papers, images, data files. These are immutable — the LLM reads from them but never modifies them. This is your source of truth.
原始来源——你精心整理的源文档集合。文章、论文、图片、数据文件。这些是不可变的——LLM 从中读取但从不修改它们。这是你的事实来源。
The wiki — a directory of LLM-generated markdown files. Summaries, entity pages, concept pages, comparisons, an overview, a synthesis. The LLM owns this layer entirely. It creates pages, updates them when new sources arrive, maintains cross-references, and keeps everything consistent. You read it; the LLM writes it.
维基——一个由 LLM 生成的 Markdown 文件目录。包括摘要、实体页面、概念页面、比较、概览、综合。LLM 完全拥有这一层。它创建页面,在新来源到达时更新它们,维护交叉引用,并保持一切一致。你阅读它;LLM 编写它。
The schema — a document (e.g. CLAUDE.md for Claude Code or AGENTS.md for Codex) that tells the LLM how the wiki is structured, what the conventions are, and what workflows to follow when ingesting sources, answering questions, or maintaining the wiki. This is the key configuration file — it’s what makes the LLM a disciplined wiki maintainer rather than a generic chatbot. You and the LLM co-evolve this over time as you figure out what works for your domain.
模式——一个文档(例如,用于Claude Code的CLAUDE.md,或用于Codex的AGENTS.md),它向大型语言模型说明维基的结构、约定以及在引入来源、回答问题或维护维基时应遵循的工作流程。这是关键的配置文件——正是它使大型语言模型成为一位有条理的维基维护者,而非普通的聊天机器人。你与大型语言模型将随着时间推移共同进化这一模式,逐步摸索出最适合你所在领域的做法。
Operations
操作
Ingest. You drop a new source into the raw collection and tell the LLM to process it. An example flow: the LLM reads the source, discusses key takeaways with you, writes a summary page in the wiki, updates the index, updates relevant entity and concept pages across the wiki, and appends an entry to the log. A single source might touch 10-15 wiki pages. Personally I prefer to ingest sources one at a time and stay involved — I read the summaries, check the updates, and guide the LLM on what to emphasize. But you could also batch-ingest many sources at once with less supervision. It’s up to you to develop the workflow that fits your style and document it in the schema for future sessions.
摄入。您将新来源放入原始集合中,并指示大语言模型对其进行处理。一个典型流程如下:大语言模型读取该来源,与您探讨关键要点,在维基中撰写摘要页面,更新索引,同步更新维基中相关的实体与概念页面,并在日志中追加一条记录。单个来源可能涉及10到15个维基页面。我个人更倾向于一次摄入一个来源,并全程参与——我会阅读摘要、查看更新内容,并指导大语言模型重点突出哪些内容。不过,您也可以批量摄入多个来源,只需较少的监督。如何制定适合您风格的工作流程取决于你并将其记录在方案中,以供今后使用,完全取决于您自己。
Query. You ask questions against the wiki. The LLM searches for relevant pages, reads them, and synthesizes an answer with citations. Answers can take different forms depending on the question — a markdown page, a comparison table, a slide deck (Marp), a chart (matplotlib), a canvas. The important insight: good answers can be filed back into the wiki as new pages. A comparison you asked for, an analysis, a connection you discovered — these are valuable and shouldn’t disappear into chat history. This way your explorations compound in the knowledge base just like ingested sources do.
查询。您可针对维基提出问题。大型语言模型会搜索相关页面,阅读这些页面,并结合引用生成答案。根据问题的不同,答案可采用多种形式——Markdown页面、对比表格、幻灯片演示文稿(Marp)、图表(Matplotlib)或画布。重要的一点是:优质的答案可作为新页面存入维基。您提出的对比、所做的分析、发现的关联——这些都是宝贵的知识,不应仅仅停留在聊天记录中。这样一来,您的探索成果便能像已收录的资料一样,在知识库中不断累积和增值。
Lint. Periodically, ask the LLM to health-check the wiki. Look for: contradictions between pages, stale claims that newer sources have superseded, orphan pages with no inbound links, important concepts mentioned but lacking their own page, missing cross-references, data gaps that could be filled with a web search. The LLM is good at suggesting new questions to investigate and new sources to look for. This keeps the wiki healthy as it grows.
巡检。定期请大语言模型对维基进行健康检查。重点关注:页面之间的矛盾之处、已被新资料取代的过时说法、没有入站链接的孤立页面、虽提及但尚未单独成页的重要概念、缺失的交叉引用,以及可通过网络搜索填补的数据空白。大语言模型擅长提出值得深入探究的新问题和需要查找的新资料。这有助于随着维基不断壮大而保持其健康状态。
Indexing and logging
索引与日志记录
Two special files help the LLM (and you) navigate the wiki as it grows. They serve different purposes:
两个特殊文件有助于大型语言模型(以及您)在维基不断增长的过程中进行导航。它们各自服务于不同的目的:
index.md is content-oriented. It’s a catalog of everything in the wiki — each page listed with a link, a one-line summary, and optionally metadata like date or source count. Organized by category (entities, concepts, sources, etc.). The LLM updates it on every ingest. When answering a query, the LLM reads the index first to find relevant pages, then drills into them. This works surprisingly well at moderate scale (~100 sources, ~hundreds of pages) and avoids the need for embedding-based RAG infrastructure.
index.md 是以内容为导向的。它列出了维基中的所有内容——每个页面都附有链接、一行简要概述,以及可选的元数据,如日期或来源数量。内容按类别组织(实体、概念、来源等)。大型语言模型会在每次数据摄入时更新此索引。在回答查询时,大型语言模型会先读取索引,找到相关页面,然后再深入查阅这些页面。这一方法在中等规模下(约100个来源、数百个页面)表现得相当出色,并且无需依赖基于嵌入的RAG基础设施。
log.md is chronological. It’s an append-only record of what happened and when — ingests, queries, lint passes. A useful tip: if each entry starts with a consistent prefix (e.g. ## [2026-04-02] ingest | Article Title), the log becomes parseable with simple unix tools — grep "^## \[" log.md | tail -5 gives you the last 5 entries. The log gives you a timeline of the wiki’s evolution and helps the LLM understand what’s been done recently.
log.md 是按时间顺序记录的。它仅用于追加记录所发生的事及其时间——包括摄入、查询和 lint 检查。一个小技巧:如果每条记录都以一致的前缀开头(例如:## [2026-04-02] ingest | 文章标题),那么借助简单的 Unix 工具就能轻松解析日志——使用 grep “^## [” log.md | tail -5 即可获取最近的 5 条记录。这份日志为你提供了维基发展历程的时间线,并有助于大语言模型了解近期已完成的工作。
Optional: CLI tools
可选:CLI 工具
At some point you may want to build small tools that help the LLM operate on the wiki more efficiently. A search engine over the wiki pages is the most obvious one — at small scale the index file is enough, but as the wiki grows you want proper search. qmd is a good option: it’s a local search engine for markdown files with hybrid BM25/vector search and LLM re-ranking, all on-device. It has both a CLI (so the LLM can shell out to it) and an MCP server (so the LLM can use it as a native tool). You could also build something simpler yourself — the LLM can help you vibe-code a naive search script as the need arises.
在某个时候,你可能希望开发一些小型工具,以帮助大语言模型更高效地操作维基百科。最显而易见的一个工具就是维基页面搜索引擎——在规模较小时,只需一个索引文件就足够了;但随着维基百科的不断壮大,你就需要一套完善的搜索功能。qmd是个不错的选择:它是一款本地Markdown文件搜索引擎,采用混合BM25/向量检索技术,并辅以大语言模型重新排序,所有处理均在设备本地完成。qmd同时提供命令行界面(以便大语言模型能够调用它)和MCP服务器(让大语言模型将其作为原生工具使用)。当然,你也可以自己动手打造一个更简单的方案——如果需要的话,大语言模型还能帮你快速编写一个基础的搜索脚本呢!
Tips and tricks
提示与技巧
- Obsidian Web Clipper
is a browser extension that converts web articles to markdown. Very useful for quickly getting sources into your raw collection. -
Obsidian Web Clipper 是一款浏览器扩展程序,可将网页文章转换为 Markdown 格式。对于快速将资料导入你的原始收藏库非常有用。 - Download images locally.
In Obsidian Settings → Files and links, set “Attachment folder path” to a fixed directory (e.g. raw/assets/). Then in Settings → Hotkeys, search for “Download” to find “Download attachments for current file” and bind it to a hotkey (e.g. Ctrl+Shift+D). After clipping an article, hit the hotkey and all images get downloaded to local disk. This is optional but useful — it lets the LLM view and reference images directly instead of relying on URLs that may break. Note that LLMs can’t natively read markdown with inline images in one pass — the workaround is to have the LLM read the text first, then view some or all of the referenced images separately to gain additional context. It’s a bit clunky but works well enough. -
本地下载图片。在Obsidian的“设置”→“文件与链接”中,将“附件文件夹路径”设为一个固定目录(例如:raw/assets/)。随后,在“设置”→“快捷键”中搜索“下载”,找到“为当前文件下载附件”,并将其绑定到一个快捷键(例如:Ctrl+Shift+D)。剪辑文章后,按下该快捷键,所有图片便会下载到本地磁盘。这一功能可选但很有用——它能让大语言模型直接查看和引用图片,而无需依赖可能失效的URL。需要注意的是,大语言模型无法一次性原生读取包含内联图片的Markdown文档——为此,一种可行的 workaround 是:先让大语言模型读取文本内容,然后再单独查看部分或全部引用的图片,以获取更多上下文信息。虽然操作稍显繁琐,但效果相当不错。 - Obsidian’s graph view
is the best way to see the shape of your wiki — what’s connected to what, which pages are hubs, which are orphans. -
Obsidian 的图表视图是查看维基结构的最佳方式——了解哪些内容相互关联,哪些页面是枢纽,哪些页面则孤立无援。 - Marp
is a markdown-based slide deck format. Obsidian has a plugin for it. Useful for generating presentations directly from wiki content. -
Marp 是一种基于 Markdown 的幻灯片格式。Obsidian 有一个适用于它的插件,非常适合直接从维基内容生成演示文稿。 - Dataview
is an Obsidian plugin that runs queries over page frontmatter. If your LLM adds YAML frontmatter to wiki pages (tags, dates, source counts), Dataview can generate dynamic tables and lists. -
Dataview 是一款 Obsidian 插件,可对页面前言元数据执行查询。如果您的大型语言模型为维基页面添加了 YAML 前言元数据(标签、日期、来源计数),Dataview 就能生成动态表格和列表。 -
The wiki is just a git repo of markdown files. You get version history, branching, and collaboration for free. -
该维基只是一个包含Markdown文件的Git仓库。你可以免费获得版本历史、分支功能以及协作支持。
Why this works
为什么这有效
The tedious part of maintaining a knowledge base is not the reading or the thinking — it’s the bookkeeping. Updating cross-references, keeping summaries current, noting when new data contradicts old claims, maintaining consistency across dozens of pages. Humans abandon wikis because the maintenance burden grows faster than the value. LLMs don’t get bored, don’t forget to update a cross-reference, and can touch 15 files in one pass. The wiki stays maintained because the cost of maintenance is near zero.
维护知识库的繁琐之处,并非阅读或思考,而是记账工作。更新交叉引用、保持摘要最新、留意新数据与旧说法相矛盾的时刻、确保数十页内容的一致性——这些都耗费大量精力。人类会放弃维基,是因为维护负担的增长速度远超其价值增长。而大语言模型不会感到厌倦,不会忘记更新交叉引用,还能一次性处理多达15个文件。因此,维基得以持续维护,因为其维护成本几乎为零。
The human’s job is to curate sources, direct the analysis, ask good questions, and think about what it all means. The LLM’s job is everything else.
人类的工作是精选信息源、指导分析、提出优质问题,并思考这一切的深层含义。而大型语言模型的任务则是完成其他一切工作。
The idea is related in spirit to Vannevar Bush’s Memex (1945) — a personal, curated knowledge store with associative trails between documents. Bush’s vision was closer to this than to what the web became: private, actively curated, with the connections between documents as valuable as the documents themselves. The part he couldn’t solve was who does the maintenance. The LLM handles that.
这一构想在精神上与范内瓦·布什于1945年提出的Memex相关——Memex是一种个人化的、经过精心整理的知识存储系统,其中的文档之间以联想路径相互连接。布什的设想更接近于这种模式,而非后来演变成的网络形态:它强调私密性,由用户主动进行内容精选,文档之间的关联与文档本身同样具有重要价值。他未能解决的一个问题就是:谁来负责维护?而大型语言模型则恰好能够胜任这一任务。
Note
注意
This document is intentionally abstract. It describes the idea, not a specific implementation. The exact directory structure, the schema conventions, the page formats, the tooling — all of that will depend on your domain, your preferences, and your LLM of choice. Everything mentioned above is optional and modular — pick what’s useful, ignore what isn’t. For example: your sources might be text-only, so you don’t need image handling at all. Your wiki might be small enough that the index file is all you need, no search engine required. You might not care about slide decks and just want markdown pages. You might want a completely different set of output formats. The right way to use this is to share it with your LLM agent and work together to instantiate a version that fits your needs. The document’s only job is to communicate the pattern. Your LLM can figure out the rest.
本文档刻意保持抽象。它所描述的是一个理念,而非具体的实现方案。确切的目录结构、模式约定、页面格式以及工具链——所有这些都将取决于你的具体领域、个人偏好以及所选用的大语言模型。上述提及的一切均属可选且模块化设计:挑选对你有用的部分,忽略那些无用的内容。例如:你的资料来源可能仅包含文本,因此根本无需处理图像;你的维基可能规模较小,只需一个索引文件就足够了,完全不需要搜索引擎;你可能并不在意幻灯片演示,只希望以Markdown页面呈现内容;又或者,你可能需要一套完全不同的输出格式。正确使用本文档的方式是:将其分享给你的大语言模型代理,与之协作,共同构建出一款契合你需求的版本。本文档的唯一任务就是传达这一模式,至于如何具体实施,就交由你的大语言模型来完成吧。
夜雨聆风