How Cursor Indexes Codebases Fast

type

status

date

slug

summary

Merkle Trees Explained Simply

Merkle 树是一种树形结构，其中每个“叶”节点都标有数据块的加密哈希值，而每个非叶节点都标有其子节点标签的加密哈希值。这创建了一个分层结构，任何级别的更改都可以通过比较哈希值来有效地检测到。

Think of them as a fingerprinting system for data:可以将它们视为数据的指纹系统：

Each piece of data (like a file) gets its own unique fingerprint (hash)

每条数据（如文件）都有自己独特的指纹（哈希值）

Pairs of fingerprints are combined and given a new fingerprint

将多对指纹组合起来，形成一个新的指纹

This process continues until you have just one master fingerprint (the root hash)

这个过程一直持续到你只有一个主指纹（根哈希）

根哈希汇总了各个数据块中包含的所有数据，作为对整个数据集的加密承诺。这种方法的优点在于，如果任何单个数据块发生变化，它上面的所有指纹都会改变，最终改变根哈希。

How Cursor Uses Merkle Trees for Codebase Indexing

Step 1: Code Chunking and Processing

步骤 1：代码分块和处理

Cursor first chunks your codebase files locally, splitting code into semantically meaningful pieces before any processing occurs.

Cursor 首先在本地对您的代码库文件进行分块，在进行任何处理之前将代码分成语义上有意义的部分。

Step 2: Merkle Tree Construction and Synchronization

步骤2：Merkle 树的构建和同步

When codebase indexing is enabled, Cursor scans the folder opened in the editor and computes a Merkle tree of hashes of all valid files. This Merkle tree is then synchronized with Cursor's server, as detailed in Cursor's security documentation.

启用代码库索引后，Cursor 会扫描编辑器中打开的文件夹，并计算所有有效文件的哈希值 Merkle 树。然后，此 Merkle 树会与 Cursor 的服务器同步，详情请参阅 Cursor 的安全文档。

Step 3: Embedding Generation

步骤3：嵌入生成

After the chunks are sent to Cursor's server, embeddings are created using either OpenAI's embedding API or a custom embedding model (I couldn’t verify this). These vector representations capture the semantic meaning of the code chunks.

将代码块发送到 Cursor 服务器后，系统会使用 OpenAI 的嵌入 API 或自定义嵌入模型（我无法验证这一点）创建嵌入。这些向量表示捕捉了代码块的语义。

Step 4: Storage and Indexing

步骤 4：存储和索引

The embeddings, along with metadata like start/end line numbers and file paths, are stored in a remote vector database (Turbopuffer). To maintain privacy while still enabling path-based filtering, Cursor stores an obfuscated relative file path with each vector. Importantly, according to Cursor's founder, "None of your code is stored in our databases. It's gone after the life of the request."

嵌入以及诸如起始/结束行号和文件路径之类的元数据都存储在远程向量数据库 (Turbopuffer) 中。为了在保持隐私的同时仍支持基于路径的过滤，Cursor 为每个向量存储了一个经过混淆的相对文件路径。重要的是，据 Cursor 创始人所说，“您的代码不会存储在我们的数据库中。请求结束后，它们就消失了。”

Step 5: Periodic Updates Using Merkle Trees

步骤 5：使用 Merkle 树进行定期更新

Every 10 minutes, Cursor checks for hash mismatches, using the Merkle tree to identify which files have changed. Only the changed files need to be uploaded, significantly reducing bandwidth usage, as explained in Cursor's security documentation. This is where the Merkle tree structure provides its greatest value—enabling efficient incremental updates.Cursor

每 10 分钟检查一次哈希值不匹配的情况，并使用 Merkle 树来识别哪些文件已更改。只需上传更改的文件，即可显著减少带宽使用量，具体细节请参阅 Cursor 的安全文档。这正是 Merkle 树结构的最大价值所在——实现高效的增量更新。

Code Chunking Strategies

代码分块策略

The effectiveness of the codebase indexing largely depends on how code is chunked. While my previous explanation didn't go into detail about chunking methods, this blog post about building a Cursor-like codebase feature provides some insights:

代码库索引的有效性很大程度上取决于代码的分块方式。虽然我之前的解释没有详细阐述分块方法，但这篇关于构建类似 Cursor 的代码库功能的博客文章提供了一些见解：

While simple approaches split code by characters, words, or lines, they often miss semantic boundaries—resulting in degraded embedding quality.

虽然简单的方法可以按字符、单词或行来分割代码，但它们往往会错过语义边界，从而导致嵌入质量下降。

You can split code based on a fixed token count, but this can cut off code blocks like functions or classes mid-way.您可以根据固定的标记数来拆分代码，但这会在中途切断函数或类之类的代码块。

A more effective approach is to use an intelligent splitter that understands code structure, such as recursive text splitters that use high-level delimiters (e.g., class and function definitions) to split at appropriate semantic boundaries.更有效的方法是使用能够理解代码结构的智能分割器，例如使用高级分隔符（例如，类和函数定义）在适当的语义边界处进行分割的递归文本分割器。

An even more elegant solution is to split the code based on its Abstract Syntax Tree (AST) structure. By traversing the AST depth-first, it splits code into sub-trees that fit within token limits. To avoid creating too many small chunks, sibling nodes are merged into larger chunks as long as they stay under the token limit. Tools like tree-sitter can be used for this AST parsing, supporting a wide range of programming languages.一种更优雅的解决方案是根据抽象语法树 (AST) 结构拆分代码。通过深度优先遍历 AST，它将代码拆分成符合 token 限制的子树。为了避免创建过多的小块，只要兄弟节点不超过 token 限制，它们就会合并成更大的块。可以使用 tree-sitter 等工具进行这种 AST 解析，支持多种编程语言。

How Embeddings Are Used at Inference Time

推理时如何使用嵌入

After covering how Cursor creates and stores code embeddings, a natural question arises: how are these embeddings actually used once they've been generated? This section explains the practical application of these embeddings during normal usage.

在介绍了 Cursor 如何创建和存储代码嵌入之后，一个自然而然的问题出现了： 这些嵌入生成后，实际上是如何使用的？ 本节将讲解这些嵌入在正常使用过程中的实际应用。

Semantic Search and Context Retrieval

语义搜索和上下文检索

When you interact with Cursor's AI features like asking questions about your codebase (using @Codebase or ⌘ Enter), the following process occurs:

当您与 Cursor 的 AI 功能进行交互（例如询问有关代码库的问题（使用 @Codebase 或 ⌘ Enter））时，会发生以下过程：

Query Embedding: Cursor computes an embedding for your question or the code context you're working with.查询嵌入：光标为您的问题或您正在使用的代码上下文计算嵌入。

Vector Similarity Search: This query embedding is sent to Turbopuffer (Cursor's vector database), which performs a nearest-neighbor search to find code chunks semantically similar to your query.向量相似性搜索：此查询嵌入被发送到 Turbopuffer（Cursor 的向量数据库），它执行最近邻搜索以查找与您的查询语义相似的代码块。

Local File Access: Cursor's client receives the results, which include obfuscated file paths and line ranges of the most relevant code chunks. Importantly, the actual code content remains on your machine and is retrieved locally.本地文件访问：Cursor 的客户端接收结果，其中包括混淆后的文件路径和最相关代码块的行范围。重要的是，实际的代码内容仍保留在您的计算机上，并在本地检索。

Context Assembly: The client reads these relevant code chunks from your local files and sends them as context to the server for the LLM to process alongside your question.上下文组装：客户端从您的本地文件中读取这些相关的代码块，并将它们作为上下文发送到服务器，以便 LLM 与您的问题一起处理。

Informed Response: The LLM now has the necessary context from your codebase to provide a more informed and relevant response to your question or to generate appropriate code completions.知情回应：LLM 现在拥有来自您的代码库的必要上下文，可以为您的问题提供更明智和更相关的回应或生成适当的代码补全。

This embedding-powered retrieval allows for:

这种嵌入检索功能可以实现：

Contextual Code Generation: When writing new code, Cursor can reference similar implementations in your existing codebase, maintaining consistent patterns and styles.上下文代码生成：编写新代码时，Cursor 可以引用现有代码库中的类似实现，保持一致的模式和样式。

Codebase Q&A: You can ask questions about your codebase and get answers informed by your actual code rather than generic responses.代码库问答：您可以询问有关代码库的问题，并获得根据您的实际代码而非通用答案的答案。

Smart Code Completion: Code completions can be enhanced with awareness of your project's specific conventions and patterns.智能代码完成：通过了解项目的特定约定和模式，可以增强代码完成功能。

Intelligent Refactoring: When refactoring code, the system can identify all related pieces across your codebase that might need similar changes.智能重构：重构代码时，系统可以识别代码库中可能需要类似更改的所有相关部分。

Why Cursor Uses Merkle TreesCursor

为何使用 Merkle 树

Many of these details are security-related, and thus can be found in Cursor’s security documentation.其中许多细节与安全有关，因此可以在 Cursor 的安全文档中找到。

1. Efficient Incremental Updates

1.高效的增量更新

By using a Merkle tree, Cursor can quickly identify exactly which files have changed since the last synchronization. Instead of re-uploading the entire codebase, it only needs to upload the specific files that have been modified. This is important for large codebases where re-indexing everything would be too expensive in terms of bandwidth and processing time.

通过使用 Merkle 树，Cursor 可以快速准确地识别自上次同步以来哪些文件发生了更改。它无需重新上传整个代码库，只需上传已修改的特定文件即可。这对于大型代码库来说非常重要，因为重新索引所有内容会占用过多的带宽和处理时间。

2. Data Integrity Verification

2. 数据完整性验证

The Merkle tree structure allows Cursor to efficiently verify that the files being indexed match what's stored on the server. The hierarchical hash structure makes it easy to detect any inconsistencies or corrupted data during transfer.

Merkle 树结构使 Cursor 能够高效地验证索引文件与服务器上存储的文件是否匹配。分层哈希结构则能够轻松检测传输过程中的任何不一致或损坏数据。

3. Optimized Caching

3.优化缓存

Cursor stores embeddings in a cache indexed by the hash of the chunk, ensuring that indexing the same codebase a second time is much faster. This is great for teams where multiple developers might be working with the same codebase.

Cursor 将嵌入存储在由块哈希值索引的缓存中，从而确保第二次索引同一代码库的速度更快。这对于多个开发人员可能使用同一代码库的团队非常有用。

4. Privacy-Preserving Indexing

4. 隐私保护索引

To protect sensitive information in file paths, Cursor implements path obfuscation by splitting the path by '/' and '.' characters and encrypting each segment with a secret key stored on the client. While this still reveals some information about directory hierarchy, it hides most sensitive details.为了保护文件路径中的敏感信息，Cursor 会使用“/”和“.”字符分割路径，并使用存储在客户端的密钥对每个路径段进行加密，从而实现路径混淆。虽然这仍然会泄露一些目录层次结构的信息，但它隐藏了大多数敏感信息。

5. Git History Integration

5. Git 历史记录集成

When codebase indexing is enabled in a Git repository, Cursor also indexes the Git history. It stores commit SHAs, parent information, and obfuscated file names. To enable sharing the data structure for users in the same Git repo and on the same team, the secret key for obfuscating file names is derived from hashes of recent commit contents.

当在 Git 存储库中启用代码库索引时，Cursor 还会索引 Git 历史记录。它存储提交 SHA、父级信息以及混淆后的文件名。为了使同一 Git 存储库和同一团队的用户能够共享数据结构，用于混淆文件名的密钥源自最近提交内容的哈希值。

Embedding Models and Considerations

嵌入模型和注意事项

The choice of embedding model significantly impacts the quality of code search and understanding. While some systems use open-source models like all-MiniLM-L6-v2, Cursor likely uses either OpenAI's embedding models or custom embedding models specifically tuned for code. For specialized code embeddings, models like Microsoft's unixcoder-base or Voyage AI's voyage-code-2 are good for code-specific semantic understanding.

嵌入模型的选择会显著影响代码搜索和理解的质量。虽然有些系统使用像 all-MiniLM-L6-v2 这样的开源模型，但 Cursor 很可能使用 OpenAI 的嵌入模型或专门针对代码调整的自定义嵌入模型。对于专用的代码嵌入，像微软的 unixcoder-base 或 Voyage AI 的 voyage-code-2 这样的模型非常适合特定于代码的语义理解。

The embedding challenge is made more complex because embedding models have token limits. OpenAI's text-embedding-3-small model, for example, has a token limit of 8192. Effective chunking helps stay within token limits while preserving semantic meaning.

由于嵌入模型存在 token 数量限制，嵌入挑战变得更加复杂。例如，OpenAI 的 text-embedding-3-small 模型的 token 数量限制为 8192。有效的分块有助于在保留语义的同时，保持 token 数量限制。

The Handshake Process 握手过程

A key aspect of Cursor's Merkle tree implementation is the handshake process that occurs during synchronization. Logs from the Cursor application reveal that when initializing codebase indexing, Cursor creates a "merkle client" and performs a "startup handshake" with the server. This handshake involves sending the root hash of the locally computed Merkle tree to the server, as seen in Issue #2209 on GitHub and Issue #981 on GitHub.

Cursor 的 Merkle 树实现的一个关键方面是同步过程中发生的握手过程。Cursor 应用程序的日志显示，在初始化代码库索引时，Cursor 会创建一个“Merkle 客户端”，并与服务器执行“启动握手”。此握手涉及将本地计算的 Merkle 树的根哈希发送到服务器，如 GitHub 上的 Issue #2209 和 GitHub 上的 Issue #981 中所述。

The handshake process allows the server to determine which parts of the codebase need to be synced. Based on the handshake logs, we can see that Cursor computes the initial hash of the codebase and sends it to the server for verification, as documented in Issue #2209 on GitHub.

握手过程允许服务器确定代码库的哪些部分需要同步。根据握手日志，我们可以看到 Cursor 计算了代码库的初始哈希值并将其发送到服务器进行验证，具体细节请参阅 GitHub 上的 Issue #2209 。

Technical Implementation Challenges

技术实施挑战

While the Merkle tree approach offers many advantages, it's not without implementation challenges. Cursor's indexing feature often experiences heavy load, causing many requests to fail. This can result in files needing to be uploaded several times before they get fully indexed. Users might notice higher than expected network traffic to 'repo42.cursor.sh' as a result of these retry mechanisms, as mentioned in Cursor's security documentation.

虽然 Merkle 树方法有很多优势，但在实现上也存在一些挑战。Cursor 的索引功能经常负载过重，导致许多请求失败。这可能导致文件需要多次上传才能完全被索引。用户可能会注意到，由于这些重试机制，到“repo42.cursor.sh”的网络流量高于预期，正如 Cursor 的安全文档中所述。

Another challenge relates to embedding security. Academic research has shown that reversing embeddings is possible in some cases. While current attacks typically rely on having access to the embedding model and working with short strings, there is a potential risk that an adversary who gains access to Cursor's vector database could extract information about indexed codebases from the stored embeddings.

另一个挑战与嵌入安全性有关。学术研究表明，在某些情况下，逆向嵌入是可能的。虽然当前的攻击通常依赖于访问嵌入模型并使用短字符串，但存在一个潜在风险：获得 Cursor 矢量数据库访问权限的攻击者可以从存储的嵌入中提取有关索引代码库的信息。