Chapters
Try It For Free
April 1, 2026

Your Repo Is a Knowledge Graph. You Just Don't Query It Yet. | Harness Blog

Why Source Code Management must become Source Context Management in the age of AI agents

The Premise

For decades, SCM has meant one thing: Source Code Management. Git commits, branches, pull requests, and version history. The plumbing of software delivery. But as AI agents show up in every phase of the software development lifecycle, from writing a spec to shipping code to reviewing a PR, the acronym is quietly undergoing its most important transformation yet.

SCM is becoming Source Context Management.

And this isn't a rebrand. It's a rethinking of what a source repository is, what it stores, and what it serves, not just to developers, but to the agents working alongside them.

The Context Crisis in Agentic SDLC

AI agents in software development are powerful but contextually blind by default. Ask a coding agent to implement a feature and it will reach out and read files, one by one, directory by directory, until it has assembled enough context to act. Ask a code review agent to assess a PR and it will crawl through the codebase to understand what changed and why it matters.

Anthropic's 2026 Agentic Coding Trends Report documents this shift in detail: the SDLC is changing dramatically as single agents evolve into coordinated multi-agent teams operating across planning, coding, review, and deployment. The report projects the AI agents market to grow from $7.84 billion in 2025 to $52.62 billion by 2030. But as agents multiply across the lifecycle, so does their hunger for codebase context, and so does the cost of getting that context wrong.

This approach has two brutal failure modes:

  1. Context window bloat. Feeding raw source files to an LLM is expensive, slow, and lossy. A 300,000-line codebase doesn't fit in any context window. The agent is forced to guess what's relevant, and it often guesses wrong.
  2. Semantic blindness. Reading files doesn't tell an agent why code is structured the way it is, what modules depend on what, which functions are high-risk, or what the design philosophy behind a component is. Text is not meaning.

The result? Agents that hallucinate implementations because they missed a key abstraction three directories away. Code reviewers that flag style issues but miss architectural regressions. PRD generators that know the syntax of your codebase but not its soul.

The bottleneck is not the model. It is the absence of a pre-computed, semantically rich, always-available representation of the entire codebase: a context engine.

A Tale of Two Agents

Consider a simple task: "Add rate limiting to the /checkout endpoint."

Without a context engine, a coding agent opens checkout.go, reads the handler function, and writes a token-bucket rate limiter inline at the top of the handler. The code compiles. The tests pass. The PR looks clean.

The agent missed three things:

  • The service already uses a middleware-based rate limiting pattern in middleware/ratelimit.go for every other endpoint. The agent created a second, inconsistent approach.
  • A shared RateLimitConfig interface exists that all rate limiters must implement for centralized configuration management. The agent's inline implementation ignores it.
  • Every rate-limit event flows through a centralized metrics.Emit() call for observability dashboards. The agent's version remains invisible to ops.

The code works. The team that maintains it finds it wrong in every way that matters. A senior engineer catches these issues in review, requests changes, and the cycle restarts. Multiply this by every agent-generated PR across every team, every day.

With a context engine, the same agent queries before writing code: "How is rate limiting implemented in this service?" The context engine returns:

  1. The existing middleware pattern in middleware/ratelimit.go
  2. The RateLimitConfig interface it must implement
  3. The metrics.Emit() integration point for observability
  4. The test conventions in middleware/ratelimit_test.go

The agent writes a new rate limiter that follows the established pattern, implements the shared interface, emits metrics through the standard pipeline, and includes tests that match the existing style. The PR wins approval on the first pass.

The difference is context quality, not model quality.

Beyond LSP: From Interactive Intelligence to Agentic Intelligence

The Language Server Protocol (LSP) transformed developer tooling in the past decade. By standardizing the interface between editors and language-aware backends, LSP gave every IDE, from VS Code to Neovim, access to autocomplete, go-to-definition, hover documentation, and real-time diagnostics. LSP was designed to serve a specific consumer: a human developer, working interactively, in a single file at a time. That design made the right trade-offs for its era:

  • Interactive response optimization. Servers pre-compute and cache indices for low-latency, cursor-anchored queries rather than producing complete semantic snapshots of entire repositories on demand
  • Position orientation. Most queries anchor to a file and cursor position, perfect for an editor but limiting for full-repo semantic traversal
  • Session binding. Requires an active language server process, tightly coupled to an open editor session
  • Single-client design. The protocol assumes one client per server instance, not built for concurrent multi-agent access

For interactive development, these are strengths. LSP excels at what it was built to do.

Agents are a different class of consumer. They don't sit in a file waiting for cursor events. They operate across entire repositories, across SDLC phases, often in parallel. They need the full semantic picture before they start, not incrementally as they navigate.

Agents need not a replacement for LSP, but a complement: something pre-built, always available, queryable at repo scale, and semantically complete, ready before anyone opens a file.

Enter LST: The Foundation of Source Context Management

Lossless Semantic Trees (LST), pioneered by the OpenRewrite project (born at Netflix, commercialized by Moderne), take a different approach to code representation.

Unlike the traditional Abstract Syntax Tree (AST), an LST:

  • Preserves formatting. Whitespace, comments, style decisions are retained, enabling round-trip code transformation without destructive rewrites
  • Is fully type-attributed. Every node in the tree knows the full type of every symbol, including fields defined in external binary dependencies
  • Is pre-computed and cacheable. LSTs are generated once, stored, and queried repeatedly without needing a live language server
  • Scales to entire repositories. Moderne's platform has demonstrated querying and transforming hundreds of millions of lines of code in seconds using pre-stored LSTs

This is the first layer of a Source Context Management system. Not raw files. Not a running language server. A pre-indexed semantic tree of the entire codebase, queryable by agents at any time.

The Three-Layer Architecture of a Context Engine

A proper Source Context Management system is not a single component. It is a three-layer stack that turns a repository from a file store into something agents can actually reason over.

Layer 1: Semantic Indexing (LST + Embeddings)

Every file in the repository is parsed into an LST and simultaneously embedded into a vector representation. This creates two complementary indices:

  • Structural index (LST): knows types, dependencies, call hierarchies, inheritance chains
  • Semantic index (vectors): knows meaning, intent, similar patterns, conceptual proximity

Layer 2: Code Graph

The LST and semantic indices are projected into a code knowledge graph, a property graph where nodes are functions, classes, modules, interfaces, and comments, and edges are relationships: calls, imports, inherits, implements, modifies, tests.

This graph enables queries like:

  • "What is the blast radius if I change this interface?"
  • "Which modules have never been touched by the team that owns this service?"
  • "What are all the callers of this deprecated function across microservices?"
  • "Which code paths are covered by zero tests?"

Layer 3: Agentic Integration (MCP / API)

The context engine exposes itself through a Model Context Protocol (MCP) server or REST API, so any agent (whether a coding agent, a review agent, a risk assessment agent, or a documentation agent) can query the context engine directly, retrieving precisely the subgraph or semantic chunk it needs, without ever touching the raw file system.

The key insight: agents never read files. They query the context engine.

One Context Engine, Entire SDLC

A single context engine can serve every phase of the software development lifecycle.

Product Requirements (PRD Generation)

A PRD agent queries the context engine to understand existing capabilities, technical constraints, and module boundaries before generating a requirements document. It produces specs grounded in what the system actually is, not what someone thinks it is.

Technical Specification

A spec agent traverses the code graph to identify affected components, surface similar prior implementations, flag integration points, and propose an architecture, all without reading a single file directly.

Implementation (Coding)

A coding agent retrieves the precise subgraph surrounding the feature area: the types it needs to implement, the interfaces it must satisfy, the patterns used in adjacent modules, the test conventions for this package. It writes code that fits the codebase, not just code that compiles.

Pull Request & Code Review

A review agent queries the context engine to understand the semantic diff, not just what lines changed, but what that change means for the rest of the system. It can immediately surface:

  • Functions that are now unreachable
  • Breaking changes to downstream consumers
  • Regressions in design patterns
  • Missing test coverage for the changed blast radius

Risk Assessment

A risk agent scores every PR against the code graph, identifying high-centrality nodes (code that many things depend on), historically buggy modules, and changes that cross team ownership boundaries. No DORA metrics spreadsheet required.

Documentation & Design Principles

A documentation agent can traverse the code graph to generate living documentation (architecture diagrams, module dependency maps, API contracts) that updates automatically as the codebase evolves. Design principles can be encoded as graph constraints and validated on every merge.

Incident Response

When a production incident occurs, an on-call agent queries the context engine with the failing component and gets an immediate blast-radius map, the last 10 changes to that subgraph, the owners, and the test coverage status. Time-to-understanding drops from hours to seconds.

The Business Imperative

The business case is simple:

  • Developer productivity. Agents with accurate context write correct code on the first pass. Fewer review cycles, fewer reverted commits, fewer rollbacks.
  • Delivery velocity. Pre-computed context means agents don't spend half their time reading the codebase. Tasks that take minutes of agent compute today can take seconds.
  • Risk reduction. A code graph makes the blast radius of every change visible before it merges. Risk moves left, from production incidents to pre-merge awareness.
  • Institutional memory. The context engine captures why code is structured the way it is, not just what it does. New engineers (and new agents) onboard against the graph, not against tribal knowledge.

The Open Source Ecosystem Is Already Here

This is not a theoretical architecture. Tools exist today:

  • OpenRewrite / Moderne. LST generation and large-scale codebase transformation
  • tree-sitter. Universal parser for building ASTs across 150+ languages
  • CodeRAG. Graph-based code analysis for AI-assisted development
  • ArangoDB / FalkorDB / Neo4j / Memgraph. Graph databases well-suited for code relationship storage
  • Chroma / Qdrant / Milvus. Vector databases for semantic code embeddings
  • MCP (Model Context Protocol). Anthropic's open protocol for agent-to-tool communication
  • Context-aware code review engines. Platforms that leverage semantic code understanding to power AI-assisted review beyond surface-level linting

The missing piece is not any individual component. It is the platform that assembles them into a unified, repo-attached context engine that every agent in the SDLC can query through a single interface.

The Challenges to Solve

Source Context Management faces real engineering challenges:

  1. Security and access control. A context engine is a pre-analyzed, queryable understanding of the codebase: dependency chains, blast-radius maps, test coverage gaps, ownership boundaries. In the wrong hands, this becomes a penetration testing roadmap. Agents querying the context engine must respect the same repo-level permissions that developers have, enforced at the graph query level, not just the API boundary. Context leakage across team or tenant boundaries is the single highest-severity threat vector this architecture introduces. Any serious deployment must treat threat modeling as a first-class architectural concern. Anthropic's 2026 Agentic Coding Trends Report makes the same call, listing "security-first architecture" as one of eight defining trends for agentic coding.
  2. Polyglot repos. Most enterprise codebases span multiple languages. The context engine must support unified graph construction across Java, Python, TypeScript, Go, and more simultaneously.
  3. Index freshness. The context engine must update incrementally on every commit, much like Git's own index, which uses content-addressable hashing and stat caching to detect exactly what changed without re-reading every file. A context engine that rebuilds from scratch on every push will not scale; one that recomputes only the affected subgraph on each commit will. A stale graph is worse than no graph, because it gives agents false confidence.
  4. Graph scale. A 10-million-line monorepo produces a graph with hundreds of millions of edges. Query performance at this scale requires dedicated graph infrastructure.
  5. Evaluation. How do you measure whether an agent's output improved because the context engine was accurate? Building evals for context quality is an unsolved problem.

The Reframe: What Is a Repository?

This is the shift:

A repository is not a collection of files. A repository is a knowledge graph with a version history attached.

Git's job is to version that knowledge. The context engine's job is to make it queryable. The agent's job is to act on it.

Follow this model and the consequences are concrete. Every CI/CD pipeline should include a context engine update step, as natural as running tests. Every developer platform should expose a context engine API alongside its code hosting API. Every AI coding tool should be evaluated not just on model quality but on context engine quality.

Source code repositories that don't invest in their context layer will produce agents that are fast but wrong. Repositories with rich, well-maintained context engines will produce agents that feel like senior engineers, because they have the same depth of understanding of the codebase that a senior engineer carries in their head.

Conclusion: The Next Infrastructure Primitive

The LSP gave us IDE intelligence. Git gave us version control. Docker gave us portable environments. Kubernetes gave us cluster orchestration. Each of these was an infrastructure primitive that unlocked a new generation of developer tooling.

The Source Context Engine is the next infrastructure primitive.

It is the prerequisite for every agentic SDLC capability worth building. And like every infrastructure primitive before it, the teams and platforms that build it first will be hard to catch.

SCM is no longer just about managing source code. It's about managing the context that makes the source code understandable.

Ompragash Viswanathan

Ompragash has a knack for Automation and AI and currently serves as a Senior Software Engineer at Harness. When he’s not at work, you’ll often find him tinkering Ansible, crafting efficient pipelines, or automating complex routines. He can be found on GitHub: @ompragash.

Similar Blogs

Harness Platform