Unlock the hidden structure and insights within your codebase by transforming it into a powerful, queryable knowledge graph!
This project is your gateway to transforming complex source code into structured, queryable knowledge graphs. By leveraging semantic analysis (primarily via VS Code Language Server capabilities and historically through ANTLR), we extract meaningful information about your code's entities, relationships, and architecture. Dive deep into your codebase like never before!
- Knowledge Graph Generation: Converts source code from various languages into a rich graph structure.
- VS Code Integration: Primarily utilizes VS Code's powerful language servers for parsing and symbol extraction, providing broad language support out-of-the-box.
- Language Agnostic (via VS Code): The VS Code module aims to support any language for which a good Language Server Protocol (LSP) implementation exists.
- Detailed Symbol & Relationship Extraction: Captures classes, functions, variables, calls, inheritance, implementations, and more.
- File System Awareness: Includes tools for intelligently walking file trees, respecting
.gitignorepatterns. - MinHashing for Similarity: Implements MinHash for locality-sensitive hashing, useful for detecting near-duplicate code snippets or tracking semantic drift.
- (Historical) ANTLR-based Parsing & Querying: Features a sophisticated, though now deprecated, ANTLR-based parsing pipeline with a custom AST Query Language (
bevel_ast_ql) for fine-grained code analysis. - Codebase Combiner Utility: Handy scripts to bundle your entire project (or a summary) into a single text file for easy sharing or LLM analysis.
Understanding large, evolving codebases is a monumental task. This project aims to alleviate that by:
- Deep Code Understanding: Visualize and query relationships between components.
- Impact Analysis: Understand the ripple effects of changes.
- Architectural Insights: Discover design patterns, dependencies, and potential issues.
- Foundation for Custom Tooling: Build linters, documentation generators, or advanced refactoring tools on top of the graph.
- AI & LLM Augmentation: Provide rich, structured context about code to Large Language Models.
The project is organized into several key modules:
-
vscode/(Primary & Active):- Interfaces with VS Code's language services to extract symbols, references, and definitions.
VsCodeParser.kt: Orchestrates symbol extraction from files.VsCodeConnectionParser.kt: Infers connections (calls, inheritance, etc.) from symbols.languageSpecs/: Contains language-specific configurations and heuristics to refine VS Code's output.- This is the recommended and actively developed approach for parsing.
-
antlr/(⚠️ Largely Deprecated & Non-Functional):- Contains ANTLR grammars for various languages (Kotlin, C#, JavaScript).
- Includes a custom AST Query Language (
bevel_ast_ql/) for defining patterns to extract nodes and relationships from ANTLR parse trees. QueryBasedAntlrParser.kt&ConverterBasedAntlrParser.kt: Historical parsers using this system.- Important: This module underwent a large-scale refactor and is no longer functional in its current state. It's preserved for its valuable query language design and potential future reintegration if specific deep-parsing needs arise that VS Code LSP cannot fulfill. The documentation in
docs/primarily refers to this deprecated system.
-
providers/:GitignoreAwareFileWalker.kt: Efficiently traverses project directories, respecting.gitignore.MinHasher.kt: Implements MinHash for code similarity analysis.
-
regex/(⚠️ Deprecated):- Contains older, regex-based parsers for specific frameworks (e.g., AngularJS). Not actively maintained.
-
Root-level Scripts (
combine_*.sh,combine_files.py):- Utilities to package the codebase itself into a single text file. Useful for providing context to LLMs or for archiving.
- File Discovery: The
GitignoreAwareFileWalkerscans the target project for relevant source files. - Symbol Extraction (via VS Code): For each file, the
VsCodeParsercommunicates with VS Code (or a compatible LSP client) to request document symbols, workspace symbols, definitions, and references. - Graph Node Creation: Extracted symbols (classes, functions, variables, etc.) are transformed into nodes in the knowledge graph.
- Relationship Inference: The
VsCodeConnectionParseranalyzes the symbol information (e.g., call hierarchies, type definitions, inheritance) to create connections (edges) between nodes. - Graph Augmentation & Refinement: Language-specific logic in
vscode/languageSpecs/can further refine the graph, adding more detailed connections or node properties. - (Optional) Hashing:
MinHashercan be used to generate semantic fingerprints of code blocks or files.
- Java Development Kit (JDK) 17 or higher
- Gradle (the project uses the Gradle wrapper, so it will be downloaded automatically)
- (For
vscodemodule functionality) A running instance of VS Code or a compatible LSP server setup that the tool can communicate with (details depend on the specific runner implementation). - Python 3 for the
combine_*.shscripts (pip3 install -r requirements.txt).
-
Clone the repository:
git clone https://github.com/YOUR_USERNAME/code-to-knowledge-graph.git cd code-to-knowledge-graph -
Build the project using Gradle:
./gradlew build
This will compile the Kotlin/Java code, run tests, and produce necessary artifacts.
The primary way to use the knowledge graph generation capabilities is programmatically by integrating the vscode module's parsers into your own applications or analysis scripts.
// Example (Conceptual - actual API may vary)
// import software.bevel.code_to_knowledge_graph.vscode.VsCodeParser
// import software.bevel.code_to_knowledge_graph.vscode.VsCodeConnectionParser
// ...
// val projectPath = "/path/to/your/codebase"
// val vsCodeParser = VsCodeParser(...)
// val connectionParser = VsCodeConnectionParser(...)
// val initialGraph = vsCodeParser.parseProject(projectPath)
// val finalGraph = connectionParser.enhanceGraph(initialGraph)
// Now you have 'finalGraph' to work with!For development and testing the vscode module, you'll typically need a way to simulate or directly interact with VS Code's LSP. The BatchProcessor and related classes in vscode/src/main/kotlin/ hint at a mechanism for this.
As mentioned, the antlr/ module and its associated bevel_ast_ql query language are currently deprecated and non-functional due to a major refactor. The project's primary focus for parsing is now the vscode/ module.
The documentation found in:
docs/queries.mddocs/programming_constructs.md
...describes how to use the bevel_ast_ql system. While this system is not active, the concepts and query examples might still provide valuable insights into thinking about code-to-graph transformations, should you wish to explore or revive parts of that system.
This project also includes a handy set of scripts to bundle your (or any) codebase into a single text file, respecting .gitignore. This is excellent for feeding context to LLMs.
- Full Combine (respects .gitignore):
./combine_codebase.sh path/to/your/project output_combined.txt
- Compact Combine (first 50 lines per file, respects .gitignore):
./combine_compact.sh path/to/your/project output_compact.txt
- Summary View (file structure + first 10 lines per file, respects .gitignore):
(Requires Python dependencies:
./combine_summary.sh path/to/your/project output_summary.txt
pip3 install -r requirements.txt)
Contributions are welcome! Whether it's improving the VS Code integration, adding new language-specific handlers, enhancing the graph model, or fixing bugs, your help is appreciated.
- Fork the repository.
- Create your feature branch (
git checkout -b feature/AmazingFeature). - Commit your changes (
git commit -m 'Add some AmazingFeature'). - Push to the branch (
git push origin feature/AmazingFeature). - Open a Pull Request.
Please make sure your code adheres to the existing style and that all tests pass.
This project is licensed under the Mozilla Public License Version 2.0. See the LICENSE file for details.
The NOTICE file contains information about licenses of third-party dependencies.
- The ANTLR project for their powerful parser generator (though our ANTLR module is currently deprecated).
- The Dynatrace hash4j library for MinHash implementation.
- The broader Language Server Protocol (LSP) community for enabling robust cross-editor language intelligence.
- All contributors and users of this project.
Happy Coding & Graphing! 🧑💻➡️📊

