Show HN: Open KB: Open LLM Knowledge Base
VectifyAI has released OpenKB, an open-source knowledge base system for large language models that builds on Andrej Karpathy's original work and adds support for long PDF documents through a feature called Pageindex. For developers building AI agents and document-heavy applicatio...
Hacker News user mingtianzhang, posting on behalf of VectifyAI, announced the release of OpenKB, an open-source knowledge base system designed to work with large language models. The project is publicly available at github.com/VectifyAI/OpenKB and was submitted to Hacker News under the "Show HN" format, the community's designated space for creators to share projects they have built. The post gathered 5 points and 2 comments within its first 11 hours, modest numbers that reflect the technically niche audience this kind of infrastructure project tends to attract initially.
Why This Matters
PDF handling is the unglamorous problem that every enterprise AI project eventually crashes into, and OpenKB is one of the few open-source projects that acknowledges it directly rather than pretending flat text files represent real-world usage. The retrieval-augmented generation market has seen significant investment from commercial players like Pinecone and Weaviate, but self-hosted, transparent alternatives remain underdeveloped. With at least 2 other "Show HN" knowledge base projects appearing on Hacker News in the same week as this announcement, including "An AI-powered knowledge base that thinks" by DSpider, developer appetite for open-source tooling in this category is clearly accelerating. VectifyAI is entering a crowded field, but their specific focus on long-document indexing gives them a legitimate technical angle to own.
Daily briefing from 50+ sources. Free, 5-minute read.
The Full Story
Andrej Karpathy, the AI researcher who served as Tesla's AI director before his work at OpenAI, has been publicly exploring what thoughtful knowledge management for LLMs could look like. His ideas around open knowledge bases have become a reference point for developers thinking about how AI systems should store and retrieve information beyond their training data. VectifyAI took that conceptual foundation and built something runnable from . The core problem OpenKB tries to solve is straightforward to describe but surprisingly hard to execute. Language models are trained on data up to a cutoff date and have no inherent ability to consult new documents, proprietary files, or specialized knowledge repositories on their own. Retrieval-augmented generation, commonly called RAG, is the dominant approach for solving this, and it involves indexing external documents so a model can pull relevant chunks into its context window before generating a response. Where most RAG tutorials fall apart is on real-world documents, which are long, messily formatted, and rarely arrive as clean plain text.
That is where Pageindex comes in. The feature, which VectifyAI describes as their primary technical differentiator, maintains awareness of page boundaries when processing PDFs. This sounds simple, but it solves a real problem: when a system slices a 200-page technical manual into chunks without knowing where pages begin and end, it loses the ability to tell a user that a particular answer came from page 47. For enterprise applications, compliance work, or any domain where source attribution matters, that kind of precision is not optional.
When a commenter on Hacker News, posting under the handle 0xAgentKitchen, pointed out that there are already dozens of Karpathy-inspired knowledge base projects and suggested VectifyAI add a comparison table to their README, mingtianzhang responded directly and acknowledged the feedback. The reply was candid: "Our biggest differentiation is that OpenKB can handle long PDFs and images, something that isn't trivial or easily 'vibe-coded.'" That is a pointed dig at the wave of AI projects assembled quickly with minimal engineering depth, and it signals that VectifyAI sees serious document processing as the moat worth building.
The project is fully open source, meaning developers can inspect the code, fork it, and extend it without licensing fees or vendor lock-in. For teams building AI tools and platforms that need to remain self-hosted for compliance or cost reasons, that transparency is a genuine selling point that commercial vector database products cannot match on their own.
Key Details
- VectifyAI posted OpenKB to Hacker News on the date of submission, receiving 5 points and 2 comments within the first 11 hours.
- The GitHub repository is located at github.com/VectifyAI/OpenKB and is publicly available under an open-source license.
- The project is built as an open-source version of Andrej Karpathy's knowledge base concept, extended with Pageindex for PDF support.
- Pageindex specifically enables page-boundary-aware indexing of long PDFs and image-embedded documents.
- At least 2 competing knowledge base "Show HN" posts appeared on Hacker News within the same week, reflecting high developer activity in this category.
- Commercial vector database competitors in the same space include Pinecone, Weaviate, Milvus, and the open-source Chroma library.
What's Next
VectifyAI has publicly committed to exploring a comparison table in their README after community feedback, which will be a meaningful test of whether they can articulate their technical advantages clearly enough to stand out among competing projects. The team's next challenge is producing documentation and real-world benchmarks that demonstrate Pageindex's performance on specific document types, such as 100-plus-page financial reports or multi-column research papers. Developers evaluating the project for production use will want those numbers before committing to integrate it into a larger agent architecture.
How This Compares
The comparison to other open-source RAG and knowledge base tools is worth taking seriously. LlamaIndex and LangChain, two of the most widely adopted frameworks in this space, both offer PDF ingestion pipelines, but they are general-purpose frameworks rather than purpose-built knowledge base systems. OpenKB is positioning itself as the latter, which means it could integrate with those frameworks rather than compete with them directly.
Within the narrower category of Karpathy-inspired knowledge base clones, the 0xAgentKitchen comment on Hacker News identified a real crowding problem. Without a clear comparison against the other 30-odd similar projects, developers have no efficient way to evaluate whether OpenKB's Pageindex feature justifies switching from whatever they already use. The projects by DSpider and smissingham that appeared on Hacker News in the same week share the same audience and the same pitch, which means differentiation through documentation and benchmarks is not a nice-to-have for VectifyAI, it is existential.
On the commercial side, AWS Kendra, Microsoft Azure AI Search, and Google's Vertex AI Search all handle long PDFs in enterprise settings, but they come with cost structures and data residency implications that make many developers prefer self-hosted alternatives. OpenKB's open-source positioning is a direct answer to that preference. The question is whether the implementation can match the reliability that paid services have spent years tuning. Check AI Agents Daily news for ongoing coverage as the project matures and community benchmarks start appearing.
FAQ
Q: What is a knowledge base for an LLM and why do you need one? A: A knowledge base lets a language model answer questions using documents you provide, rather than relying only on its training data. This is essential for applications that need current information, proprietary company documents, or specialized content like legal filings or technical manuals. The model retrieves relevant passages from the knowledge base before generating each response.
Q: What does Pageindex actually do in OpenKB? A: Pageindex tracks page boundaries when processing PDF files so the system knows exactly which page a piece of information came from. Without this, a long document gets sliced into chunks that lose their original location context, making accurate source attribution difficult. For research, compliance, or enterprise use cases, knowing the precise source page is often a hard requirement.
Q: How is OpenKB different from just using LangChain or LlamaIndex? A: LangChain and LlamaIndex are broad frameworks that include document ingestion as one of many features. OpenKB is built specifically as a knowledge base system, which means the design decisions around indexing, retrieval, and long-document handling are central rather than add-ons. Developers can find guides comparing these approaches to decide which fits their stack.
OpenKB is a small project right now, but the specific problem it is attacking, reliable long-PDF indexing for LLM applications, is one of the more practical and underserved needs in the current AI development ecosystem. Whether VectifyAI can build the community and documentation to separate itself from the crowded field of similar projects will determine how far it goes. Subscribe to the AI Agents Daily weekly newsletter for daily updates on AI agents, tools, and automation.
Get stories like this daily
Free briefing. Curated from 50+ sources. 5-minute read every morning.




