Joseph T. French

Building an AI-Ready Financial Intelligence Platform on LadybugDB

I started building RoboSystems on Neo4j. It's what I knew, and the Cypher query language was a natural fit for the financial knowledge graphs I was constructing. The prototype worked. Then I did the math on what it would actually cost to run in production, and more importantly, what it would cost my users.

RoboSystems is an open-source financial data platform. The whole point is that companies can fork it and deploy it in their own AWS account, owning their own data. I couldn't build that on top of a database that required an enterprise license to run at scale. Neo4j Community Edition has hard limits that make it impractical for production workloads, and Neo4j Enterprise would mean passing a significant licensing toll on every user who deploys the platform. That's fundamentally incompatible with building open-source infrastructure.

So I migrated to Kuzu, an embedded graph database that supported Cypher and seemed like the right fit. I spent a few months rebuilding the graph layer on Kuzu. Then, less than a week after I open-sourced RoboSystems, Apple acquired Kuzu and the open-source project was shut down suddenly without warning. I was stuck for about a month with no viable option. I needed something embeddable, something that spoke Cypher, something that could handle hundreds of millions of nodes on a single instance without a cluster, and something with a license that wouldn't create a tax on every deployment. Nothing fit.

Then LadybugDB emerged, a community fork of Kuzu with a new maintainer committed to keeping it open source. What started as the only remaining option quickly became a genuine technical advantage. The DuckDB integration, the columnar storage model, the vector extension, the Icebug graph algorithms, all of it turned out to be better than what I had before, not just a substitute. LadybugDB is now the foundation of the entire platform.

The Problem

Every company that files annual and quarterly financial reports with the SEC does so in XBRL, a structured data format that is surprisingly difficult to work with at scale. There are more than ten thousand reporting companies, hundreds of thousands of filings going back to 2009, and hundreds of millions of individual data points. I wanted to shred all of that into a queryable knowledge graph, not just the raw numbers, but the semantic structure: what elements mean, how they relate to each other across companies, and how financial statements are constructed. And I wanted AI agents to be able to query it naturally through MCP tools without needing to write complex Cypher.

I needed something I could embed on a single EC2 instance, distribute as a file, and scale horizontally by simply launching more read-only replicas that pull the latest snapshot from S3. LadybugDB gave me all of that, plus native DuckDB integration that turned out to be the real unlock for the data pipeline.

The Architecture

The SEC pipeline is a six-stage process orchestrated by Dagster:

Download — Discover and download XBRL filings from SEC EDGAR, partitioned by quarter and form type
Process — Shred XBRL into parquet files using a custom processor that extracts entities, facts, elements, labels, structures, and associations — 14 node and 21 relationship types
Stage — Ingest parquet into DuckDB using glob patterns, with deduplication via GROUP BY and FIRST() aggregation
Enrich — Generate knowledge artifacts using Icebug's graph analytics (PageRank, Core Decomposition, BFS) via zero-copy Arrow-to-CSR construction, then infuse 384-dimensional embeddings for semantic search
Materialize — COPY from DuckDB staging tables directly into LadybugDB, with batch processing for tables exceeding 20M rows
Publish — Backup to S3 and distribute to a fleet of read-only shared replicas behind an ALB

The entire pipeline runs nightly after filings stop flowing at around 9pm EST. A company like Uber files their 10-K on a Friday, and by Saturday morning it's been shredded, staged, enriched, materialized and available for AI-powered analysis.

Why LadybugDB Works Here

A few things make LadybugDB uniquely suited for this:

DuckDB integration is first-class. The materialization path from DuckDB staging tables into LadybugDB via ATTACH and COPY is fast and reliable. For the SEC dataset, I'm moving over a hundred million rows through this pipeline, and the direct COPY path handles it at several million rows per minute without issues. For large tables like Fact (which can exceed 100M rows), I use batched materialization with 20M-row chunks.

The vector extension is production-ready. Every element, label, and structure node in the SEC graph carries a 384-dimensional embedding (BAAI/bge-small-en-v1.5). LadybugDB stores these as FLOAT[384] columns with HNSW indexes created post-materialization. This is what powers semantic search. An AI agent can resolve a natural language concept like "revenue" to the correct XBRL element across different taxonomies using vector similarity, then traverse the graph to pull the actual numbers.

Embeddable means operationally simple. Each LadybugDB instance runs on a single EC2 node (ARM64 Graviton). The shared repository tier uses r7g.2xlarge instances (64GB RAM) for the shared master, with a fleet of read-only replicas behind an ALB on m7g.large instances. There's no cluster to manage, no replication protocol to debug. The replicas simply download the published .lbug file from S3 on startup.

Icebug: Graph Analytics That Changed Everything

The real breakthrough came when I integrated Icebug for offline graph analytics. The SEC XBRL data contains millions of edges representing how financial elements relate to each other — calculation relationships (revenue minus expenses equals income) and presentation relationships (how statements are laid out).

I use Graph.fromCSR() to construct Icebug graphs directly from Arrow arrays exported by DuckDB, zero-copy with no per-element Python loops. From roughly 50M raw association rows (the same element relationships repeat across 100K filings), DuckDB deduplicates down to 3M unique edges, which are then fed into Icebug for three key analyses:

PageRank identifies the most important financial elements across the entire corpus
Core Decomposition reveals the structural backbone of financial reporting
BFS from known roots (NetIncomeLoss, Assets, CashAndCashEquivalentsPeriodIncreaseDecrease) classifies every element by its primary financial statement

These analytics, combined with the embeddings, produce a confidence-scored classification for every concept in the graph. The result: instead of requiring an LLM to generate complex Cypher queries on a schema with 14 node types and 21 relationship types, I can give it simple, tailored MCP tools that just provide the answer with filtering options. A user might ask an AI agent "what was Uber's revenue in Q4 2024?" and the tool resolves "revenue" to the correct XBRL element using embedding similarity + graph-structural confidence, then does a straightforward filtered traversal. Less time to retrieve, dramatically less hallucination.

Enriching the Graph After Construction

The biggest strides I've made haven't been in the raw data pipeline, they've been in enriching the graph after it's built. The philosophy is simple: you can't classify what's in the graph until you can see the full picture. Trying to infer structural patterns during construction, when you're processing one filing at a time, doesn't work. You need the complete graph first.

So after shredding the XBRL, I spin up a temporary LadybugDB instance, load the filing contents into the graph, and run Cypher queries that identify what things actually are: roll-ups, roll-forwards, hierarchies, disclosure patterns. From those patterns, I create new Classification and FactSet nodes that make the graph dramatically easier for downstream tools to navigate. After exporting the enriched nodes to parquet, I then throw the temporary graph away. That's a pattern that only works with an embedded database. You can't justify standing up a Neo4j cluster for a short-lived enrichment job.

This is where the real value is. A graph full of raw XBRL data is useful. A graph where every structure has been classified, every element has been scored by importance, and every concept has been mapped to a canonical meaning is something an AI agent can actually work with reliably.

Beyond Shared Data: User Graphs and Connectors

The SEC dataset proves the pattern, but the long-term potential is in user graphs. RoboSystems uses the same infrastructure to give every customer their own dedicated LadybugDB instance where they can:

Upload parquet files to S3, stage in DuckDB, and materialize into their own graph
Connect accounting systems like QuickBooks through OAuth — the pipeline extracts, transforms via dbt, and materializes transactions directly into the graph
Create subgraphs — isolated databases on the same instance for workspaces, team collaboration, or AI memory

The subgraph feature has been particularly interesting. A subgraph can serve as an AI memory layer, where the MCP client has tools for creating nodes and relationships in the graph schema, writing content for later retrieval. A fork of the main graph lets you experiment with different data transformations without touching production.

The Distribution Model

RoboSystems is open source. The entire codebase — the SEC pipeline, the graph API, the MCP tools, is designed to be forked and deployed into a customer's own AWS account. The infrastructure is defined in CloudFormation templates parameterized by a single graph.yml config file. GitHub Actions handles deployment. There's no managed service vendor lock-in.

This only works because LadybugDB is embeddable. If the graph engine required a separately managed cluster, the fork-and-deploy model would collapse under operational complexity. Instead, each tier is a single EC2 instance running the Graph API with LadybugDB embedded — the same Docker image and deployment pattern whether you're running a 5M-node graph or a 200M-node one, just on a bigger instance.

What I've Learned

Building on LadybugDB, a few things stand out:

The DuckDB-to-LadybugDB pipeline is the killer feature. Being able to stage, transform, enrich, and deduplicate data in DuckDB, then materialize directly into the graph without an intermediate serialization step, is what makes the SEC pipeline viable. I couldn't do this with a client-server graph database.

Embeddings in the graph change what's possible. Storing vectors alongside graph structure means you can do hybrid queries, semantic similarity to find the right nodes, then graph traversal to get the relationships. This is what makes the MCP tools accurate enough for production use.

Icebug completes the picture. Raw graph structure plus embeddings is good. Adding PageRank, core decomposition, and BFS-derived classifications makes it great. The confidence scores from graph analytics are what let me build tools that just work, instead of tools that sometimes work.

Operational simplicity compounds. Every operational decision I didn't have to make, no cluster management, no replication protocol, no connection pooling tuning, freed up time to build actual product features. The constraint of simplicity pushed better architecture.

I'm committed to supporting LadybugDB and calling it out at every opportunity because I believe this is a foundational open-source technology. The combination of an embedded columnar graph engine with native DuckDB integration, vector support, and Icebug analytics is genuinely unique — and it's what makes RoboSystems possible.

RoboSystems is an open-source financial intelligence platform. The SEC knowledge graph is available as a shared repository, and the full platform can be forked and deployed to any AWS account. Learn more at robosystems.ai.