Content Database System
This project includes a SQLite-based content database for managing article metadata, tracking external source references, and generating AI-powered summaries. The database is stored locally in .cache/knowledge.db (gitignored) and serves as a foundation for content quality tooling.
Quick Start
Section titled “Quick Start”# Scan all content files and populate databasenpm run kb:scan
# Generate AI summaries for articlesnpm run kb:summarize
# Export sources to resources.yamlnode scripts/export-resources.mjs
# View database statisticsnpm run kb:statsArchitecture
Section titled “Architecture”Database Schema
Section titled “Database Schema”articles
Section titled “articles”Stores content extracted from MDX files.
| Column | Type | Description |
|---|---|---|
id | TEXT PRIMARY KEY | Entity ID from filename |
path | TEXT | Relative path to source file |
title | TEXT | Article title from frontmatter |
description | TEXT | Article description |
content | TEXT | Plain text content (JSX removed) |
word_count | INTEGER | Word count for prioritization |
quality | INTEGER | Quality rating from frontmatter |
content_hash | TEXT | MD5 hash for change detection |
created_at | TEXT | When article was first indexed |
updated_at | TEXT | When article was last updated |
sources
Section titled “sources”Stores metadata about external references discovered in articles.
| Column | Type | Description |
|---|---|---|
id | TEXT PRIMARY KEY | SHA256 hash of URL/DOI (16 chars) |
url | TEXT | Full URL of external source |
doi | TEXT | Digital Object Identifier (if paper) |
title | TEXT | Source title/headline |
authors | TEXT (JSON) | Array of author names |
year | INTEGER | Publication year |
source_type | TEXT | Type: paper, blog, report, web, etc. |
content | TEXT | Fetched source content |
fetch_status | TEXT | pending, fetched, failed, manual |
fetched_at | TEXT | When source was last fetched |
article_sources
Section titled “article_sources”Junction table linking articles to their cited sources.
| Column | Type | Description |
|---|---|---|
article_id | TEXT | Foreign key to articles |
source_id | TEXT | Foreign key to sources |
citation_context | TEXT | Quote where source is cited |
summaries
Section titled “summaries”Stores AI-generated summaries of articles and sources.
| Column | Type | Description |
|---|---|---|
entity_id | TEXT PRIMARY KEY | ID of summarized entity |
entity_type | TEXT | ’article’ or ‘source’ |
one_liner | TEXT | Single-sentence summary (max 25 words) |
summary | TEXT | 2-3 paragraph summary |
key_points | TEXT (JSON) | 3-5 bullet points |
key_claims | TEXT (JSON) | Claims with values |
model | TEXT | Model used (e.g., claude-3-5-haiku) |
tokens_used | INTEGER | Total tokens consumed |
generated_at | TEXT | When summary was generated |
entity_relations
Section titled “entity_relations”Entity relationships loaded from entities.yaml.
| Column | Type | Description |
|---|---|---|
from_id | TEXT | Source entity ID |
to_id | TEXT | Target entity ID |
relationship | TEXT | Relationship type |
Commands Reference
Section titled “Commands Reference”npm run kb:scan
Section titled “npm run kb:scan”Scans all MDX files and populates the database.
# Standard scan (skips unchanged files)npm run kb:scan
# Force rescan all filesnode scripts/scan-content.mjs --force
# Show per-file progressnode scripts/scan-content.mjs --verbose
# Show database stats onlynode scripts/scan-content.mjs --statsWhat it does:
- Finds all
.mdxand.mdfiles insrc/content/docs/ - Extracts frontmatter (title, description, quality, sources)
- Extracts plain text content (removes imports, JSX, HTML comments)
- Discovers URLs from markdown links and DOIs
- Infers source types from domains (arxiv.org → paper, lesswrong.com → blog)
- Loads entity relations from
entities.yaml - Skips unchanged files via content hash comparison
npm run kb:summarize
Section titled “npm run kb:summarize”Generates AI summaries using the Anthropic API.
# Summarize 10 articles (default)npm run kb:summarize
# Summarize specific countnode scripts/generate-summaries.mjs --batch 50
# Summarize sources instead of articlesnode scripts/generate-summaries.mjs --type sources
# Use higher-quality modelnode scripts/generate-summaries.mjs --model sonnet
# Summarize specific articlenode scripts/generate-summaries.mjs --id deceptive-alignment
# Re-summarize changed contentnode scripts/generate-summaries.mjs --resummary
# Preview without API callsnode scripts/generate-summaries.mjs --dry-runModels available:
| Model | ID | Cost (per 1M input tokens) | Use case |
|---|---|---|---|
| haiku | claude-3-5-haiku-20241022 | ~$0.25 | Bulk summarization |
| sonnet | claude-sonnet-4-20250514 | ~$3.00 | Higher quality |
| opus | claude-opus-4-20250514 | ~$15.00 | Best quality |
Cost estimates:
| Task | Model | Estimated Cost |
|---|---|---|
| Summarize 311 articles | Haiku | ~$2-3 |
| Summarize 793 sources | Haiku | ~$10-15 |
| Single article improvement | Sonnet | ~$0.20 |
npm run kb:stats
Section titled “npm run kb:stats”Display database statistics.
npm run kb:statsCore Module API
Section titled “Core Module API”The database is accessed via scripts/lib/knowledge-db.mjs.
Import
Section titled “Import”import { db, // Better-sqlite3 instance articles, // Article operations sources, // Source operations summaries, // Summary operations relations, // Entity relation operations getResearchContext, // Full context for article getStats, // Database statistics CACHE_DIR, // Path to .cache/ SOURCES_DIR, // Path to .cache/sources/} from './scripts/lib/knowledge-db.mjs';articles API
Section titled “articles API”// Get article by IDconst article = articles.get('deceptive-alignment');
// Get article with its summaryconst withSummary = articles.getWithSummary('deceptive-alignment');
// Get all articlesconst all = articles.getAll();
// Find articles needing summariesconst unsummarized = articles.needingSummary();
// Search articlesconst results = articles.search('reward hacking');
// Check if content changedconst changed = articles.hasChanged('id', newHash);
// Insert/update articlearticles.upsert({ id: 'my-article', path: 'knowledge-base/risks/my-article.mdx', title: 'My Article', description: 'Description here', content: 'Full text content...', word_count: 1500, quality: 3, content_hash: 'abc123...'});sources API
Section titled “sources API”// Get source by ID or URLconst source = sources.get('abc123def456');const byUrl = sources.getByUrl('https://arxiv.org/...');
// Get sources for an articleconst articleSources = sources.getForArticle('deceptive-alignment');
// Get pending sources for fetchingconst pending = sources.getPending(100);
// Link source to articlesources.linkToArticle('article-id', 'source-id', 'citation context');
// Mark source fetch statussources.markFetched('source-id', 'content...');sources.markFailed('source-id', 'Error message');
// Get statisticsconst stats = sources.stats();// { total: 793, pending: 793, fetched: 0, failed: 0, manual: 0 }summaries API
Section titled “summaries API”// Get summary by entity IDconst summary = summaries.get('deceptive-alignment');
// Get all summaries of a typeconst articleSummaries = summaries.getAll('article');
// Insert/update summarysummaries.upsert('entity-id', 'article', { oneLiner: 'Single sentence...', summary: 'Full summary...', keyPoints: ['Point 1', 'Point 2'], keyClaims: [{ claim: '...', value: '...' }], model: 'claude-3-5-haiku-20241022', tokensUsed: 1247});
// Get statisticsconst stats = summaries.stats();// { article: { count: 311, tokens: 387000 }, source: { count: 0, tokens: 0 } }Research Context
Section titled “Research Context”Get comprehensive context for improving an article:
const context = getResearchContext('deceptive-alignment');// Returns:// {// article: { ...article, summary: {...} },// relatedArticles: [...],// sources: [...],// claims: [...],// stats: { relatedCount, sourcesTotal, sourcesFetched, claimsCount }// }Source Type Inference
Section titled “Source Type Inference”When scanning content, source types are inferred from domains:
| Domain Pattern | Inferred Type |
|---|---|
| arxiv.org, doi.org, nature.com | paper |
| lesswrong.com, alignmentforum.org | blog |
| substack.com | blog |
| .gov, congress.gov, whitehouse.gov | government |
| wikipedia.org | reference |
| .pdf (any domain) | report |
| (default) | web |
Directory Structure
Section titled “Directory Structure”project/├── .cache/ # Gitignored│ ├── knowledge.db # SQLite database│ └── sources/ # Cached source documents│ ├── pdf/│ ├── html/│ └── text/├── scripts/│ ├── lib/│ │ ├── knowledge-db.mjs # Core DB module│ │ ├── file-utils.mjs # File discovery│ │ ├── mdx-utils.mjs # MDX parsing│ │ └── output.mjs # Terminal formatting│ ├── scan-content.mjs # Content scanner│ └── generate-summaries.mjs # AI summarization├── src/content/docs/ # Source MDX files└── .env # API credentialsWorkflow Examples
Section titled “Workflow Examples”After Editing Content
Section titled “After Editing Content”# 1. Scan for changes (fast, uses hash comparison)npm run kb:scan
# 2. Generate summaries for new/changed articlesnpm run kb:summarize --resummaryBulk Initial Setup
Section titled “Bulk Initial Setup”# 1. Scan all contentnpm run kb:scan --force
# 2. Generate summaries in batchesnode scripts/generate-summaries.mjs --batch 100node scripts/generate-summaries.mjs --batch 100node scripts/generate-summaries.mjs --batch 100# ... repeat until doneCheck Database State
Section titled “Check Database State”# View statisticsnpm run kb:stats
# Or programmaticallynode -e "import { getStats } from './scripts/lib/knowledge-db.mjs';console.log(JSON.stringify(getStats(), null, 2));"Exporting to YAML (Resources System)
Section titled “Exporting to YAML (Resources System)”The database serves as a processing layer. Canonical data is exported to YAML files for the site build.
Export Resources
Section titled “Export Resources”# Export cited sources to resources.yamlnode scripts/export-resources.mjs
# Export ALL sources (including uncited)node scripts/export-resources.mjs --all
# Preview without writingnode scripts/export-resources.mjs --dry-runThe export script:
- Reads sources from SQLite with their AI summaries
- Merges with existing
src/data/resources.yaml(preserves manual edits) - Includes
cited_byreferences showing which articles cite each source
Using Resources in MDX
Section titled “Using Resources in MDX”Once resources are in resources.yaml, you can reference them semantically:
import { ResourceLink, ResourceList, ResourceCite } from '../../components/wiki';
Recent work on AI control <ResourceCite id="ai-control-2023" /> shows...
See also: <ResourceLink id="superintelligence-2014" />
## Key Papers
<ResourceList ids={["ai-control-2023", "concrete-problems-2016"]} showSummaries/>Resource Schema
Section titled “Resource Schema”Resources in resources.yaml have this structure:
- id: ai-control-2023 url: https://arxiv.org/abs/2312.06942 title: "AI Control: Improving Safety..." authors: ["Ryan Greenblatt", "Buck Shlegeris"] published_date: "2023-12" type: paper # paper, blog, book, report, talk, podcast, government, reference, web summary: "AI-generated summary..." key_points: - "Point 1" - "Point 2" cited_by: - agentic-ai - ai-controlLimitations
Section titled “Limitations”- Source fetching not implemented: The
fetch_statusfield exists but automatic fetching of external sources is not yet built - Claims extraction minimal: The claims table exists but extraction is not fully implemented
- Local only: Database is gitignored and must be regenerated on each machine
- No incremental summary updates: Summaries are regenerated from scratch, not updated
Future Enhancements
Section titled “Future Enhancements”Potential improvements to the system:
- Automatic source fetching (PDFs, web pages)
- Claims extraction and consistency checking across articles
- Similarity search using embeddings
- Migration of entity
sourcesto use resource IDs - Integration with content validation tools