Skip to content

Content Database System

This project includes a SQLite-based content database for managing article metadata, tracking external source references, and generating AI-powered summaries. The database is stored locally in .cache/knowledge.db (gitignored) and serves as a foundation for content quality tooling.


Terminal window
# Scan all content files and populate database
npm run kb:scan
# Generate AI summaries for articles
npm run kb:summarize
# Export sources to resources.yaml
node scripts/export-resources.mjs
# View database statistics
npm run kb:stats

Loading diagram...

Stores content extracted from MDX files.

ColumnTypeDescription
idTEXT PRIMARY KEYEntity ID from filename
pathTEXTRelative path to source file
titleTEXTArticle title from frontmatter
descriptionTEXTArticle description
contentTEXTPlain text content (JSX removed)
word_countINTEGERWord count for prioritization
qualityINTEGERQuality rating from frontmatter
content_hashTEXTMD5 hash for change detection
created_atTEXTWhen article was first indexed
updated_atTEXTWhen article was last updated

Stores metadata about external references discovered in articles.

ColumnTypeDescription
idTEXT PRIMARY KEYSHA256 hash of URL/DOI (16 chars)
urlTEXTFull URL of external source
doiTEXTDigital Object Identifier (if paper)
titleTEXTSource title/headline
authorsTEXT (JSON)Array of author names
yearINTEGERPublication year
source_typeTEXTType: paper, blog, report, web, etc.
contentTEXTFetched source content
fetch_statusTEXTpending, fetched, failed, manual
fetched_atTEXTWhen source was last fetched

Junction table linking articles to their cited sources.

ColumnTypeDescription
article_idTEXTForeign key to articles
source_idTEXTForeign key to sources
citation_contextTEXTQuote where source is cited

Stores AI-generated summaries of articles and sources.

ColumnTypeDescription
entity_idTEXT PRIMARY KEYID of summarized entity
entity_typeTEXT’article’ or ‘source’
one_linerTEXTSingle-sentence summary (max 25 words)
summaryTEXT2-3 paragraph summary
key_pointsTEXT (JSON)3-5 bullet points
key_claimsTEXT (JSON)Claims with values
modelTEXTModel used (e.g., claude-3-5-haiku)
tokens_usedINTEGERTotal tokens consumed
generated_atTEXTWhen summary was generated

Entity relationships loaded from entities.yaml.

ColumnTypeDescription
from_idTEXTSource entity ID
to_idTEXTTarget entity ID
relationshipTEXTRelationship type

Scans all MDX files and populates the database.

Terminal window
# Standard scan (skips unchanged files)
npm run kb:scan
# Force rescan all files
node scripts/scan-content.mjs --force
# Show per-file progress
node scripts/scan-content.mjs --verbose
# Show database stats only
node scripts/scan-content.mjs --stats

What it does:

  1. Finds all .mdx and .md files in src/content/docs/
  2. Extracts frontmatter (title, description, quality, sources)
  3. Extracts plain text content (removes imports, JSX, HTML comments)
  4. Discovers URLs from markdown links and DOIs
  5. Infers source types from domains (arxiv.org → paper, lesswrong.com → blog)
  6. Loads entity relations from entities.yaml
  7. Skips unchanged files via content hash comparison

Generates AI summaries using the Anthropic API.

Terminal window
# Summarize 10 articles (default)
npm run kb:summarize
# Summarize specific count
node scripts/generate-summaries.mjs --batch 50
# Summarize sources instead of articles
node scripts/generate-summaries.mjs --type sources
# Use higher-quality model
node scripts/generate-summaries.mjs --model sonnet
# Summarize specific article
node scripts/generate-summaries.mjs --id deceptive-alignment
# Re-summarize changed content
node scripts/generate-summaries.mjs --resummary
# Preview without API calls
node scripts/generate-summaries.mjs --dry-run

Models available:

ModelIDCost (per 1M input tokens)Use case
haikuclaude-3-5-haiku-20241022~$0.25Bulk summarization
sonnetclaude-sonnet-4-20250514~$3.00Higher quality
opusclaude-opus-4-20250514~$15.00Best quality

Cost estimates:

TaskModelEstimated Cost
Summarize 311 articlesHaiku~$2-3
Summarize 793 sourcesHaiku~$10-15
Single article improvementSonnet~$0.20

Display database statistics.

Terminal window
npm run kb:stats

The database is accessed via scripts/lib/knowledge-db.mjs.

import {
db, // Better-sqlite3 instance
articles, // Article operations
sources, // Source operations
summaries, // Summary operations
relations, // Entity relation operations
getResearchContext, // Full context for article
getStats, // Database statistics
CACHE_DIR, // Path to .cache/
SOURCES_DIR, // Path to .cache/sources/
} from './scripts/lib/knowledge-db.mjs';
// Get article by ID
const article = articles.get('deceptive-alignment');
// Get article with its summary
const withSummary = articles.getWithSummary('deceptive-alignment');
// Get all articles
const all = articles.getAll();
// Find articles needing summaries
const unsummarized = articles.needingSummary();
// Search articles
const results = articles.search('reward hacking');
// Check if content changed
const changed = articles.hasChanged('id', newHash);
// Insert/update article
articles.upsert({
id: 'my-article',
path: 'knowledge-base/risks/my-article.mdx',
title: 'My Article',
description: 'Description here',
content: 'Full text content...',
word_count: 1500,
quality: 3,
content_hash: 'abc123...'
});
// Get source by ID or URL
const source = sources.get('abc123def456');
const byUrl = sources.getByUrl('https://arxiv.org/...');
// Get sources for an article
const articleSources = sources.getForArticle('deceptive-alignment');
// Get pending sources for fetching
const pending = sources.getPending(100);
// Link source to article
sources.linkToArticle('article-id', 'source-id', 'citation context');
// Mark source fetch status
sources.markFetched('source-id', 'content...');
sources.markFailed('source-id', 'Error message');
// Get statistics
const stats = sources.stats();
// { total: 793, pending: 793, fetched: 0, failed: 0, manual: 0 }
// Get summary by entity ID
const summary = summaries.get('deceptive-alignment');
// Get all summaries of a type
const articleSummaries = summaries.getAll('article');
// Insert/update summary
summaries.upsert('entity-id', 'article', {
oneLiner: 'Single sentence...',
summary: 'Full summary...',
keyPoints: ['Point 1', 'Point 2'],
keyClaims: [{ claim: '...', value: '...' }],
model: 'claude-3-5-haiku-20241022',
tokensUsed: 1247
});
// Get statistics
const stats = summaries.stats();
// { article: { count: 311, tokens: 387000 }, source: { count: 0, tokens: 0 } }

Get comprehensive context for improving an article:

const context = getResearchContext('deceptive-alignment');
// Returns:
// {
// article: { ...article, summary: {...} },
// relatedArticles: [...],
// sources: [...],
// claims: [...],
// stats: { relatedCount, sourcesTotal, sourcesFetched, claimsCount }
// }

When scanning content, source types are inferred from domains:

Domain PatternInferred Type
arxiv.org, doi.org, nature.compaper
lesswrong.com, alignmentforum.orgblog
substack.comblog
.gov, congress.gov, whitehouse.govgovernment
wikipedia.orgreference
.pdf (any domain)report
(default)web

project/
├── .cache/ # Gitignored
│ ├── knowledge.db # SQLite database
│ └── sources/ # Cached source documents
│ ├── pdf/
│ ├── html/
│ └── text/
├── scripts/
│ ├── lib/
│ │ ├── knowledge-db.mjs # Core DB module
│ │ ├── file-utils.mjs # File discovery
│ │ ├── mdx-utils.mjs # MDX parsing
│ │ └── output.mjs # Terminal formatting
│ ├── scan-content.mjs # Content scanner
│ └── generate-summaries.mjs # AI summarization
├── src/content/docs/ # Source MDX files
└── .env # API credentials

Terminal window
# 1. Scan for changes (fast, uses hash comparison)
npm run kb:scan
# 2. Generate summaries for new/changed articles
npm run kb:summarize --resummary
Terminal window
# 1. Scan all content
npm run kb:scan --force
# 2. Generate summaries in batches
node scripts/generate-summaries.mjs --batch 100
node scripts/generate-summaries.mjs --batch 100
node scripts/generate-summaries.mjs --batch 100
# ... repeat until done
Terminal window
# View statistics
npm run kb:stats
# Or programmatically
node -e "
import { getStats } from './scripts/lib/knowledge-db.mjs';
console.log(JSON.stringify(getStats(), null, 2));
"

The database serves as a processing layer. Canonical data is exported to YAML files for the site build.

Terminal window
# Export cited sources to resources.yaml
node scripts/export-resources.mjs
# Export ALL sources (including uncited)
node scripts/export-resources.mjs --all
# Preview without writing
node scripts/export-resources.mjs --dry-run

The export script:

  • Reads sources from SQLite with their AI summaries
  • Merges with existing src/data/resources.yaml (preserves manual edits)
  • Includes cited_by references showing which articles cite each source

Once resources are in resources.yaml, you can reference them semantically:

import { ResourceLink, ResourceList, ResourceCite } from '../../components/wiki';
Recent work on AI control <ResourceCite id="ai-control-2023" /> shows...
See also: <ResourceLink id="superintelligence-2014" />
## Key Papers
<ResourceList
ids={["ai-control-2023", "concrete-problems-2016"]}
showSummaries
/>

Resources in resources.yaml have this structure:

- id: ai-control-2023
url: https://arxiv.org/abs/2312.06942
title: "AI Control: Improving Safety..."
authors: ["Ryan Greenblatt", "Buck Shlegeris"]
published_date: "2023-12"
type: paper # paper, blog, book, report, talk, podcast, government, reference, web
summary: "AI-generated summary..."
key_points:
- "Point 1"
- "Point 2"
cited_by:
- agentic-ai
- ai-control

  1. Source fetching not implemented: The fetch_status field exists but automatic fetching of external sources is not yet built
  2. Claims extraction minimal: The claims table exists but extraction is not fully implemented
  3. Local only: Database is gitignored and must be regenerated on each machine
  4. No incremental summary updates: Summaries are regenerated from scratch, not updated

Potential improvements to the system:

  • Automatic source fetching (PDFs, web pages)
  • Claims extraction and consistency checking across articles
  • Similarity search using embeddings
  • Migration of entity sources to use resource IDs
  • Integration with content validation tools