• 8 Posts
  • 16 Comments
Joined 4 months ago
cake
Cake day: November 22nd, 2025

help-circle




  • i love this idea but i must be honest: fine-tuning a mistral model to specifically embed the weights with the “voice” of Marx Engels Lenin Stalin and Mao would be an immense machine learning and engineering task. Now Abliterated DeepSeek 7B (abliterated means uncensored) = true power…


















  • percyraskovatoCrush Agentic AIScope your prompts
    link
    fedilink
    arrow-up
    2
    ·
    4 months ago

    here is my CLAUDE.md for a project:


    CLAUDE.md

    This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

    ⚠️ CRITICAL SCALE UPDATE

    Corpus Scale: 200GB raw archive → 50GB optimized (75% reduction through strategic filtering)

    Corpus Analysis: ✅ Complete - 46GB English content analyzed (55,753 documents across 6 sections)

    For architecture overview, see: Architecture Overview (includes corpus foundation)

    The architecture includes:

    • Corpus Foundation: Systematic 46GB analysis informing all data decisions
    • Metadata Schema: 5-layer model achieving 85%+ author coverage
    • Chunking Strategies: 4 adaptive strategies based on document structure
    • Knowledge Graph: ~2,500 entities forming hybrid retrieval foundation
    • Infrastructure: Simplified GCP architecture with Weaviate + Runpod embeddings
    • Parallel Development: 6-instance coordination strategy

    Project Overview

    The Marxists Internet Archive (MIA) RAG Pipeline converts 126,000+ pages of Marxist theory (HTML + PDFs) into a queryable RAG system. This is a local, private, fully-owned knowledge base designed for material analysis research, class composition studies, and theoretical framework development.

    Note: The reference implementation below works for small-scale testing. For production processing, see Architecture Overview for complete details.

    🏗️ Parallel Development Architecture

    This project uses a 6-instance parallel development model where different Claude Code instances work on separate modules simultaneously. Each instance has specific boundaries to prevent conflicts.

    Instance Boundaries

    Instance 1 (Storage & Pipeline):

    • src/mia_rag/storage/ - GCS storage management
    • src/mia_rag/pipeline/ - Document processing pipeline
    • tests/unit/instance1_* - Instance 1 tests

    Instance 2 (Embeddings):

    • src/mia_rag/embeddings/ - Runpod embedding generation
    • tests/unit/instance2_* - Instance 2 tests

    Instance 3 (Weaviate):

    • src/mia_rag/vectordb/ - Weaviate vector database
    • tests/unit/instance3_* - Instance 3 tests

    Instance 4 (API):

    • src/mia_rag/api/ - FastAPI query interface
    • tests/unit/instance4_* - Instance 4 tests

    Instance 5 (MCP):

    • src/mia_rag/mcp/ - Model Context Protocol integration
    • tests/unit/instance5_* - Instance 5 tests

    Instance 6 (Monitoring & Testing):

    • src/mia_rag/monitoring/ - Prometheus/Grafana monitoring
    • tests/(integration|scale|contract)/ - Cross-instance tests

    Shared Resources (require coordination):

    • src/mia_rag/interfaces/ - Interface contracts (RFC process required)
    • src/mia_rag/common/ - Shared utilities

    Working in Parallel

    Before starting work:

    1. Check planning/ directory for active projects and issues
    2. Verify your instance assignment in .instance file
    3. Run boundary check: poetry run python scripts/check_boundaries.py --instance instance{N} --auto

    Branch naming convention:

    • Instance work: instance{N}/{module}-{feature} (e.g., instance1/storage-gcs-retry)
    • Interface changes: rfc/{number}-{description} (e.g., rfc/001-metadata-schema)
    • Releases: release/v{version} (e.g., release/v0.2.0)
    • Hotfixes: hotfix/{description} (e.g., hotfix/memory-leak)

    CI/CD workflows:

    • instance-tests.yml - Runs tests for changed instances only
    • conflict-detection.yml - Detects boundary violations in PRs
    • daily-integration.yml - Merges instance branches into shared integration branch

    Development Commands

    Setup and Installation

    # Install Poetry dependencies (core + dev)
    poetry install
    
    # Install specific instance dependencies
    poetry install --extras instance1  # Storage & Pipeline
    poetry install --extras instance2  # Embeddings
    poetry install --extras instance3  # Weaviate
    poetry install --extras instance4  # API
    poetry install --extras instance5  # MCP
    poetry install --extras instance6  # Monitoring
    
    # Install all dependencies (integration testing)
    poetry install --extras all
    

    Testing

    # Run all tests for your instance
    poetry run pytest -m instance1  # Replace with your instance number
    
    # Run specific test types
    poetry run pytest -m unit        # Unit tests only
    poetry run pytest -m integration # Integration tests
    poetry run pytest -m contract    # Contract tests (interface validation)
    
    # Run tests for a specific file
    poetry run pytest tests/unit/instance1_storage/test_gcs_storage.py
    
    # Run with coverage
    poetry run pytest --cov=src/mia_rag --cov-report=html
    
    # Run specific test by name
    poetry run pytest -k "test_embedding_generation"
    

    Linting and Code Quality

    # Run Ruff linting
    poetry run ruff check .
    
    # Auto-fix issues
    poetry run ruff check --fix .
    
    # Format code
    poetry run ruff format .
    
    # Type checking
    poetry run mypy src/
    
    # Check cyclomatic complexity (for refactoring)
    poetry run radon cc src/ -a -nb
    

    Git Workflow

    # Install git hooks
    bash scripts/install-hooks.sh
    
    # Check boundaries before commit
    poetry run python scripts/check_boundaries.py --instance instance1 --auto
    
    # Check interface compliance
    poetry run python scripts/check_interfaces.py --check-all
    
    # Commit with conventional commit format
    git commit -m "feat(storage): add GCS retry logic"
    # Types: feat, fix, docs, style, refactor, test, chore
    

    Running the Pipeline (Reference Implementation)

    # Step 1: Download MIA metadata
    python mia_processor.py --download-json
    
    # Step 2: Process archive (HTML/PDF → Markdown)
    python mia_processor.py --process-archive ~/Downloads/dump_www-marxists-org/ --output ~/marxists-processed/
    
    # Step 3: Ingest to vector database
    python rag_ingest.py --db chroma --markdown-dir ~/marxists-processed/markdown/ --persist-dir ./mia_vectordb/
    
    # Step 4: Query the system
    python query_example.py --db chroma --query "What is surplus value?" --persist-dir ./mia_vectordb/
    

    Code Architecture

    Reference Implementation (Legacy)

    The original monolithic implementation consists of:

    • mia_processor.py - HTML/PDF to Markdown conversion
    • rag_ingest.py - Chunking and vector database ingestion
    • query_example.py - Query interface

    These are working but being refactored into the modular src/mia_rag/ structure.

    Refactored Architecture

    Domain Models (scripts/domain/):

    • boundaries.py - Instance boundary specifications
    • instance.py - Instance configuration and metadata
    • interfaces.py - Interface contract definitions
    • recovery.py - Recovery state and operations
    • metrics.py - Metrics and performance tracking

    Design Patterns (scripts/patterns/):

    • specifications.py - Specification pattern for boundary checking
    • validators.py - Chain of Responsibility pattern for validation
    • visitors.py - Visitor pattern for interface analysis
    • commands.py - Command pattern for operations
    • recovery.py - Template Method pattern for recovery strategies
    • repositories.py - Repository pattern for data access
    • builders.py - Builder pattern for complex objects

    Key Refactored Scripts:

    • scripts/check_boundaries.py - Uses Specification pattern (✅ Refactored)
    • scripts/check_conflicts.py - Uses Chain of Responsibility (✅ Refactored)
    • scripts/check_interfaces.py - Uses Visitor pattern (✅ Refactored)
    • scripts/instance_map.py - Uses Command pattern (✅ Refactored)
    • scripts/instance_recovery.py - Uses Template Method pattern (✅ Refactored)

    Complexity Targets (enforced by Ruff):

    • Max branches: 12 per function
    • Max statements: 50 per function
    • Max arguments: 7 per function
    • Max returns: 6 per function

    Package Structure

    src/mia_rag/
    ├── interfaces/          # Interface contracts (shared)
    │   ├── __init__.py
    │   └── contracts.py
    ├── common/              # Shared utilities (coordination required)
    ├── storage/             # Instance 1: GCS storage
    ├── pipeline/            # Instance 1: Document processing
    ├── embeddings/          # Instance 2: Runpod embeddings
    ├── vectordb/            # Instance 3: Weaviate
    ├── api/                 # Instance 4: FastAPI
    ├── mcp/                 # Instance 5: MCP server
    └── monitoring/          # Instance 6: Prometheus/Grafana
    

    Corpus Analysis Foundation

    CRITICAL: All implementation decisions must be informed by the completed corpus analysis (46GB English content, 55,753 documents).

    Essential Reading Before Coding

    Metadata & Schemas:

    Chunking & Document Structure:

    • …/specs/07-chunking-strategies-spec.md - 4 adaptive chunking strategies
      • 70% documents have good heading hierarchies → semantic-break chunking
      • 40% heading-less → paragraph-cluster chunking fallback
      • Glossary → entry-based chunking (special case)
      • Target: 650-750 tokens/chunk average, >70% with heading context

    Knowledge Graph & Entities:

    • …/specs/08-knowledge-graph-spec.md - Hybrid retrieval architecture
      • ~2,500 Glossary entities form canonical node set
      • 10 node types, 14 edge types for vector + graph retrieval
      • 5k-10k cross-references extracted from corpus

    Section-Specific Analyses

    When implementing processing for specific corpus sections, consult: