- 8 Posts
- 16 Comments
that is precisely what it is: a thin wrapper around the chatgpt API with a RAG database of all their articles
btw
https://github.com/Kimonarrow/ChatGPT-4o-Jailbreak
knock yourself out! jailbreak their AI and go wild
i love this idea but i must be honest: fine-tuning a mistral model to specifically embed the weights with the “voice” of Marx Engels Lenin Stalin and Mao would be an immense machine learning and engineering task. Now Abliterated DeepSeek 7B (abliterated means uncensored) = true power…
thats a tooling/prompting/context window management problem. it can be solved with proper programming procedures and smart memory management
https://github.com/percy-raskova/marxists.org-rag-db
one step ahead of you
i’m actually working on this! https://github.com/percy-raskova/pw-mcp
its literally just them doing the whole 'selling newspapers" thing but with generative AI
hey! i’m the developer of the Marxist RAG and the ProleWIki RAG so if you have any questions/ideas/suggestions/want to participate reply and let me know!!!
Yeah I’m actually in the process of fine-turning and developing an actual AI
the trotskyists just use a wrapper around vanilla Chat GPT and give it a database of Trotskyist articles lmfao
vanilla chat gpt is free btw
excellent! if its ok i can dm you what we had in mind for a program
percyraskovato
Comradeship // Freechat•(Not me but from a curious redditor) Groups to join IRL?Does anyone live in New York? By chance, do you know any groups that are having the same conversations we have online but IRL
3·3 months agoanyone near baltimore feel free to reach out im trying to start a primary org with my partner and we r drafting up bylaws and points of unity and CONOPS. we have an idea of a survival program but need to flesh it out
percyraskovatoCrush Agentic AI•If you're scared about crush or other agents accessing too much of your files -> isolate them in a docker container
4·3 months agoThis is generally a good strategy with one caveat about security: docker containers usually run on a docker account that has sudo permissions. Make sure your AI agent doesn’t have the capacity to do any of that especially if you’re mounting your hard drive from within the container itself as like a data drive saving to your local machine!
IIRC Podman is the open source version of Docker and doesn’t have the sudo permission issue described above
Also docker container networking can be a nightmare, but you’re overall correct. I think this is the move!
percyraskovatoCrush Agentic AI•What are the advantages of using Crush Deepseek Agentic AI over Microsoft Copilot?
1·3 months agoFor one, you truly own your data. You can control which API endpoints you ping and which models you use. You can use it with Ollama. You aren’t giving over your queries and data to Microsoft.
here is my CLAUDE.md for a project:
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
⚠️ CRITICAL SCALE UPDATE
Corpus Scale: 200GB raw archive → 50GB optimized (75% reduction through strategic filtering)
Corpus Analysis: ✅ Complete - 46GB English content analyzed (55,753 documents across 6 sections)
For architecture overview, see: Architecture Overview (includes corpus foundation)
The architecture includes:
- Corpus Foundation: Systematic 46GB analysis informing all data decisions
- Metadata Schema: 5-layer model achieving 85%+ author coverage
- Chunking Strategies: 4 adaptive strategies based on document structure
- Knowledge Graph: ~2,500 entities forming hybrid retrieval foundation
- Infrastructure: Simplified GCP architecture with Weaviate + Runpod embeddings
- Parallel Development: 6-instance coordination strategy
Project Overview
The Marxists Internet Archive (MIA) RAG Pipeline converts 126,000+ pages of Marxist theory (HTML + PDFs) into a queryable RAG system. This is a local, private, fully-owned knowledge base designed for material analysis research, class composition studies, and theoretical framework development.
Note: The reference implementation below works for small-scale testing. For production processing, see Architecture Overview for complete details.
🏗️ Parallel Development Architecture
This project uses a 6-instance parallel development model where different Claude Code instances work on separate modules simultaneously. Each instance has specific boundaries to prevent conflicts.
Instance Boundaries
Instance 1 (Storage & Pipeline):
src/mia_rag/storage/- GCS storage managementsrc/mia_rag/pipeline/- Document processing pipelinetests/unit/instance1_*- Instance 1 tests
Instance 2 (Embeddings):
src/mia_rag/embeddings/- Runpod embedding generationtests/unit/instance2_*- Instance 2 tests
Instance 3 (Weaviate):
src/mia_rag/vectordb/- Weaviate vector databasetests/unit/instance3_*- Instance 3 tests
Instance 4 (API):
src/mia_rag/api/- FastAPI query interfacetests/unit/instance4_*- Instance 4 tests
Instance 5 (MCP):
src/mia_rag/mcp/- Model Context Protocol integrationtests/unit/instance5_*- Instance 5 tests
Instance 6 (Monitoring & Testing):
src/mia_rag/monitoring/- Prometheus/Grafana monitoringtests/(integration|scale|contract)/- Cross-instance tests
Shared Resources (require coordination):
src/mia_rag/interfaces/- Interface contracts (RFC process required)src/mia_rag/common/- Shared utilities
Working in Parallel
Before starting work:
- Check
planning/directory for active projects and issues - Verify your instance assignment in
.instancefile - Run boundary check:
poetry run python scripts/check_boundaries.py --instance instance{N} --auto
Branch naming convention:
- Instance work:
instance{N}/{module}-{feature}(e.g.,instance1/storage-gcs-retry) - Interface changes:
rfc/{number}-{description}(e.g.,rfc/001-metadata-schema) - Releases:
release/v{version}(e.g.,release/v0.2.0) - Hotfixes:
hotfix/{description}(e.g.,hotfix/memory-leak)
CI/CD workflows:
instance-tests.yml- Runs tests for changed instances onlyconflict-detection.yml- Detects boundary violations in PRsdaily-integration.yml- Merges instance branches into shared integration branch
Development Commands
Setup and Installation
# Install Poetry dependencies (core + dev) poetry install # Install specific instance dependencies poetry install --extras instance1 # Storage & Pipeline poetry install --extras instance2 # Embeddings poetry install --extras instance3 # Weaviate poetry install --extras instance4 # API poetry install --extras instance5 # MCP poetry install --extras instance6 # Monitoring # Install all dependencies (integration testing) poetry install --extras allTesting
# Run all tests for your instance poetry run pytest -m instance1 # Replace with your instance number # Run specific test types poetry run pytest -m unit # Unit tests only poetry run pytest -m integration # Integration tests poetry run pytest -m contract # Contract tests (interface validation) # Run tests for a specific file poetry run pytest tests/unit/instance1_storage/test_gcs_storage.py # Run with coverage poetry run pytest --cov=src/mia_rag --cov-report=html # Run specific test by name poetry run pytest -k "test_embedding_generation"Linting and Code Quality
# Run Ruff linting poetry run ruff check . # Auto-fix issues poetry run ruff check --fix . # Format code poetry run ruff format . # Type checking poetry run mypy src/ # Check cyclomatic complexity (for refactoring) poetry run radon cc src/ -a -nbGit Workflow
# Install git hooks bash scripts/install-hooks.sh # Check boundaries before commit poetry run python scripts/check_boundaries.py --instance instance1 --auto # Check interface compliance poetry run python scripts/check_interfaces.py --check-all # Commit with conventional commit format git commit -m "feat(storage): add GCS retry logic" # Types: feat, fix, docs, style, refactor, test, choreRunning the Pipeline (Reference Implementation)
# Step 1: Download MIA metadata python mia_processor.py --download-json # Step 2: Process archive (HTML/PDF → Markdown) python mia_processor.py --process-archive ~/Downloads/dump_www-marxists-org/ --output ~/marxists-processed/ # Step 3: Ingest to vector database python rag_ingest.py --db chroma --markdown-dir ~/marxists-processed/markdown/ --persist-dir ./mia_vectordb/ # Step 4: Query the system python query_example.py --db chroma --query "What is surplus value?" --persist-dir ./mia_vectordb/Code Architecture
Reference Implementation (Legacy)
The original monolithic implementation consists of:
mia_processor.py- HTML/PDF to Markdown conversionrag_ingest.py- Chunking and vector database ingestionquery_example.py- Query interface
These are working but being refactored into the modular
src/mia_rag/structure.Refactored Architecture
Domain Models (
scripts/domain/):boundaries.py- Instance boundary specificationsinstance.py- Instance configuration and metadatainterfaces.py- Interface contract definitionsrecovery.py- Recovery state and operationsmetrics.py- Metrics and performance tracking
Design Patterns (
scripts/patterns/):specifications.py- Specification pattern for boundary checkingvalidators.py- Chain of Responsibility pattern for validationvisitors.py- Visitor pattern for interface analysiscommands.py- Command pattern for operationsrecovery.py- Template Method pattern for recovery strategiesrepositories.py- Repository pattern for data accessbuilders.py- Builder pattern for complex objects
Key Refactored Scripts:
scripts/check_boundaries.py- Uses Specification pattern (✅ Refactored)scripts/check_conflicts.py- Uses Chain of Responsibility (✅ Refactored)scripts/check_interfaces.py- Uses Visitor pattern (✅ Refactored)scripts/instance_map.py- Uses Command pattern (✅ Refactored)scripts/instance_recovery.py- Uses Template Method pattern (✅ Refactored)
Complexity Targets (enforced by Ruff):
- Max branches: 12 per function
- Max statements: 50 per function
- Max arguments: 7 per function
- Max returns: 6 per function
Package Structure
src/mia_rag/ ├── interfaces/ # Interface contracts (shared) │ ├── __init__.py │ └── contracts.py ├── common/ # Shared utilities (coordination required) ├── storage/ # Instance 1: GCS storage ├── pipeline/ # Instance 1: Document processing ├── embeddings/ # Instance 2: Runpod embeddings ├── vectordb/ # Instance 3: Weaviate ├── api/ # Instance 4: FastAPI ├── mcp/ # Instance 5: MCP server └── monitoring/ # Instance 6: Prometheus/GrafanaCorpus Analysis Foundation
CRITICAL: All implementation decisions must be informed by the completed corpus analysis (46GB English content, 55,753 documents).
Essential Reading Before Coding
Metadata & Schemas:
- …/docs/explanation/corpus-analysis/06-metadata-unified-schema.md - 5-layer metadata model
- Achieves 85%+ author coverage through multi-source extraction
- Section-specific rules: Archive (100% path), ETOL (85% title+keywords), EROL (95% org from title)
- Encoding normalization: 62% ISO-8859-1 → UTF-8 conversion required
Chunking & Document Structure:
- …/specs/07-chunking-strategies-spec.md - 4 adaptive chunking strategies
- 70% documents have good heading hierarchies → semantic-break chunking
- 40% heading-less → paragraph-cluster chunking fallback
- Glossary → entry-based chunking (special case)
- Target: 650-750 tokens/chunk average, >70% with heading context
Knowledge Graph & Entities:
- …/specs/08-knowledge-graph-spec.md - Hybrid retrieval architecture
- ~2,500 Glossary entities form canonical node set
- 10 node types, 14 edge types for vector + graph retrieval
- 5k-10k cross-references extracted from corpus
Section-Specific Analyses
When implementing processing for specific corpus sections, consult:
- Archive (4.3GB, 15,637 files): …/docs/explanation/corpus-analysis/01-archive-section-analysis.md
- History (33GB, 33,190 files - ETOL/EROL/Other): […/docs/
percyraskovatoCrush Agentic AI•Good practice with coding: ask the LLM to audit and refactor the code at the end
3·4 months agoYou can actually implement a lot of this using static analysis and linting tools to guarantee a consistent output. You have the right idea but for something like linting and formatting IMO it is better to have a static analysis tool that gives you the same result every single time. Turn it into a pre commit hook to ensure consistency of code and taht your LLM agent stays on track. But you got the right idea.





https://github.com/percy-raskova/marxists.org-rag-db
It is in progress, if you’d like to help develop this let me know!