How many keywords can you stuff in a title right?

I’m posting this in the prolewiki community because we’ll be discussing ProleWiki’s own in-development RAG for LLMs, but first you probably saw the WSWS, i.e. the trots, published ‘Socialism AI’.

In their press release, they basically self-congratulate themselves about how cool this is for the workers movement and socialism and great victory this and great victory that blahblahblah. You know how trots are.

Their system is usable through ai.wsws.org or something iirc, it’s a web-interface so yes it’s cool that it comes as a package you can just run from any device and don’t have to fiddle with it, there’s also a lot of problems with it especially when coming from self-proclaimed communists. Though with how much of a joke trots are to everyone, I feel like I’m not really throwing oil into the fire with this post lol.

We looked into how their system works because they give absolutely 0 indication on the technical implementation, and found several notices of copyright in the Terms of Service. They say that the output from their AI belongs to them, for example. Courts in the US have found that LLM output is public domain but sure I guess, not really my area of expertise.

We’ll get into it.

Understanding what WSWS did

  • WSWS did not train a model from the ground-up
  • WSWS did not fine-tune an existing open-source model
  • WSWS is not running and hosting their own model.

What WSWS does (and you can find this out from just using browser tools, i.e. F12 on their homepage) is use the chatGPT and Deepseek APIs.

Their pipeline is like this (as far as we can ascertain from simple browser tools):

You send your prompt -> they add their own instructions to it -> LLM fetches WSWS blog articles to answer your prompt -> LLM reads blog articles -> LLM answers your prompt with the WSWS blog articles as sources.

This is what we call RAG, or Retrieval-Augmented Generation. The technique is legit, I’m not disputing that, it’s just the way they did it is both inefficient and concerning.

The Problems I have with that way of doing things

We’ll get into the technical problems when I detail what the ProleWiki MCP will look like.

it’s also very closed-source and obfuscated. Mind you I did not create an account (too much hassle if I want to retain my privacy on it), but you have to understand your prompt + llm output transits through OpenAI and Deepseek. There is no privacy when using this service, it goes straight to the feds with OAI.

Secondly they sell paid tiers, starting at 5$ per month for 150 messages which is… absolutely nothing.

Thirdly everything is closed off. They did not release any documentation on how this works or how you could run this yourself.

Selling paid tiers is not a problem in itself at least for me personally. You have to break even and they do pay API access to openAI and Deepseek (though Deepseek is very cheap). The problem I have is they at least should offer an open-source implementation for people who know how to use it, at the very least make the RAG files available. This is not the case.

I’m also a proponent of paying it forward. Yes this costs them money, but they could find a way to break even in ways that don’t consist of just selling another SaaS (software-as-a-service). Let people pay it forward for others or something. Accept that you will lose some money on running this and cover with dues or people in the party who have money and don’t mind maintaining this service. Accept donations. Lots of ways you can do this that are not so commercial, i.e. “if you can’t pay you must vacate the premises”.

The technical implementation: ProleWiki MCP vs. Socialism AI

A few months ago we started working with a dev who was making the Marxists Internet Archive available for RAG use. This project evolved and they are now making a ProleWiki MCP with the pages we sent them. It’ll still be RAG, but more efficient.

So first, let’s look at how the Socialism AI RAG works. If you remember the pipeline:

You send your prompt -> they add their own instructions to it -> LLM fetches WSWS blog articles to answer your prompt (<-- we are here) -> LLM reads blog articles -> LLM answers your prompt with the WSWS blog articles as sources.

The problem we’ve found is what kind of data exactly the LLM gets access to. Imagine it like a bin the LLM can sift through to make an answer with. If you provide it with the link to the page, it parses that as html code, with all its tags, headers, script calls etc. Imagine me giving you a page full of html code and asking you “can you answer when Lenin was born from this info?” You can, but it’s gonna take a while and a lot of it is simply unnecessary. And you only have this one page to make an answer. If Lenin’s DOB is not neatly written on it, you have to do extra thinking to put it together (this is the context window - the LLM simply won’t look through 250k WSWS articles, it has to pick and choose which articles are more likely to help answer the question).

Therefore we can optimize this bin. Instead of giving you full pages you can pick from, we can give you individual lines. In our RAG for ProleWiki, what our dev did was some math that extracts every line from our pages on the principle of 1 line = 1 idea. Then it puts these ideas together in a matrix and sorts them by semantic closeness.

What this means is if you’re the LLM, you don’t get a full page on the October Revolution or Lenin to answer a question with. You can see our page on Lenin is quite lengthy and if you asked a question that is not on this page when the LLM pulled it to look at it before answering (for example you can see the self-exile section is empty), it might not answer your question as best it could.

With the semantic matrix, instead of picking from pages, it picks from lines to make a coherent answer. Instead of looking at just Lenin’s page and filling its entire context window with it, it looks at semantic information relating to Lenin’s self-exile on ProleWiki - or other sources you add to the corpus, the ‘bin’ - and then makes an answer on this.

tl;dr:

This means if we have information about Lenin’s self-exile on say the USSR page (because why not!), it will pull exactly that thread from that page.

And this is much more powerful than what the WSWS did and why they offer such measly usage rates. They are filling up context window and sending noise tokens because they’re giving an entire <!DOCTYPE HTML><head><meta-name>… html page instead of just the relevant content. Again - as far as we can tell from looking in from the outside.

But where does the MCP come in?

MCPs are kinda new, and were made for AI to work with. I wouldn’t be the best person to explain them but basically it lets an LLM look at some data (website, files, etc) and work with that data in some way. Mostly used in agentic work, tools are exposed to the llm such as view file or edit file, so it can perform these operations itself instead of having you do it and then confirm. So if you have an agent (such as crush, our favorite here on lemmygrad), an LLM can and will view and edit the files you tell it to. These are an example of 2 tools.

With an MCP, you give the LLM access to data it can read and can also give it its own tools. You could make a tool “ProleWiki-fetch”. When the LLM decides to use this tool, it communicates with the ProleWiki MCP you have installed locally and lets it say “okay, let’s use the prolewiki-fetch tool to look at data from prolewiki to answer this question”. Then the MCP does its magic and sends back to the LLM the information.

And not only that, but as we said you can also run this locally. We are still figuring out how we’ll package all of this but most likely we’ll make the source files available so that anyone can build any RAG or make their own cloud web interface if they want.

Likewise for the MCP, it will be downloadable with our source files so that you can just add it to your agent interface and start using it to query the LLM and answer with prolewiki content.

Communism is not in a position of strength currently. So, I don’t see any reason we should be trying to hide and obfuscate any of our content. On the contrary, proletarian education demands it be accessible without discrimination. Unlike trots, we trust the people to make the right decisions collectively - if someone wants to use ProleWiki content to train a model and paywall that, let them. There will be 10 more that won’t be.

In fact speaking of models, our dev is also working on something there… but I was asked not to say too much about it as it’s very experimental 🤐

  • Philo_and_sophy
    link
    fedilink
    English
    arrow-up
    1
    ·
    19 hours ago

    While I have no love for chatgpt wrappers, this framing is incongruous

    You’re comparing a glorified search server with an ml model. Vastly different technologies and implications for comrades

    A more genuine comparison would be a community prolewiki/Marxist.org fine tuned model vs a thinly wrapped chatgpt model that can’t escape it’s liberal RLHF

    Setting up an ML MCP server is very useful, but no model will ever use it correctly unless ML thought is actually baked in (and at a pretty deep layer I’m finding)