I finally get vector embeddings and this whole thing about transformers.

Here’s a Deepseek explanation:

If it sounds confusing, it was to me too until I started working on a project for myself (a discovery app for linux so that I can find apps more easily), then it all made sense.

What I’m using for this app is a sentence-transformer model. It’s a tiny (~90MB) pre-trained LLM that embeds vectors over 384 dimensions.

What this means is in the picture above:

#Your transformer model does this:

“read manga” → [0.12, -0.45, 0.87, …, 0.23] # 384 numbers!

“comic reader” → [0.11, -0.44, 0.86, …, 0.22] # Very similar numbers!

“email client” → [-0.89, 0.32, -0.15, …, -0.67] # Very different numbers!

These embeddings map semantic meaning, just by transforming sentences to a series of numbers. You can see ‘comic reader’ is close to ‘read manga’, but not the exact same. But closer than “email client” is. So mathematically, we know that read manga and comic reader must have similar meanings in language.

When you search for “read manga” in my app, it transforms that query into vectors. Then using cosine similarity, which basically measures the distance between embeddings mathematically, you can calculate the similarity between two concepts:

With a transformer, at no point do you have to pre-write a synonyms list (for example that ‘video’ is a synonym of mp4, film, movie, etc - traditionally this would be hard-coded, someone makes a list of synonyms one by one).

Instead, the sentence-transformer model is pre-trained and learned these semantic relationships by itself, and you can just use that model that’ll work out of the box by itself.

It is itself a very small LLM, and both this one and the big LLMs use the transformer architecture. Except the sentence-transformer stops once it has vector embeddings.

and this is how Deepseek in the example above is able to tell me about Komikku, which is a real app to read manga, instead of randomly naming VLC or inventing a fake name. You’ll notice it was also able to write a pretty convincing example of vector embeddings, by making actual close and further apart numbers.

This stuff gets complex fast, even now I’m not sure I’m accurately representing how LLMs work lol. But this is a fundamental mechanism of LLMs, including CLIP models in image generation (to encode your prompt into something the checkpoint can understand).

FAISS is the second step; it’s what allows us to go query that vector matrice. So after we’ve transformed your search query into a vector, we need to compare it to other vectors. This is what FAISS is for, it’s like we talk to a database with SQL.

I don’t know if this is true of everyone but I learn best by having actual real projects to delve into. Using LLMs like this, anyone can start learning about concepts that seem completely above their level. The LLM codes the app but even with this level of abstraction/black box (I’m not entirely sure of the app’s logic, though the LLM can explain it to me), you learn new things. And most of all: you can start making new solutions that you couldn’t before.

The app

Sentence transforming makes for a very powerful search engine and not only that, anyone can do it. I’m making my app with crush, and it works. In fact it took less than 24 hours to get it all up and running (and I was asleep for some of those hours).

Example of usage. Similarity is the cosine angle that we talked about, and how far away the two embeddings (your search and the app’s description) are from each other.

No synonym dictionary, no regex fuzzying, no hard-coded concepts. Everything is automatically built and searched, and this is very powerful. It also means you can do very long specific queries, and it will find results.

The next steps are 1. optimization and 2. improving search results more. This is just simple cosine angle, but we can do more if we had more data than just the short-description. This is a bit of a challenge as I’m not sure where to get that data from exactly, but we’ll get there. I’m sure there’s also a bunch more math we could add to refine the results.

As you can see the best similarity for that query above (which is admittedly very specific) is only 58.4% - the cosine difference expressed as percentage. I want to be able to reliably get up to 75% at least. And to do that, you can either refine how the embedding works or add complementary methods to what’s already there.

I keep saying it: LLMs and this whole tech allow us to approach problems in a different way. This is completely different from doing fuzzy search, which was the standard for years. 12 hours is all you need to deploy a search engine now.

As for optimization the app needs the ~90MB model on your machine to embed your search query, and creates some heavy json files (10+ MB). This is okay for the prototype, and I’m fine with going up to ~250MB total disk space personally, but no more. The bigger problem is it runs on pytorch for development which takes up Several gigabytes of space which is just ludicrous. But first I want to finish the prototype and implement everything, then I can look at optimization and refactoring.

Oh yeah and it can also do that out of the box:

(Katawa Shoujo being included was funny lol, but the rest works: we name a completely different app in the query and still find results but for books, novels, ebooks, etc!)

If I ever finish this - I’m having some issues with crush I want to fix, and I don’t want to rush and burn myself out - I’ll put it up on my codeberg under MIT licence.

And with Deepseek being so cheap total cost so far is not even 2.50$ lol. You can power this sort of development with 1$ ko-fi donations.

PS: If you want me to search for some apps though I can send you some results if there’s something you’re really looking for. It could be worth building a small index of unsatisfactory results too.

    • ☆ Yσɠƚԋσʂ ☆
      link
      fedilink
      arrow-up
      7
      arrow-down
      1
      ·
      10 days ago

      It’s really unbecoming behavior to berate people who are learning and sharing what they’ve figured out. The LLM clearly explained the concept in an accessible way. Complaining about “AI slop” isn’t useful. Stop gatekeeping and telling people how to learn.

      • Philo_and_sophy
        link
        fedilink
        English
        arrow-up
        2
        arrow-down
        6
        ·
        10 days ago

        If that reads as beratement, I do apologize as a self crit. Yet if I were to hold the same level of consideration, there are many comments in this post that also warrant removal because I certainly don’t feel respected

        To the point of complaining isn’t useful, that’s why I also included resources for anyone who wants to learn. Not a defense, but reflecting intention

        Moreover, I accept the criticism that there are more helpful approaches. It would take much more time to correct the factual errors but that would be a start

        Personally I don’t see how asking people not to post AI generated information vs linking to human based and vetted resources is gatekeeping, but I accept that standard if it is

        All that said, i just realized we don’t have any explicit rules against misinformation, which was my fundamental issue.

        • amemorablename
          link
          fedilink
          arrow-up
          7
          ·
          10 days ago

          It sounds like there are two main things at work here:

          1. Validity of information provided by AI without cross-referencing (backing up what they say with other sources). I think this is a valid concern, though trying to have that discussion within this thread probably isn’t the best place for it. It may be better to make a dedicated thread and link to this one if you want to use it as a reference for incorrect output.

          2. Correcting information you recognize to be incorrect. If you can provide sourcing on what’s incorrect about the output, that would certainly help the case that we should be wary of posting AI output without other sources.

          But your original comment reads to me like the usual reaction of people hating AI because it’s AI, rather than directly addressing these issues. I don’t think it’s too late to course correct and focus on correctness of information.

          • Philo_and_sophy
            link
            fedilink
            English
            arrow-up
            1
            arrow-down
            5
            ·
            edit-2
            10 days ago

            Agreed on the first two points fam. Though I’m legitimately very puzzled how multiple people read that comment as anti AI when the point was preserving the human and social connections underlying this tech imho

            We as socialists/communists have a more rooted appreciation for humanity and society than our liberal public, so I assumed that value would be assumed in my comment rather than the anti ai sentiment.

            Probably a mix of poor assumptions and being bad at off the cuff words/comments 🤷🏿‍♀️

            To the point of course correction, I’m tapping out actually. This is the culmination of months of trying to combat misinformation in this community, specifically from this poster.

            I have nothing but respect for them as a comrade and genuinely support their work in AI, but they post things that gain traction in the community even though it’s factually incorrect (most recently about SocialismAI vs their MCP server project)

            It’s really hard (i.e. rigorous and time consuming) to manually correct information, hence me making the request that I did. And if we start using AI for corrections, then we’ve lost the script completely imho

            Thanks for helpful comment

            • CriticalResist8OPMA
              link
              fedilink
              arrow-up
              7
              ·
              10 days ago

              The two systems (WSWS AI and MCP) are RAGs. Just that one was made in the simplest way possible and the other was made “correctly”. Conceptually both perform the same function: retrieving information from a pool of texts and basing the LLM’s response on the information found there. It seems we might disagree on definition there, rereading your comment from back then, and that’s fine. But it doesn’t mean I’m spreading misinformation just because we don’t work on the same definition of what a RAG is.

              You can’t just say “I’m right, you’re wrong” and expect people to follow along. Nobody knows literally everything.

              I’m not sure what else to say. If you must impart knowledge then make your own posts to share and teach people. And if you have a problem with me then I can only recommend you block me. “Months” is stretching it, the only other time I know we interacted was with that MCP post which was 17 days ago. One time two months ago you asked about Mistral (and a definition of what free means to you so that’s why we may be working on different definitions) and that same month asked about the ProleWiki dump files - I checked my inbox. But those two occurrences were not combating misinformation but asking a question.

              Again in your comment you talk about me (directly) sharing misinformation but make no indication of what that misinformation is, neither broadly nor specifically. It comes across as you don’t like something about how I presented that information, but we don’t know what and we are left to guess. The way you say it, even if it wasn’t intended (and why I’m pointing it out) is as if I go around just saying shit for the lolz.

              I’ve been wanting to have a glossary/encyclopedia of LLM-related terms, especially pertaining to users who want to start understanding all the terms they might hear about AI. If you want to contribute, I’ve been hopeful for someone to make that glossary.

              One last thing I want to point just for thoroughness,

              when the point was preserving the human and social connections underlying this tech imho

              This implies there are “right” ways to use AI and “wrong” ways. But this is exactly in-the-box thinking. What is the right way to use AI to you may not be important at all to someone else. To be cohesive we can’t on the one hand say “yes you can use AI to code an app instead of asking your developer friend” and on the other “oh but don’t trust it to explain that app go read the books written by people on it”. Either we accept AI will replace some ‘social’ connections or we think it shouldn’t replace anything and burn it down entirely. But the moment you pass a prompt to an LLM, any prompt, is the moment you are not asking Google/your friend/a lemmy thread instead.

            • amemorablename
              link
              fedilink
              arrow-up
              6
              ·
              10 days ago

              I think it’s largely because situations keep coming up on here where somebody posts something AI and it gets dismissive comments directed toward it. It’s not that AI can’t have problems. It’s that the content of criticisms is often very shallow.

              And again, I think AI in the realm of fact is a real concern. I’m not expecting you to manually correct AI output every time you see it somewhere, but if you can correct in this context, you could leverage it in a broader argument about risks of engaging with AI output in matters of fact. Without those corrections, we have to take you at your word that you’ve observed factual errors and the position comes out weaker as a result, even though I think it is a good one, broadly speaking.

              • Philo_and_sophy
                link
                fedilink
                English
                arrow-up
                2
                arrow-down
                3
                ·
                10 days ago

                I’m only pulling off the top, but there are more issues that hopefully will be addressed by others

                What I’m using for this app is a sentence-transformer model. It’s a tiny (~90MB) pre-trained LLM that embeds vectors over 384 dimensions.

                The poster isn’t using an LLM but a text encoder. This is important because the model will never generate text, only vector embeddings. They allude to as much in their post, but I don’t know if they are aware by their own admission

                On the more pedantic, yet still meaningful side: The first L in LLM is for large, and by their own admission it’s a tiny model

                And also these models generate embeddings vs embed vectors. No model embeds vectors afaik.

                • amemorablename
                  link
                  fedilink
                  arrow-up
                  3
                  ·
                  10 days ago

                  Hmm, okay. I was more thinking of errors the model itself made in describing how things work. But correcting other people is reasonable too.

                  That said, I’m not so sure about this point:

                  No model embeds vectors afaik.

                  From what I can find through a cursory search, there is something called vector embedding going on, at least in the context of LLMs. I guess this smaller model is a different story if it is a wholly different kind of architecture, as you say.

                  https://medium.com/@narendra.squadsync/vector-embeddings-in-large-language-models-llms-3e746f1063f3

                  https://labs.adaline.ai/p/how-do-embeddings-work-in-llms

                  https://ml-digest.com/architecture-training-of-the-embedding-layer-of-llms/

                  (I don’t know if these sources are reliable, it’s just what I could find.)

                  • Philo_and_sophy
                    link
                    fedilink
                    English
                    arrow-up
                    2
                    arrow-down
                    1
                    ·
                    10 days ago

                    Respectfully, there’s no embedding of vectors, even in these examples. These are all examples of models which generate embeddings or the embedding layers in the models themselves. You’re implicitly proving my point about how hard it is to understand this work

                    An easy thought experiment is if you incrementally “embed vectors” or accumulate any data into a given model, it eventually expands to being unhostable.

                    In reality, we train models and transformers so they have internal latent representations via their weights.

                    Many neutral nets have embedding layers, but those are about mapping and shaping the data as it flows through the model. But again, these are trained not “embedded”

                    But to the larger point, these are quality resources from people which have been vetted (via PageRank I assume). It’s so much easier to have these discussions with a static knowledge base, vs AI output that can never be replicated verbatim

                • m532
                  link
                  fedilink
                  arrow-up
                  2
                  arrow-down
                  1
                  ·
                  10 days ago

                  I wonder what those “more issues” are when you want someone to “delete your post” over a bunch of nitpicks (some of them wrong even)…

                  And also these models generate embeddings vs embed vectors. No model embeds vectors afaik.

                  An encoder model generates embeddings for the input. The embeddings are tensors. Vectors are 1-dimensional tensors. Most models use higher-dimensional tensors, but those could also be view-ed as 1-dimensional. So, every model with embeddings embeds vectors.

                  The first L in LLM is for large, and by their own admission it’s a tiny model

                  When LLMs were invented, 90Mb models were large.

                  • Philo_and_sophy
                    link
                    fedilink
                    English
                    arrow-up
                    1
                    ·
                    9 days ago

                    You’re contradicting yourself in your own paragraph fam

                    An encoder model generates embeddings for the input.

                    But also

                    So, every model with embeddings embeds vectors.

                    Which one is it, do models generate embeddings or do they embed vectors?


                    And to be clear, I believe that your assertion that models “embed vectors” is incorrect, I just want you to clarify your rebuttal 🤷🏿‍♀️

                    To the last point, is it still the 1900s? Do we still call movies talkies? Language matters, especially when your intent is to educate

                    LLMs are integrally tied to transformer architectures, and transformers allowed enabled devs to scale language models to becoming LLMs

                    GPT 1, the first LLM, was 117 million parameters, which is much large than OPs tiny “LLM”