Note: I had stopped writing posts in 2017. Slowly getting back into it in 2024, mostly for AI.

Vector embeddings

Jan 11, 2024 | LLM

This seemed like the core ideas so I wanted to clarify them conceptually. “Embeddings” emphasizes the notion of representing data in a meaningful and structured way, while “vectors” refers to the numerical representation itself. ‘Vector embeddings’ is a way to represent different data types (like words, sentences, articles etc) as points in a multidimensional space. Somewhat regrettably, both ’embeddings’ and ‘vectors’ can sometimes be used interchangeably in the context of vector embeddings. Unexpectedly, OpenAI’s page seems to do this.

This post from Dharmesh Shah (CTO/founder of Hubspot) nicely explains the concept of vector embedding from the ground up. What I found noteworthy:

  • OpenAI’s vector embedding model is called ada-002. There are open source models too. These models take English text and map it into a space with 1,536 dimensions.
  • So any text (phrase, para, document) can now have coordinates for a single point in 1,536 dimensional space. And if you plot another piece of text and compare the two coordinates, you’d be comparing what they mean and if they are similar to each other or not.
  • This implicit semantic comparison is like human understanding then… just represented mathematically.. That’s mind-blowing.

Later on, I stumbled on to Pinecone.io’s basic description of Vector Embeddings for developers here, and the following made a lot of sense:

  • Mathematically, a vector is an object that has both a magnitude and a direction. We can picture a vector as a directed line segment, whose length is the magnitude of the vector and with an arrow indicating the direction. Developers can represent vectors as an array or numerical values.
  • The big issue is converting usual raw data (text, image, audio, etc) into vector representation. For that we need embedding models.
  • Embedding models are built by passing a large amount of labeled data to a neural network (NN). So we iteratively train the NN till it eventually does correct predictions. At that point, the embedding model is this NN without the last layer (last layer is what gives the final output. The layer before that has the vector embeddings).
  • With the right embedding model, vectors that are close together have a contextual similarity and start reflecting the knowledge (meaning) of the domain the model was trained on.

Sources: Elastic.co, Agent.ai, Pinecone.io