These days everyone is stunned by the generative power of the GPT models, including myself. However, today I want to discuss a GPT feature that is largely overlooked in media coverage: the embeddings endpoint in the OpenAI API. This feature ‘translates’ any text into a 1536-dimensional numerical vector. Personally, I prefer to use the term ‘vectorisation’ instead of ’embedding’.
The idea to turn individual words into vectors is about 10 years old now. These word vectorisations were trained on large corpora – or what we thought were large corpora 10 years ago. They are useful because the way words are distributed in the vector space represents relationships between them. The most famous expression of this concept is probably the equation king – man = queen – woman, which holds approximately true when creating word vectors with tools like GloVe or Word2Vec. These word vectorisations have been in use ever since, forming the basis for much of the progress in machine learning for language-related tasks.
Now, GPT offers the same level of analysis for entire texts. It’s not the first model that can do this beyond single words, but its high quality and affordability make it highly attractive. If you experiment with it, you can quickly see how useful it may be. For example, a question and its answer are matched to fairly similar vectors. Additionally, a text in one language and its translation into another will be mapped to close-by vectors.
This vectorisation can be used for anything that can be done with vectors. Text comparison and thereby text search is one obvious use case. We at Creative Virtual will be using it this way in our upcoming Gluon release for intent matching – still keeping rules as the fallback option if and when needed. Another way we are already using it is for text clustering. Finally, you could use the text vectors as the input layer for a neural network and train it for whatever task you want, thereby ‘inheriting’ many of GPT’s text understanding capabilities.
So, if you have access to the OpenAI API and if you are running out of ideas for what to do with the chat endpoint, give the embeddings endpoint a chance. Vectorise away!