Member-only story
Paper Reading: Text and Code Embeddings by Contrastive Pre-Training
Task definition:
Embedings are numerical representations of concepts converted to number sequences, which make it is easy for a computer to understand the relationships between those concepts. Mainly, we use unsupervised learning like word2vec, Glove, BERT etc. Text embeding is used in many applications like classification, semantic search, clustering to name a few. This paper talks about to improve commonly used text embedding using openAI GPT models futher.
SUMMARY OF THE PAPER:
Deep unsupervided learning with generative and embedding models has been a drmatic success in last few years. Generative models are producing realistic content and benefit amny downstream applications. In generative models, In generative models, input is distributed over multiple hidden states of the model. While some generative models can lean from a single representation but most autoregressive models do not. However, learning from such a single representaion may be required for some tasks like Neural IR. Embeddings are useful for working with natural language and code, because they can be readily consumed and compared by other machine learning models and algorithms. Embedings that are numerically similar are also symantically similar. Embedding models are explicitly optimized to learn a low dimensional representation that captures symantic meaning of the input data.