Is this the end of embeddings?
Hear from Josh and Tashrish on the pros and cons of using embeddings-based RAG for document data extraction and our experiments with switching to completions-only RAG.
Are embeddings still the best for RAG?
We’ve recently explored some new approaches to retrieval-augmented generation (RAG) that rely solely on completions without using embeddings.
Our latest blog post offers an in-depth look at how this completions-only method compares to embedding-based approaches, and why we believe it may be the future for certain RAG use cases as language models continue to improve.
https://lnkd.in/ga-rTfEf
Are embeddings still the best for RAG?
We’ve recently explored some new approaches to retrieval-augmented generation (RAG) that rely solely on completions without using embeddings.
Our latest blog post offers an in-depth look at how this completions-only method compares to embedding-based approaches, and why we believe it may be the future for certain RAG use cases as language models continue to improve.
https://lnkd.in/ga-rTfEf
A sneak-peak into some of the directions I'm exploring lately.
**rellm** (as in "Reimagine with LLMs", a placeholder name I generated from gpt4o), is an "impractical" library, that attempts to re-imagine some of the current algorithms/workflows using LLMs instead of the pre-LLMs approaches.
The idea is simple, LLMs are very powerful multimodal, unstructured i/p-to-structured-o/p functions, with a vast amount of knowledge embedded into them, this allows us to potentially re-imagine some of the older approaches we have been using and get 1) much better results 2) much easier code to work with.
My first attempt is creating a "Clustering/Document-organizing" tool. Usually this would be done using embeddings, clustering those embeddings and hoping something interesting shows up, one of the main drawbacks here is that it is difficult to generate embeddings based on a "criterion", e.g. I'd like to group user reviews based on the positives/negatives/most urgent, etc.
With LLMs it becomes a simple classification problem conditioned on the criterion, and it does not matter what the modality of the document either (text, image, audio, video), as LLMs are becoming natively multimodal lately.
Saying that, using LLMs for a task like this on scale is probably impractical still (that's why I'm calling rellm impractical), but with the cost, latency and size going down, it might be in the near future that we start seeing a new paradigm of coding where some of the more "nuanced" bits of functionality, those bits that require some level of complicated algorithms and maths to "mimic" human judgement (like perception, language understanding) or bits of functionality that attempt to normalize and standardize notoriously unstructured inputs for the rest of the code to work with, are replaced with straightforward LLM calls.
Here is a small demo of rellm
𝗡𝗲𝘄 𝗜𝘁𝗮𝗹𝗶𝗮𝗻 𝗽𝗿𝗲𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝗱𝗮𝘁𝗮𝘀𝗲𝘁 🇮🇹👍👎
The most common fine-tuning workflow of a Language Models involves two steps:
🔹 𝘚𝘶𝘱𝘦𝘳𝘷𝘪𝘴𝘦𝘥/𝘐𝘯𝘴𝘵𝘳𝘶𝘤𝘵𝘪𝘰𝘯 𝘍𝘪𝘯𝘦 𝘛𝘶𝘯𝘪𝘯𝘨 (𝘚𝘍𝘛/𝘐𝘍𝘛): train the model to follow instructions.
Datasets for this step include instruction-response pairs.
🔹 𝘗𝘳𝘦𝘧𝘦𝘳𝘦𝘯𝘤𝘦 𝘛𝘶𝘯𝘪𝘯𝘨: align the model with human/AI preferences by training it to favor high-quality responses over poor ones. A simple and effective algorithm to do that is 𝘋𝘪𝘳𝘦𝘤𝘵 𝘗𝘳𝘦𝘧𝘦𝘳𝘦𝘯𝘤𝘦 𝘖𝘱𝘵𝘪𝘮𝘪𝘻𝘢𝘵𝘪𝘰𝘯 (𝘋𝘗𝘖).
Data for this step follows this format: instruction, chosen response, rejected response.
During the recent #gemma competition, I trained a nice SFT model and wanted to further improve it with Preference Tuning.
I identified some good datasets (by mii-llm and Ruggero Marino Lazzaroni 🙏) but had limited examples (<3K).
𝗧𝗵𝗲𝗻 𝗜 𝗳𝗼𝘂𝗻𝗱 𝗮 𝗵𝗶𝗱𝗱𝗲𝗻 𝗴𝗲𝗺 -> 💎 𝗲𝘃𝗼𝗹-𝗱𝗽𝗼-𝗶𝘁𝗮 (by Edoardo Federici)
This dataset contains 20K prompts translated from Evol-Instruct, with responses generated using GPT-3.5 Turbo and Claude 3 Opus.
⚠️ It only has a limitation: the response from the stronger model (Claude) is always classified as "chosen" and the other one as "rejected". It is a good but not perfect approximation.
𝗜 𝘁𝗵𝗼𝘂𝗴𝗵𝘁: 𝗜 𝗰𝗮𝗻 𝗶𝗺𝗽𝗿𝗼𝘃𝗲 𝗶𝘁! 🪄
I used Llama-3.1-70B-Instruct as a Judge 🧑⚖️ to re-rank the responses.
I queried the model via the cheap Hugging Face API PRO.
My prompt was inspired by the Ultrafeedback prompt (available in distilabel by Argilla).
📊 Results:
- 7% of the times chosen and rejected were swapped 🔀
- Another 7% of responses were ties
- I used the obtained dataset to train 2 models with DPO, achieving significant improvements for Italian! 📈
I've published my new dataset (anakin87/evol-dpo-ita-reranked) on the HF Hub.
𝘾𝙝𝙚𝙘𝙠 𝙩𝙝𝙚 𝙘𝙤𝙢𝙢𝙚𝙣𝙩𝙨 𝙛𝙤𝙧 𝙡𝙞𝙣𝙠𝙨 𝙩𝙤 𝙩𝙝𝙚 𝙙𝙖𝙩𝙖𝙨𝙚𝙩 𝙖𝙣𝙙 𝙘𝙤𝙙𝙚!
#finetuning#llm#dpo
We open-sourced our LLM Router today at the LLM Routing meetup in SF. Have a look and star it on GitHub:
https://lnkd.in/gFt8i5ud
Additionally we open-sourced our dataset and intent-tuned embedding model.
Intent-Tuned Embedding Model: https://lnkd.in/gMuHRzCZ
- Based on: BAAI/bge-base-en-v1.5
Fine-tuned using contrastive learning with cosine similarity loss
- Merged with the base model at 3:2 ratio
Targets and Scores:
- 13 leading LLMs
claude-3-haiku-20240307
claude-3-opus-20240229
claude-3-sonnet-20240229
command-r
command-r-plus
dbrx-instruct
gpt-3.5-turbo-0125
gpt-4-turbo-2024-04-09
llama-3-70b-instruct
mistral-large
mistral-medium
mistral-small
mixtral-8x7b-instruct
- Computed Bradley-Terry scores for each model, from pairwise outcomes, as done in LMSYS Chatbot Arena Leaderboard
- Normalized all scores to a scale from 0 to 1 for interoperability
Available under: pulze/intent-v0.1
Deploy locally with: https://lnkd.in/gtTGVFYy
—
Dataset: https://lnkd.in/gyfxYGtQ
- Prompts and intent categories derived from GAIR-NLP/Auto-J scenario classification dataset
- gpt-4-turbo used as LLM judge for pairwise comparisons
—
I enjoyed given a technical deep dive and live demo of our router’s capabilities and made it a hands-on presentation alongside other presentations from Martian, OpenRouter and Flashbots.
If you have any questions or wanna learn more just DM me and I can show you how to personalize the router with your own data!
Source: https://lnkd.in/g56P2Z7J
📜 Simple Summary of the Paper
“Don’t Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks”
🧠 For Non-Tech Audience
💡 Main Idea:
Think about when you want to find answers. Normally, you might Google it (search/retrieval) and then write your response.
🔹 AI models using Retrieval-Augmented Generation (RAG) do the same thing—they search for information first, then generate an answer.
🔹 This paper introduces Cache-Augmented Generation (CAG), a new way where AI remembers important knowledge in advance and answers instantly without searching every time.
🚀 Why It’s Important?
✅ Faster answers → No need to search, just recall from memory.
✅ More accurate → No risk of picking wrong sources.
✅ Simpler AI system → No need for complex search tools.
🔹 Example:
• Old Way (RAG) → Like searching Google before answering a question.
• New Way (CAG) → Like having all the important notes ready, so you answer instantly.
📌 Big Idea: CAG is faster, more efficient, and sometimes more accurate than RAG.
💡 For Tech Audience
🚨 Problem with RAG (Retrieval-Augmented Generation)
RAG improves AI by retrieving external knowledge, but it has some problems:
❌ Slow response time → Because AI needs to search first.
❌ Possible wrong answers → If retrieval picks bad documents.
❌ More complex system → Harder to maintain and scale.
🛠 Solution – Cache-Augmented Generation (CAG)
Instead of searching in real-time, CAG preloads important knowledge into memory and uses a precomputed key-value (KV) cache to store it.
🔹 How It Works:
1️⃣ Preload Knowledge → Store relevant documents in AI’s long-context memory.
2️⃣ Precompute KV Cache → Convert knowledge into fast-access memory storage.
3️⃣ Use Cache Instead of Search → AI answers instantly without external retrieval.
🚀 Why CAG is Better?
✅ No waiting time → AI doesn’t have to search, just recalls from memory.
✅ More reliable answers → No mistakes from bad document retrieval.
✅ Simpler system → No need for complex retrieval setup.
📊 Test & Results
📌 Datasets: SQuAD, HotPotQA
📌 Compared With: Traditional RAG (BM25 & OpenAI Indexes)
📌 Results:
• CAG was faster and more accurate than RAG.
• Big performance boost for large knowledge-based tasks.
🔮 What This Means for the Future?
As LLMs with longer context memory improve, CAG could replace RAG for many AI applications.
✔ For tasks with a fixed knowledge base, CAG is faster, simpler, and more reliable than RAG.
🏆 Final Thought
AI doesn’t always need to search. If it remembers things the right way, it can be faster, smarter, and more efficient 🚀
General-purpose language models (LLMs) can be customized through fine-tuning, enhancing them for specific tasks with domain-specific data. However, this method poses the risk of compromising general capabilities. On the other hand, Retrieval-Augmented Generation (RAG) is ideal for large models as it incorporates external knowledge without altering core abilities.
While fine-tuning suits smaller models and tasks requiring memorization, RAG is preferred for large models and tasks needing frequent updates and external knowledge integration. Your choice between these methods should consider model size, use case, infrastructure, and knowledge update requirements, with a combined approach often yielding optimal results.
Read more here: https://lnkd.in/gh5vsqmb
In this article we discuss in simple terms the various strategies for 'breaking up the input' which significantly impact (and complicate) data prep for LLMs.
Retrieval-Augmented Generation (RAG) is essential for equipping your Large Language Models (LLMs) with the most relevant enterprise data. But it is way more nuanced than the just 'talk to your PDF' youtube videos which make it look trivial.
🚀 Day 9: One Hot Encoding in NLP! 🌟
Today, let's dive into one of the foundational techniques in Natural Language Processing: One Hot Encoding! This method is part of the broader category of frequency-based embedding techniques used for representing words.
🔑 What is One Hot Encoding?
One Hot Encoding is a way to represent categorical variables as binary vectors. In the context of NLP, it allows us to convert words or tokens into a format that machine learning algorithms can understand. As part of frequency-based embedding techniques, it creates unique binary vectors for each word based on their presence in the dataset.
📊 How Does It Work?
✂️ Tokenization: First, we split the text into individual words (tokens).
📚 Vocabulary Creation: Next, we build a list of unique words from the tokenized text.
🔢 Indexing Words: Each word in the vocabulary is assigned an index based on its position.
🔗 One-Hot Vectors: For every word, a binary vector is created where one position (corresponding to the word's index) is marked as 1, while all other positions are 0.
Let’s say our vocabulary consists of the words: ['cat', 'dog', 'fish'].
The word 'cat' is represented as [1, 0, 0]
The word 'dog' is represented as [0, 1, 0]
The word 'fish' is represented as [0, 0, 1]
✨ Advantages:
Simplicity: Easy to implement and understand.
Binary Representation: Suitable for algorithms that work well with binary data.
⚠️ Disadvantages:
High Dimensionality: With large vocabularies, the vectors become very sparse, leading to inefficiency.
Lack of Context: One Hot Encoding does not capture semantic relationships between words (e.g., 'cat' and 'dog' being more similar than 'cat' and 'car').
Memory Inefficiency: Storing large binary vectors can consume significant memory resources.
Check out my GitHub repository for more insights on this topic!
https://lnkd.in/gzG_Tisb
💬 How do you see One-Hot Encoding fitting into the current landscape of NLP techniques? Share your insights in the comments below!
#NLP#DataScience#OneHotEncoding#MachineLearning#AI#NLTK#ContinuousLearning#Upskilling#LearningJourney
The impact of vocabulary size on LLM's performance.
======================================
OpenAI’s GPT-4 has a vocabulary of around 100,000 tokens, while Google’s Gemini has 250,000 tokens — 2.5 times larger than GPT-4, which was already 3 times bigger than the first modern model, BERT.
It turns out that vocabulary size significantly impacts an LLM’s performance, affecting the balance between cost, latency, and quality — the LLM’s triad metrics.
** Cost. I claim the bigger the LLM’s vocabulary, the smaller the LLM’s cost for the user! More explicitly, increasing the size of the vocabulary leads to less generated tokens, which in turn leads to a smaller cost for the user — that could be a big competitive advantage.
The immediate implication is that Gemini will output much less tokens than GPT. For example, Gemini may output only a single token for the word ‘LLM’, whereas ‘LLM’ is not part of GPT-4’s vocabulary (a.k.a. Out-of-Vocabulary — OOV) hence it outputs two different tokens (‘L’ & ‘LM’) and ensembles them later.
It seems that an increasing number of generated tokens is OpenAI’s strategy. Strawberry o1, OpenAI’s recent model, with embedded Chain-of-Thought, excels in reasoning but it comes with a price of an excessive amount of generated tokens (that you don’t see), hence it’s much slower and more expensive.
** Latency. The effect on latency is mixed. On one hand generating more tokens takes more time, but on the other hand generating a single token out of smaller vocabulary may take less time per token, so it may cancel out in total the more generated tokens.
** Quality. This section is likely the most intriguing part and it's analyzed in depth in my post below. Take a look.
https://lnkd.in/d_frdWy4
Empowering Recommendation Systems with Distilled GPT-4 Intelligence
After reading about DisRanker, I’m inspired to rethink recommendation systems. The idea of distilling GPT-4’s ranking capabilities into a compact 3B-parameter LLM is promising for scoring relevance in real-world applications. Instead of aiming to fit BERT, why not harness a smaller LLM for both efficiency and depth? This approach could capture GPT-4’s contextual strengths while staying light on resources—ideal for dynamic recommendation tasks. I’m excited about using this framework to combine domain-specific insights with hybrid rank loss, ultimately enhancing user experience without overwhelming system capacity.
https://lnkd.in/eRekiArr