Hands-On Large Language Models
Hands-On Large Language Models
With Early Release ebooks, you get books in their earliest form—the author’s
raw and unedited content as they write—so you can take advantage of these
technologies long before the official release of these titles.
The views expressed in this work are those of the authors and do not
represent the publisher’s views. While the publisher and the authors have
used good faith efforts to ensure that the information and instructions
contained in this work are accurate, the publisher and the authors disclaim all
responsibility for errors or omissions, including without limitation
responsibility for damages resulting from the use of or reliance on this work.
Use of the information and instructions contained in this work is at your own
risk. If any code samples or other technology this work contains or describes
is subject to open source licenses or the intellectual property rights of others,
it is your responsibility to ensure that your use thereof complies with such
licenses and/or rights.
978-1-098-15090-7
[TO COME]
Chapter 1. Categorizing Text
With Early Release ebooks, you get books in their earliest form—the author’s
raw and unedited content as they write—so you can take advantage of these
technologies long before the official release of these titles.
This will be the 2nd chapter of the final book. Please note that the GitHub
repo will be made active later on.
If you have comments about how we might improve the content and/or
examples in this book, or if you notice missing material within this chapter,
please reach out to the editor at [email protected].
One of the most common tasks in natural language processing, and machine
learning in general, is classification. The goal of the task is to train a model to
assign a label or class to some input text. Categorizing text is used across the
world for a wide range of applications, from sentiment analysis and intent
detection to extracting entities and detecting language.
Let’s start by looking at the most basic application and technique, fully-
supervised text classification.
Figure 1-1. An example of supervised classification. Can we predict whether a movie review is either
positive or negative?
In this pipeline, we always need to train the classifier but we can choose to
fine-tune either the entire LLM, certain parts of it, or keep it as is. If we
choose not to fine-tune it all, we refer to this procedure as freezing its layers.
This means that the layers cannot be updated during the training process.
However, it may be beneficial to unfreeze at least some of its layers such that
the Large Language Models can be fine-tuned for the specific classification
task. This process is illustrated in Figure 1-2.
Figure 1-2. A common procedure for supervised text classification. We convert our textual input data to
numerical representations through feature extraction. Then, a classifier is trained to predict labels.
Model Selection
We can use an LLM to represent the text to be fed into our classifier. The
choice of this model, however, may not be as straightforward as you might
think. Models differ in the language they can handle, their architecture, size,
inference speed, architecture, accuracy for certain tasks, and many more
differences exist.
Selecting the right model for the job can be a form of art in itself. Trying
thousands of pre-trained models that can be found on HuggingFace’s Hub is
not feasible so we need to be efficient with the models that we choose.
Having said that, there are a number of models that are a great starting point
and give you an idea of the base performance of these kinds of models.
Consider them solid baselines:
BERT-base-uncased
Roberta-base
Distilbert-base-uncased
Deberta-base
BERT-tiny
Albert-base-v2
Data
import pandas as pd
from datasets import load_dataset
tomatoes = load_dataset("rotten_tomatoes")
TIP
Although this book focuses on LLMs, it is highly advised to compare these examples against classic,
but strong baselines such as representing text with TF-IDF and training a LogisticRegression classifier
on top of that.
Classification Head
Using the Rotten Tomatoes dataset, we can start with the most
straightforward example of a predictive task, namely binary classification.
This is often applied in sentiment analysis, detecting whether a certain
document is positive or negative. This can be customer reviews with a label
indicating whether that review is positive or negative (binary). In our case,
we are going to predict whether a movie review is negative (0) or positive
(1).
Training a classifier with transformer-based models generally follows a two-
step approach:
Figure 1-4. First, we start by using a generic pre-trained LLM (e.g., BERT) to convert our textual data
into more numerical representations. During training, we will “freeze” the model such that its weights
will not be updated. This speeds up training significantly but is generally less accurate.
Figure 1-5. After fine-tuning our LLM, we train a classifier on the numerical representations and labels.
Typically, a Feed Forward Neural Network is chosen as the classifier.
These two steps each describe the same model since the classification head is
added directly to the BERT model. As illustrated in Figure 1-6, our classifier
is nothing more than a pre-trained LLM with a linear layer attached to it. It is
feature extraction and classification in one.
Figure 1-6. We adopt the BERT model such that its output embeddings are fed into a classification
head. This head generally consists of a linear layer but might include dropout beforehand.
NOTE
In Chapter 10, we will use the same pipeline shown in Figures 2-4 and 2-5 but will instead fine-tune the
Large Language Model. There, we will go more in-depth into how fine-tuning works and why it
improves upon the pipeline as shown here. For now, it is essential to know that fine-tuning this model
together with the classification head improves the accuracy during the classification task. The reason
for this is that it allows the Large Language Model to better represent the text for classification
purposes. It is fine-tuned toward the domain-specific texts.
Example
Next, we can train the model on our training dataset and predict the labels of
our evaluation dataset:
import numpy as np
from sklearn.metrics import f1_score
Now that we have trained our model, all that is left is evaluation:
TIP
The simpletransformers package has a number of easy-to-use features for different tasks. For
example, you could also use it to create a custom Named Entity Recognition model with only a few
lines of code.
Pre-Trained Embeddings
Figure 1-7. First, we use an LLM that was trained specifically to generate accurate numerical
representations. These tend to be better representative vectors than we receive from a general
Transformer-based model like BERT.
Second, as shown in Figure 1-8, we use the embeddings as input for a logistic
regression model. We are completely separating the feature extraction model
from the classification model.
Figure 1-8. Using the embeddings as our features, we train a logistic regression model on our training
data.
In contrast to our previous example, these two steps each describe a different
model. SBERT for generating features, namely embeddings, and a Logistic
Regression as the classifier. As illustrated in Figure 2-9, our classifier is
nothing more than a pre-trained LLM with a linear layer attached to it.
Figure 1-9. The classifier is a separate model that leverages the embeddings from SBERT to learn from.
Example
In practice, you can use any classifier on top of our generated embeddings,
like Decision Trees or Neural Networks.
Zero-shot Classification
We started this chapter with examples where all of our training data has
labels. In practice, however, this might not always be the case. Getting
labeled data is a resource-intensive task that can require significant human
labor. Instead, we can use zero-shot classification models. This method is a
nice example of transfer learning where a model trained for one task is used
for a task different than what it was originally trained for. An overview of
zero-shot classification is given in Figure 2-11. Note that this pipeline also
demonstrates the capabilities of performing multi-label classification if the
probabilities of multiple labels exceed a given threshold.
Figure 1-10. Figure 2-11. In zero-shot classification, the LLM is not trained on any of the candidate
labels. It learned from different labels and generalized that information to the candidate labels.
Often, zero-shot classification tasks are used with pre-trained LLMs that use
natural language to describe what we want our model to do. It is often
referred to as an emergent feature of LLMs as the models increase in size
(wei2022emergent). As we will see later in this chapter on classification with
generative models, GPT-like models can often do these kinds of tasks quite
well.
Pre-Trained Embeddings
Fortunately, there is a trick that we can use. We can describe our labels based
on what they should represent. For example, a negative label for movie
reviews can be described as “This is a negative movie review”. By describing
and embedding the labels and documents, we have data that we can work
with. This process, as illustrated in Figure 1-11, allows us to generate our
own target labels without the need to actually have any labeled data.
Figure 1-11. To embed the labels, we first need to give them a description. For example, the description
of a negative label could be “A negative movie review”. This description can then be embedded
through sentence-transformers. In the end, both labels as well as all the documents are embedded.
It is the cosine of the angle between vectors which is calculated through the
dot product of the embeddings and divided by the product of their lengths. It
definitely sounds more complicated than it is and, hopefully, the illustration
in Figure 1-12 should provide additional intuition.
Figure 1-12. The cosine similarity is the angle between two vectors or embeddings. In this example, we
calculate the similarity between a document and the two possible labels, positive and negative.
For each document, its embedding is compared to that of each label. The
label with the highest similarity to the document is chosen. Figure 1-13 gives
a nice example of how a document is assigned a label.
Figure 1-13. After embedding the label descriptions and the documents, we can use cosine similarity
for each label document pair. For each document, the label with the highest similarity to the document
is chosen.
Example
Since we are dealing with positive and negative movie reviews, let’s name
the labels “A positive review” and “A negative review”. This allows us to
embed those labels:
Now that we have embeddings for our reviews and the labels, we can apply
cosine similarity between them to see which label fits best with which
review. Doing so requires only a few lines of code:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
And that is it! We only needed to come up with names for our labels to
perform our classification tasks. Let’s see how well this method works:
>>> print(classification_report(eval_df.label, y_pred))
An F-1 score of 0.81 is quite impressive considering we did not use any
labeled data at all! This just shows how versatile and useful embeddings are
especially if you are a bit creative with how they are used.
We use the code we used before to see whether this actually works:
By only changing the phrasing of the labels, we increased our F-1 score quite
a bit!
TIP
In the example, we applied zero-shot classification by naming the labels and embedding them. When
we have a few labeled examples, embedding them and adding them to the pipeline could help increase
the performance. For example, we could average the embeddings of the labeled examples together with
the label embeddings. We could even do a voting procedure by creating different types of
representations (label embeddings, document embeddings, averaged embeddings, etc.) and see which
label is most often found. This would make our zero-shot classification example a few-shot approach.
Figure 1-14. An example of natural language inference (NLI). The hypothesis is contradicted by the
premise and is not relevant to one another.
NLI can be used for zero-shot classification by being a bit creative with how
the premise/hypothesis pair is used, as demonstrated in Figure 1-15. We use
the input document, the review that we want to extract sentiment from and
use that as our premise (yin2019benchmarking). Then, we create a hypothesis
asking whether the premise is about our target label. In our movie reviews
example, the hypothesis could be: “This example is a positive movie review”.
When the model finds it to be an entailment, we can label the review as
positive and negative when it is a contradiction. Using NLI for zero-shot
classification is illustrated with an example in Figure 1-15.
Figure 1-15. An example of zero-shot classification with natural language inference (NLI). The
hypothesis is supported by the premise and the model will return that the review is indeed a positive
movie review.
Example
NOTE
Over the course of the last few years, Hugging Face has strived to become the Github of Machine
Learning by hosting pretty much everything related to Machine Learning. As a result, there is a large
amount of pre-trained models available on their hub. For zero-shot classification tasks, you can follow
this link: https://huggingface.co/models?pipeline_tag=zero-shot-classification.
# Candidate labels
candidate_labels_dict = {"negative movie review": 0, "positive
candidate_labels = ["negative movie review", "positive movie re
# Create predictions
predictions = pipe(eval_df.text.values.tolist(), candidate_labe
TIP
This guiding process is done mainly through the prompts that you give such
as a model. Optimizing the prompts such that the model understands what
kind of answer you are looking for is called prompt engineering. This
section will demonstrate how we can leverage generative models to perform a
wide variety of classification tasks.
This is especially true for extremely large language models, such as GPT-3.
An excellent paper and read on this subject, “Language Models are Few-Shot
Learners”, describes that these models are competitive on downstream tasks
whilst needing less task-specific data (brown2020language).
In-Context Learning
Figure 1-16. Zero-shot and few-shot classification through prompt engineering with generative models.
Example
One easy way to avoid rate limit errors is to automatically retry requests with
a random exponential backoff. Retrying with exponential backoff means
performing a short sleep when a rate limit error is hit, then retrying the
unsuccessful request. If the request is still unsuccessful, the sleep length is
increased and the process is repeated. This continues until the request is
successful or until a maximum number of retries is reached.
Lastly, we need to sign in to OpenAI’s API with an API-key that you can get
from your account:
import openai
openai.api_key = "sk-..."
WARNING
When using external APIs, always keep track of your usage. External APIs, such as OpenAI or Cohere,
can quickly become costly if you request too often from their APIs.
Zero-shot Classification
[DOCUMENT]
You might have noticed that we explicitly say to not give any other answers.
These generative models tend to have a mind of their own and return large
explanations as to why something is or isn’t negative. Since we are
evaluating its results, we want either a 0 or a 1 to be returned.
Next, let’s see if it can correctly predict that the review “unpretentious,
charming, quickie, original” is positive:
[DOCUMENT]
The output indeed shows that the review was labeled by OpenAI’s model as
positive! Using this prompt template, we can insert any document at the
“[DOCUMENT]” tag. These models have token limits which means that we
might not be able to insert an entire book into the prompt. Fortunately,
reviews tend not to be the sizes of books but are often quite short.
Next, we can run this for all reviews in the evaluation dataset and look at its
performance. Do note though that this requires 300 requests to OpenAI’s
API:
> from sklearn.metrics import classification_report
> from tqdm import tqdm
>
> y_pred = [int(gpt_prediction(zeroshot_prompt, doc)) for
> print(classification_report(eval_df.label, y_pred))
An F-1 score of 0.91! That is the highest we have seen thus far and is quite
impressive considering we did not fine-tune the model at all.
NOTE
Although this zero-shot classification with GPT has shown high performance, it should be noted that
fine-tuning generally outperforms in-context learning as presented in this section. This is especially true
if domain-specific data is involved which the model during pre-training is unlikely to have seen. A
model’s adaptability to task-specific nuances might be limited when its parameters are not updated for
the task at hand. Preferably, we would want to fine-tune this GPT model on this data to improve its
performance even further!
Few-shot Classification
[DOCUMENT]
NOTE
Since we added a few examples to the prompt, the generative model consumes more tokens and as a
result could increase the costs of requesting the API. However, that is relatively little compared to fine-
tuning and updating the entire model.
Prediction is the same as before but replacing the zero-shot prompt with the
few-shot prompt:
As before, let’s run the improved prompt against the entire evaluation dataset:
The F1-score is now 0.92 which is a very slight increase compared to what
we had before. This is not unexpected since its score was already quite high
and the task at hand was not particularly complex.
NOTE
We can extend the examples of in-context learning to multi-label classification by engineering the
prompt. For example, we can ask the model to choose one or multiple labels and return them separated
by commas.
Figure 1-17. An example of named entity recognition that detects the entities “place” and “time”.
When we think about token classification, one major framework comes into
mind, namely SpaCy (https://spacy.io/). It is an incredible package for
performing many industrial-strength NLP applications and has been the go-to
framework for NER tasks. So, let’s use it!
Example
To use OpenAI’s models with SpaCy, we will first need to save the API key
as an environment variable. This makes it easier for SpaCy to access it
without the need to save it locally:
import os
os.environ['OPENAI_API_KEY'] = "sk-..."
import spacy
nlp = spacy.blank("en")
# Display entities
html = displacy.render(doc, style="ent")
display(HTML(html))
Figure 1-18. The output of SpaCy using OpenAI’s GPT-3.5 model. Without any training, it correctly
identifies our custom entities.
That is much better! Figure 2-X shows that we can clearly see that the model
has correctly identified our custom entities. Without any fine-tuning or
training of the model, we can easily detect entities that we are interested in.
TIP
Training a NER model from scratch with SpaCy is not possible with only a few lines of code but it is
also by no means difficult! Their documentation and tutorials are, in our opinions, state-of-the-art and
do an excellent job of explaining how to create a custom model.
Summary
In this chapter, we saw many different techniques for performing a wide
variety of classification tasks. From fine-tuning your entire model to no
tuning at all! Classifying textual data is not as straightforward as it may seem
on the surface and there is an incredible amount of creative techniques for
doing so.
In the next chapter, we will continue with classification but focus instead on
unsupervised classification. What can we do if we have textual data without
any labels? What information can we extract? We will focus on clustering our
data as well as naming the clusters with topic modeling techniques.
Chapter 2. Semantic Search
With Early Release ebooks, you get books in their earliest form—the author’s
raw and unedited content as they write—so you can take advantage of these
technologies long before the official release of these titles.
This will be the 3rd chapter of the final book. Please note that the GitHub
repo will be made active later on.
If you have comments about how we might improve the content and/or
examples in this book, or if you notice missing material within this chapter,
please reach out to the editor at [email protected].
Search was one of the first Large Language Model (LLM) applications to see
broad industry adoption. Months after the release of the seminal BERT: Pre-
training of Deep Bidirectional Transformers for Language Understanding
paper, Google announced it was using it to power Google Search and that it
represented “one of the biggest leaps forward in the history of Search”. Not
to be outdone, Microsoft Bing also stated that “Starting from April of this
year, we used large transformer models to deliver the largest quality
improvements to our Bing customers in the past year”.
This is a clear testament to the power and usefulness of these models. Their
addition instantly and massively improves some of the most mature, well-
maintained systems that billions of people around the planet rely on. The
ability they add is called semantic search, which enables searching by
meaning, and not simply keyword matching.
In this chapter, we’ll discuss three major ways of using language models to
power search systems. We’ll go over code examples where you can use these
capabilities to power your own applications. Note that this is not only useful
for web search, but that search is a major component of most apps and
products. So our focus will not be just on building a web search engine, but
rather on your own dataset. This capability powers lots of other exciting LLM
applications that build on top of search (e.g., retrieval-augmented generation,
or document question answering). Let’s start by looking at these three ways
of using LLMs for semantic search.
1- Dense Retrieval
Say that a user types a search query into a search engine. Dense retrieval
systems rely on the concept of embeddings, the same concept we’ve
encountered in the previous chapters, and turn the search problem into
retrieving the nearest neighbors of the search query (after both the query
and the documents are converted into embeddings). Figure 2-1 shows how
dense retrieval takes a search query, consults its archive of texts, and
outputs a set of relevant results.
Figure 2-1. Dense retrieval is one of the key types of semantic search, relying on the similarity of text
embeddings to retrieve relevant results
2- Reranking
These systems are pipelines of multiple steps. A Reranking LLM is one of
these steps and is tasked with scoring the relevance of a subset of results
against the query, and then the order of results is changed based on these
scores. Figure 2-2 shows how rerankers are different from dense retrieval
in that they take an additional input: a set of search results from a previous
step in the search pipeline.
Figure 2-2. Rerankers, the second key type of semantic search, take a search query and a collection of
results, and re-order them by relevance, often resulting in vastly improved results.
3- Generative Search
The growing LLM capability of text generation led to a new batch of
search systems that include a generation model that simply generates an
answer in response to a query. Figure 2-3 shows a generative search
example.
Figure 2-3. Generative search formulates an answer to a question and cites its information sources.
All three concepts are powerful and can be used together in the same
pipeline. The rest of the chapter covers these three types of systems in more
detail. While these are the major categories, they are not the only LLM
applications in the domain of search.
Dense Retrieval
Recall that embeddings turn text into numeric representations. Those can be
thought of as points in space as we can see in Figure 2-4. Points that are close
together mean that the text they represent is similar. So in this example, text 1
and text 2 are similar to each other (because they are near each other), and
different from text 3 (because it’s farther away).
Figure 2-4. The intuition of embeddings: each text is a point, texts with similar meaning are close to
each other.
This is the property that is used to build search systems. In this scenario,
when a user enters a search query, we embed the query, thus projecting it into
the same space as our text archive. Then we simply find the nearest
documents to the query in that space, and those would be the search results.
Figure 2-5. Dense retrieval relies on the property that search queries will be close to their relevant
results.
Judging by the distances in Figure 2-5, “text 2” is the best result for this
query, followed by “text 1”. Two questions could arise here, however:
Should text 3 even be returned as a result? That’s a decision for you, the
system designer. It’s sometimes desirable to have a max threshold of
similarity score to filter out irrelevant results (in case the corpus has no
relevant results for the query).
Are a query and its best result semantically similar? Not always. This is why
language models need to be trained on question-answer pairs to become
better at retrieval. This process is explained in more detail in chapter 13.
Dense Retrieval Example
Let’s take a look at a dense retrieval example by using Cohere to search the
Wikipedia page for the film Interstellar. In this example, we will do the
following:
1. Get the text we want to make searchable, apply some light processing to
chunk it into sentences.
2. Embed the sentences
3. Build the search index
4. Search and see the results
To start, we’ll need to install the libraries we’ll need for the example:
import cohere
import numpy as np
import re
import pandas as pd
from tqdm import tqdm
from sklearn.metrics.pairwise import cosine_similarity
from annoy import AnnoyIndex
text = """
Interstellar is a 2014 epic science fiction film co-written,
It stars Matthew McConaughey, Anne Hathaway, Jessica Chastain
Set in a dystopian future where humanity is struggling to sur
embeds = np.array(response)
print(embeds.shape)
Which outputs:
(15, 4096)
Indicating that we have 15 vectors, each one is of size 4096.
3. Build The Search Index
Before we can search, we need to build a search index. An index stores the
embeddings and is optimized to quickly retrieve the nearest neighbors
even if we have a very large number of points.
search_index.build(10)
search_index.save('test.ann')
def search(query):
# 1. Get the query's embedding
query_embed = co.embed(texts=[query]).embeddings[0]
The first result has the least distance, and so is the most similar to the query.
Looking at it, it answers the question perfectly. Notice that this wouldn’t
have been possible if we were only doing keyword search because the top
result did not include the words “much” or “make”.
Distance: 1.244138
Distance: 0.917728
Distance: 0.871881
It’s useful to be aware of some of the drawbacks of dense retrieval and how
to address them. What happens, for example, if the texts don’t contain the
answer? We still get results and their distances. For example:
texts
Another caveat of dense retrieval is cases where a user wants to find an exact
match to text they’re looking for. That’s a case that’s perfect for keyword
matching. That’s one reason why hybrid search, which includes both
semantic search and keyword search, is used.
The final thing we’d like to point out is that this is a case where each sentence
contained a piece of information, and we showed queries that specifically ask
those for that information. What about questions whose answers span
multiple sentences? This shows one of the important design parameters of
dense retrieval systems: what is the best way to chunk long texts? And why
do we need to chunk them in the first place?
There are several possible ways, and two possible approaches shown in
Figure 2-6 include indexing one vector per document, and indexing multiple
vectors per document.
Figure 2-6. It’s possible to create one vector representing an entire document, but it’s better for longer
documents to be split into smaller chunks that get their own embeddings.
One vector per document
In this approach, we use a single vector to represent the whole document. The
possibilities here include:
This approach can satisfy some information needs, but not others. A lot of the
time, a search is for a specific piece of information contained in an article,
which is better captured if the concept had its own vector.
In this approach, we chunk the document into smaller pieces, and embed
those chunks. Our search index then becomes that of chunk embeddings, not
entire document embeddings.
The chunking approach is better because it has full coverage of the text and
because the vectors tend to capture individual concepts inside the text. This
leads to a more expressive search index. Figure X-3 shows a number of
possible approaches.
Figure 2-7. A number of possible options for chunking a document for embedding.
The best way of chunking a long text will depend on the types of texts and
queries your system anticipates. Approaches include:
Each sentence is a chunk. The issue here is this could be too granular and
the vectors don’t capture enough of the context.
Each paragraph is a chunk. This is great if the text is made up of short
paragraphs. Otherwise, it may be that every 4-8 sentences are a chunk.
Some chunks derive a lot of their meaning from the text around them. So
we can incorporate some context via:
Adding the title of the document to the chunk
Adding some of the text before and after them to the chunk. This way,
the chunks can overlap so they include some surrounding text. This is
what we can see in Figure 2-8.
Figure 2-8. Chunking the text into overlapping segments is one strategy to retain more of the context
around different segments.
The most straightforward way to find the nearest neighbors is to calculate the
distances between the query and the archive. That can easily be done with
NumPy and is a reasonable approach if you have thousands or tens of
thousands of vectors in your archive.
As you scale beyond to the millions of vectors, an optimized approach for the
retrieval is to rely on approximate nearest neighbor search libraries like
Annoy or FAISS. These allow you to retrieve results from massive indexes in
milliseconds and some of them can scale to GPUs and clusters of machines to
serve very large indices.
Another class of vector retrieval systems are vector databases like Weaviate
or Pinecone. A vector database allows you to add or delete vectors without
having to rebuild the index. They also provide ways to filter your search or
customize it in ways beyond merely vector distances.
Just like we’ve seen in the text classification chapter, we can improve the
performance of an LLM on a task using fine-tuning. Just like in that case,
retrieval needs to optimize text embeddings and not simply token
embeddings. The process for this finetuning is to get training data composed
of queries and relevant results.
Having these examples, we now have three pairs - two positive pairs and one
negative pair. Let’s assume, as we can see in Figure 2-9, that before fine-
tuning, all three queries have the same distance from the result document.
That’s not far-fetched because they all talk about Interstellar.
Figure 2-9. Before fine-tuning, the embeddings of both relevant and irrelevant queries may be close to a
particular document.
The fine-tuning step works to make the relevant queries closer to the
document and at the same time making irrelevant queries farther from the
document. We can see this effect in Figure 2-10.
Figure 2-10. After the fine-tuning process, the text embedding model becomes better at this search task
by incorporating how we define relevance on our dataset using the examples we provided of relevant
and irrelevant documents.
Reranking
A lot of companies have already built search systems. For those companies,
an easier way to incorporate language models is as a final step inside their
search pipeline. This step is tasked with changing the order of the search
results based on relevance to the search query. This one step can vastly
improve search results and it’s in fact what Microsoft Bing added to achieve
the improvements to the search results using BERT-like models.
Figure 2-11 shows the structure of a rerank search system serving as the
second stage in a two-stage search system.
Figure 2-11. LLM Rerankers operate as a part of a search pipeline with the goal of re-ordering a
number of shortlisted search results by relevance
Reranking Example
A reranker takes in the search query and a number of search results, and
returns the optimal ordering of these documents so the most relevant ones to
the query are higher in ranking.
import cohere as co
API_KEY = ""
co = cohere.Client(API_KEY)
MODEL_NAME = "rerank-english-02" # another option is rerank-mul
query = "film gross"
Output:
This shows the reranker is much more confident about the first result,
assigning it a relevance score of 0.92 while the other results are scored much
lower in relevance.
More often, however, our index would have thousands or millions of entries,
and we need to shortlist, say one hundred or one thousand results and then
present those to the reranker. This shortlisting is called the first stage of the
search pipeline.
One popular way of building LLM search rerankers present the query and
each result to an LLM working as a cross-encoder. Meaning that a query and
possible result are presented to the model at the same time allowing the
model to view the full text of both these texts before it assigns a relevance
score. This method is described in more detail in a paper titled Multi-Stage
Document Ranking with BERT and is sometimes referred to as monoBERT.
To learn more about the development of using LLMs for search, Pretrained
Transformers for Text Ranking: BERT and Beyond is a highly recommended
look at the developments of these models until about 2021.
Generative Search
You may have noticed that dense retrieval and reranking both use
representation language models, and not generative language models. That’s
because they’re better optimized for these tasks than generative models.
At a certain scale, however, generative LLMs started to seem more and more
capable of a form of useful information retrieval. People started asking
models like ChatGPT questions and sometimes got relevant answers. The
media started painting this as a threat to Google which seems to have started
an arms race in using language models for search. Microsoft launched Bing
AI, powered by generative models. Google launched Bard, its own answer in
this space.
The first batch of generative search systems is using search models as simply
a summarization step at the end of the search pipeline. We can see an
example in Figure 2-12.
Figure 2-12. Generative search formulates answers and summaries at the end of a search pipeline while
citing its sources (returned by the previous steps in the search system).
Until the time of this writing, however, language models excel at generating
coherent text but they are not reliable in retrieving facts. They don’t yet really
know what they know or don’t know, and tend to answer lots of questions
with coherent text that can be incorrect. This is often referred to as
hallucination. Because of it, and for the fact that search is a use case that
often relies on facts or referencing existing documents, generative search
models are trained to cite their sources and include links to them in their
answers.
Generative search is still in its infancy and is expected to improve with time.
It draws from a machine learning research area called retrieval-augmented
generation. Notable systems in the field include RAG, RETRO, Atlas,
amongst others.
Evaluation metrics
Evaluating search systems needs three major components, a text archive, a set
of queries, and relevance judgments indicating which documents are relevant
for each query. We see these components in FIgure 3-13.
Figure 2-13. To evaluate search systems, we need a test suite including queries and relevance
judgements indicating which documents in our archive are relevant for each query.
Using this test suite, we can proceed to explore evaluating search systems.
Let’s start with a simple example, let’s assume we pass Query 1 to two
different search systems. And get two sets of results. Say we limit the number
of results to three results only as we can see in Figure 2-14.
Figure 2-14. To compare two search systems, we pass the same query from our test suite to both
systems and look at their top results
Figure 2-15. Looking at the relevance judgements from our test suite, we can see that System 1 did a
better job than System 2.
This shows us a clear case where system 1 is better than system 2. Intuitively,
we may just count how many relevant results each system retrieved. System
A got two out of three correctly, and System 2 got only one out of three
correctly.
But what about a case like Figure3-16 where both systems only get one
relevant result out of three, but they’re in different positions.
Figure 2-16. We need a scoring system that rewards system 1 for assigning a high position to a relevant
result -- even though both systems retrieved only one relevant result in their top three results.
In this case, we can intuit that System 1 did a better job than system 2
because the result in the first position (the most important position) is correct.
But how can we assign a number or score to how much better that result is?
Mean Average Precision is a measure that is able to quantify this distinction.
The first one is easy, looking at only the first result, we calculate the
precision score: we divide the number of correct results by the total number
of results (correct and incorrect). Figure 2-17 shows that in this case, we have
one correct result out of one (since we’re only looking at the first position
now). So precision here is 1/1 = 1.
Figure 2-17. To calculate Mean Average Precision, we start by calculating precision at each position,
starting by position #1.
We need to continue calculating precision results for the rest of the position.
The calculation at the second position looks at both the first and second
position. The precision score here is 1 (one out of two results being correct)
divided by 2 (two results we’re evaluating) = 0.5.
Figure 2-18 continues the calculation for the second and third positions. It
then goes one step further -- having calculated the precision for each position,
we average them to arrive at an Average Precision score of 0.61.
This calculation shows the average precision for a single query and its results.
If we calculate the average precision for System 1 on all the queries in our
test suite and get their mean, we arrive at the Mean Average Precision score
that we can use to compare System 1 to other systems across all the queries in
our test suite.
Summary
In this chapter, we looked at different ways of using language models to
improve existing search systems and even be the core of new, more powerful
search systems. These include:
Text clustering aims to group similar texts based on their semantic content,
meaning, and relationships, as illustrated in Figure 3-1. Just like how we’ve
used distances between text embeddings in dense retrieval in chapter XXX,
clustering embeddings allow us to group the documents in our archive by
similarity.
This freedom also comes with its challenges. Since we are not guided by a
specific task, then how do we evaluate our unsupervised clustering output?
How do we optimize our algorithm? Without labels, what are we optimizing
the algorithm for? When do we know our algorithm is correct? What does it
mean for the algorithm to be “correct”? Although these challenges can be
quite complex, they are not insurmountable but often require some creativity
and a good understanding of the use case.
Striking a balance between the freedom of text clustering and the challenges
it brings can be quite difficult. This becomes even more pronounced if we
step into the world of topic modeling, which has started to adopt the “text
clustering” way of thinking.
With topic modeling, we want to discover abstract topics that appear in large
collections of textual data. We can describe a topic in many ways, but it has
traditionally been described by a set of keywords or key phrases. A topic
about natural language processing (NLP) could be described with terms such
as “deep learning”, “transformers”, and “self-attention”. Traditionally, we
expect a document about a specific topic to contain terms appearing more
frequently than others. This expectation, however, ignores contextual
information that a document might contain. Instead, we can leverage Large
Language Models, together with text clustering, to model contextualized
textual information and extract semantically-informed topics. Figure 3-2
demonstrates this idea of describing clusters through textual representations.
Figure 3-2. Topic modeling is a way to give meaning to clusters of textual documents.
In this chapter, we will provide a guide on how text clustering can be done
with Large Language Models. Then, we will transition into a text-clustering-
inspired method of topic modeling, namely BERTopic.
Text Clustering
One major component of exploratory data analysis in NLP is text clustering.
This unsupervised technique aims to group similar texts or documents
together as a way to easily discover patterns among large collections of
textual data. Before diving into a classification task, text clustering allows for
getting an intuitive understanding of the task but also of its complexity.
The patterns that are discovered from text clustering can be used across a
variety of business use cases. From identifying recurring support issues and
discovering new content to drive SEO practices, to detecting topic trends in
social media and discovering duplicate content. The possibilities are diverse
and with such a technique, creativity becomes a key component. As a result,
text clustering can become more than just a quick method for exploratory
data analysis.
Data
Before we describe how to perform text clustering, we will first introduce the
data that we are going to be using throughout this chapter. To keep up with
the theme of this book, we will be clustering a variety of ArXiv articles in the
domain of machine learning and natural language processing. The dataset
contains roughly XXX articles between XXX and XXX.
We start by importing our dataset using HuggingFace’s dataset package and
extracting metadata that we are going to use later on, like the abstracts, years,
and categories of the articles.
Now that we have our data, we can perform text clustering. To perform text
clustering, a number of techniques can be employed, from graph-based neural
networks to centroid-based clustering techniques. In this section, we will go
through a well-known pipeline for text clustering that consists of three major
steps:
1. Embed documents
2. Reduce dimensionality
3. Cluster embeddings
1. Embed documents
The first step in clustering textual data is converting our textual data to text
embeddings. Recall from previous chapters that embeddings are numerical
representations of text that capture its meaning. Producing embeddings
optimized for semantic similarity tasks is especially important for clustering.
By mapping each document to a numerical representation such that
semantically similar documents are close, clustering will become much more
powerful. A set of popular Large Language Models optimized for these kinds
of tasks can be found in the well-known sentence-transformers framework
(reimers2019sentence). Figure 3-3 shows this first step of converting
documents to numerical representations.
Figure 3-3. Step 1: We convert documents to numerical representations, namely embeddings.
2. Reduce dimensionality
NOTE
Dimensionality reduction techniques, however, are not flawless. They cannot perfectly capture high-
dimensional data in a lower-dimensional representation. Information will always be lost with this
procedure. There is a balance between reducing dimensionality and keeping as much information as
possible.
3. Cluster embeddings
As shown in Figure 3-5, the final step in our pipeline is to cluster the
previously reduced embeddings. Many algorithms out there handle clustering
tasks quite well, from centroid-based methods like k-Means to hierarchical
methods like Agglomerative Clustering. The choice is up to the user and is
highly influenced by the respective use case. Our data might contain some
noise, so a clustering algorithm that detects outliers would be preferred. If our
data comes in daily, we might want to look for an online or incremental
approach instead to model if new clusters were created.
Figure 3-5. Step 3: We cluster the documents using the embeddings that were reduced in their
dimensionality.
# Visualize clusters
df.cluster = df.cluster.astype(int).astype(str)
sns.scatterplot(data=df, x='x', y='y', hue='cluster',
linewidth=0, legend=False, s=3, alpha=0.3)
As we can see in Figure 3-6, it tends to capture major clusters quite well.
Note how clusters of points are colored in the same color, indicating that
HDBSCAN put them in a group together. Since we have a large number of
clusters, the plotting library cycles the colors between clusters, so don’t think
that all blue points are one cluster, for example.
Figure 3-6. The generated clusters (colored) and outliers (grey) are represented as a 2D visualization.
NOTE
Using any dimensionality reduction technique for visualization purposes creates information loss. It is
merely an approximation of what our original embeddings look like. Although it is informative, it
might push clusters together and drive them further apart than they actually are. Human evaluation,
inspecting the clusters ourselves, is, therefore, a key component of cluster analysis!
These printed documents tell us that the cluster likely contains documents
that talk about XXX. We can do this for every created cluster out there but
that can be quite a lot of work, especially if we want to experiment with our
hyperparameters. Instead, we would like to create a method for automatically
extracting representations from these clusters without us having to go through
all documents.
This is where topic modeling comes in. It allows us to model these clusters
and give singular meaning to them. Although there are many techniques out
there, we choose a method that builds upon this clustering philosophy as it
allows for significant flexibility.
Topic Modeling
Traditionally, topic modeling is a technique that aims to find latent topics or
themes in a collection of textual data. For each topic, a set of keywords or
phrases are identified that best represent and capture the meaning of the topic.
This technique is ideal for finding common themes in large corpora as it
gives meaning to sets of similar content. An illustrated overview of topic
modeling in practice can be found in Figure 3-7.
To this day, the technique is still a staple in many topic modeling use cases,
and with its strong theoretical background and practical applications, it is
unlikely to go away soon. However, with the seemingly exponential growth
of Large Language Models, we start to wonder if we can leverage these Large
Language Models in the domain of topic modeling.
There have been several models adopting Large Language Models for topic
modeling, like the embedded topic model and the contextualized topic model.
However, with the rapid developments in natural language processing, these
models have a hard time keeping up.
BERTopic
Although there are quite a few methods for doing so, there is a trick in
BERTopic that allows it to quickly describe a cluster, and therefore make it a
topic, whilst generating a highly modular pipeline. The underlying algorithm
of BERTopic contains, roughly, two major steps.
Figure 3-8 describes the same steps as before, namely using sentence-
transformers for embedding the documents, UMAP for dimensionality
reduction, and HDBSCAN for clustering.
Figure 3-8. The first part of BERTopic’s pipeline is clustering textual data.
However, words like “the”, “and”, and “I” appear quite frequently in most
English texts and are likely to be overrepresented. To give proper weight to
these words, BERTopic uses a technique called c-TF-IDF, which stands for
class-based term-frequency inverse-document frequency. c-TF-IDF is a class-
based adaptation of the classic TF-IDF procedure. Instead of considering the
importance of words within documents, c-TF-IDF considers the importance
of words between clusters of documents.
To weight this count, we take the logarithm of one plus the average number
of words per cluster *A* divided by the frequency of term *x* across all
clusters. Plus one is added within the logarithm to guarantee positive values
which is also often done within TF-IDF.
Figure 3-9. The second part of BERTopic’s pipeline is representing the topics. The calculation of the
weight of term *x* in a class *c*.
Putting the two steps together, clustering and representing topics, results in
the full pipeline of BERTopic, as illustrated in Figure 3-10. With this
pipeline, we can cluster semantically similar documents and from the clusters
generate topics represented by several keywords. The higher the weight of a
keyword for a topic, the more representative it is of that topic.
Figure 3-10. The full pipeline of BERTopic, roughly, consists of two steps, clustering and topic
representation.
NOTE
Interestingly, the c-TF-IDF trick does not use a Large Language Model and therefore does not take the
context and semantic nature of words into account. However, like with neural search, it allows for an
efficient starting point after which we can use the more compute-heavy techniques, such as GPT-like
models.
One major advantage of this pipeline is that the two steps, clustering and
topic representation, are relatively independent of one another. When we
generate our topics using c-TF-IDF, we do not use the models from the
clustering step, and, for example, do not need to track the embeddings of
every single document. As a result, this allows for significant modularity not
only with respect to the topic generation process but the entire pipeline.
NOTE
With clustering, each document is assigned to only a single cluster or topic. In practice, documents
might contain multiple topics, and assigning a multi-topic document to a single topic would not always
be the most accurate method. We will go into this later, as BERTopic has a few ways of handling this,
but it is important to understand that at its core, topic modeling with BERTopic is a clustering task.
You can think of this modularity as building with lego blocks, each part of
the pipeline is completely replaceable with another, similar algorithm. This
“lego block” way of thinking is illustrated in Figure 3-11. The figure also
shows an additional algorithmic lego block that we can use. Although we use
c-TF-IDF to create our initial topic representations, there are a number of
interesting ways we can use LLMs to fine-tune these representations. In the
“Representation Models” section below, we will go into extensive detail on
how this algorithmic lego block works.
Figure 3-11. The modularity of BERTopic is a key component and allows you to build your own topic
model whoever you want.
Code Overview
Enough talk! This is a hands-on book, so it is finally time for some hands-on
coding. The default pipeline, as illustrated previously in Figure 3-10, only
requires a few lines of code:
However, the modularity that BERTopic is known for and that we have
visualized thus far can also be visualized through a coding example. First, let
us import some relevant packages:
As you might have noticed, most of the imports, like UMAP and HDBSCAN,
are part of the default BERTopic pipeline. Next, let us build the default
pipeline of BERTopic a bit more explicitly and go each individual step:
This code allows us to go through all steps of the algorithm explicitly and
essentially let us build the topic model however we want. The resulting topic
model, as defined in the variable topic_model , now represents the base
pipeline of BERTopic as illustrated back in Figure 3-10.
Example
We are going to keep using the abstracts of ArXiv articles throughout this use
case. To recap what we did with text clustering, we start by importing our
dataset using HuggingFace’s dataset package and extracting metadata that we
are going to use later on, like the abstracts, years, and categories of the
articles.
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(abstracts)
With this pipeline, you will have 3 variables returned, namely
topic_model , topics , and probs :
topic_model is the model that we have just trained before and contains
information about the model and the topics that we created.
topics are the topics for each abstract.
probs are the probabilities that a topic belongs to a certain abstract.
Before we start to explore our topic model, there is one change that we will
need to make the results reproducible. As mentioned before, one of the
underlying models of BERTopic is UMAP. This model is stochastic in nature
which means that every time we run BERTopic, we will get different results.
We can prevent this by passing a `random_state` to the UMAP model.
>>> topic_model.get_topic_info()
Topic Count Name
0 -1 11648 -1_of_the_and_to
1 0 1554 0_question_answer_questions_qa
2 1 620 1_hate_offensive_toxic_detection
3 2 578 2_summarization_summaries_summary_abstractive
4 3 568 3_parsing_parser_dependency_amr
... ... ... ...
317 316 10 316_prf_search_conversational_spoke
318 317 10 317_crowdsourcing_workers_annotators_underl
319 318 10 318_curriculum_nmt_translation_dcl
320 319 10 319_botsim_menu_user_dialogue
321 320 10 320_color_colors_ib_naming
There are many topics generated from our model, XXX! Each of these topics
is represented by several keywords, which are concatenated with a “_” in the
Name column. This Name column allows us to quickly get a feeling of what
the topic is about as it shows the four keywords that best represent it.
NOTE
You might also have noticed that the very first topic is labeled -1. That topic contains all documents
that could not be fitted within a topic and are considered to be outliers. This is a result of the clustering
algorithm, HDBSCAN, that does not force all points to be clustered. To remove outliers, we could
either use a non-outlier algorithm like k-Means or use BERTopic’s reduce_outliers() function to
remove some of the outliers and assign them to topics.
>>> topic_model.get_topic(2)
[('summarization', 0.029974019692323675),
('summaries', 0.018938088406361412),
('summary', 0.018019112468622436),
('abstractive', 0.015758156442697138),
('document', 0.011038627359130419),
('extractive', 0.010607624721836042),
('rouge', 0.00936377058925341),
('factual', 0.005651676100789188),
('sentences', 0.005262910357048789),
('mds', 0.005050565343932314)]
This gives us a bit more context about the topic and helps us understand what
the topic is about. For example, it is interesting to see the word “rogue”
appear since that is a common metric for evaluating summarization models.
It returns that topic 17 has a relatively high similarity (0.675) with our search
term. If we then inspect the topic, we can see that it is indeed a topic about
topic modeling:
>>> topic_model.get_topic(17)
[('topic', 0.0503756681079549),
('topics', 0.02834246786579726),
('lda', 0.015441277604137684),
('latent', 0.011458141214781893),
('documents', 0.01013764950401255),
('document', 0.009854201885298964),
('dirichlet', 0.009521114618288628),
('modeling', 0.008775384549157435),
('allocation', 0.0077508974418589605),
('clustering', 0.005909325849593925)]
Although we know that his topic is about topic modeling, let us see if the
BERTopic abstract is also assigned to this topic:
It is! It seems that the topic is not just about LDA-based methods but also
cluster-based techniques, like BERTopic.
We can use this technique to see what the topic distribution is of the first
sentence in the BERTopic paper:
(Interactive) Visualizations
Going through XXX topics manually can be quite a task. Instead, several
helpful visualization functions allow us to get a broad overview of the topics
that were generated. Many of which are interactive by using the Plotly
visualization framework.
topic_model.visualize_topics()
Figure 3-14. The intertopic distance map of topics represented in 2D space.
NOTE
We only visualized a selection of topics since showing all 300 topics would result in quite a messy
visualization. Also, instead of passing `abstracts`, we passed `titles` since we only want to view the
titles of each paper when we hover over a document and not the entire abstract.
The bar chart in Figure 3-16 gives a nice indication of which keywords are
most important to a specific topic. Take topic 2 for example–it seems that the
word “summarization” is most representative of that topic and that other
words are very similar in importance.
Representation Models
With the neural-search style modularity that BERTopic employs, it can
leverage many different types of Large Language Models whilst minimizing
computing. This allows for a large range of topic fine-tuning methods, from
part-of-speech to text-generation methods, like ChatGPT. Figure 3-17
demonstrates the variety of LLMs that we can leverage to fine-tune topic
representations.
Figure 3-17. After applying the c-TF-IDF weighting, topics can be fine-tuned with a wide variety of
representation models. Many of which are Large Language Models.
Topics generated with c-TF-IDF serve as a good first ranking of words with
respect to their topic. In this section, these initial rankings of words can be
considered candidate keywords for a topic as we might change their rankings
based on any representation model. We will go through several representation
models that can be used within BERTopic and that are also interesting from a
Large Language Modeling standpoint.
Before we start, we first need to do two things. First, we are going to save our
original topic representations so that it will be much easier to compare with
and without representation models:
Second, let’s create a short wrapper that we can use to quickly visualize the
differences in topic words to compare with and without representation
models:
In BERTopic, we want to use something similar but on a topic level and not a
document level. As shown in Figure 3-18, KeyBERTInspired uses c-TF-IDF
to create a set of representative documents per topic by randomly sampling
500 documents per topic, calculating their c-TF-IDF values, and finding the
most representative documents. These documents are embedded and
averaged to be used as an updated topic embedding. Then, the similarity
between our candidate keywords and the updated topic embedding is
calculated to re-rank our candidate keywords.
Figure 3-18. The procedure of the KeyBERTInspired representation model
# KeyBERTInspired
from bertopic.representation import KeyBERTInspired
representation_model = KeyBERTInspired()
The updated model shows that the topics are much easier to read compared to
the original model. It also shows the downside of using embedding-based
techniques. Words in the original model, like “amr” and “qa” are perfectly
reasonable words
Part-of-Speech
c-TF-IDF does not make any distinction of the type of words it deems to be
important. Whether it is a noun, verb, adjective, or even a preposition, they
can all end up as important keywords. When we want to have human-
readable labels that are straightforward and intuitive to interpret, we might
want topics that are described by, for example, nouns only.
As shown in Figure 3-19, we can use SpaCy to make sure that only nouns
end up in our topic representations. As with most representation models, this
is highly efficient since the nouns are extracted from only a small but
representative subset of the data.
# Part-of-Speech tagging
from bertopic.representation import PartOfSpeech
representation_model = PartOfSpeech("en_core_web_sm")
Figure 3-20. The procedure of the Maximal Marginal Relevance representation model. The diversity of
the resulting keywords is represented by lambda (λ).
The resulting topics are much more diverse! Topic XXX, which originally
used a lot of “summarization” words, the topic only contains the word
“summarization”. Also, duplicates, like “embedding” and “embeddings” are
now removed.
Text Generation
Text generation models have shown great potential in 2023. They perform
well across a wide range of tasks and allow for extensive creativity in
prompting. Their capabilities are not to be underestimated and not using them
in BERTopic would frankly be a waste. We talked at length about these
models in Chapter XXX, but it’s useful now to see how they tie into the topic
modeling process.
Figure 3-21. Use text generative LLMs and prompt engineering to create labels for topics from
keywords and documents related to each topic.
Prompting
prompt = """
I have a topic that contains the following documents: \n
The topic is described by the following keywords: [KEYWORDS]
Second, the keywords that make up a topic are also passed to the prompt and
referenced using the “[KEYWORDS]” tag. These keywords could also
already be optimized using KeyBERTInspired, PartOfSpeech, or any
representation model.
Third, we give specific instructions to the Large Language Model. This is just
as important as the steps before since this will decide how the model
generates the label.
"""
I have a topic that contains the following documents:
- Our videos are also made possible by your support on patreon.
- If you want to help us make more videos, you can do so on pat
- If you want to help us make more videos, you can do so there.
- And if you want to support us in our endeavor to survive in t
BERTopic allows for using such a model to generate topic labels. We create a
prompt and ask it to create topics based on the keywords of each topic,
labeled with the `[KEYWORDS]` tag.
There are interesting topic labels that are created but we can also see that the
model is not perfect by any means.
OpenAI
When we are talking about generative AI, we cannot forget about ChatGPT
and its incredible performance. Although not open source, it makes for an
interesting model that has changed the AI field in just a few months. We can
select any text generation model from OpenAI’s collection to use in
BERTopic.
Cohere
As with OpenAI, we can use Cohere’s API within BERTopic on top of its
pipeline to further fine-tune the topic representations with a generative text
model. Make sure to grab an API key and you can start generating topic
representations.
import cohere
from bertopic.representation import Cohere
# Cohere Representation Model
co = cohere.Client(my_api_key)
representation_model = Cohere(co)
LangChain
To take things a step further with Large Language Models, we can leverage
the LangChain framework. It allows for any of the previous text generation
methods to be supplemented with additional information or even chained
together. Most notably, LangChain connects language models to other
sources of data to enable them to interact with their environment.
For example, we could use it to build a vector database with OpenAI and
apply ChatGPT on top of that database. As we want to minimize the amount
of information LangChain needs, the most representative documents are
passed to the package. Then, we could use any LangChain-supported
language model to extract the topics. The example below demonstrates the
use of OpenAI with LangChain.
The field of topic modeling is quite broad and ranges from many different
applications to variations of the same model. This also holds for BERTopic
as it has implemented a wide range of variations for different purposes, such
as dynamic, (semi-) supervised, online, hierarchical, and guided topic
modeling. Figure 3-22-X shows a number of topic modeling variations and
how to implement them in BERTopic.
Figure 3-22. -X Topic Modeling Variations in BERTopic
Summary
In this chapter we discussed a cluster-based method for topic modeling,
BERTopic. By leveraging a modular structure, we used a variety of Large
Language Models to create document representations and fine-tune topic
representations. We extracted the topics found in ArXiv abstracts and saw
how we could use BERTopic’s modular structure to develop different kinds
of topic representations.
Chapter 4. Tokens & Token Embeddings
With Early Release ebooks, you get books in their earliest form—the author’s
raw and unedited content as they write—so you can take advantage of these
technologies long before the official release of these titles.
This will be the 8th chapter of the final book. Please note that the GitHub
repo will be made active later on.
If you have comments about how we might improve the content and/or
examples in this book, or if you notice missing material within this chapter,
please reach out to the editor at [email protected].
The majority of the embeddings we’ve looked at so far are text embeddings,
vectors that represent an entire sentence, passage, or document. Figure 4-1
shows this distinction.
Figure 4-1. The difference between text embeddings (one vector for a sentence or paragraph) and token
embeddings (one vector per word or token).
LLM Tokenization
Figure 4-2. High-level view of a language model and its input prompt.
Let us look closer into that generation process to examine more of the steps
involved in text generation. Let’s start by loading our model and its
tokenizer.
We can then proceed to the actual generation. Notice that the generation code
always includes a tokenization step prior to the generation step.
Looking at this code, we can see that the model does not in fact receive the
text prompt. Instead, the tokenizers processed the input prompt, and returned
the information the model needed in the variable input_ids, which the model
used as its input.
This reveals the inputs that LLMs respond to. A series of integers as shown in
Figure 4-3. Each one is the unique ID for a specific token (character, word or
part of word). These IDs reference a table inside the tokenizer containing all
the tokens it knows.
Figure 4-3. A tokenizer processes the input prompt and prepares the actual input into the language
model: a list of token ids.
If we want to inspect those IDs, we can use the tokenizer’s decode method to
translate the IDs back into text that we can read:
for id in input_ids[0]:
print(tokenizer.decode(id))
Which prints:
<s>
Write
an
email
apolog
izing
to
Sarah
for
the
trag
ic
garden
ing
m
ish
ap
.
Exp
lain
how
it
happened
.
This is how the tokenizer broke down our input prompt. Notice the following:
The first token is the token with ID #1, which is <s>, a special token
indicating the beginning of the text
Some tokens are complete words (e.g., Write, an, email)
Some tokens are parts of words (e.g., apolog, izing, trag, ic)
Punctuation characters are their own token
Notice how the space character does not have its own token. Instead,
partial tokens (like ‘izing’ and ‘ic') have a special hidden character at their
beginning that indicate that they’re connected with the token that precedes
them in the text.
There are three major factors that dictate how a tokenizer breaks down an
input prompt. First, at model design time, the creator of the model chooses a
tokenization method. Popular methods include Byte-Pair Encoding (BPE for
short, widely used by GPT models), WordPiece (used by BERT), and
SentencePiece (used by LLAMA). These methods are similar in that they aim
to optimize an efficient set of tokens to represent a text dataset, but they
arrive at it in different ways.
In addition to being used to process the input text into a language model,
tokenizers are used on the output of the language model to turn the resulting
token ID into the output word or token associated with it as Figure 4-4 shows.
Figure 4-4. Tokenizers are also used to process the output of the model by converting the output token
ID into the word or token associated with that ID.
Word tokens
This approach was common with earlier methods like Word2Vec but is
being used less and less in NLP. Its usefulness, however, led it to be used
outside of NLP for use cases such as recommendation systems, as we’ll
see later in the chapter.
Figure 4-5. There are multiple methods of tokenization that break down the text to different sizes of
components (words, subwords, characters, and bytes).
When compared to character tokens, this method benefits from the ability
to fit more text within the limited context length of a Transformer model.
So with a model with a context length of 1024, you may be able to fit
three times as much text using subword tokenization than using character
tokens (sub word tokens often average three characters per token).
Character Tokens
This is another method that is able to deal successfully with new words
because it has the raw letters to fall-back on. While that makes the
representation easier to tokenize, it makes the modeling more difficult.
Where a model with subword tokenization can represent “play” as one
token, a model using character-level tokens needs to model the
information to spell out “p-l-a-y” in addition to modeling the rest of the
sequence.
Byte Tokens
One additional tokenization method breaks down tokens into the
individual bytes that are used to represent unicode characters. Papers like
CANINE: Pre-training an Efficient Tokenization-Free Encoder for
Language Representation outline methods like this which are also called
“tokenization free encoding”. Other works like ByT5: Towards a token-
free future with pre-trained byte-to-byte models show that this can be a
competitive method.
We’ve pointed out earlier three major factors that dictate the tokens that
appear within a tokenizer: the tokenization method, the parameters and
special tokens we use to initialize the tokenizer, and the dataset the tokenizer
is trained on. Let’s compare and contrast a number of actual, trained
tokenizers to see how these choices change their behavior.
This will allow us to see how each tokenizer deals with a number of different
kinds of tokens:
Capitalization
Languages other than English
Emojis
Programming code with its keywords and whitespaces often used for
indentation (in languages like python for example)
Numbers and digits
Let’s go from older to newer tokenizers and see how they tokenize this text
and what that might say about the language model. We’ll tokenize the text,
and then print each token with a gray background color.
bert-base-uncased
’sep_token’: '[SEP]'
‘pad_token’: '[PAD]'
‘cls_token’: '[CLS]'
‘mask_token’: '[MASK]'
Tokenized text:
With the uncased (and more popular) version of the BERT tokenizer, we
notice the following:
The newline breaks are gone, which makes the model blind to information
encoded in newlines (e.g., a chat log when each turn is in a new line)
All the text is in lower case
The word “capitalization” is encoded as two subtokens capital ##ization .
The ## characters are used to indicate this token is a partial token
connected to the token the precedes it. This is also a method to indicate
where the spaces are, it is assumed tokens without ## before them have a
space before them.
The emoji and Chinese characters are gone and replaced with the [UNK]
special token indicating an “unknown token”.
bert-base-cased
Tokenized text:
[CLS] English and CA ##PI ##TA ##L ##I ##Z ##AT ##ION [UNK] [UN
The cased version of the BERT tokenizer differs mainly in including upper-
case tokens.
gpt2
Tokenized text:
������
12 . 0 * 50 = 600
The 蟠characters are now represented into multiple tokens each. While
we see these tokens printed as the � character, they actually stand for
different tokens. For example, the emoji is broken down into the tokens
with token ids: 8582, 236, and 113. The tokenizer is successful in
reconstructing the original character from these tokens. We can see that by
printing tokenizer.decode([8582, 236, 113]), which prints out
The two tabs are represented as two tokens (token number 197 in that
vocabulary) and the four spaces are represented as three tokens (number 220)
with the final space being a part of the token for the closing quote character.
NOTE
What is the significance of white space characters? These are important for models that understand or
generate code. A model that uses a single token to represent four consecutive white space characters
can be said to be more tuned to a python code dataset. While a model can live with representing it as
four different tokens, it does make the modeling more difficult as the model needs to keep track of the
indentation level. This is an example of where tokenization choices can help the model improve on a
certain task.
google/flan-t5-xxl
Tokenization method: SentencePiece, introduced in SentencePiece: A simple
and language independent subword tokenizer and detokenizer for Neural Text
Processing
Special tokens:
- ‘unk_token’: '<unk>'
- ‘pad_token’: '<pad>'
Tokenized text:
The FLAN-T5 family of models use the sentencepiece method. We notice the
following:
GPT-4
Tokenization method: BPE
Special tokens:
<|endoftext|>
Fill in the middle tokens. These three tokens enable the GPT-4 capability of
generating a completion given not only the text before it but also considering
the text after it. This method is explained in more detail in the paper Efficient
Training of Language Models to Fill in the Middle. These special tokens are:
<|fim_prefix|>
<|fim_middle|>
<|fim_suffix|>
Tokenized text:
The GPT-4 tokenizer represents the four spaces as a single token. In fact,
it has a specific token to every sequence of white spaces up until a list of
83 white spaces.
The python keyword elif has its own token in GPT-4. Both this and the
previous point stem from the model’s focus on code in addition to natural
language.
The GPT-4 tokenizer uses fewer tokens to represent most words. Example
here include ‘CAPITALIZATION’ (two tokens, vs. four) and ‘tokens’
(one token vs. three).
bigcode/starcoder
Tokenization method:
Special tokens:
'<|endoftext|>'
'<fim_prefix>'
'<fim_middle>'
'<fim_suffix>'
'<fim_pad>'
When representing code, managing the context is important. One file might
make a function call to a function that is defined in a different file. So the
model needs some way of being able to identify code that is in different files
in the same code repository, while making a distinction between code in
different repos. That’s why starcoder uses special tokens for the name of the
repository and the filename:
'<filename>'
'<reponame>
'<gh_stars>'
The tokenizer also includes a bunch of the special tokens to perform better on
code. These include:
'<issue_start>'
'<jupyter_start>'
'<jupyter_text>'
Paper: StarCoder: may the source be with you!
Tokenized text:
facebook/galactica-1.3b
Tokenization method:
Special tokens:
<s>
<pad>
</s>
<unk>
[START_REF]
[END_REF]
Step-by-Step Reasoning -
<work> is an interesting token that the model uses for chain-of-thought
reasoning.
Tokenized text:
The Galactica tokenizer behaves similar to star coder in that it has code in
mind. It also encodes white spaces in the same way - assigning a single token
to sequences of whitespace of different lengths. It differs in that it also does
that for tabs, though. So from all the tokenizers we’ve seen so far, it’s the
only one that’s assigned a single token to the string made up of two tabs
('\t\t')
We can now recap our tour by looking at all these examples side by side:
bert-base-uncased
[CLS] english and capital ##ization [UNK] [UNK] show
bert-base-cased
[CLS] English and CA ##PI ##TA ##L ##I ##Z ##AT ##IO
gpt2
English and CAP ITAL IZ ATION � � � � � � show
google/flan-t5-xxl
English and CA PI TAL IZ ATION <unk> <unk> show _ to
GPT-4
English and CAPITAL IZATION � � � � � � show _t
bigcode/starcoder
English and CAPITAL IZATION � � � � � show _ tok
facebook/galactica-
English and CAP ITAL IZATION � � � � � � � sho
1.3b
meta-llama/Llama-
<s> English and C AP IT AL IZ ATION � � � � � �
2-70b-chat-hf
Notice how there’s a new tokenizer added in the bottom. By now, you should
be able to understand many of its properties by just glancing at this output.
This is the tokenizer for LLaMA2, the most recent of these models.
Tokenizer Properties
Tokenizer Parameters
Vocabulary size
How many tokens to keep in the tokenizer’s vocabulary? (30K, 50K are
often used vocabulary size values, but more and more we’re seeing larger
sizes like 100K)
Special tokens
What special tokens do we want the model to keep track of. We can add as
many of these as we want, especially if we want to build LLM for special
use cases. Common choices include:
Aside from these, the LLM designer can add tokens that help better model
the domain of the problem they’re trying to focus on, as we’ve seen with
Galactica’s <work> and [START_REF] tokens.
Capitalization
In languages such as English, how do we want to deal with capitalization?
Should we convert everything to lower-case? (Name capitalization often
carries useful information, but do we want to waste token vocabulary
space on all caps versions of words?). This is why some models are
released in both cased and uncased versions (like Bert-base cased and the
more popular Bert-base uncased).
Even if we select the same method and parameters, tokenizer behavior will be
different based on the dataset it was trained on (before we even start model
training). The tokenization methods mentioned previously work by
optimizing the vocabulary to represent a specific dataset. From our guided
tour we’ve seen how that has an impact on datasets like code, and
multilingual text.
For code, for example, we’ve seen that a text-focused tokenizer may tokenize
the indentation spaces like this (We’ll highlight some tokens in yellow and
green):
These tokenization choices make the model’s job easier and thus its
performance has a higher probability of improving.
The language model holds an embedding vector for each token in the
tokenizer’s vocabulary as we can see in Figure 4-6. In the beginning, these
vectors are randomly initialized like the rest of the model’s weights, but the
training process assigns them the values that enable the useful behavior
they’re trained to perform.
Figure 4-6. A language model holds an embedding vector associated with each token in its tokenizer.
Now that we’ve covered token embeddings as the input to a language model,
let’s look at how language models can create better token embeddings. This
is one of the main ways of using language models for text representation that
empowers applications like named-entity recognition or extractive text
summarization (which summarizes a long text by highlighting to most
important parts of it, instead of generating new text as a summary).
Figure 4-7. Language models produce contextualized token embeddings that improve on raw, static
token embeddings
This code downloads a pre-trained tokenizer and model, then uses them to
process the string “Hello world”. The output of the model is then saved in the
output variable. Let’s inspect that variable by first printing its dimensions (we
expect it to be a multi-dimensional array).
The model we’re using here is called DeBERTA v3, which at the time of
writing, is one of the best-performing language models for token embeddings
while being small and highly efficient. It is described in the paper
DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training
with Gradient-Disentangled Embedding Sharing.
output.shape
This prints out:
torch.Size([1, 4, 384])
We can ignore the first dimension and read this as four tokens, each one
embedded in 384 values.
But what are these four vectors? Did the tokenizer break the two words into
four tokens, or is something else happening here? We can use what we’ve
learned about tokenizers to inspect them:
[CLS]
Hello
world
[SEP]
Which shows that this particular tokenizer and model operate by adding the
[CLS] and [SEP] tokens to the beginning and end of a string.
Our language model has now processed the text input. The result of its output
is the following:
tensor([[
[-3.3060, -0.0507, -0.1098, ..., -0.1704, -0.1618, 0.6932],
[ 0.8918, 0.0740, -0.1583, ..., 0.1869, 1.4760, 0.0751],
[ 0.0871, 0.6364, -0.3050, ..., 0.4729, -0.1829, 1.0157],
[-3.1624, -0.1436, -0.0941, ..., -0.0290, -0.1265, 0.7954]
]], grad_fn=<NativeLayerNormBackward0>)
A visual like this is essential for the next chapter when we start to look at
how Transformer-based LLMs work under the hood.
Word Embeddings
Token embeddings are useful even outside of large language models.
Embeddings generated by pre-LLM methods like Word2Vec, Glove, and
Fasttext still have uses in NLP and beyond NLP. In this section, we’ll look at
how to use pre-trained Word2Vec embeddings and touch on how the method
creates word embeddings. Seeing how Word2Vec is trained will prime you
for the chapter on contrastive training. Then in the following section, we’ll
see how those embeddings can be used for recommendation systems.
Let’s look at how we can download pre-trained word embeddings using the
Gensim library
import gensim
import gensim.downloader as api
from sklearn.metrics.pairwise import cosine_similarity
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
# Download embeddings (66MB, glove, trained on wikipedia, vecto
# Other options include "word2vec-google-news-300"
# More options at https://github.com/RaRe-Technologies/gensim-d
model = api.load("glove-wiki-gigaword-50")
model.most_similar([model['king']], topn=11)
Which outputs:
[('king', 1.0000001192092896),
('prince', 0.8236179351806641),
('queen', 0.7839043140411377),
('ii', 0.7746230363845825),
('emperor', 0.7736247777938843),
('son', 0.766719400882721),
('uncle', 0.7627150416374207),
('kingdom', 0.7542161345481873),
('throne', 0.7539914846420288),
('brother', 0.7492411136627197),
('ruler', 0.7434253692626953)]
Just like LLMs, word2vec is trained on examples generated from text. Let’s
say for example, we have the text "Thou shalt not make a machine in the
likeness of a human mind" from the Dune novels by Frank Herbert. The
algorithm uses a sliding window to generate training examples. We can for
example have a window size two, meaning that we consider two neighbors on
each side of a central word.
The embeddings are generated from a classification task. This task is used to
train a neural network to predict if words appear in the same context or not.
We can think of this as a neural network that takes two words and outputs 1 if
they tend to appear in the same context, and 0 if they do not.
In the first position for the sliding window, we can generate four training
examples as we can see in Figure 4-9.
Figure 4-9. A sliding window is used to generate training examples for the word2vec algorithm to later
predict if two words are neighbors or not.
In each of the produced training examples, the word in the center is used as
one input, and each of its neighbors is a distinct second input in each training
example. We expect the final trained model to be able to classify this
neighbor relationship and output 1 if the two input words it receives are
indeed neighbors.
Figure 4-10. Each generated training example shows a pair of neighboring words.
If, however, we have a dataset of only a target value of 1, then a model can
ace it by output 1 all the time. To get around this, we need to enrich our
training dataset with examples of words that are not typically neighbors.
These are called negative examples and are shown in Figure 4-11.
Figure 4-11. We need to present our models with negative examples: words that are not usually
neighbors. A better model is able to better distinguish between the positive and negative examples.
It turns out that we don’t have to be too scientific in how we choose the
negative examples. A lot of useful models are result from simple ability to
detect positive examples from randomly generated examples (inspired by an
important idea called Noise Contrastive Estimation and described in Noise-
contrastive estimation: A new estimation principle for unnormalized
statistical models). So in this case, we get random words and add them to the
dataset and indicate that they are not neighbors (and thus the model should
output 0 when it sees them.
With this, we’ve seen two of the main concepts of word2vec (Figure 4-12):
Skipgram - the method of selecting neighboring words and negative sampling
- adding negative examples by random sampling from the dataset.
Figure 4-12. Skipgram and Negative Sampling are two of the main ideas behind the word2vec
algorithm and are useful in many other problems that can be formulated as token sequence problems.
We can generate millions and even billions of training examples like this
from running text. Before proceeding to train a neural network on this
dataset, we need to make a couple of tokenization decisions, which, just like
we’ve seen with LLM tokenizers, include how to deal with capitalization and
punctuation and how many tokens we want in our vocabulary.
We then create an embedding vector for each token, and randomly initialize
them, as can be seen in Figure 4-13. In practice, this is a matrix of
dimensions vocab_size x embedding_dimensions.
Figure 4-13. A vocabulary of words and their starting, random, uninitialized embedding vectors.
Figure 4-14. A neural network is trained to predict if two words are neighbors. It updates the
embeddings in the training process to produce the final, trained embeddings.
Based on whether its prediction was correct or not, the typical machine
learning training step updates the embeddings so that the next the model is
presented with those two vectors, it has a better chance of being more correct.
And by the end of the training process, we have better embeddings for all the
tokens in our vocabulary.
This idea of a model that takes two vectors and predicts if they have a certain
relation is one of the most powerful ideas in machine learning, and time after
time has proven to work very well with language models. This is why we’re
dedicating chapter XXX to go over this concept and how it optimizes
language models for specific tasks (like sentence embeddings and retrieval).
The same idea is also central to bridging modalities like text and images
which is key to AI Image generation models. In that formulation, a model is
presented with an image and a caption, and it should predict whether that
caption describes this image or not.
In this section we’ll use the Word2vec algorithm to embed songs using
human-made music playlists. Imagine if we treated each song as we would a
word or token, and we treated each playlist like a sentence. These
embeddings can then be used to recommend similar songs which often appear
together in playlists.
The dataset we’ll use was collected by Shuo Chen from Cornell University.
The dataset contains playlists from hundreds of radio stations around the US.
Figure 4-15 demonstrates this dataset.
Figure 4-15. For song embeddings that capture song similarity we’ll use a dataset made up of a
collection of playlists, each containing a list of songs.
Let’s demonstrate the end product before we look at how it’s built. So let’s
give it a few songs and see what it recommends in response.
Let’s start by giving it Michael Jackson’s Billie Jean, the song with ID
#3822.
print_recommendations(3822)
title Billie Jean
artist Michael Jackson
Recommendations:
id title artist
That looks reasonable. Madonna, Prince, and other Michael Jackson songs
are the nearest neighbors.
Let’s step away from Pop and into Rap, and see the neighbors of 2Pac’s
California Love:
print_recommendations(842)
id title artist
song_id = 2172
# Ask the model for songs similar to song #2172
model.wv.most_similar(positive=str(song_id))
Which outputs:
[('2976', 0.9977465271949768),
('3167', 0.9977430701255798),
('3094', 0.9975950717926025),
('2640', 0.9966474175453186),
('2849', 0.9963167905807495)]
And that is the list of the songs whose embeddings are most similar to song
2172. See the jupyter notebook for the code that links song ids to their names
and artist names.
Resulting in recommendations that are all in the same heavy metal and hard
rock genre:
id title artist
Tokenizers are the first step in processing the input to a LLM -- turning
text into a list of token IDs.
Some of the common tokenization schemes include breaking text down
into words, subword tokens, characters, or bytes
A tour of real-world pre-trained tokenizers (from BERT to GPT2, GPT4,
and other models) showed us areas where some tokenizers are better (e.g.,
preserving information like capitalization, new lines, or tokens in other
languages) and other areas where tokenizers are just different from each
other (e.g., how they break down certain words).
Three of the major tokenizer design decisions are the tokenizer algorithm
(e.g., BPE, WordPiece, SentencePiece), tokenization parameters
(including vocabulary size, special tokens, capitalization, treatment of
capitalization and different languages), and the dataset the tokenizer is
trained on.
Language models are also creators of high-quality contextualized token
embeddings that improve on raw static embeddings. Those contextualized
token embeddings are what’s used for tasks including NER, extractive text
summarization, and span classification.
Before LLMs, word embedding methods like word2vec, Glove and
Fasttext were popular. They still have some use cases within and outside
of language processing.
The Word2Vec algorithm relies on two main ideas: Skipgram and
Negative Sampling. It also uses contrastive training similar to the one
we’ll see in the contrastive training chapter.
Token embeddings are useful for creating and improving recommender
systems as we’ve seen in the music recommender we’ve built from
curated song playlists.
About the Authors
Jay Alammar is Director and Engineering Fellow at Cohere (pioneering
provider of large language models as an API). A role in which he advises and
educates enterprises and the developer community on using language models
for practical use cases). Through his popular AI/ML blog, Jay has helped
millions of researchers and engineers visually understand machine learning
tools and concepts from the basic (ending up in the documentation of
packages like NumPy and pandas) to the cutting-edge (Transformers, BERT,
GPT-3). Jay is also a co-creator of popular machine learning and natural
language processing courses on Udacity.
He is the author and maintainer of several open source packages that rely on
the strength of Large Language Models, such as BERTopic, PolyFuzz, and
KeyBERT. His packages are downloaded millions of times and are used by
data professionals and organizations across the world.