What are tokens? | annulearnstoai

If you've ever wondered how AI models like ChatGPT actually "read" and "understand" text, the answer starts with something called tokens. Let's break down what tokens are and why they matter.

What Are Tokens?

A token is a chunk of text that a model reads or generates. Here's the important part: tokens aren't exactly the same as words.

They can be:

Whole words (like "hello")
Parts of words (like "un" and "happy" from "unhappy")
Even punctuation marks

Each model has its own way of breaking text into tokens, using something called a tokenizer. Different models use different tokenizers, which means the same text might be split differently depending on which AI you're using.

How Tokenization Works: The "Ice Cream" Example

Let's look at how the phrase "ice cream" gets tokenized. Most modern models, like OpenAI's GPT series, use a method called Byte Pair Encoding (BPE). Here's what happens:

"ice cream" is typically split into two tokens:
- "ice"
- " cream" (notice the space is included)

Why does this matter? Spaces are meaningful to tokenizers. A word with a space in front of it often becomes its own token. While some very common phrases like "New York" might be stored as single tokens because they appear frequently together, "ice cream" is usually split into two separate pieces.

From Text to Numbers: Token IDs

Here's where it gets interesting: AI models don't actually store text directly. Instead, they convert tokens into numbers called token IDs.

For example, using GPT's tokenizer:

"ice" → 8578
" cream" → 14141

So when you type "ice cream," the model sees: [8578, 14141]

These numbers are what the model actually processes internally.

Embeddings: The Mathematical Meaning of Words

Once text is converted to token IDs, the model transforms these numbers into embeddings—high-dimensional vectors that represent the semantic meaning of each token.

Think of an embedding as a way to capture what a word "means" in mathematical form. Just like you might describe ice cream by rating it on different qualities (sweetness, temperature, creaminess), embeddings represent words using hundreds or thousands of dimensions. Each dimension captures some aspect of meaning, though what each dimension represents isn't something humans can easily interpret—it's learned by the model during training.

Embedding Dimensions

Different models use different sized embeddings:

GPT-3.5: 12,288 dimensions (for the largest variant)
GPT-4: Similar scale, with exact numbers varying by version
text-embedding-ada-002: 1,536 dimensions

When the token "ice" (ID: 8578) enters GPT-4, it gets converted into a vector with 12,288 numbers—that's what the model uses for all its internal computations.

An embedding looks like a long list of decimal numbers:

A Note on Applications

[-0.0123, 0.0456, -0.0789, 0.0231, -0.0567, 0.0892, ...]

These vectors allow the model to understand relationships between words. Words with similar meanings have similar vectors, which is why AI can understand that "happy" and "joyful" are related, or that "king" and "queen" have a similar relationship to "man" and "woman."

Embeddings aren't just used inside language models during text generation. They're also crucial for vector retrieval systems—like when you search a database of documents to find the most relevant ones. By converting both your query and stored documents into embeddings, systems can use mathematical similarity measures (like cosine similarity or Euclidean distance) to find the best matches.

For example, if you search for "ice cream," documents about "gelato" might score highly because their embeddings are mathematically similar, even though they don't use the exact same words.

Understanding Tokens in Large Language Models