When you build an NLP pipeline—whether for sentiment analysis, chatbots, or translation—the very first step is always the same: tokenization. In plain words, tokenization dices raw text into smaller, consistent chunks that a model can count, index, and learn from.
1 What is a Token?
Think of tokens as the LEGO® bricks of language. They can be as big as a whole word or as tiny as a single character, depending on how you slice them.
Sentence: "IBM taught me tokenization."
Possible tokens: ["IBM", "taught", "me", "tokenization"]
Different models expect different brick sizes, so choosing the right tokenizer is strategic.
2 Why Tokenization Matters
- Sentiment analysis: detect “good” vs “bad”.
- Text generation: decide what piece comes next.
- Search engines: match “running” with “run”.
Without tokenization, your neural net sees text as a long, unreadable string of bytes—hardly the recipe for comprehension.
3 The Three Classical Approaches
Method | How It Works | When to Use | Watch‑outs |
---|---|---|---|
Word‑based | Splits on whitespace & punctuation | Quick prototypes, rule‑based systems | Huge vocabulary, OOV* explosion |
Character‑based | Every character is a token | Morphologically rich languages, misspellings | Longer sequences, less semantic punch |
Sub‑word | Keeps common words whole, chops rare ones into pieces | State‑of‑the‑art transformers (BERT, GPT‑x) | More complex training & merges |
*OOV = out‑of‑vocabulary words
4 A Closer Look at Sub‑word Algorithms
- WordPiece (BERT) Greedy merges: start with characters, repeatedly join pairs that boost likelihood.
from transformers import BertTokenizer
tok = BertTokenizer.from_pretrained("bert-base-uncased")
print(tok.tokenize("tokenization lovers"))
# ['token', '##ization', 'lovers']
- Unigram (XLNet, SentencePiece) Vocabulary pruning: begin with many candidates, drop the least useful until a target size is reached.
- SentencePiece Language‑agnostic: trains directly on raw text, treats spaces as tokens, so no pre‑tokenization needed.
5 Tokenization + Indexing in PyTorch
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
sentences = ["Life is short", "Tokenization is powerful"]
tokenizer = get_tokenizer("basic_english")
def yield_tokens(data_iter):
for text in data_iter:
yield tokenizer(text)
vocab = build_vocab_from_iterator(yield_tokens(sentences),
specials=["", "", "", ""])
vocab.set_default_index(vocab[""])
tokens = tokenizer(sentences[0]) # ['life', 'is', 'short']
indices = vocab(tokens) # [5, 6, 7] (example output)
# Add special tokens + padding
max_len = 6
padded = [""] + tokens + [""]
padded += [""] * (max_len - len(padded))
Why it matters: Models operate on integers, not strings. torchtext
lets you jump from raw text to GPU‑ready tensors in three lines.
6 Special Tokens Cheat‑Sheet
Token | Purpose |
---|---|
|
Beginning of sentence |
|
End of sentence |
|
Sequence padding |
|
Unknown / rare word |
Adding them makes batching cleaner and generation deterministic.
7 Key Takeaways
- Tokenization is non‑negotiable—mis‑tokenize and your downstream model will stumble.
- Choose by trade‑off: word‑level (semantic clarity) vs character‑level (tiny vocab) vs sub‑word (best of both, extra complexity).
- Modern transformers ♥ sub‑word algorithms such as WordPiece, Unigram, and SentencePiece.
-
Indexing turns tokens into numbers; libraries like
torchtext
, spaCy, andtransformers
automate the grunt work. -
Special tokens (
,
, etc.) keep sequence models from losing their place.
Reference
My study notes from the IBM Generative AI and LLMs: Architecture and Data Preparation course.