When you build an NLP pipeline—whether for sentiment analysis, chatbots, or translation—the very first step is always the same: tokenization. In plain words, tokenization dices raw text into smaller, consistent chunks that a model can count, index, and learn from.

1  What is a Token?

Think of tokens as the LEGO® bricks of language. They can be as big as a whole word or as tiny as a single character, depending on how you slice them.

Sentence: "IBM taught me tokenization."
Possible tokens: ["IBM", "taught", "me", "tokenization"]

Different models expect different brick sizes, so choosing the right tokenizer is strategic.

2  Why Tokenization Matters

  • Sentiment analysis: detect “good” vs “bad”.
  • Text generation: decide what piece comes next.
  • Search engines: match “running” with “run”.

Without tokenization, your neural net sees text as a long, unreadable string of bytes—hardly the recipe for comprehension.

3  The Three Classical Approaches

Method How It Works When to Use Watch‑outs
Word‑based Splits on whitespace & punctuation Quick prototypes, rule‑based systems Huge vocabulary, OOV* explosion
Character‑based Every character is a token Morphologically rich languages, misspellings Longer sequences, less semantic punch
Sub‑word Keeps common words whole, chops rare ones into pieces State‑of‑the‑art transformers (BERT, GPT‑x) More complex training & merges

*OOV = out‑of‑vocabulary words

4  A Closer Look at Sub‑word Algorithms

  1. WordPiece (BERT) Greedy merges: start with characters, repeatedly join pairs that boost likelihood.
from transformers import BertTokenizer
   tok = BertTokenizer.from_pretrained("bert-base-uncased")
   print(tok.tokenize("tokenization lovers"))  
   # ['token', '##ization', 'lovers']
  1. Unigram (XLNet, SentencePiece) Vocabulary pruning: begin with many candidates, drop the least useful until a target size is reached.
  2. SentencePiece Language‑agnostic: trains directly on raw text, treats spaces as tokens, so no pre‑tokenization needed.

5  Tokenization + Indexing in PyTorch

from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

sentences = ["Life is short", "Tokenization is powerful"]
tokenizer = get_tokenizer("basic_english")

def yield_tokens(data_iter):
    for text in data_iter:
        yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(sentences),
                                  specials=["", "", "", ""])
vocab.set_default_index(vocab[""])

tokens = tokenizer(sentences[0])          # ['life', 'is', 'short']
indices = vocab(tokens)                   # [5, 6, 7]  (example output)

# Add special tokens + padding
max_len = 6
padded = [""] + tokens + [""]
padded += [""] * (max_len - len(padded))

Why it matters: Models operate on integers, not strings. torchtext lets you jump from raw text to GPU‑ready tensors in three lines.

6  Special Tokens Cheat‑Sheet

Token Purpose
Beginning of sentence
End of sentence
Sequence padding
Unknown / rare word

Adding them makes batching cleaner and generation deterministic.

7  Key Takeaways

  • Tokenization is non‑negotiable—mis‑tokenize and your downstream model will stumble.
  • Choose by trade‑off: word‑level (semantic clarity) vs character‑level (tiny vocab) vs sub‑word (best of both, extra complexity).
  • Modern transformers ♥ sub‑word algorithms such as WordPiece, Unigram, and SentencePiece.
  • Indexing turns tokens into numbers; libraries like torchtext, spaCy, and transformers automate the grunt work.
  • Special tokens (, , etc.) keep sequence models from losing their place.

Reference

My study notes from the IBM Generative AI and LLMs: Architecture and Data Preparation course.