How BERT Works
December 17, 2025
BERT is a bidirectional transformer encoder. Let's break that down: Bidirectional means each word attends to words before AND after it—it sees the entire sentence's context at once. Transformer is the architecture using attention (dot product → softmax → weighted average). Encoder converts text into vectors (vs. a "decoder" which generates text).
I'll explain this through an example classification task. We want to classify news titles into four topics: Business, Sports, World, and SciTech. Let's do it with: "Apple stock rises"
Step 1: Tokenize (text → integer IDs)
Computers can't process words directly, so we represent each word as a number called a token. Think of your student ID—a unique number representing you. Tokens do the same for words.
BERT has a fixed vocabulary file mapping ~30,000 words to integer IDs. We look up each word:
Input: "Apple stock rises"
Output: [101, 8347, 4518, 9012, 102]
↑ ↑ ↑ ↑ ↑
[CLS] Apple stock rises [SEP]Two special tokens are added automatically: [CLS] ("Classification")—a dummy token at the start that will accumulate sentence meaning. [SEP] ("Separator")—marks the end of the sentence. These aren't part of the original text. BERT adds them to every input.
Step 2: Embed (integer IDs → vectors)
Goal: Turn each integer (word) into a richer representation (768 numbers instead of 1). Why 768 numbers? A single number can't capture meaning. With 768 numbers, similar words can have similar vectors, encoding relationships between words.
BERT has a table with 30,000 rows, each row is 768 numbers:
Row 0: [0.02, -0.01, 0.05, 0.03, ... 768 numbers total] Row 1: [0.11, 0.08, -0.03, 0.14, ...] ... Row 101: [0.10, 0.20, 0.05, 0.15, ...] ← [CLS] Row 8347: [0.80, 0.20, 0.90, 0.40, ...] ← "Apple" Row 4518: [0.50, 0.50, 0.50, 0.60, ...] ← "stock" Row 9012: [0.30, 0.70, 0.60, 0.20, ...] ← "rises"
Where did these numbers come from? Pre-training. BERT was trained on billions of sentences, and the numbers were adjusted until similar words had similar vectors.
After Step 2, we have (using 4 numbers for simplicity; real BERT uses 768):
[CLS] → [0.10, 0.20, 0.05, 0.15] Apple → [0.80, 0.20, 0.90, 0.40] stock → [0.50, 0.50, 0.50, 0.60] rises → [0.30, 0.70, 0.60, 0.20] [SEP] → [0.15, 0.10, 0.25, 0.05]
Right now, [CLS] (and [SEP]) are the same for every sentence.
Step 3: Attention (vectors get blended together)
Goal: Right now, "stock" has the same vector whether it's in "Apple stock" or "chicken stock." We want context to change the vector.
How: Each word looks at every other word, computes similarity, and blends their vectors together. This is what makes BERT bidirectional: when processing "stock," it looks at "Apple" (before) AND "rises" (after) simultaneously.
(Note: Real BERT transforms vectors through learned matrices Q, K, V before computing dot products. We're showing the simplified version using raw vectors.)
Step 3a: Compute similarity (dot product)
Dot product = multiply corresponding numbers, then sum.
[CLS] = [0.10, 0.20, 0.05, 0.15]
dot([CLS], [CLS]) = 0.10×0.10 + 0.20×0.20 + 0.05×0.05 + 0.15×0.15
= 0.01 + 0.04 + 0.0025 + 0.0225 = 0.075
dot([CLS], Apple) = 0.10×0.80 + 0.20×0.20 + 0.05×0.90 + 0.15×0.40
= 0.08 + 0.04 + 0.045 + 0.06 = 0.225
dot([CLS], stock) = 0.10×0.50 + 0.20×0.50 + 0.05×0.50 + 0.15×0.60
= 0.05 + 0.10 + 0.025 + 0.09 = 0.265
dot([CLS], rises) = 0.10×0.30 + 0.20×0.70 + 0.05×0.60 + 0.15×0.20
= 0.03 + 0.14 + 0.03 + 0.03 = 0.23
dot([CLS], [SEP]) = 0.10×0.15 + 0.20×0.10 + 0.05×0.25 + 0.15×0.05
= 0.015 + 0.02 + 0.0125 + 0.0075 = 0.055
Similarity scores: [0.075, 0.225, 0.265, 0.23, 0.055]
[CLS] Apple stock rises [SEP]Step 3b: Convert to weights (Softmax)
We want weights that sum to 1 (so we can do a weighted average).
Scores: [0.075, 0.225, 0.265, 0.23, 0.055]
Step 1: exp() each number
exp(0.075) = 1.078
exp(0.225) = 1.252
exp(0.265) = 1.303
exp(0.23) = 1.259
exp(0.055) = 1.057
Result: [1.078, 1.252, 1.303, 1.259, 1.057]
Step 2: sum them
1.078 + 1.252 + 1.303 + 1.259 + 1.057 = 5.949
Step 3: divide each by sum
1.078/5.949 = 0.18
1.252/5.949 = 0.21
1.303/5.949 = 0.22
1.259/5.949 = 0.21
1.057/5.949 = 0.18
Weights: [0.18, 0.21, 0.22, 0.21, 0.18]
[CLS] Apple stock rises [SEP]Step 3c: Weighted average (blend vectors)
Multiply each vector by its weight, then add them all together.
new_[CLS] = 0.18 × [0.10, 0.20, 0.05, 0.15] ([CLS])
+ 0.21 × [0.80, 0.20, 0.90, 0.40] (Apple)
+ 0.22 × [0.50, 0.50, 0.50, 0.60] (stock)
+ 0.21 × [0.30, 0.70, 0.60, 0.20] (rises)
+ 0.18 × [0.15, 0.10, 0.25, 0.05] ([SEP])
Before attention: [CLS] = [0.10, 0.20, 0.05, 0.15]
After attention: [CLS] = [0.386, 0.353, 0.479, 0.294]Repeat 12 times (12 layers). Vectors keep refining. After layer 12, the [CLS] vector encodes the whole sentence.
Step 4: Classify (vector → topic)
Goal: Turn the [CLS] vector into one of 4 topics. How: Dot product with learned "topic vectors."
Final [CLS] = [0.386, 0.353, 0.479, 0.294]
Classification weights (randomly initialized before training):
Business_weights = [0.9, 0.1, 0.8, 0.3]
Sports_weights = [0.1, 0.9, 0.2, 0.7]
World_weights = [0.4, 0.3, 0.5, 0.4]
SciTech_weights = [0.6, 0.2, 0.7, 0.25]
Compute dot products:
Business = 0.386×0.90 + 0.353×0.10 + 0.479×0.80 + 0.294×0.30
= 0.347 + 0.035 + 0.383 + 0.088 = 0.853
Sports = 0.386×0.10 + 0.353×0.90 + 0.479×0.20 + 0.294×0.70
= 0.039 + 0.318 + 0.096 + 0.206 = 0.659
World = 0.386×0.40 + 0.353×0.30 + 0.479×0.50 + 0.294×0.40
= 0.154 + 0.106 + 0.240 + 0.118 = 0.618
SciTech = 0.386×0.60 + 0.353×0.20 + 0.479×0.70 + 0.294×0.25
= 0.232 + 0.071 + 0.335 + 0.074 = 0.712
Scores: Business=0.853, Sports=0.659, World=0.618, SciTech=0.712
Highest: Business (0.853)
PREDICTION: Business ✓Step 5: Training (adjusting weights when wrong)
Goal: If prediction was wrong, adjust the topic vectors so next time it's better.
True label: Business Prediction: Business Correct! No update needed.
But what if we're wrong? Let's try another example:
Next example: "touchdown wins game" [CLS] vector after attention: [0.12, 0.88, 0.31, ...] Scores: Business = dot([0.12, 0.88, 0.31, ...], Business_weights) = 0.72 ← highest Sports = dot([0.12, 0.88, 0.31, ...], Sports_weights) = 0.65 Prediction: Business True label: Sports WRONG! Update weights: Sports_weights += 0.01 × [CLS]_vector (move toward this sentence) Business_weights -= 0.01 × [CLS]_vector (move away from this sentence) New weights: Sports_weights = [0.1+0.0012, 0.9+0.0088, 0.2+0.0031, ...] Business_weights = [0.9-0.0012, 0.1-0.0088, 0.8-0.0031, ...]
Now Sports_weights is slightly more aligned with "touchdown wins game" type sentences. Repeat this millions of times, and the model learns to classify accurately.