Skip-gram vs CBOW: Understanding Word2vec Training Examples
S2 E2 Word Embeddings (One-Hot Encoding, Word2Vec, CBOW, Skip-Gram, GLoVE)
Machine Learning 53: Skip-Gram it has discursion of gradient descent
What is Word2Vec? How does it work? CBOW and Skip-gram It shows clearly the matrix dimensions but may need caution.
Understanding Word2Vec It compares SVD wth word2vec
What Are Word Embeddings? It discusses the ‘customized’ embedding in normal transformer where embedding is learned from scratch together with other parameters including attention matrices in the transformer model. Word Embedding and Word2Vec, Clearly Explained!!! indicates that and also has a comment that the embedding is sort of by product of the overall learning goal (predicting the next token)
Introduction
When working with Word2vec embeddings, one of the most important decisions you’ll make is choosing between Skip-gram and Continuous Bag of Words (CBOW). The conventional wisdom suggests that if you have lots of data, Skip-gram tends to work best, while CBOW works better when you don’t have as much data. But is this really true? Let’s dive deep into how these models work and understand the reasoning behind this claim.
What Are Training Examples?
Before we compare the models, let’s clarify what we mean by “training examples.” A training example (or data point) is a single observation or instance used to teach a machine learning model. It’s the fundamental unit of data on which the learning process operates.
In supervised and self-supervised learning contexts like Word2vec, a training example consists of two main parts:
- Input (Feature Vector): The data the model receives to make a prediction
- Output (Label/Target): The correct answer or value the model is expected to predict
In Word2vec, these training examples are automatically generated from a large body of text (corpus) based on a small window of words called the context window.
How CBOW Generates Training Examples
Continuous Bag of Words (CBOW) takes all the context words surrounding a center word and tries to predict that single center word. For each context window, it generates one training example.
Example
Consider the sentence: “The cat sat on the mat”
With a window size of 2 and “cat” as the center word:
- Input: (The, sat, on, the) - the surrounding context words
- Output: cat - the center word to predict
- Result: 1 training example per window
By averaging the context vectors to predict the target, CBOW effectively “smoothes” the context, which is beneficial for frequent words but less effective for rare ones.
How Skip-gram Generates Training Examples
Skip-gram takes the complete opposite approach. It takes the center word and tries to predict each surrounding context word individually. For a context window of size C, it generates 2×C training examples from a single window.
Example
Using the same sentence: “The cat sat on the mat”
With a window size of 2 and “cat” as the center word, Skip-gram creates multiple pairs:
- (cat, The)
- (cat, sat)
- (cat, on)
- (cat, the)
Result: 4 training examples from one window - effectively multiplying your training data!
Comparison Table
| Model | Input (Features) | Output (Label/Target) | Training Examples per Window |
|---|---|---|---|
| CBOW | Context words surrounding the center word | The single center word | One example per window |
| Skip-gram | The single center word | A single context word | Multiple examples (one for every surrounding word) |
Performance Implications
Training Data Multiplication
This fundamental difference in training example generation is the primary reason for the performance differences between the two models:
Skip-gram:
- Creates a very large, diverse dataset of pairs
- Makes the task harder for each individual example (as it learns to predict many different context words)
- Provides better exposure for rare words
- Ultimately leads to higher-quality embeddings, especially with more data
CBOW:
- Trains faster because it has fewer steps per window
- Averages context vectors, which smoothes the context
- More beneficial for frequent words
- Less effective for capturing rare word meanings
Which Model Should You Choose?
The Conventional Wisdom
The general statement about data size is partially true, though the reasoning is debated in practice:
Skip-gram is generally considered to work better with:
- Larger datasets
- Rare words or phrases
- When you need high-quality embeddings and have computational resources
CBOW is typically better for:
- Smaller datasets
- Faster training requirements
- Frequent words with slightly better accuracy
The Original Word2vec Paper
Interestingly, the original Word2vec paper by Mikolov et al. suggested somewhat different characteristics:
“Skip-gram: works well with small amount of the training data, represents well even rare words or phrases. CBOW: several times faster to train than the skip-gram, slightly better accuracy for the frequent words.”
However, many subsequent real-world applications and discussions align with the notion that Skip-gram benefits more from large datasets due to its training process.
Why Skip-gram Works Better with Large Data
The key insight is that Skip-gram creates more training examples from each context window compared to CBOW. This effectively multiplies your training data, which can be highly beneficial when you have a very large dataset.
However, this multiplication also makes the learning task harder for each individual example. With more data, the model has enough examples to learn these harder patterns effectively, resulting in better embeddings overall.
Helpful Analogy
Think of it this way:
- CBOW is like one question where you use all the surrounding sentences to answer a fill-in-the-blank
- Skip-gram is like a set of many questions, where you use one fill-in-the-blank word to predict each of the surrounding sentences one-by-one
The Role of Negative Sampling
Now that we understand how training examples are generated, let’s explore an equally important concept: negative sampling. This technique is crucial for making Word2vec training computationally feasible.
What is a Negative Sample?
A negative sample is essentially a false example that is intentionally introduced to teach the model what a correct, or “positive,” example is not.
In typical machine learning tasks, you have correct observations (positive samples) that the model should learn to identify. A negative sample is a data point or pair that the model should explicitly learn to reject or assign a low probability to.
Example in Skip-gram
Consider the sentence: “The dog chased the cat”
With a context window of 1, the positive pairs are:
- (dog, chased) → Label: 1 (True)
- (chased, dog) → Label: 1 (True)
- (chased, cat) → Label: 1 (True)
Negative samples are created by taking a center word and pairing it with a random word from the vocabulary that did not appear in its context window:
- (dog, banana) → Label: 0 (False)
- (chased, zebra) → Label: 0 (False)
The model is then trained to assign high scores to positive samples and low scores to negative samples.
Why Negative Sampling is Essential: The Efficiency Problem
The primary reason for using negative sampling is to drastically improve computational efficiency when dealing with large vocabularies.
In the original Word2vec model, for every training example, the model had to compute an output probability for every single word in the entire vocabulary (often hundreds of thousands of words) using a function called Softmax. This is incredibly slow.
Negative Sampling solves this by converting the problem:
- From: A massive multi-class classification task (which word is the right context word out of all possible words?)
- To: A small set of binary classification tasks (is this pair a true context pair or a random one?)
Instead of updating the weights for V (the size of the entire vocabulary) words, the model only updates the weights for:
- 1 positive word
- A small, fixed number of k negative samples (typically k=5 to 20)
This reduces the computational complexity from O(|V|) to O(k), making it feasible to train word embeddings on massive datasets.
How Negative Samples are Chosen
Negative samples are not chosen purely randomly. Instead, they’re drawn from a special probability distribution based on word frequency.
In Word2vec, a word w is selected as a negative sample with a probability proportional to its frequency raised to the power of 3/4:
P(w) ∝ Freq(w)^(3/4)
This method ensures that:
- More frequent words (like “the,” “a,” “is”) are more likely to be selected as negative samples, which makes sense because they are more common “distractors”
- The sampling is smoothed (due to the 3/4 exponent), preventing extremely frequent words from completely dominating the negative samples
This smart sampling strategy balances between selecting common words that provide useful contrasts while avoiding over-representation of the most frequent words.
Conclusion
The choice between Skip-gram and CBOW ultimately depends on your specific use case:
- If you have abundant data and computational resources, Skip-gram will likely give you better quality embeddings, especially for rare words
- If you need faster training and are working with a smaller dataset focused on frequent words, CBOW may be more practical
Understanding how each model generates training examples, combined with the efficiency gains from negative sampling, helps you make an informed decision based on your data characteristics and requirements. The combination of clever training example generation and negative sampling is what makes Word2vec both effective and practical for real-world applications.