Decoding DINO v2 framework
The Sinkhorn-Knopp algorithm is an iterative method used to transform a matrix into a doubly stochastic matrix — meaning that all rows and columns sum to 1. In the context of DINOv2 (or self-supervised learning), it helps achieve optimal transport or balanced assignments of data points to prototypes (or clusters). Let’s break it down and then dive into an example within the DINOv2 framework.
Sinkhorn-Knopp in DINOv2
In DINOv2, we train a student network to match the predictions of a teacher network. The teacher generates cluster assignments or prototypes for each data point. These assignments are represented as probability distributions over clusters, and we want to ensure that these assignments are balanced across the entire batch — meaning that every prototype gets roughly an equal share of data points.
This is where Sinkhorn-Knopp comes in: it normalizes the teacher’s predictions (or soft assignments) so that:
- Each row of the assignment matrix sums to 1 — ensuring that each prototype has the same total weight.
- Each column of the assignment matrix sums to 1 — ensuring that each data point contributes equally to the assignment across prototypes.
Steps of Sinkhorn-Knopp in DINOv2
Here’s a typical flow in DINOv2:
- Teacher output: The teacher network provides an output matrix of size
K x B
, where: K
is the number of prototypes (or clusters),B
is the batch size (number of data points).- Sinkhorn normalization: The Sinkhorn-Knopp algorithm takes this matrix and iteratively normalizes it:
- Row normalization ensures that the weight of each prototype sums to
1/K
, distributing the assignments equally across all prototypes. - Column normalization ensures that the weight of each data point sums to
1/B
, assigning equal weight from each data point across all prototypes. - Balanced assignment: After a few iterations of normalization, the output matrix is balanced — meaning that both rows and columns sum to 1. This ensures that every prototype gets a balanced share of the data, which is crucial for avoiding degenerate solutions (where some prototypes might not get any assignments).
This matrix contains the raw similarities or scores between data points and prototypes. However, these scores might not yet represent valid probability distributions.
Example of Sinkhorn-Knopp in DINOv2
Let’s walk through an example to illustrate how Sinkhorn-Knopp operates in DINOv2:
Step 1: Teacher Output (Raw Scores)
Suppose we have K = 3
prototypes and a batch size B = 4
. The teacher network produces the following raw scores (before normalization):
Data Point | Prototype 1 | Prototype 2 | Prototype 3 |
Data 1 | 2.5 | 0.5 | 1.0 |
Data 2 | 1.0 | 2.0 | 0.5 |
Data 3 | 0.5 | 2.5 | 1.5 |
Data 4 | 2.0 | 0.5 | 2.0 |
These scores represent the similarities between each data point and the prototypes. However, they are not yet balanced or normalized.
Step 2: Apply Softmax (Sharpening)
Before applying Sinkhorn-Knopp, we typically apply softmax to sharpen the output. This converts the raw scores into probabilities for each data point over the prototypes. After softmax:
Data Point | Prototype 1 | Prototype 2 | Prototype 3 |
Data 1 | 0.7 | 0.1 | 0.2 |
Data 2 | 0.3 | 0.6 | 0.1 |
Data 3 | 0.1 | 0.6 | 0.3 |
Data 4 | 0.4 | 0.2 | 0.4 |
Now, each row sums to 1 (since we applied softmax), but the columns might not sum to 1 yet.
Step 3: Sinkhorn-Knopp Iteration
The Sinkhorn-Knopp algorithm iterates to normalize both rows and columns. Here’s a breakdown of the steps:
- Row normalization:
- Normalize each row so that the total weight per prototype is balanced across all data points (this is already done by softmax).
- Column normalization:
- Adjust the columns so that the total weight assigned by each data point to the prototypes sums to 1. For example, after the first iteration:
Data Point | Prototype 1 | Prototype 2 | Prototype 3 |
Data 1 | 0.6 | 0.1 | 0.2 |
Data 2 | 0.4 | 0.5 | 0.1 |
Data 3 | 0.1 | 0.6 | 0.3 |
Data 4 | 0.3 | 0.3 | 0.4 |
Now, each column approximately sums to 1, ensuring that prototypes get a balanced share of assignments.
Step 4: Repeat Iterations
The algorithm continues alternating between row and column normalization for a few iterations. After 3-5 iterations, the matrix converges, ensuring both rows and columns sum to 1.
Final Balanced Matrix
After running Sinkhorn-Knopp, the matrix might look something like this:
Data Point | Prototype 1 | Prototype 2 | Prototype 3 |
Data 1 | 0.55 | 0.15 | 0.30 |
Data 2 | 0.35 | 0.50 | 0.15 |
Data 3 | 0.15 | 0.55 | 0.30 |
Data 4 | 0.40 | 0.30 | 0.30 |
Now, both the rows (each data point’s assignments) and columns (each prototype’s total weight) sum to 1, ensuring that the assignments are balanced and consistent.
Why is Sinkhorn-Knopp Important in DINOv2?
- Balanced Assignment: The Sinkhorn-Knopp algorithm ensures that each prototype gets an equal number of assignments. This prevents degenerate solutions where some prototypes never get assigned any data, which could harm the learning process.
- Optimal Transport: Sinkhorn-Knopp implements an approximate optimal transport solution, which in simple terms means finding the best way to assign data points to clusters (or prototypes) while maintaining balance.
- Regularization: By balancing the assignments, it encourages the student network to learn diverse and useful representations for the entire dataset, rather than focusing too much on any specific cluster or prototype.
Summary
The Sinkhorn-Knopp algorithm in DINOv2 plays a critical role in ensuring balanced cluster assignments during self-supervised learning. By iteratively normalizing the teacher's outputs, it ensures that all prototypes get equal participation, preventing imbalances and promoting better representation learning.