DINO v2 - Sinkhorn-Knopp Centering

Description

Decoding DINO v2 framework

Sinkhorn-Knopp in DINOv2

In DINOv2, we train a student network to match the predictions of a teacher network. The teacher generates cluster assignments or prototypes for each data point. These assignments are represented as probability distributions over clusters, and we want to ensure that these assignments are balanced across the entire batch — meaning that every prototype gets roughly an equal share of data points.

This is where Sinkhorn-Knopp comes in: it normalizes the teacher’s predictions (or soft assignments) so that:

Each row of the assignment matrix sums to 1 — ensuring that each prototype has the same total weight.
Each column of the assignment matrix sums to 1 — ensuring that each data point contributes equally to the assignment across prototypes.

Steps of Sinkhorn-Knopp in DINOv2

Here’s a typical flow in DINOv2:

Teacher output: The teacher network provides an output matrix of size K x B, where:

K is the number of prototypes (or clusters),
B is the batch size (number of data points).

This matrix contains the raw similarities or scores between data points and prototypes. However, these scores might not yet represent valid probability distributions.

Sinkhorn normalization: The Sinkhorn-Knopp algorithm takes this matrix and iteratively normalizes it:

Row normalization ensures that the weight of each prototype sums to 1/K, distributing the assignments equally across all prototypes.
Column normalization ensures that the weight of each data point sums to 1/B, assigning equal weight from each data point across all prototypes.

Balanced assignment: After a few iterations of normalization, the output matrix is balanced — meaning that both rows and columns sum to 1. This ensures that every prototype gets a balanced share of the data, which is crucial for avoiding degenerate solutions (where some prototypes might not get any assignments).

Example of Sinkhorn-Knopp in DINOv2

Let’s walk through an example to illustrate how Sinkhorn-Knopp operates in DINOv2:

Step 1: Teacher Output (Raw Scores)

Suppose we have K = 3 prototypes and a batch size B = 4. The teacher network produces the following raw scores (before normalization):

Data Point	Prototype 1	Prototype 2	Prototype 3
Data 1	2.5	0.5	1.0
Data 2	1.0	2.0	0.5
Data 3	0.5	2.5	1.5
Data 4	2.0	0.5	2.0

These scores represent the similarities between each data point and the prototypes. However, they are not yet balanced or normalized.

Step 2: Apply Softmax (Sharpening)

Before applying Sinkhorn-Knopp, we typically apply softmax to sharpen the output. This converts the raw scores into probabilities for each data point over the prototypes. After softmax:

Data Point	Prototype 1	Prototype 2	Prototype 3
Data 1	0.7	0.1	0.2
Data 2	0.3	0.6	0.1
Data 3	0.1	0.6	0.3
Data 4	0.4	0.2	0.4

Now, each row sums to 1 (since we applied softmax), but the columns might not sum to 1 yet.

Step 3: Sinkhorn-Knopp Iteration

The Sinkhorn-Knopp algorithm iterates to normalize both rows and columns. Here’s a breakdown of the steps:

Row normalization:

Normalize each row so that the total weight per prototype is balanced across all data points (this is already done by softmax).

Column normalization:

Adjust the columns so that the total weight assigned by each data point to the prototypes sums to 1. For example, after the first iteration:

Data Point	Prototype 1	Prototype 2	Prototype 3
Data 1	0.6	0.1	0.2
Data 2	0.4	0.5	0.1
Data 3	0.1	0.6	0.3
Data 4	0.3	0.3	0.4

Now, each column approximately sums to 1, ensuring that prototypes get a balanced share of assignments.

Step 4: Repeat Iterations

The algorithm continues alternating between row and column normalization for a few iterations. After 3-5 iterations, the matrix converges, ensuring both rows and columns sum to 1.

Final Balanced Matrix

After running Sinkhorn-Knopp, the matrix might look something like this:

Data Point	Prototype 1	Prototype 2	Prototype 3
Data 1	0.55	0.15	0.30
Data 2	0.35	0.50	0.15
Data 3	0.15	0.55	0.30
Data 4	0.40	0.30	0.30

Now, both the rows (each data point’s assignments) and columns (each prototype’s total weight) sum to 1, ensuring that the assignments are balanced and consistent.

Why is Sinkhorn-Knopp Important in DINOv2?

Balanced Assignment: The Sinkhorn-Knopp algorithm ensures that each prototype gets an equal number of assignments. This prevents degenerate solutions where some prototypes never get assigned any data, which could harm the learning process.
Optimal Transport: Sinkhorn-Knopp implements an approximate optimal transport solution, which in simple terms means finding the best way to assign data points to clusters (or prototypes) while maintaining balance.
Regularization: By balancing the assignments, it encourages the student network to learn diverse and useful representations for the entire dataset, rather than focusing too much on any specific cluster or prototype.

Summary

The Sinkhorn-Knopp algorithm in DINOv2 plays a critical role in ensuring balanced cluster assignments during self-supervised learning. By iteratively normalizing the teacher's outputs, it ensures that all prototypes get equal participation, preventing imbalances and promoting better representation learning.