3 Jan 2026

The One Cake Fallacy (Softmax vs. Sigmoid)

Mateo Lafalce - Blog

Why not just use Softmax for everything instead of Sigmoid?

The answer lies in the fundamental difference between competition and independence.

The core distinction is how these functions treat the output space . Softmax enforces competition. It normalizes the output vector such that the sum of all probabilities equals exactly 1.

Because of this constraint, if the probability of class increases, the probability of class must decrease mathematically. They are fighting for a slice of the same pie.

Sigmoid allows independence. Sigmoid squashes each individual output between 0 and 1 independently of the others.

Here, can be greater than 1. The probability of class has no mathematical impact on class .

To visualize this, imagine the output layer of your network:

Softmax is a Single Cake: You have one cake to divide among classes. You cannot give 90% of the cake to Dog and 90% to Cat. This models Mutually Exclusive events.
Sigmoid is a Row of Light Switches: You have a switch for Dog and a switch for Cat. You can turn both on or both off. The state of one switch does not physically affect the other. This models Multi-label events.

If you use Softmax for a multi-label problem, you confuse the network.

If the network detects a Dog , Softmax will force the probability of Grass to drop to satisfy the summation constraint. You are effectively penalizing the model for being correct about the second object.

Rule of Thumb:

Use Softmax if the classes are mutually exclusive.
Use Sigmoid if multiple classes can coexist.

This blog is open source. See an error? Go ahead and propose a change.