31 Dec 2025

Understanding Decision Trees: The White Box of Machine Learning

Mateo Lafalce - Blog

In the world of Machine Learning, the Decision Tree stands out for one specific reason: clarity. While neural networks often act as "black boxes" where the logic is hidden, a Decision Tree is a "white box" model. You can see exactly how it thinks.

At its core, a Decision Tree is a flowchart-like structure used for both classification (predicting a category) and regression (predicting a value). Imagine playing a game of 20 Questions. You ask a series of Yes/No questions to narrow down the possibilities until you arrive at an answer. That is exactly how this algorithm functions.

How It Works: The Gini Impurity

A Decision Tree doesn't just guess which questions to ask; it calculates them using specific metrics. The most common metric in classification trees is the Gini Impurity.

Gini measures the purity of a group of data.

Gini = 0: Perfect purity. All items in the node belong to the same class.
Gini > 0: Impurity. The node contains a mix of different classes.

When the tree tries to split the data, it calculates the Gini score for the resulting groups. The algorithm always chooses the question that results in the lowest possible Gini score, ensuring the data becomes more organized with every step.

graph TD
    %% Root Node
    Start(("Start Loan Application")) --> A{"Is FICO Score >= 670?"}

    %% Decision 1
    A -- "No (Score < 670)" --> B("DENIED: High Credit Risk"):::denied
    A -- "Yes" --> C{"Is DTI Ratio < 43%?"}

    %% Decision 2
    C -- "No (DTI Too High)" --> D("DENIED: Excessive Debt"):::denied
    C -- "Yes" --> E{"Job History > 2 Years?"}

    %% Decision 3
    E -- "No (Unstable)" --> F("DENIED: Unstable Employment"):::denied
    E -- "Yes" --> G("LOAN APPROVED"):::approved

    %% Styling
    classDef denied fill:#ffdddd,stroke:#cc0000,stroke-width:2px,color:#990000;
    classDef approved fill:#ddffdd,stroke:#00cc00,stroke-width:2px,color:#005500;
    
    style A fill:#f9f9f9,stroke:#333,stroke-width:2px
    style C fill:#f9f9f9,stroke:#333,stroke-width:2px
    style E fill:#f9f9f9,stroke:#333,stroke-width:2px

When to Use

Explainability is key: You need to explain the logic to stakeholders (e.g., why a loan was rejected).
Data requires little prep: They handle both numerical and categorical data well and don't require heavy normalization.
Feature selection: You want to know which variables matter the most. The top of the tree usually represents the most important features.

When Not To:

Data is prone to overfitting: Trees can easily become too complex, memorizing noise in the training data rather than finding patterns.
Small data changes occur: Decision trees are unstable; a small change in the data can result in a completely different tree structure.
Complex relationships exist: For tasks like image recognition or natural language processing, trees are generally too weak on their own (though they are powerful when combined into Random Forests).

Python Implementation

Here is an example using the classic Iris dataset.

This blog is open source. See an error? Go ahead and propose a change.