19 Dec 2025

I Cloned Myself with a Markov Chain

Mateo Lafalce - Blog

By feeding raw WhatsApp chat logs into a Python script, I created a text generator that mimics specific speaking styles with surprising accuracy.

How It Works

This project runs on pure probability. It implements a Markov Chain of order N (I defaulted to ). Here is the breakdown:

  1. Ingestion: The script parses standard WhatsApp.txt exports, filtering out system messages to isolate the target author's raw speech patterns.
  2. Tokenization: It cleans the text, stripping non-alphanumeric characters and converting everything to lowercase to normalize the dataset.
  3. Training: It builds a dictionary state map. The key is a sequence of words, and the value is a list of every word that has historically followed that sequence.

The Output

The generator takes a seed phrase and slides a window forward, picking the next word based on the frequency observed in the chat history.

The result is a lightweight ghost of the user. It captures their vocabulary, unique phrasing, and sentence rhythm perfectly, even if the logic occasionally veers into total nonsense. It’s a fun, low-compute way to preserve the digital vibe.

Check out the code.


This blog is open source. See an error? Go ahead and propose a change.