


The pursuit of digital immortality.
Overview
Project Prometheus is about creating a digital twin – an AI that captures me, not just my mannerisms. The question: how close can AI get to being indistinguishable from talking to the real me?
Architecture
The 'brain' is Hermes, a language model for conversations. But a digital twin needs my voice. I'm testing Text-to-Speech (TTS) technologies—Tortoise TTS and Coqui TTS—to replicate my voice.
Training
TTS models learn from audio samples. I recorded samples with different sounds, intonations, and expressions to capture how I speak. The recording includes all alphabetic sounds and my name as a baseline for training.
Evaluation
Finding the right TTS model took time. High-quality commercial options are closed-source and expensive. Good open-source alternatives are rare. After research, I found a few worth testing.
Open Source
Coqui TTS is an open-source library with models like Tacotron, VITS, and Glow-TTS. It supports voice cloning, multilingual speech, and custom training. It has an active community and works for both testing and production.
Flow Matching
E2/F5 TTS focuses on voice cloning. Its 'one-shot' cloning can replicate a voice from a short audio sample.
Commercial
Resemble.ai is a voice synthesis platform with cloning capabilities. It can clone voices from minimal data and focuses on realistic, expressive output.
Result
After testing, E2/F5 TTS works best for voice cloning. It captures my voice accurately from minimal input and avoids common accent artifacts.
Visual
With voice synthesis working, the next step is visual representation. The goal is a 3D avatar that resembles me and can be animated—needed for a complete digital twin.
Scanning
For the 3D model, I'm using RealityScan, an app by Epic Games that creates 3D models from smartphone photos using photogrammetry.
Creating a digital human from scratch is complex. I'm using game development tech to build a facial mesh for lip-syncing and expressions.
AI Video
I also tested Google's VEO3 video generation model. It creates video from text prompts and reference images—another way to generate a moving digital version. The clip below was generated directly from my reference images.