Prometheus stole fire and got chained to a rock for eternity. We'll take AI—and hope the gods don't notice.

Navigation

Digital Twin

Building an AI that replicates human personality through voice synthesis and visual representation.

Purpose

Project Prometheus is about creating a digital twin – an AI that captures me, not just my mannerisms. The question: how close can we get to an AI that's indistinguishable from talking to the real me?

Training Input Audio

0:00--:--

The 'brain' is Hermes, a language model for conversations. But a digital twin needs my voice. I'm testing Text-to-Speech (TTS) technologies—Tortoise TTS and Coqui TTS—to replicate my voice.

TTS models learn from audio samples. I recorded samples with different sounds, intonations, and expressions to capture how I speak. The recording includes all alphabetic sounds and my name as a baseline for training.

Training Data Sample

0:00--:--

Finding the right TTS model took time. High-quality commercial options are closed-source and expensive. Good open-source alternatives are rare. After research, I found a few worth testing.

Coqui TTS is an open-source library with models like Tacotron, VITS, and Glow-TTS. It supports voice cloning, multilingual speech, and custom training. It has an active community and works for both testing and production.

Coqui TTS Output

0:00--:--

E2/F5 TTS focuses on voice cloning. Its 'one-shot' cloning can replicate a voice from a short audio sample.

E2/F5 TTS Output

0:00--:--

Resemble.ai is a voice synthesis platform with cloning capabilities. It can clone voices from minimal data and focuses on realistic, expressive output.

Resemble.ai Output

0:00--:--

After testing, E2/F5 TTS works best for voice cloning. It captures my voice accurately from minimal input and avoids common accent artifacts.

With voice synthesis working, the next step is visual representation. The goal is a 3D avatar that resembles me and can be animated—needed for a complete digital twin.

Creating a digital human from scratch is complex. Open-source solutions are rare, but the video game industry has tools for character creation—facial scanning, motion capture, and digital actors. I'm using game development tech to build a facial mesh for lip-syncing and expressions. First step: a 3D scan of my head and shoulders.

For the 3D model, I'm using RealityScan, an app by Epic Games that creates 3D models from smartphone photos using photogrammetry.

I tested RealityScan with my hand first. Key lessons: capture multiple angles in vertical slices around the subject. Stay still—the software needs stable reference points. Use ambient lighting instead of harsh directional light. The subject should be matte to avoid reflections that confuse the image boundaries.

Hand test by edwinkassier on Sketchfab

I also tested Google's VEO3 video generation model. It creates video from text prompts and reference images—another way to generate a moving digital version.

VEO3 generates video frames from reference images. The result is an animated version of me speaking and moving. It's different from the 3D modeling approach with RealityScan.

The video shows what VEO3 generated from my reference images.

Prometheus stole fire and got chained to a rock for eternity. We'll take AI—and hope the gods don't notice.

Navigation

Digital Twin

Building an AI that replicates human personality through voice synthesis and visual representation.

Purpose

Training Input Audio

0:00--:--

The 'brain' is Hermes, a language model for conversations. But a digital twin needs my voice. I'm testing Text-to-Speech (TTS) technologies—Tortoise TTS and Coqui TTS—to replicate my voice.

Training Data Sample

0:00--:--

Finding the right TTS model took time. High-quality commercial options are closed-source and expensive. Good open-source alternatives are rare. After research, I found a few worth testing.

Coqui TTS Output

0:00--:--

E2/F5 TTS focuses on voice cloning. Its 'one-shot' cloning can replicate a voice from a short audio sample.

E2/F5 TTS Output

0:00--:--

Resemble.ai is a voice synthesis platform with cloning capabilities. It can clone voices from minimal data and focuses on realistic, expressive output.

Resemble.ai Output

0:00--:--

After testing, E2/F5 TTS works best for voice cloning. It captures my voice accurately from minimal input and avoids common accent artifacts.

With voice synthesis working, the next step is visual representation. The goal is a 3D avatar that resembles me and can be animated—needed for a complete digital twin.

For the 3D model, I'm using RealityScan, an app by Epic Games that creates 3D models from smartphone photos using photogrammetry.

Hand test by edwinkassier on Sketchfab

I also tested Google's VEO3 video generation model. It creates video from text prompts and reference images—another way to generate a moving digital version.

VEO3 generates video frames from reference images. The result is an animated version of me speaking and moving. It's different from the 3D modeling approach with RealityScan.

The video shows what VEO3 generated from my reference images.