Processing Pipeline

The Processing Pipeline

After uploading your video, Toast runs six processing steps. Each happens in sequence, and you can see progress for each step in real-time.

Pipeline Steps

1. Audio Extraction

Toast uses FFmpeg (running directly in your browser via WebAssembly) to extract the audio track from your video. The audio is saved as a WAV file in your browser's secure storage.

A low-resolution proxy video (480p) is also generated for smooth playback during editing.

2. Audio Quality Analysis

The extracted audio is analyzed for Signal-to-Noise Ratio (SNR). This determines whether noise removal is needed.

SNR Level	Quality	Action
25+ dB	Good	No noise removal needed
Below 25 dB	Noisy	Automatic noise removal triggered

3. Noise Removal (Conditional)

If your audio is noisy, Toast sends the audio to CleanVoice for AI-powered noise removal. The cleaned audio replaces the original for all subsequent processing.

Only the audio file is transmitted — never your video.

4. Transcription

Toast uses ElevenLabs Scribe v2 for transcription, providing:

Word-level timestamps — every word is precisely timed
99-language support — automatic language detection
Speaker diarization — identifies different speakers
Filler word detection — marks "um", "uh", "like", "you know", etc.

5. Face Tracking

MediaPipe FaceMesh runs in a dedicated Web Worker to detect face positions throughout the video. This data powers:

Smart reframe — automatically crop to follow the speaker
Aspect ratio conversion — keep the face centered when converting 16:9 to 9:16

6. AI Analysis

Toast sends the transcript (not the video) to Claude (or OpenAI as fallback) to generate an edit manifest — a JSON document that describes:

Which segments to keep and which to cut
Where filler words and dead air are located
Suggested B-roll placements
Reasoning for each edit decision

The edit manifest becomes the source of truth for your project. You can view the AI's reasoning in the AI Reasoning Panel.

Processing Time

Typical processing times for a 10-minute video:

Step	Duration
Audio extraction	~10 seconds
Quality analysis	~5 seconds
Noise removal	~15 seconds (if needed)
Transcription	~30 seconds
Face tracking	~15 seconds
AI analysis	~20 seconds
Total	~1.5 minutes

Longer videos take proportionally longer. A 30-minute video typically processes in 4-5 minutes.

On this page