Toast Docs
Getting Started

Processing Pipeline

What happens after you upload a video — the six-step AI processing pipeline.

The Processing Pipeline

After uploading your video, Toast runs six processing steps. Each happens in sequence, and you can see progress for each step in real-time.

Pipeline Steps

1. Audio Extraction

Toast uses FFmpeg (running directly in your browser via WebAssembly) to extract the audio track from your video. The audio is saved as a WAV file in your browser's secure storage.

A low-resolution proxy video (480p) is also generated for smooth playback during editing.

2. Audio Quality Analysis

The extracted audio is analyzed for Signal-to-Noise Ratio (SNR). This determines whether noise removal is needed.

SNR LevelQualityAction
25+ dBGoodNo noise removal needed
Below 25 dBNoisyAutomatic noise removal triggered

3. Noise Removal (Conditional)

If your audio is noisy, Toast sends the audio to CleanVoice for AI-powered noise removal. The cleaned audio replaces the original for all subsequent processing.

Only the audio file is transmitted — never your video.

4. Transcription

Toast uses ElevenLabs Scribe v2 for transcription, providing:

  • Word-level timestamps — every word is precisely timed
  • 99-language support — automatic language detection
  • Speaker diarization — identifies different speakers
  • Filler word detection — marks "um", "uh", "like", "you know", etc.

5. Face Tracking

MediaPipe FaceMesh runs in a dedicated Web Worker to detect face positions throughout the video. This data powers:

  • Smart reframe — automatically crop to follow the speaker
  • Aspect ratio conversion — keep the face centered when converting 16:9 to 9:16

6. AI Analysis

Toast sends the transcript (not the video) to Claude (or OpenAI as fallback) to generate an edit manifest — a JSON document that describes:

  • Which segments to keep and which to cut
  • Where filler words and dead air are located
  • Suggested B-roll placements
  • Reasoning for each edit decision

The edit manifest becomes the source of truth for your project. You can view the AI's reasoning in the AI Reasoning Panel.

Processing Time

Typical processing times for a 10-minute video:

StepDuration
Audio extraction~10 seconds
Quality analysis~5 seconds
Noise removal~15 seconds (if needed)
Transcription~30 seconds
Face tracking~15 seconds
AI analysis~20 seconds
Total~1.5 minutes

Longer videos take proportionally longer. A 30-minute video typically processes in 4-5 minutes.

On this page