Processing Pipeline
What happens after you upload a video — the six-step AI processing pipeline.
The Processing Pipeline
After uploading your video, Toast runs six processing steps. Each happens in sequence, and you can see progress for each step in real-time.
Pipeline Steps
1. Audio Extraction
Toast uses FFmpeg (running directly in your browser via WebAssembly) to extract the audio track from your video. The audio is saved as a WAV file in your browser's secure storage.
A low-resolution proxy video (480p) is also generated for smooth playback during editing.
2. Audio Quality Analysis
The extracted audio is analyzed for Signal-to-Noise Ratio (SNR). This determines whether noise removal is needed.
| SNR Level | Quality | Action |
|---|---|---|
| 25+ dB | Good | No noise removal needed |
| Below 25 dB | Noisy | Automatic noise removal triggered |
3. Noise Removal (Conditional)
If your audio is noisy, Toast sends the audio to CleanVoice for AI-powered noise removal. The cleaned audio replaces the original for all subsequent processing.
Only the audio file is transmitted — never your video.
4. Transcription
Toast uses ElevenLabs Scribe v2 for transcription, providing:
- Word-level timestamps — every word is precisely timed
- 99-language support — automatic language detection
- Speaker diarization — identifies different speakers
- Filler word detection — marks "um", "uh", "like", "you know", etc.
5. Face Tracking
MediaPipe FaceMesh runs in a dedicated Web Worker to detect face positions throughout the video. This data powers:
- Smart reframe — automatically crop to follow the speaker
- Aspect ratio conversion — keep the face centered when converting 16:9 to 9:16
6. AI Analysis
Toast sends the transcript (not the video) to Claude (or OpenAI as fallback) to generate an edit manifest — a JSON document that describes:
- Which segments to keep and which to cut
- Where filler words and dead air are located
- Suggested B-roll placements
- Reasoning for each edit decision
The edit manifest becomes the source of truth for your project. You can view the AI's reasoning in the AI Reasoning Panel.
Processing Time
Typical processing times for a 10-minute video:
| Step | Duration |
|---|---|
| Audio extraction | ~10 seconds |
| Quality analysis | ~5 seconds |
| Noise removal | ~15 seconds (if needed) |
| Transcription | ~30 seconds |
| Face tracking | ~15 seconds |
| AI analysis | ~20 seconds |
| Total | ~1.5 minutes |
Longer videos take proportionally longer. A 30-minute video typically processes in 4-5 minutes.