View raw

caption#

Word-timed captions with kinetic animation styles. The schema's flagship "AI captions" surface — pass an array of timestamped words (e.g., direct from Whisper) and a style name, and the runtime renders them with sub-word timing. Inherits common fields.

interface CaptionElement extends BaseElement {
  type: 'caption';
  words: CaptionWord[];
  style?: 'tiktok_bounce' | 'fade_reveal' | 'kinetic_typewriter' | 'word_pop';
  max_length?: number | 'auto';   // windowing — see below

  // Text-like styling (same semantics as `text`)
  font_family?: string;
  font_size?: number | string;
  font_weight?: number | string;
  font_style?: 'normal' | 'italic';
  fill_color?: string;
  stroke_color?: string;
  stroke_width?: number;
  text_align?: 'left' | 'center' | 'right';
  line_height?: number;
  letter_spacing?: number;
  background_color?: string;
  background_border_radius?: number;
  shadow_color?: string;
  shadow_x?: number;
  shadow_y?: number;
  shadow_blur?: number;

  // Caption-specific
  highlight_color?: string;
  highlight_background_color?: string;
}

Words#

interface CaptionWord {
  text: string;
  start: number;
  end: number;
}
FieldTypeDescription
textstringThe word as it should render.
startnumberWhen the word becomes active, in seconds relative to the caption element's time.
endnumberWhen the word stops being active.

Words don't have to be space-separated tokens — passing { text: "—" } between sentences renders a visible em-dash; punctuation can ride on the preceding word.

Windowing (max_length)#

A whole transcript on one caption would render as one unreadable block. max_length shows only part of it at a time — only the chunk active at the current moment displays (inside the box, wrapped):

ValueMeaning
numberMax letters per chunk — a chunk grows word-by-word until the next word would exceed this many characters (Creatomate-style).
'auto'Chunk by a few words (also breaking on pauses) — the sensible default for speech.
absentNo windowing — the whole transcript shows at once.

Windowing is a display rule: it doesn't change word start/end, and the kinetic style still animates within the active chunk. Transcribing in the editor sets max_length: 'auto' by default.

Style#

FieldTypeDefaultDescription
style'tiktok_bounce' | 'fade_reveal' | 'kinetic_typewriter' | 'word_pop''fade_reveal'Per-word animation behavior.

tiktok_bounce#

The active word scales up ~18% and gets the highlight color; previous words stay visible at the base fill color. Designed for vertical social video.

fade_reveal#

Words fade in as they activate and stay visible. Calmer than tiktok_bounce. Default for documentary / explainer pacing.

kinetic_typewriter#

Words pop in one at a time, no overlap. Each word stays until the next activates, then is replaced. Use this when you want one word on screen at a time.

word_pop#

The active word scales briefly (~120ms) on activation, then settles. Previous words stay at the base style. Subtler than tiktok_bounce.

Highlight colors#

FieldTypeDefaultDescription
highlight_colorstring (hex)fill_colorColor applied to the currently-active word. Defaults to the base fill_color (no color change).
highlight_background_colorstring (hex)Background plate color applied behind the currently-active word. Pair with background_border_radius for a chip effect.

Pair highlight_color with a contrast fill_color to make the active word "pop" — for example, white words with a yellow highlight.

Typography#

Same fields as text. See that page for font_family, font_size, font_weight, font_style, text_align, line_height, letter_spacing, plate, and shadow.

Example: Whisper-style captions#

{
  "type": "caption",
  "x": 960, "y": 1600,
  "track": 4,
  "time": 1,
  "duration": 8,
  "style": "tiktok_bounce",
  "font_family": "Inter",
  "font_size": 64,
  "font_weight": 800,
  "fill_color": "#ffffff",
  "highlight_color": "#facc15",
  "stroke_color": "#000000",
  "stroke_width": 4,
  "words": [
    { "text": "this",  "start": 0,    "end": 0.35 },
    { "text": "is",    "start": 0.35, "end": 0.55 },
    { "text": "how",   "start": 0.55, "end": 0.80 },
    { "text": "agents","start": 0.80, "end": 1.30 },
    { "text": "make",  "start": 1.30, "end": 1.60 },
    { "text": "video", "start": 1.60, "end": 2.20 }
  ]
}

Notes#

  • Whisper integration — Whisper's word-level timestamps map directly onto the words array. The MCP server's authoring guide includes a one-shot recipe.
  • Caption duration vs word duration — the element's duration (from common fields) bounds when the caption is on screen. Words with start / end outside the element's [0, duration] are clipped.
  • Animation interaction — caption styles handle per-word animation. Top-level animations on the caption element fire on the whole element (e.g., a fade-in at start applies before the per-word style takes over).