The Clipkit Protocol#
Version: 1.0 (CKP/1.0) Status: Draft License: Apache-2.0
This document specifies the Clipkit Protocol — the JSON-based interchange
format for describing motion-graphics videos. Documents that conform to
this protocol can be rendered by any conforming runtime: the reference
implementation @clipkit/runtime, future server-side renderers, or
third-party implementations.
The key words MUST, MUST NOT, SHOULD, SHOULD NOT, MAY, and REQUIRED in this document are to be interpreted as described in RFC 2119.
The reference implementation lives in packages/schema
of this repository.
Table of contents#
- Introduction
- Document structure
- Coordinate system, units, and types
- The element model
- Element types
- Animation
- Time, duration, sequencing
- Asset references
- Output and rendering
- Conformance levels
- Versioning and extensions
- Implementation notes
1. Introduction#
1.1. Goals#
The Clipkit Protocol aims to be:
- JSON-native. A Clipkit document is plain JSON. Any tool that can write JSON can produce one. AI agents in particular benefit from this — they emit structured data better than they emit code.
- Renderer-agnostic. The protocol describes what a video looks like, not how to render it. A conforming runtime may use WebGPU, WebGL2, software rasterization, server-side headless browsers, or any other mechanism.
- Composable. Documents are built from a small set of primitive element types. Higher-level patterns are authoring concerns — they produce primitive elements; the protocol itself has no notion of "pattern."
- Deterministic. Given the same document and the same time, every conforming runtime MUST produce the same frame composition. Exact pixel output may differ between rasterization backends, but the scene description is unambiguous.
1.2. Non-goals#
- Visual editing format. Editors MAY store additional state alongside a Source (selection, undo history, asset binaries) but that state is outside this protocol.
- Media container. Clipkit references external media (images, video, audio) by URL or path. The protocol does not embed binary media.
- Audio mixing graph. Audio elements are positioned in time; the exact mix algorithm (gains, codecs, bitrates) is implementation-defined.
1.3. Reference implementation#
@clipkit/protocol is the canonical TypeScript + Zod implementation of
this protocol. Other implementations SHOULD treat it as authoritative on
ambiguous points until those points are resolved here.
2. Document structure#
2.1. The Source object#
A Clipkit document is a JSON object with the following shape:
{ "clipkit_version": "1.0", "output_format": "mp4", "width": 1920, "height": 1080, "duration": 30, "frame_rate": 30, "background_color": "#000000", "elements": [ /* ... */ ] }
Fields#
| Field | Type | Required? | Default | Meaning |
|---|---|---|---|---|
clipkit_version | string | SHOULD | "1.0" | The protocol version this document targets. See §11. |
output_format | string | MAY | "mp4" | One of "mp4" (video) or "gif" (animated). Clipkit is video-only. |
width | integer | SHOULD | 1920 | Composition width in pixels. MUST be positive. |
height | integer | SHOULD | 1080 | Composition height in pixels. MUST be positive. |
duration | number | "auto" | SHOULD | "auto" | Total composition duration in seconds. "auto" MUST be interpreted as the maximum end-time of any active element. |
frame_rate | number | SHOULD | 30 | Frames per second. MUST be positive. |
background_color | string | MAY | "#000000" | Color (§3.4) the frame is cleared to before any element draws. Absent → opaque black. |
motion_blur | object | MAY | — | Whole-frame motion blur by exact sub-frame supersampling. See "Motion blur" below. |
camera | object | MAY | — | Scene camera (CKP/1.0): perspective lens + movable pose (position/orientation) + sort. Absent = identity = exact 2D. See §4.4.2, §4.4.3. |
fonts | array | MAY | — | Font faces the renderer MUST register before rendering. See "Fonts" below. |
elements | array | REQUIRED | — | At least one element. See §4–5. |
Motion blur#
motion_blur is an object with two optional fields:
| Field | Type | Default | Meaning |
|---|---|---|---|
samples | integer 1–32 | 8 | Sub-frame samples rendered per output frame. 1 disables blur. |
shutter | number (0, 1] | 0.5 | Fraction of the frame interval the shutter is open. 0.5 corresponds to a 180° film shutter. |
When present with samples ≥ 2, the renderer MUST produce each output
frame as the arithmetic mean of samples full-scene renders taken at
sub-frame times centered on the frame time. For the output frame at
time t with frame rate f and composition duration D:
t_k = clamp(t + ((k + 0.5) / samples − 0.5) × shutter / f, 0, D)
for k = 0 … samples−1
Each sample is a complete render of the composition at t_k under the
normal rendering model (every animated value, transition, and effect is
evaluated at t_k). The output pixel is the per-channel arithmetic
mean of the samples sample pixels in the output color space (8-bit
sRGB, after compositing over the opaque background), rounded once,
half away from zero. Accumulation MUST be carried at a precision that
makes the result exact before that single rounding (e.g. float
accumulation of 8-bit samples).
This is deterministic: the same document produces the same pixels.
Media elements evaluate t_k through their own media-time mapping
(§5.3.2); a video whose decoder quantizes to its own frame times
contributes the same decoded frame to several samples, which is
conformant.
Motion blur applies at export/render time. Interactive previews MAY
render the unblurred scene (a single sample at t) for speed and MUST
NOT be treated as the reference output when motion_blur is present.
Fonts#
fonts is an array of font-face objects. The renderer MUST register
every face before the first frame renders, so a document carrying its
fonts is self-sufficient — text metrics don't depend on what the host
environment happens to have installed.
| Field | Type | Required? | Default | Meaning |
|---|---|---|---|---|
family | string | REQUIRED | — | The font_family name text elements reference. |
weight | number | string | MAY | "normal" | CSS font-weight. Variable fonts MAY declare a range (e.g. "100 900"). |
style | string | MAY | "normal" | "normal" or "italic". |
src | string | REQUIRED | — | URL of the font bytes: absolute, relative (resolved against the document hosting the Source), or a data: URI. |
unicode_range | string | MAY | — | CSS unicode-range (e.g. "U+0000-00FF, U+0131"). Subsetted webfonts ship one file per script under identical family/weight/style; without the range every subset matches every codepoint, and the winning file may lack the glyphs being rendered. Renderers MUST honor it when matching glyphs to faces. |
Multiple entries MAY share a family (different weights, styles, or
unicode-range subsets). Renderers MUST treat each entry as a distinct
face, exactly as CSS treats multiple @font-face rules.
2.2. Forward compatibility#
Documents MAY contain additional top-level fields not listed here. Conforming runtimes MUST ignore unknown top-level fields (passthrough). This applies to all objects in this protocol unless otherwise stated.
3. Coordinate system, units, and types#
3.1. Coordinate system#
- The origin
(0, 0)is at the top-left of the composition. - The positive x axis runs right.
- The positive y axis runs down.
- Positions, sizes, and offsets are in pixels relative to the composition width / height unless an explicit unit suffix is used (§3.3).
3.2. Anchors#
Most elements support x_anchor and y_anchor values in [0, 1].
These define which point inside the element's bounding box is positioned
at (x, y):
0= left edge (or top edge) — the default0.5= center1= right edge (or bottom edge)
By default both anchors are 0, so (x, y) is the element's top-left
corner — the CSS left/top / SVG / Canvas convention. For example,
x: 960 (anchor 0) places the element's left edge at x=960;
x: 960, x_anchor: 0.5 places its center at x=960. The anchor only
moves where the box sits — rotation and scale always pivot the element's
geometric center (§4.4), independent of the anchor.
3.3. Length values#
Wherever a position or dimension is accepted, a string with one of the following unit suffixes MAY be used:
| Unit | Meaning |
|---|---|
"100px" | pixels (same as bare number 100) |
"50%" | percentage of the property's natural reference (width → composition width; etc.) |
"10vw" | percentage of composition width |
"15vh" | percentage of composition height |
"5vmin" | percentage of the smaller composition dimension |
"5vmax" | percentage of the larger composition dimension |
A bare number is interpreted as pixels.
Divergence from CSS:
%and the viewport units resolve against the composition (canvas), never the parent element."50%"is always half the canvas, even on a child nested inside a group. There is no parent-relative percentage; use pixels for child-relative sizing inside groups.
3.4. Color values#
Colors are CSS-style strings. The reference runtime accepts:
- hex —
"#rgb","#rgba","#rrggbb","#rrggbbaa" rgb()/rgba()— comma or space separated, alpha as 0..1 or%hsl()/hsla()— same separators, optionaldegon the hue- the 148 CSS named colors (
"red","rebeccapurple", …) and"transparent"
Unrecognized strings fall back to white. (Conformance: only hex support is REQUIRED of a runtime; the rest are RECOMMENDED for CSS parity.)
Internally, all colors flow through rendering as straight-alpha sRGB, premultiplied at the final composition step. Authors do not need to think about this; it is mentioned to document the reference runtime's behavior.
3.5. Keyframe values#
Many properties accept either a static value or a Keyframe[] array
for animation. The keyframe form is:
[ { "time": 0, "value": 0, "easing": "ease-out-cubic" }, { "time": 1.0, "value": 100, "easing": "ease-out-cubic" } ]
time is in seconds, relative to the element's time (see §7).
value can be a number, a string, or [x, y] for position-like
properties. easing is OPTIONAL; see §6.4.
3.6. Expression values#
Any numeric property (transform, opacity, blur/brightness/
contrast/saturation, effect params, camera/light numbers — every
property whose type admits a number) MAY instead be given as an
expression object: a closed-form, deterministic function of the
element's own clock.
{ "y": { "expr": "540 + sin(t * PI) * 30" } } { "rotation": { "expr": "t * 90" } } { "x": { "expr": "960 + wiggle(3, 12)" } }
An expression is a pure function of element-local time and the element's own index — nothing else. This is the entire safety boundary, and it is NORMATIVE:
- It MUST NOT reference any other element (
ref(),valueAtTime) or read any runtime input (mouse, audio). Those are Tier-B and are permanently unsupported. - It MUST be deterministic: every conforming runtime MUST produce
identical results for the same expression and clock.
noise,wiggle, andrandomderive from the protocol's normative value-noise hash (bit-stable across runtimes; seed defaults to0, never to wall-clock). - Because it is a function of
t/i, it is bakeable: a runtime or tool MAY sample it to aKeyframe[]at any frame rate. A keyframe table and an expression are two encodings of the same value.
Scope — the only identifiers in scope:
| Variable | Meaning |
|---|---|
t | element-local time, seconds (0 at the element's time) |
dur | element duration, seconds |
i | element index within its generated/particle set (default 0) |
n | sibling count (default 1) |
value | the property's base value (its documented default) |
Constants: PI, TAU, E. Functions (the ONLY callable identifiers):
sin cos tan asin acos atan atan2 sinh cosh tanh
abs sign sqrt cbrt pow exp log log2 floor ceil round trunc fract
min max mod hypot
clamp(x,lo,hi) lerp(a,b,u) mix(=lerp) step(edge,x) smoothstep(e0,e1,x)
linear(x,x0,x1,y0,y1) // map x∈[x0,x1] → [y0,y1], clamped
ease(x,x0,x1,y0,y1) // same, cubic in-out
noise(x[,seed]) // value noise, ∈[-1,1]
wiggle(freq,amp[,seed]) // amp · fractal noise(t·freq)
random(seed) // deterministic [0,1) hash, time-independent
Operators: + - * / % ^ (^ = exponent, right-assoc), unary -,
comparisons < > <= >= == !=, logical && || !, and the ternary
cond ? a : b. Nothing else — no member access, assignment, indexing,
statements, or string values.
Evaluation (NORMATIVE). A conforming runtime MUST evaluate
expressions with a restricted parser/evaluator, NOT a general
code-execution facility (eval / Function). Any unknown identifier or
function, member access, assignment, or string literal is a parse error;
a parse error or a non-finite (NaN / ±∞) result MUST fall back to the
property's base value. Expressions are numeric-only in this version;
string/text expressions are reserved.
4. The element model#
Every element extends a common base:
interface BaseElement { id?: string; name?: string; type: ElementType; // discriminator layer: number; // REQUIRED, unique per container; 1..1000; LOWER draws in front (layer 1 = on top) time?: number | string; // seconds duration?: number | string | "auto" | "end"; // Transform (numbers OR length strings OR Keyframe[]) x?: number | string | Keyframe[]; y?: number | string | Keyframe[]; x_anchor?: number | string; y_anchor?: number | string; width?: number | string | Keyframe[]; height?: number | string | Keyframe[]; rotation?: number | Keyframe[]; // degrees, around center scale?: number | Keyframe[]; // Visual opacity?: number | Keyframe[]; // 0..1 (CSS convention), default 1 visible?: boolean; // false skips rendering (§4.2) blend_mode?: 'normal' | 'multiply' | 'screen' | 'add' | 'overlay' | 'hard-light' | 'soft-light'; // §4.5 blur_radius?: number | Keyframe[]; // Gaussian σ in px (§4.6) brightness?: number | Keyframe[]; // multiplier, 1 = unchanged (§4.6) contrast?: number | Keyframe[]; // multiplier, 1 = unchanged (§4.6) saturation?: number | Keyframe[]; // multiplier, 1 = unchanged (§4.6) effects?: Effect[]; // ordered stylize passes (§4.7) // Animation animations?: Animation[]; keyframe_animations?: KeyframeAnimation[]; }
4.1. The type discriminator#
type is REQUIRED and identifies which variant the object is. Conforming
runtimes MUST recognize the values defined in §5. Documents containing
unknown type values are valid documents; runtimes MAY skip such
elements with a warning rather than failing the whole document.
4.2. Draw order#
Every element owns a layer — a unique integer 1..1000 within its
container (the top-level elements, each group's elements, each group
mask's elements), like an After Effects layer. Lower layer draws
in front; layer 1 is on top. Elements draw back-to-front by depth
(z, §4.4), then by layer:
draw_key = (depth descending, i.e. farther first), then layer descending
(highest layer drawn first/behind, layer 1 drawn last/on top)
z is the single depth axis (§4.4.2): with no camera it is pure
stacking order (no perspective), with a camera it additionally
foreshortens. layer orders elements within equal depth — when depths
are equal (e.g. a 2D document where every z is 0), draw order is
exactly layer descending, so layer 1 ends up on top. layer is
required and unique per container (duplicate layers are a validation
error), so no two elements tie; the sort MUST nonetheless be stable. The
same rule applies to a group's children, locally within the group.
(There is no separate z_index field — depth ordering is unified onto
z. camera.sort: 'paint' opts a camera composition back into fixed
layer order; see §4.4.3.)
Elements with visible: false MUST NOT be rendered. Inactive elements
(§7.1) MUST NOT be rendered.
4.3. Anchors and bounding boxes#
The element's bounding box is width × height pixels, with its
anchor point at (x, y). The visual center MUST be computed
from the anchor:
center_x = x + (0.5 - x_anchor) * width
center_y = y + (0.5 - y_anchor) * height
Rotation rotates the element around its center.
4.4. Transform composition#
Every element supports, in addition to position and rotation:
| Field | Type | Default | Meaning |
|---|---|---|---|
scale | number | 1 | Uniform scale factor. |
x_scale, y_scale | number | "N%" | 1 | Per-axis scale factors; "150%" ≡ 1.5. |
x_skew, y_skew | number (degrees) | 0 | Shear, CSS skewX/skewY semantics: positive x_skew moves the bottom edge right; positive y_skew moves the right edge down. |
Since CKP/1.0, every element also supports a 3D transform:
| Field | Type | Default | Meaning |
|---|---|---|---|
z_rotation | number (degrees) | Keyframe[] | 0 | Rotation in the element's plane. Exact alias for rotation — authoring BOTH on one element MUST be rejected by validators. |
x_rotation | number (degrees) | Keyframe[] | 0 | Rotation around the element's local x axis; positive tips the top edge away from the viewer. |
y_rotation | number (degrees) | Keyframe[] | 0 | Rotation around the element's local y axis; positive turns the right edge away from the viewer. |
z | number (px) | Keyframe[] | 0 | Depth toward (+) / away from (−) the viewer. The depth axis for §4.2 draw order: higher z draws nearer the viewer (on top), with layer ordering elements within equal depth (lower layer in front). Under a camera it additionally drives perspective foreshortening; with no camera it is pure stacking. |
All are animatable. The effective axis scales are
sx = scale × x_scale, sy = scale × y_scale.
4.4.1. The local matrix (normative)#
An element's local transform is the 4×4 matrix, in pixels, with the
anchor-derived center (cx, cy) (§4.3) as the pivot:
M_local = T(cx, cy, z) · Rz(z_rotation) · Ry(y_rotation) · Rx(x_rotation)
· K(x_skew, y_skew) · S(sx, sy, 1) · T(−cx, −cy, 0)
where Rx/Ry/Rz are the standard right-handed rotation matrices in
the protocol's coordinate system (x right, y DOWN, z toward the
viewer; positive z_rotation is therefore clockwise on screen, as in
CKP/1.0), and K embeds the 2D shear [[1, tan(x_skew)], [tan(y_skew), 1]] (CSS skew(x, y) semantics) in the xy plane. With
x_rotation = y_rotation = z = 0 this reduces exactly to the CKP/1.0
composition scale → shear → rotate around the anchor-derived
center; documents without 3D fields MUST render identically under
both models.
Group transforms stack multiplicatively onto children:
M = M_parent · M_local, evaluated in the group's local space, and
opacities multiply down the group chain. Children of a group that
does NOT rasterize to a layer (no clip, mask, filter fields, or
effects — see §4.4.3) therefore live in the parent's 3D space:
nested 3D rotations compose, with no opt-in flag.
4.4.2. The camera#
A Source MAY declare one scene camera — a perspective lens (perspective,
origin_x, origin_y) plus an optional rigid pose (a position and a
look orientation) that moves the viewpoint through the scene:
camera?: { perspective: number | Keyframe[]; // focal distance in px, > 0 origin_x?: number | string; // default "50%" (canvas center) origin_y?: number | string; // default "50%" // Eye position offset from the default eye, px, about the origin. Default 0. x?: number | Keyframe[]; // +x right y?: number | Keyframe[]; // +y down z?: number | Keyframe[]; // +z = eye toward the scene (dolly in) // Eye orientation, degrees, Euler (applied Rz·Ry·Rx). Default 0. x_rotation?: number | Keyframe[]; // pitch (tilt) y_rotation?: number | Keyframe[]; // yaw (pan) z_rotation?: number | Keyframe[]; // roll sort?: 'depth' | 'paint'; // compositing order, §4.4.3. Default 'depth'. }
The lens follows CSS Transforms Module Level 2 perspective()
semantics: with d = perspective and origin (ox, oy),
P = T(ox, oy, 0) · [ 1 0 0 0 ; 0 1 0 0 ; 0 0 1 0 ; 0 0 −1/d 1 ] · T(−ox, −oy, 0)
The pose is the inverse of the camera's rigid world transform, taken
about the origin so it composes with the lens. With eye position
(x, y, z) and rotation R = Rz·Ry·Rx,
V = T(ox, oy, 0) · R⁻¹ · T(−x, −y, −z) · T(−ox, −oy, 0)
and the camera matrix applied once at the root is
camera = P · V
so every element renders through P · V · M_chain · M_local, followed
by the perspective divide. Orientation is given as explicit Euler angles
rather than a look-at target: a target would derive orientation from
geometry (hidden runtime math); a "look at this point" gesture is an
authoring convenience that resolves to these angles, not a schema field.
Identity reduction (normative). With the pose at its defaults
(x=y=z=0, all rotations 0), V = I and camera = P bit-for-bit:
a document that uses only perspective/origin renders identically to a
pre-pose runtime. Smaller d = stronger foreshortening; perspective
is animatable (camera push/pull), as are all pose fields (orbit, pan,
tilt, dolly).
No camera ⇒ camera = I. 3D rotations still render (affine
foreshortening — a y-rotated card narrows but its edges stay parallel).
z has no perspective effect without a camera, but it still orders
(§4.2, §4.4.3): z is the single depth axis, so without a camera it acts
as pure stacking order. A document with no 3D fields (all z = 0) and no
camera renders as pure layer stacking — equal depth collapses to layer
order (descending, so layer 1 is on top).
4.4.3. Compositing under 3D (normative)#
- Paint order — depth (2.5D).
zis the single depth axis and it orders always: each sibling draw list (the top-level elements, and the children of each non-flattened group) is painted back-to-front by depth — the eye-spacezof the sibling's anchor (itszwith no camera;zafterP · Vunder a camera), far cards first. With no camera this is pure stacking with no perspective; with a camera it is the same ordering plus foreshortening. This is whole-card (per-element) 2.5D sorting — flat cards ordered by distance — NOT a per-pixel depth buffer. The sort is stable: equal depths break bylayerorder (descending — layer 1 drawn last/on top), so a document where everyz = 0collapses to exactlayerorder, and the same Source always yields the same pixels. A flattened group (clip / mask / filters / effects) is a single card and sorts by its own anchor depth as a unit; the §4.4.3 flattening rule means its children are already coplanar inside the flat layer and keep their in-layer order. An element's own internal quads (a particle system's particles, a text element's glyphs) are likewise NOT reached by this sort — they follow their own documented order (e.g. particles in spawn order, §5.13). Limitation (normative, not a bug): because the unit of sorting is the whole card, cards that interpenetrate, or whose anchor-depth order disagrees with their true per-pixel order, can be ordered "wrong" at some camera angles. There is no per-pixel resolution in the 2.5D model.sort: 'paint'opts a camera composition back into fixedlayerorder for authors who want explicit, camera-stable layering. There is no depth buffer in the rendering model. - Flattening at layer boundaries. A group with
clip: trueormaskrenders its children in its own flat 2D layer space exactly as in CKP/1.0, and the finished layer's quad is then transformed by the full matrix chain. 3D declared INSIDE such a subtree composes only within that flat layer's plane; 3D declared ON or ABOVE it projects the layer as a unit. (This is how a clipped UI-mock group tilts as one card.) - Effect surfaces are screen-space. Filter fields (
blur_radius,brightness,contrast,saturation,hue_rotate) andeffectsentries evaluate on the element's PROJECTED rendering: the element (or group subtree) draws with its full transform — including 3D and the camera — into a surface-sized layer, and the filter/effect chain runs on those screen-space pixels. This matches CKP/1.0, where the element's own transform is likewise baked into its effect surface before filtering, and means e.g. a glow's radius or a stroke's width is uniform in screen pixels on a tilted card. A shape's nativeshadowforeshortens with the element's plane, but itsoffset_x/offset_ytranslate in the PARENT plane — consistent with CKP/1.0, where a rotated shape's shadow offset stays screen-aligned. - Glass. Glass is legal under 3D. The pane is a true plane in the scene: pane-local coordinates come from the inverse of the pane's plane homography (the restriction of the full §4.4 matrix chain to the pane's plane — equivalent to exact per-fragment ray/plane intersection), the §4.7 optical model runs unchanged in that local frame, and refracted sample points map FORWARD through the homography onto the screen-space backdrop snapshot. See §4.7 "Glass under 3D" for the normative model and degenerate cases. With no 3D on the element or its un-flattened chain, the orthographic CKP/1.0 path applies bit-for-bit.
- Anti-aliasing. SDF edge anti-aliasing remains derivative-based and scales naturally under projection; the §1 cross-backend tolerance language applies unchanged.
4.5. Blend modes#
blend_mode selects how the element's pixels combine with the pixels
already drawn beneath it. It is element-local: it changes only this
element's compositing math and MUST NOT alter how any other element
renders. With premultiplied sources:
| Value | Color math | Character |
|---|---|---|
normal (default) | out = src + dst·(1 − src.a) | Standard over. |
multiply | out = src·dst + dst·(1 − src.a) | Darkens; white is neutral; uncovered pixels leave the destination unchanged. |
screen | out = src + dst·(1 − src) | Lightens; black is neutral. |
add | out = src + dst | Linear dodge; overlaps sum toward white (glow). |
overlay | B(cb,cs) = 2·cb·cs if cb ≤ 0.5 else 1 − 2·(1−cb)·(1−cs) | Multiply in dark backdrop areas, screen in light ones; boosts contrast. |
hard-light | B(cb,cs) = overlay with source and backdrop swapped | Like shining a harsh light through the source. |
soft-light | B(cb,cs) per W3C soft-light | A gentler hard-light (soft dodge/burn). |
The blend function B(cb, cs) operates on straight-alpha (un-
premultiplied) backdrop cb and source cs per channel; the result
composites via the general separable formula
co = αs·(1−αb)·cs + αs·αb·B(cb,cs) + (1−αs)·αb·cb, with
αo = αs + αb·(1 − αs). For normal/multiply/screen/add this
reduces to the closed forms above, expressible with fixed-function
blending. overlay/hard-light/soft-light are piecewise on the
backdrop (or source) and cannot be; a conforming runtime isolates
the element to its own layer and composites it against a snapshot of
the backdrop. The alpha channel always composites normally, so
coverage is unaffected by the mode.
On a group, blend_mode applies when the group's flattened layer is
composited — which only exists when the group is layered via
clip: true or mask (§5.8). On an unlayered group the field MUST be
ignored (children draw directly to the parent surface with their own
modes); runtimes SHOULD warn. Children inside a layered group
composite against each other inside the layer, isolated from the
backdrop — matching CSS isolation: isolate semantics.
4.6. Filters#
Four element-local filter fields, all animatable, all following CSS
filter function semantics:
| Field | Default | Meaning |
|---|---|---|
blur_radius | 0 | Gaussian blur; the value is the standard deviation σ in canvas pixels (CSS blur(σ)). |
brightness | 1 | Color multiplier: c' = c × v (CSS brightness(v)). |
contrast | 1 | Scale around mid-gray: c' = (c − 0.5) × v + 0.5 (CSS contrast(v)). |
saturation | 1 | Lerp against Rec. 709 luma: c' = mix(luma(c), c, v); 0 = grayscale (CSS saturate(v)). |
hue_rotate | 0 | Hue rotation by v DEGREES: c' = M(v) × c with the SVG feColorMatrix type="hueRotate" matrix below (CSS hue-rotate(v)). |
The hue_rotate matrix, NORMATIVE, with cosθ/sinθ of the angle:
M = [ 0.213+0.787cosθ−0.213sinθ 0.715−0.715cosθ−0.715sinθ 0.072−0.072cosθ+0.928sinθ ]
[ 0.213−0.213cosθ+0.143sinθ 0.715+0.285cosθ+0.140sinθ 0.072−0.072cosθ−0.283sinθ ]
[ 0.213−0.213cosθ−0.787sinθ 0.715−0.715cosθ+0.715sinθ 0.072+0.928cosθ+0.072sinθ ]
Blur evaluation is a NORMATIVE downsample ladder (so identical sources
produce identical pixels everywhere): bilinearly halve the image until
the residual σ / f ≤ 4 (f a power of two, max 16), apply a
25-tap Gaussian (taps at σ/f ÷ 4 spacing over ±3σ, weights
exp(−d²/2σ²) normalized by their sum) horizontally then vertically
at the reduced size, and bilinearly upsample at the consuming draw.
Sparse full-resolution taps are not an acceptable substitute — they
leave a visible σ/4-pixel grid on hard edges.
A filtered element — any type, group included, layered or not — is
rendered with its normal transform into a transparent offscreen layer,
then that layer is composited back through the filter. Filters MUST
apply in the order blur → brightness → contrast → saturation →
hue_rotate, and
the color ops MUST operate on straight (unpremultiplied) color so
translucent pixels don't skew toward the contrast midpoint. Channel
results clamp to [0, 1] before re-premultiplying; the alpha channel is
never changed by color ops.
Filters are element-local: the blur may bleed past the element's box
(like CSS filter: blur()) but never reads or alters other elements'
pixels. The element's opacity applies inside the layer and its
blend_mode applies at the filter composite, so all three features
compose. On a group, filtering flattens the subtree first — children
are filtered as one image, not individually.
4.7. Stylize effects#
effects is an ordered array of stylize passes over the element's
rendered pixels, applied AFTER the filter fields:
layer → blur → brightness → contrast → saturation → hue_rotate → effects[0] → … → effects[n]
{ "effects": [{ "type": "pixelate", "cell_size": 12 }] }
Each effect is an object discriminated by type. Effect params accept
number | Keyframe[]; keyframes evaluate against element-local time.
(Effect params are NOT addressable from keyframe_animations — there
is no property-path syntax into the array.) Like filters, effects are
element-local, work on every element type (a group flattens its
subtree first), and the element's blend_mode applies at the final
composite. Runtimes encountering an effect type they don't implement
MUST skip that effect (rendering the element without it) and SHOULD
warn.
Effects read only the element's own rendered pixels — with ONE
exclusion: glass. Glass additionally reads the element's
backdrop: the current surface's pixels at the element's position in
draw order (§4.2), i.e. everything drawn before it on the same
surface. It gets this carve-out because the effect is widely known and
in high demand, and there is no proper decomposition — refraction
needs the pixels behind the pane — and the alternative, a first-class
glass element type, is deliberately not part of this protocol. No
other effect type reads the backdrop, and backdrop-sampling blend
modes (overlay, soft-light) remain excluded (§4.5). Note the
relationship stays one-way: glass READS what is beneath it but never
alters how any other element renders.
Pixel-grid coordinates below are the element's layer pixels at output resolution; cells are aligned to the layer's origin. Color math runs on straight (unpremultiplied) color; "ink" factors scale color and alpha together (premultiplied output).
pixelate#
| Param | Default | Meaning |
|---|---|---|
cell_size | 8 (min 1) | Cell size in canvas pixels. |
Every pixel takes the color sampled at its cell's center:
out(p) = src((floor(p / cell) + 0.5) × cell).
dither#
| Param | Default | Meaning |
|---|---|---|
levels | 4 (min 2) | Quantization levels per color channel. |
Ordered dithering with the 4×4 Bayer matrix
[0 8 2 10; 12 4 14 6; 3 11 1 9; 15 7 13 5]:
t = (B[y mod 4][x mod 4] + 0.5) / 16, then per channel
c' = clamp(floor(c × (levels−1) + t) / (levels−1), 0, 1).
Alpha (coverage) is not dithered.
halftone#
| Param | Default | Meaning |
|---|---|---|
cell_size | 8 (min 2) | Dot-grid cell size in canvas pixels. |
angle | 45 | Grid rotation in degrees. |
In grid space (pixel coords rotated by angle), each cell draws a dot
at its center, colored with the source sample at that center and sized
by its luminance: luma = Rec709(c) × α, r = 0.5 × cell × √luma
(area-proportional ink). Pixel ink is
(1 − smoothstep(r−1, r+1, d)) × clamp(r, 0, 1) where d is the
grid-space distance to the dot center; outside dots the output is
transparent.
ascii#
| Param | Default | Meaning |
|---|---|---|
cell_size | 12 (min 4) | Glyph cell size in canvas pixels. |
Each cell samples its center color; luma = Rec709(c) × α selects a
glyph by i = clamp(floor(luma × 10), 0, 9) from the ten-step density
ramp space . - : = + % * @ #. The glyph tints with the cell's color;
uninked pixels are transparent. Glyph shapes are NORMATIVE — the
embedded 8×8 bitmap font below (one byte per row, MSB = leftmost
pixel), upscaled nearest-neighbor to the cell — so the effect never
depends on platform fonts:
| Glyph | Rows (hex) |
|---|---|
| space | 00 00 00 00 00 00 00 00 |
. | 00 00 00 00 00 18 18 00 |
- | 00 00 00 7E 00 00 00 00 |
: | 00 18 18 00 00 18 18 00 |
= | 00 00 7E 00 7E 00 00 00 |
+ | 00 18 18 7E 18 18 00 00 |
% | 00 C6 CC 18 30 66 C6 00 |
* | 00 66 3C FF 3C 66 00 00 |
@ | 7C C6 DE DE DE C0 78 00 |
# | 6C 6C FE 6C FE 6C 6C 00 |
glow, drop_shadow, stroke (layer styles)#
Layer styles composite BENEATH the element's own pixels (premultiplied
under-operator: out = content + style × (1 − content.α)), on any
element type.
| Effect | Params (defaults) | Math |
|---|---|---|
glow | radius 20, intensity 1, color "#FFFFFF" | silhouette alpha blurred by σ = radius (§4.6 ladder), × intensity (clamped to 1), × color. |
drop_shadow | offset_x 0, offset_y 12, blur 18, color "#000000", opacity 0.6 | silhouette alpha blurred by σ = blur, sampled at p − offset, × color × opacity. |
stroke | width 4, color "#FFFFFF" | outline band outside the silhouette: max alpha over a 16-tap ring of radius = width, × color. |
Numeric params are animatable; they chain in array order like all effects (a drop_shadow listed before a glow renders beneath it).
chroma_key, luma_key (keying)#
Keying makes pixels of the element's own rendered layer transparent
based on their color. All math operates on the STRAIGHT-alpha color
c = premultiplied.rgb / α (with c = 0 where α = 0); the resulting
coverage factor a scales the pixel's alpha (and its premultiplied
color with it).
chroma_key — params color (default "#00FF00"), tolerance
(default 0.18), softness (default 0.1), spill (default 0.5).
With k the key color and BT.709 luma `Y(x) = 0.2126·x.r + 0.7152·x.g
- 0.0722·x.b`:
CbCr(x) = ( (x.b − Y(x)) / 1.8556 , (x.r − Y(x)) / 1.5748 )
d = | CbCr(c) − CbCr(k) | — Euclidean distance
a = softness > 0 ? clamp((d − tolerance) / softness, 0, 1)
: (d > tolerance ? 1 : 0)
Spill suppression caps the key color's dominant channel (ties resolve
green → red → blue): with i that channel and j, k the others,
c.i −= spill × max(0, c.i − max(c.j, c.k)), applied to every pixel
regardless of d. Output: α' = α × a, color c × α'.
luma_key — params threshold (default 0.5), softness
(default 0.1), invert (default false):
a = softness > 0 ? clamp((Y(c) − threshold) / softness, 0, 1)
: (Y(c) > threshold ? 1 : 0)
invert → a = 1 − a
Pixels darker than threshold are removed (with invert, brighter).
tolerance / softness / threshold / spill are animatable;
color and invert are static. Keying reads only the element's own
pixels — to key a green-screen video, put the effect on the video
element (color set to the screen's actual green); a group keys its
composited children as one layer.
levels, lut (grading)#
Both operate per pixel on the straight-alpha color; alpha is never changed.
levels — params in_black 0, in_white 1, gamma 1,
out_black 0, out_white 1 (all animatable, points clamped to
[0, 1], gamma > 0). Per channel:
x = clamp((c − in_black) / (in_white − in_black), 0, 1)
y = x^(1/gamma) — gamma > 1 brightens mid-tones
out = clamp(out_black + y × (out_white − out_black), 0, 1)
lut — params source (URL of a .cube file — http(s), relative,
or data: URI), intensity 1 (animatable, 0..1). The file MUST
declare LUT_3D_SIZE N (2 ≤ N ≤ 256) with the default 0..1 domain;
data lines are r g b triples ordered red-fastest, then green, then
blue. Values clamp to [0, 1]. The graded color is the TRILINEAR
interpolation of the lattice at c × (N−1) per axis, and the output
is mix(c, graded, intensity). A LUT that fails to load or parse MUST
skip the pass with a warning (the element still renders). The lattice
is sampled at the pipeline's working precision (8-bit per channel in
the reference runtime).
fractal_noise, turbulent_displace (procedural noise)#
Both build on one NORMATIVE noise function so identical documents produce identical pixels on every runtime. All integer math is 32-bit unsigned (wrapping); lattice coordinates convert int→uint by two's complement.
pcg(v) : s = v × 747796405 + 2891336453 (mod 2³²)
w = ((s >> ((s >> 28) + 4)) XOR s) × 277803737 (mod 2³²)
→ (w >> 22) XOR w
h(c, seed) = pcg(c.x XOR pcg(c.y XOR pcg(c.z XOR pcg(seed)))) / (2³²−1)
noise(p, seed): value noise — trilinear blend of h at the 8 corners of
floor(p)'s lattice cell, weights faded per axis by
u = f³(f(6f − 15) + 10)
fbm(p, octaves, seed) = Σₒ 0.5° × noise(p × 2°, seed + o) / Σₒ 0.5°
for o = 0 … octaves−1 (lacunarity 2, gain 0.5)
fractal_noise — params scale 100 (canvas px per lattice cell),
evolution 0, offset_x/offset_y 0 (canvas px), octaves 4
(integer 1–8, static), seed 0 (integer, static; use values < 2²⁴).
The element's pixels are replaced by grayscale noise, keeping its
alpha footprint:
v = fbm((p + offset) / scale ⊕ evolution as the 3rd axis, octaves, seed)
out = (v, v, v) × α, α unchanged
Grayscale by design — chain levels to shape contrast and a lut to
color it (gray has no chroma for hue_rotate to act on). Animate
evolution for in-place churn, the offsets to scroll. On a text
element the glyphs become a noise-filled matte.
turbulent_displace — params amount 16 (max displacement,
canvas px), scale 120, evolution 0, octaves 2 (1–8, static),
seed 0 (static). Each output pixel samples the element's layer at a
noise-displaced position (bilinear, clamped to the layer):
d = ( fbm(p/scale ⊕ evolution, octaves, seed) − 0.5 ,
fbm(p/scale ⊕ evolution, octaves, seed + 7919) − 0.5 ) × 2 × amount
out(p) = layer(p + d)
scale, evolution, offsets, and amount are animatable; octaves
and seed are static.
glass#
Liquid glass — a refractive pane over the backdrop, following the
widely-adopted liquidglass optical model (reference:
github.com/ybouane/liquidglass). Glass applies to shape elements
(rectangle / ellipse), where the pane geometry is known exactly;
runtimes MUST skip the effect (with a warning) on other element types
— deriving lens geometry from rasterized alpha produces rim artifacts.
Glass samples the backdrop orthographically: since CKP/1.0, 3D
transforms (x_rotation / y_rotation / z) are invalid on a
glass-carrying element and on its un-flattened ancestor chain
(§4.4.3).
| Param | Default | Meaning |
|---|---|---|
blur_radius | 0 | Backdrop Gaussian blur σ in px. 0 = CLEAR glass (pure refraction, the default material); > 0 = FROSTED. |
refraction | 21 | Lens bend strength (≈ px of displacement; internally the reference dial = refraction / 30). The magnitude is used. |
edge_width | 40 | Bevel z-radius — how deep the lens curvature reaches. With mode: "dome" and edge_width = the shape's radius, the pane is a half-sphere magnifier. |
mode | "pill" | Lens cross-section: "pill" (biconvex — entry + exit refraction + depth-scaled centre magnification) or "dome" (flat bottom, curved top — uniform magnification toward the centre). |
edge_highlight | 0.35 | Scales the stock light rig (rim 0.22×, inner glow 0.15×, 1.5px top-biased inner stroke 0.55×, Fresnel). 0.35 reproduces the reference defaults exactly. |
dispersion | 0.05 | Chromatic aberration along the surface normal (×18 px, edge-weighted). |
shadow | 0.3 | Drop-shadow opacity, painted ONLY outside the pane's SDF (spread 10 px, vertical offset 1 px — glass never frosts its own shadow). |
backdrop_saturation | 1 | Saturation of the sampled backdrop (1 = unchanged). |
tint | none | Color drawn over the glass; alpha = strength. Not keyframable in v1. |
All numeric params are animatable. The pane's fill_color is unused
under glass; its opacity scales the pane. An ellipse evaluates as
the rounded-rect SDF with r = min(half) — a circle when square, a
stadium otherwise. Pane content (labels, icons) goes on lower layers (nearer the front).
The model, NORMATIVE, evaluated per pixel in pane-local rotated
coordinates p with half-size half, corner radius r, z-radius
zR = edge_width, and the reference dial dial = |refraction| / 30:
sdf = roundedRectSDF(p, half, r)
inside = −sdf
h(d) = √(d × (2·zR − d)) clamped to [0, zR] — half-circle bevel
∇h = central differences of h at step 2 px (the sdf is ANALYTIC:
no field facets, no measured-divergence cap, no rim "lip")
N = normalize(−∇h, 1)
depth = smoothstep(0, zR, inside)
edge = smoothstep(0.35 × min(half), 0, inside)
pill: refr = (2·∇h + ∇h·(h/zR)·0.5) × (1 − 1/1.5) × dial × 30
+ (−p / half) × dial × 4 × depth
dome: refr = −p × dial × depth × 0.35
chroma = N.xy × dispersion × 18 × (edge×0.7 + 0.3) × 2;
R/G/B sample at refr + chroma / refr / refr − chroma
col = mix(sharp, frosted, 1 − edge × 0.15)
— frosted everywhere, 15% sharp at the rim; with
blur_radius 0 both textures are identical (clear glass)
col = saturate(col, backdrop_saturation); mix toward tint;
× (1 + 0.06 × depth)
fresnel = (1 − |N.z|)⁴ × (edge_highlight / 0.35)
spec = Blinn-Phong on N: L(0.4,0.7,1)^90 + L(−0.3,−0.5,1)^50×0.3
+ diffuse L(0.1,0.3,1)^6×0.1 + L(0,0.9,0.4)^120×0.6
(specular dial; reference default 0)
stroke = 1.5px inner band × (0.4 + 0.6·topBias) × edgeHL × 0.55
rim/glow = edge × edgeHL × 0.22 / smoothstep(5,0,inside) × edgeHL × 0.15
env = (N.y×0.5 + 0.5) × fresnel × 0.08
fin = col + spec + rim + glow + stroke + env, then
mix(fin, white, fresnel × 0.2)
out = premultiply(fin, aaMask × opacity)
shadow (sdf > 0 only, offset down by offY):
d = max(sdfShadow − 1, 0); s = spread
α = (e^(−d²/s²) × 0.65
+ e^(−0.08·d / max(0.04·s, 0.01)) × 0.35) × shadow
Glass under 3D (CKP/1.0, NORMATIVE). When the pane carries 3D fields or sits under an un-flattened non-affine chain (§4.4), the model above runs unchanged in the pane's LOCAL frame; only the coordinate hand-off changes:
- Let
Hbe the pane's plane homography — the 3×3 restriction of the full §4.4 matrix chain (camera included) to the pane's plane, origin at the pane's center, y down. Per fragment, pane-localp = (H⁻¹ · (px, 1)).xy / w; a non-positivewlies past the plane's horizon and outputs nothing. This is exactly the camera-ray / pane-plane intersection, in projective form. - Refracted and dispersed sample points (
p + refr ± chroma) map FORWARD throughHto surface px (clamped to the surface) and sample the same screen-space backdrop snapshot. Glass refracts the COMPOSITED IMAGE at the screen plane — it does not re-render the scene from the refracted direction. - The light rig, bevel field and SDF are defined in the pane's local frame and tilt WITH the pane (highlights track the surface, not the screen).
- Degenerate case: a singular
H(an edge-on pane with no perspective anywhere in its chain) is invisible — runtimes MUST draw nothing. Under perspective an off-axis edge-on pane projects to a thin wedge; that wedge is correct rendering, not an error. - The shadow term evaluates in pane-local coordinates like everything else, so the shadow projects with the pane.
With no 3D in play the orthographic path applies unchanged — documents valid in CKP/1.0 render bit-identically.
4.8. Lighting and materials (CKP/1.0, NORMATIVE)#
A Source MAY declare lights and an environment; an element MAY carry
a material. Lighting is opt-in and additive: a document with no
lights and no element material renders bit-identically to one
without these fields. Only elements with a material are shaded.
Shading model (PBR). When an element has a material and the Source
has lights, the element renders its content normally — those pixels are
the albedo — and the runtime shades each fragment with one model,
identical in 2D and 3D:
out = albedo · (ambient + Σ_dir diffuse(N,L)·Lc) // Lambert
+ Σ_dir specular_GGX(N,L,V,roughness)·F(V,H,F0)·Lc // Cook-Torrance
+ (kd·albedo·envAvg + envc·Fr) · reflectivity // environment (IBL)
+ emissive·albedo
where envc = mix(env(R), envAvg, roughness) is the environment sampled
along the reflection ray R = reflect(−V,N) and blurred toward its
average by roughness; Fr is the roughness-aware Schlick–Fresnel at the
view angle; kd = (1−Fr)·(1−metalness) is the dielectric diffuse weight;
and envAvg is the environment's mean color (its diffuse irradiance).
- N is the element's world-space face normal (from its §4.4 3D
orientation), perturbed per-fragment by
material.normal_mapwhen present (a tangent-space map sampled in the element's UV space, scaled bynormal_scale; the tangent/bitangent come from the quad's U/V axes). - V is the view vector: per-fragment
normalize(eye − fragmentWorld)under a camera, or(0,0,1)with no camera. This is the only term the camera affects — so specular and reflections sweep as the camera moves (3D); with no camera you animate the lights instead (2D). The math is the same. - F0 (Fresnel base reflectance) is
0.04for dielectric, the albedo for metal, interpolated bymetalness. Fresnel uses Schlick. - roughness widens the GGX specular lobe and blurs the environment reflection; reflectivity is an art dial over the environment term.
Lights (source.lights):
{ type: 'ambient', color?, intensity? }— uniform fill.{ type: 'directional', azimuth?, elevation?, color?, intensity? }— a parallel light;azimuth/elevation(degrees) give its direction. All scalar fields animatable.
Environment (source.environment) is what reflective surfaces
sample along the reflection vector. Phase 1: { type: 'gradient', stops }
— a gradient "sky" indexed by the reflection ray's vertical component
(offset 0 = looking down, 1 = up), so a surface mirrors the sky when it
tilts up and the ground when it tilts down, and the reflection shifts as
the surface (or camera) moves. Up to 4 stops. The environment also
contributes a diffuse irradiance (its average color) to dielectric
surfaces. The other type is { type: 'image', src } — an
equirectangular (2:1 lat-long) image the surface mirrors along the
reflection vector (real photographic reflections); roughness blurs it
toward the image's average color. Both share one IBL path.
Bloom (source.bloom, Phase 2) is a whole-frame post-process:
pixels brighter than threshold (luma, soft knee) are blurred by
radius and added back × intensity, so bright regions — specular
highlights, emissive surfaces, bright media — bleed light across element
boundaries. It is brightness-driven: the amount each region blooms
comes from its own brightness, not a per-element knob (use the
per-element glow effect for deliberate, art-directed halos). Opt-in;
absent ⇒ no bloom (byte-identical). All fields animatable.
Material (element.material): roughness (0 mirror‑tight .. 1
matte), metalness (0 dielectric .. 1 metal), reflectivity (env
strength), emissive (self-illumination toward the unlit pixels),
normal_map (tangent-space normal map URL — flat texel = #8080ff) and
normal_scale (perturbation strength). Scalars animatable. Absent ⇒
unlit.
Determinism / cost. No material ⇒ the element takes the exact
unlit path (no shading pass), so unlit content is byte-identical and
pays nothing. The model is local illumination (no shadows, no global
illumination) — flat 2.5D surfaces lit per-fragment.
5. Element types#
CKP/1.0 defines nine element types. Each section below specifies the fields, semantics, and required behavior for one type.
5.1. shape#
A shape draws geometry in one of two representations, selected by whether
it carries paths:
- Primitive —
shape: "rectangle" | "ellipse"with optional rounded corners, gradient fills, and stroke. Rendered as a resolution-independent SDF.fill_color,stroke_color,stroke_width, andborder_radiusare animatable viakeyframe_animations(e.g.property: "border_radius"). - Path — arbitrary vector geometry via
paths: keyframeabledmorphing, per-sub-path fill/stroke, and stroke trim/draw-on. Rasterized, so resolution is bound byview_box. Specified in §5.6.
When paths is present the primitive fields are ignored.
interface ShapeElement extends BaseElement { type: "shape"; // Primitive form (ignored when `paths` is present): shape?: "rectangle" | "ellipse"; // default "rectangle" fill_color?: string; // hex, default "#ffffff" gradient?: LinearGradient | RadialGradient; // overrides fill_color stroke_color?: string; stroke_width?: number; border_radius?: number; // PIXELS — see §5.1.2 // Path form (§5.6): view_box?: [number, number, number, number]; // default [0, 0, 100, 100] gradients?: PathGradient[]; paths?: PathDef[]; // ≥1 when present }
5.1.1. Gradients#
interface LinearGradient { type: "linear"; angle?: number; // degrees, CSS linear-gradient() convention: // 0 = to top, clockwise. 90 = to right, // 180 = to bottom (default), 270 = to left. stops: GradientStop[]; // 2..4 stops } interface RadialGradient { type: "radial"; cx?: number; // 0..1 of bounding box, default 0.5 cy?: number; radius?: number; // 0..1, default 0.5 stops: GradientStop[]; } interface GradientStop { offset: number; // 0..1 color: string; // hex }
Implementations MUST support at least 4 stops. The gradient direction for linear gradients uses CSS-style angle conventions.
5.1.2. Corner radius#
border_radius is in PIXELS, not a normalized 0..1 value. Runtimes MUST
clamp the value to half the shorter of width/height so that values
exceeding the shape produce a pill or circle rather than visual artifacts.
Conforming runtimes MUST render corner arcs as true quarter-circles, not stretched ellipses. This means SDF-based renderers MUST perform corner math in pixel space, not in normalized UV space.
5.2. text#
Single-line text rendered with the platform font stack.
interface TextElement extends BaseElement { type: "text"; text?: string; // static text spans?: TextSpan[]; // inline-styled runs (§5.2.4) font_family?: string; // CSS font stack font_size?: number | string; // "auto" fits to width font_weight?: number | string; // "400", "bold", etc. font_style?: "normal" | "italic"; fill_color?: string; stroke_color?: string; stroke_width?: number; text_align?: "left" | "center" | "right"; letter_spacing?: number; background_color?: string; // solid bg, SHRINK-WRAPPED to glyphs background_border_radius?: number; // corner radius (px) for the bg background_padding?: number | [number, number]; // bg padding px (or [x,y]) text_shadow?: TextShadow | TextShadow[]; // per-glyph shadow(s) mask?: TextMask; // reveal mask } interface TextShadow { color: string; // §3.4 offset_x?: number; // px, text-local frame. Default 0 offset_y?: number; blur?: number; // Gaussian softness px (0 = crisp). Default 0 opacity?: number; // 0..1, multiplies color alpha. Default 1 }
text_shadow is a per-glyph drop shadow (CSS text-shadow): each glyph
casts its own, so it tracks per-letter animation and overlapping glyphs —
unlike the silhouette drop_shadow effect (§4.7), which shadows the
flattened text as one shape. Pass an array for stacked shadows, painted
back-to-front (list farthest → nearest for a clean 3D extrusion). Shadows
are drawn behind every glyph (a two-pass render) so they never get clipped
by neighbouring letters. Works the same on caption. (Reach for the
drop_shadow effect instead when you want one soft shadow of the whole
text silhouette.)
background_color draws a solid background behind the text as one band
per line, each shrink-wrapped to that line's glyphs (line width × the
font ascent/descent box) — NOT the element's width/height box — so
centered or ragged multi-line text gets per-line pills, and it tracks
wrapping and font_size: "auto". background_border_radius rounds each
band; background_padding (a number, or [x, y]) insets it outward. It
rotates, scales, and skews with the element. For a per-run highlight band
use a span background (§5.2.4); for a drop shadow use a drop_shadow
effect (§4.7). The same fields work on caption (one band around the
caption phrase).
5.2.1. text content#
text (or spans, §5.2.4) provides the content. If spans is present it
takes precedence over text. If neither is present the element renders
nothing (and runtimes MAY skip the element entirely).
5.2.2. font_size: "auto"#
When font_size is the string "auto", the runtime MUST compute a
font size such that the rendered text fits inside the element's width.
The exact algorithm is implementation-defined but MUST be deterministic
for the same text + font + width inputs.
5.2.3. Text mask#
interface TextMask { type: "linear-wipe"; angle?: number; // degrees, default -45 progress?: number | Keyframe[]; // 0..1 softness?: number; // 0..1, default 0.3 }
When present, the text is rendered into an offscreen surface and
multiplied by a linear-gradient alpha mask. progress controls the
reveal position; softness controls the wipe edge width.
5.2.4. Spans#
spans carries inline-styled runs. When present it takes precedence
over text. Runs lay out left-to-right; a span whose
text is exactly "\n" is a hard line break. Each span inherits the
element's font_family / font_size / font_weight / fill_color /
letter_spacing unless it overrides them.
interface TextSpan { text: string; font_family?: string; font_size?: number | string; font_weight?: number | string; font_style?: "normal" | "italic"; fill_color?: string; letter_spacing?: number; // px tracking; inherits element's background_color?: string; // flat full-line-box band background?: TextSpanBackground; // stylized band; overrides above nowrap?: boolean; // atomic for word-wrap }
letter_spacing (element and span level) is pixels of tracking added
after EVERY character, including the last — Chrome's model, so boxes
measured in a browser reproduce exactly. nowrap marks the span atomic
for word-wrap (CSS white-space: nowrap semantics): the runtime never
breaks inside it. background draws a band behind the span's glyphs
(color, plus optional height_ratio, inset_y_ratio, padding_x,
skew_x, border_radius — see the schema for exact semantics).
5.3. image and video#
interface ImageElement extends BaseElement { type: "image"; source: string; // URL or path fit?: "cover" | "contain" | "fill" | "none"; // default "cover" crop_x?: number; // source crop, normalized 0..1 (see §5.3.1) crop_y?: number; crop_width?: number; crop_height?: number; } interface VideoElement extends BaseElement { type: "video"; source: string; fit?: "cover" | "contain" | "fill" | "none"; // default "cover" crop_x?: number; // source crop, normalized 0..1 (see §5.3.1) crop_y?: number; crop_width?: number; crop_height?: number; volume?: number; // 0..100, default 100 playback_rate?: number; // media seconds per timeline second, default 1 trim_start?: number; // seconds into the media, default 0 trim_duration?: number; // playable media window after trim_start loop?: boolean; }
Asset reference rules are defined in §8.
5.3.1. Object fit and source crop#
fit follows CSS object-fit against the element box:
| Value | Behavior |
|---|---|
cover (default) | scale media to fill the box; crop the overflow, centered |
contain | scale media to fit inside the box; letterbox |
fill | stretch media to the box exactly |
none | natural media size, centered, cropped to the box |
Source crop. crop_x / crop_y / crop_width / crop_height select a normalized
sub-rectangle of the source media (each in 0..1, origin top-left)
that is shown in place of the whole source. The default is the whole
source — crop_x = 0, crop_y = 0, crop_width = 1, crop_height = 1 — and
omitting the fields is identical to that identity crop.
Crop applies BEFORE fit: the cropped sub-rectangle becomes the
effective media, and fit then maps it into the element box exactly as
in §5.3.1. The element box (its x / y / width / height) is
unchanged — crop only chooses which part of the source fills it, so
crop composes orthogonally with transform, border_radius, filters, and
3D. A runtime MUST clamp the rect to the unit square (crop_width to
1 − crop_x, crop_height to 1 − crop_y); a zero-area crop is treated
as the identity.
Each component MAY be keyframed (§6.3). Animating the crop origin pans across the source and animating its size zooms — a Ken Burns move with no change to the element's layout.
5.3.2. Media time mapping#
The playable trim window is:
window_start = max(0, trim_start)
window_length = min(trim_duration ?? ∞, media_duration − window_start)
The media time sampled at composition time t is:
consumed = max(0, t − element.time) × playback_rate
media_t = window_start + (consumed mod window_length) if loop
= window_start + min(consumed, window_length − ε) otherwise
i.e. loop wraps WITHIN the trim window; without loop the last frame
holds. playback_rate MUST be a static number (keyframed rates are not
defined in CKP/1.0).
Time remapping. A video MAY carry time_remap — a keyframe array
(§6.3 semantics: destination-keyframe easing, element-local time) whose
VALUES are media times in seconds. When present, it REPLACES the
mapping above entirely (trim_start, trim_duration,
playback_rate, and loop are ignored):
media_t = clamp(interpolate(time_remap, t − element.time), 0, media_duration − ε)
Speed ramps are steep segments, freeze frames are flat ones, and
reverse playback follows decreasing values; easings shape the ramp.
Decoders quantize media_t to the media's own frames, which is
conformant (§2.1 motion-blur note applies the same way).
Varispeed audio. Audio under a warped clock — a video's embedded
track under time_remap, or any sound element inside a time-remapped
group (§5.8.3) — plays VARISPEED, tape-style: at each instant the
audio advances through the media at rate(t) = d(media_t)/dt, and
pitch shifts with the rate (2× plays an octave up, slow-motion plays
low). Flat segments (rate 0) are silent; decreasing segments play the
media REVERSED at |rate|. The reference implementation samples the
effective media-time function at 10 ms through the full warp chain,
splits it into monotonic runs, and schedules each run with a rate
curve (reversed runs play a sample-reversed copy of the buffer).
Pitch-preserving time-stretch is NOT defined in CKP/1.0. Fade
envelopes (audio_fade_in/audio_fade_out) are not applied under
warps in v1.
5.3.3. Video audio#
If the media container carries an audio track, conforming Level 3
runtimes MUST play/mix its FIRST audio track using the same timing as
§5.3.2, with gain volume / 100. playback_rate resamples the audio
(pitch shifts accordingly; time-stretch is not defined in CKP/1.0).
Videos without an audio track are silent — not an error.
5.3.4. Audio fades#
audio_fade_in / audio_fade_out (on both audio and video
elements, seconds, default 0) shape the gain over the element's
TIMELINE window [0, L]:
g(τ) = volume/100 × min(1, τ / fade_in) × min(1, (L − τ) / fade_out)
(each factor is 1 when its fade is 0). The envelope is piecewise linear between its corner points; runtimes MUST reproduce it within normal gain-ramp accuracy, including when playback starts mid-fade (seek).
5.4. audio#
interface AudioElement extends BaseElement { type: "audio"; source: string; volume?: number | Keyframe[]; // 0..100 trim_start?: number; trim_duration?: number; loop?: boolean; }
Audio elements have no visual representation. They contribute to the
mixed audio track produced by export-conformant runtimes (§10.3).
Preview-only runtimes MAY play them via HTMLAudioElement or similar.
5.5. caption#
Word-timed captions with optional kinetic styling.
interface CaptionElement extends BaseElement { type: "caption"; words: { text: string; start: number; end: number }[]; // start/end relative to element.time style?: "tiktok_bounce" | "fade_reveal" | "kinetic_typewriter" | "word_pop"; // Windowing — how much of the transcript shows at once (§5.5.1). A whole // transcript otherwise renders as one block; with `max_length` set, the words // are split into chunks and only the chunk active at the current time shows. max_length?: number | "auto"; // number = max LETTERS per chunk; // "auto" = a few words per chunk; // absent = show all at once. // Text-like styling font_family?: string; font_size?: number | string; // "auto" fits joined words to width font_weight?: number | string; fill_color?: string; highlight_color?: string; highlight_background_color?: string; text_align?: "left" | "center" | "right"; }
words[*].start and words[*].end are seconds RELATIVE to the
element's time. The exact kinetic behavior for each style is
defined by the reference implementation; deviations MAY occur in third-
party renderers but the timing MUST match.
5.5.1. Windowing (max_length)#
A full transcript on one element would render as a single unreadable block.
max_length splits words into CHUNKS; at any time only the chunk active then
is shown (within the element's box, wrapped). Chunking:
- number — grow a chunk word-by-word until adding the next word would exceed this many LETTERS (characters), then start a new chunk.
"auto"— chunk by a few words (and break on pauses) — the speech default.- absent — no windowing; the whole transcript shows at once.
The active chunk at element-local time t is the last chunk whose first word has
started (so a chunk lingers through silent gaps until the next begins). Word
kinetics (style) apply WITHIN the active chunk. Word start/end are
unchanged — max_length is a display rule, not a re-timing.
5.6. shape paths (vector geometry)#
A shape (§5.1) that carries paths renders as arbitrary vector geometry —
a restricted SVG-path subset (this absorbs the former standalone svg
element). Conforming runtimes MUST support viewBox-scaled paths with linear
gradients, clip-to-path, stroke-dashoffset progress, and per-path opacity.
The path-form fields live on ShapeElement (§5.1): view_box, gradients,
and paths. Their element types:
interface PathGradient { id: string; type: "linear"; x1: number; y1: number; x2: number; y2: number; // viewBox coords stops: GradientStop[]; } interface PathDef { d: string | Keyframe[]; // SVG path data, or d-string keyframes (§5.6.2) fill?: string; // hex or "url(#gradient-id)" stroke?: string; stroke_width?: number; stroke_progress?: number | Keyframe[]; // 0..1 trim_start?: number | Keyframe[]; // §5.6.1 trim_end?: number | Keyframe[]; trim_offset?: number | Keyframe[]; clip_path?: string; // another path that clips this one stroke_linecap?: "butt" | "round" | "square"; stroke_linejoin?: "miter" | "round" | "bevel"; opacity?: number; // 0..1 }
stroke_progress MUST drive the standard stroke-dasharray /
stroke-dashoffset reveal — a progress of 0 shows no stroke; 1
shows the entire stroke. Implementations MUST measure path length
deterministically (the reference implementation uses
SVGPathElement.getTotalLength()).
5.6.1. Trim paths#
Each path MAY carry a trim window: only the stroke between
trim_start and trim_end — fractions of the path's TOTAL LENGTH,
0..1 — is drawn. trim_offset rotates the window around the path,
WRAPPING at the ends (an offset of 1 is a full lap), so an animated
offset is the classic traveling-dash "snake" and an animated
trim_end is the draw-on reveal. All three are animatable;
stroke_progress remains as sugar for [0, progress] and is ignored
when any trim field is present. Fill is unaffected — trimming applies
to the STROKE only.
Reference evaluation: with window width w = clamp(trim_end, 0, 1) − clamp(trim_start, 0, 1) (nothing draws when w ≤ 0; the full stroke
when w ≥ 1) and wrapped anchor a = (trim_start + trim_offset) mod 1,
the stroke uses a dash pattern of [w·L, L − w·L] with dash offset
−a·L, where L is the path's total length — the pattern's period
equals L, so windows crossing the path's start wrap exactly.
5.6.2. Path morphing#
A path's d MAY be a keyframe array whose values are d-strings.
Between two keyframes the path MORPHS when the pair is COMPATIBLE —
identical command-letter sequences, equal numeric-argument counts, and
no arc commands (A/a, whose boolean flags cannot interpolate):
every numeric argument interpolates with the destination keyframe's
easing. An INCOMPATIBLE pair SNAPS: the source value holds until the
destination keyframe's time. No path normalization is performed — the
protocol stays literal; authors export morph targets with matching
command structure (the standard practice).
5.7. particles#
A deterministic particle system with two modes: ballistic emission and target-point convergence.
interface ParticlesElement extends BaseElement { type: "particles"; // Common size?: number; // pixels, default 12 size_variation?: number; // 0..1, default 0.4 particle_shape?: "square" | "circle"; color?: string | string[]; // array randomizes per particle rotation_speed?: number; // deg/s lifetime?: number; // seconds per particle, default 1.5 fade_at?: number; // 0..1 fraction of lifetime where fade begins, default 0.7 // Ballistic emission rate?: number; // particles per second velocity?: number; // initial speed px/s spread?: number; // cone in degrees, default 360 direction?: number; // 0=right, 90=down, -90=up gravity?: number; // px/s², positive=down // Depth (CKP/1.0, §5.7.3) z_velocity?: number; // px/s along the plane normal, default 0 z_spread?: number; // uniform vz range width px/s, default 0 // Burst (used by both modes) burst?: boolean; burst_count?: number; // Convergence (set target_points to enter convergence mode) target_points?: [number, number][]; // canvas-space targets convergence_easing?: EasingFunction; scatter_radius?: number; // disk radius around emitter }
5.7.1. Determinism#
Every particle's position, rotation, size, and color MUST be a pure
function of (element.id, particle_index, age). This means a runtime
that seeks to time T MUST produce the same composition as a runtime
that played continuously to T. The reference implementation seeds a
PRNG with a hash of element.id and the particle index; third-party
runtimes MAY use any algorithm that produces the same output as the
reference.
5.7.2. Convergence mode#
When target_points is present and non-empty:
- Each particle
nis assigned the targettarget_points[n % length]. - Each particle's start position is randomly placed within a disk of
radius
scatter_radius(default =max(canvas_width, canvas_height)) centered on(x, y). - The particle's position is
lerp(start, target, easing(age/lifetime))usingconvergence_easing(default"ease-out-quart").
5.7.3. Depth (CKP/1.0)#
Per particle, vz = z_velocity + (r − 0.5) × z_spread with r the
particle's uniform random draw, and its depth offset is vz × age px
along the emitter plane's normal (+z toward the viewer, §4.4). The
offset applies in BOTH modes (it is orthogonal to the in-plane
position) and is part of the §5.7.1 determinism contract. There is no
z gravity — gravity stays in-plane y. Like the z field (§4.4.2),
depth has no visual effect without perspective in the chain, and
particles draw in spawn order, never depth-sorted among themselves —
the §4.4.3 camera sort orders whole elements, not a particle system's
internal quads. With both fields absent or 0 the simulation is
exactly the 2D one.
5.8. group#
interface GroupElement extends BaseElement { type: "group"; elements: Element[]; clip?: boolean; // default false mask?: { mode: "alpha" | "alpha-inverted" | "luma" | "luma-inverted"; elements: Element[]; }; }
A positioned container. Children's x/y are coordinates in the
group's LOCAL space, origin at the group's top-left box corner; a
child's time is relative to the group's start; child layer and z
order locally (§4.2). The group's transform (§4.4) and
opacity stack multiplicatively onto children. Percentage/viewport units
inside a group still resolve against the COMPOSITION canvas.
5.8.1. Clipping#
With clip: true (requires explicit width and height), children
render into an offscreen layer the size of the group's box; pixels
outside the box are discarded (CSS overflow: hidden). The group's
own transform and opacity apply to the composited layer as a whole —
opacity therefore applies ONCE to the flattened layer (overlapping
semi-transparent children do not double-blend).
border_radius (px) rounds the clip box: children are masked to a
rounded rectangle, matching a rounded card that clips its content (CSS
overflow: hidden + border-radius). It is clamped to half the
smaller box dimension and is ignored on an unclipped group. (Rounded
clipping currently applies to the plain clip path; a mask group
ignores border_radius since the mask layer already defines coverage.)
5.8.2. Masks#
The mask belongs to the group it masks — declared on the masked
element, never inferred from siblings or layer adjacency. Mask
elements render into a second box-sized layer using the same local
coordinate space and timing rules as children, and may animate.
The content layer composites through the mask layer per pixel:
factor = mask.alpha (alpha)
= 1 − mask.alpha (alpha-inverted)
= luminance(mask.rgb) (luma; Rec. 709 weights 0.2126/0.7152/0.0722,
computed on premultiplied values)
= 1 − luminance(mask.rgb) (luma-inverted)
output = content × factor
mask requires explicit width/height and implies clipping (both
layers are box-sized).
5.8.3. Group time remapping#
A group MAY carry time_remap — a keyframe array (§6.3 semantics)
whose VALUES are warped local times in seconds. The group's SUBTREE
runs on the warped clock:
local = t − group_start
warped = max(0, interpolate(time_remap, local))
Children evaluate exactly as if the group's local time were warped:
their time windows, animations, keyframes, transitions, and nested
media all read the warped clock. Nested remapped groups compose (each
warps its parent's clock in turn). The group's OWN animated properties
(opacity, rotation, scale, position) read REAL time — the container
moves on the composition's clock; only its contents are warped.
Flat segments freeze the subtree, steep segments speed-ramp it, and decreasing values run it backwards. Nested video decodes the frame at its warped media time (through §5.3.2's mapping). Audio inside a remapped subtree follows the varispeed rule (§5.3.2).
5.9. No nested-composition element#
CKP deliberately has NO composition (pre-comp) element. Both things a
pre-comp bundles are covered by orthogonal features on plain elements:
- Reuse is an authoring-time concern: template functions expand into
plain elements before the Source is serialized (see
@clipkit/patternsfor the first-party library). The wire format stays fully decomposed — a runtime never resolves references or instantiates templates. - Nested timing is
time_remapon a plaingroup(§5.8.3).
A Source containing type: "composition" is invalid under CKP/1.0.
6. Animation#
Every animatable property may be driven in three ways: a static value, a named-preset animation, or a keyframe animation. These compose with the precedence:
keyframe_animation > named_animation > static_value
That is, if both a keyframe animation and a named animation target the same property at the same time, the keyframe wins.
6.1. Static values#
The value as written in JSON. No interpolation; the value is used as-is for the entire duration the element is active.
6.2. Named animations#
interface Animation { type: AnimationType; duration?: number; // seconds; defaults in §6.2.1 easing?: Easing; // default "ease-out" unless noted time?: "start" | "end" | number; // start, relative to element.time // Parameters read by specific types (ignored otherwise): frequency?: number; // Hz — shake (8), wiggle (2), text-wave (1.5) rotation?: number; // degrees — spin (360), wiggle amplitude (8), // text-flip start angle (90) distance?: number; // px — pan/shift (200), shake (24), // text-slide (40), text-fly (140), text-wave (12) direction?: "left" | "right" | "up" | "down"; // pan/shift/text-slide/text-fly scale?: number; // squash depth 0..1 (0.3) split?: "letter" | "word"; // text-* unit granularity (§6.5) stagger?: number; // text-* seconds between units (§6.5) axis?: "x" | "y" | "z"; // text-flip rotation axis (§6.5, CKP/1.0) }
time: "start" (or absent) resolves to local time 0; "end" to
element_duration − duration; a number is local seconds.
Normative tween recipes (deltas apply to the listed property; relative adds to the static value, absolute replaces it during the window):
| Type | Property | From → To | Mode | Notes |
|---|---|---|---|---|
fade-in / fade-out | opacity | 0→1 / 1→0 | absolute | |
slide-left-in | x | −200→0 | relative | starts left, moves right into place |
slide-right-in | x | +200→0 | relative | |
slide-up-in | y | +200→0 | relative | starts below, rises |
slide-down-in | y | −200→0 | relative | |
slide-*-out | x/y | 0→±200 | relative | motion direction matches the name |
scale-in / scale-out | scale | 0→1 / 1→0 | absolute | |
rotate-in / rotate-out | rotation | −90→0 / 0→+90 | relative | |
bounce-in / bounce-out | scale | 0→1 / 1→0 | absolute | default easing ease-out-back / ease-in-back |
spin | rotation | 0→rotation | relative | default easing linear |
shake | x | oscillation, amplitude distance→0 | relative | sin(2π·frequency·t) × eased envelope |
wiggle | rotation | oscillation, constant amplitude rotation | relative | same formula |
squash | y_scale, x_scale | 1→1−scale→1; 1→1+0.6·scale→1 | absolute | two half-duration phases; in ease-in-quad, out ease-out-back |
pan | x or y | −distance/2→+distance/2 along direction | relative | drifts through rest position; default easing linear |
shift | x or y | 0→distance along direction | relative | fill-forward: the end value holds for the rest of the element's life |
drift | x, y | smooth random walk, amplitude distance (30) | relative | offsets = (noise1d(frequency·t, seed) − 0.5) × 2 × distance per axis (y uses seed + 7919); frequency default 0.5, seed default 0 |
breathe | scale | oscillation, amplitude scale (0.05) | relative | scale × sin(2π·frequency·t), frequency default 0.4 |
orbit | x, y | circle of radius distance (40) | relative | x += r·sin(2πft + π/2), y += ±r·sin(2πft) (direction: "left" flips y = counter-clockwise); frequency default 0.5 rev/s |
text-* | per-unit | — | — | §6.5 |
drift's noise is NORMATIVE — the 1D form of §4.7's lattice noise:
noise1d(x, seed) is the quintic-faded linear interpolation between
h(⌊x⌋) and h(⌊x⌋+1) where h(i) = pcg(i XOR pcg(seed)) / (2³²−1)
and pcg is §4.7's hash. Same seed → identical motion everywhere.
6.2.1. Duration defaults#
duration defaults to 0.5, EXCEPT spin, shake, wiggle, pan,
drift, breathe, orbit and text-wave, which default to the
element's full duration when both time and duration are omitted
(they read as "for the element's life"). Outside its window a tween stops contributing (the property
returns to its static value), except shift's documented fill-forward.
6.3. Keyframe animations#
interface KeyframeAnimation { property: string; // name of the BaseElement field loop?: boolean | "ping-pong"; // repeat the pattern (see below) keyframes: Keyframe[]; // monotonically increasing times easing?: EasingFunction; // default per-keyframe }
For times before the first keyframe, the value is clamped to the first
keyframe's value. After the last keyframe, clamped to the last value.
Between keyframes i and i+1, the value is interpolated using the
easing on keyframe i+1 (or the animation's easing if not specified
per-keyframe). A single-keyframe track is a constant.
6.3.1. Color keyframes#
When EVERY keyframe value in a track parses as a color (§3.4 — #…,
rgb(…), rgba(…)), the track is a color track: values interpolate
componentwise in straight-alpha RGB space (alpha included), using the
same easing rules. Color tracks are honored on fill_color (shape and
plain text) and stroke_color (shape). Unparseable colors fall back to
opaque white anywhere colors are parsed.
With loop, local time folds before interpolation: true wraps —
t' = t mod span — and "ping-pong" reflects — t' = span − |((t mod 2·span)) − span| — where span is the LAST keyframe's time. The
pattern repeats for the element's whole life; without loop, time
past the last keyframe holds the final value (unchanged default).
Looping applies to scalar, color, and position keyframe animations.
6.4. Easing functions#
CKP/1.0 defines 36 named easing functions plus two parametric forms.
Mathematical definitions are in the reference implementation
(packages/runtime/src/animation/easings.ts); the polynomial/sine/expo/
circ/back families follow easings.net.
linear
ease, ease-in, ease-out, ease-in-out
ease-in-cubic, ease-out-cubic, ease-in-out-cubic
ease-in-quad, ease-out-quad, ease-in-out-quad
ease-in-quart, ease-out-quart, ease-in-out-quart
ease-in-quint, ease-out-quint, ease-in-out-quint
ease-in-sine, ease-out-sine, ease-in-out-sine
ease-in-expo, ease-out-expo, ease-in-out-expo
ease-in-circ, ease-out-circ, ease-in-out-circ
ease-in-back, ease-out-back, ease-in-out-back
elastic-in, elastic-out, elastic-in-out (decaying sinusoidal overshoot)
bounce-in, bounce-out, bounce-in-out (piecewise-parabolic ball drop)
spring (damped harmonic oscillator: mass=1, damping=10, stiffness=100;
~5% overshoot then settles. Remotion's signature feel.)
Parametric forms (string-valued):
cubic-bezier(x1, y1, x2, y2)— CSS timing-function semantics;x1/x2MUST be clamped to[0, 1],y1/y2are unbounded.steps(n)—nequidistant steps, jump-at-end (CSSsteps(n, end)).
Unknown easing names MUST fall back to linear (never error). Output
MUST match the reference within ±0.001 at any input value in [0, 1].
6.5. Per-unit text animations#
The text-* animation types apply to text elements only (ignored on
other element types; caption elements have their own kinetic system).
The text splits into units and each unit runs the same animation,
offset in time by stagger seconds per unit index.
Unit indexing. Letter index counts drawn glyphs (whitespace
excluded); word index counts whitespace-separated runs. Both run
continuously across spans and line breaks. Unit u starts at
time + u × stagger.
Defaults. split: "word" for text-appear/text-slide/
text-fly, "letter" for text-typewriter/text-wave/text-flip.
stagger: 0.09 for word splits, 0.035 for letter splits.
Per-unit duration default 0.5.
| Type | Per-unit effect | Defaults |
|---|---|---|
text-appear | opacity 0→1 over duration | easing ease-out-cubic |
text-slide | opacity 0→1 + displaced distance px opposite direction, settling at rest | distance 40, direction up, easing ease-out-cubic |
text-fly | as text-slide, farther | distance 140, easing ease-out-back |
text-typewriter | opacity steps 0→1 at the unit's start time | no fade |
text-wave | y offset distance × sin(2π·frequency·t − 0.6·u) | distance 12, frequency 1.5; full-length default (§6.2.1) |
text-flip | opacity 0→1 + 3D rotation rotation × (1−eased) degrees about axis through the unit's center, settling flat (CKP/1.0) | rotation 90, axis "x", easing ease-out-cubic |
Per-unit effects fold into the glyph's tint (opacity) and an element-local offset applied BEFORE the element transform (§4.4), so kinetic type composes with scale/skew/rotation.
text-flip semantics (CKP/1.0). The rotation pivot is the unit's
rest-layout center — the glyph cell's center for letter splits, the
word's glyph bounding-box center for word splits — translated by the
unit's current per-unit offset, so a unit sliding and flipping stays
rigid. A word split rotates the word as ONE slab (its glyphs never
splay). When multiple text-flip animations target an element, word
rotations compose OUTSIDE letter rotations. Axis "x" flips up
(rotation about the horizontal axis), "y" swings in, "z" spins
in-plane. Like all 3D (§4.4.2), the depth component is orthographic
without a camera; the foreshortening is cos θ exactly. An active
text-flip puts the text element on the full-matrix path; elements
without one are unaffected (§4.4.3 cost rules).
CKP/1.0 defines text animations as entrances only: time: "end" on a
text-* animation MUST be ignored.
6.6. Transitions (non-feature)#
CKP/1.0 deliberately defines NO transition primitive — every transition decomposes into existing primitives, and the document is exactly what renders:
- Crossfades, pushes, zoom swaps — two overlapping elements with paired animations (AGENTS.md §"Transitions").
- Wipes (circular, linear, stripe, soft-edged) — the incoming slide inside a masked group (§5.8.2) whose mask elements animate: a growing ellipse, a sweeping rectangle, a luma gradient band (AGENTS.md §"Wipes").
An earlier draft reserved a first-class two-layer transition object for wipes; group masks made it unnecessary and it is no longer planned.
6.7. Spatial motion paths#
A keyframe_animations entry with property: "position" moves the
element along a path; it overrides the element's x and y (and
any scalar x/y keyframe animations). Keyframe values are
[x, y] pairs in canvas pixels — or [x, y, z] triples for a 3D
path (z in pixels, +z toward the viewer, §4.4). A 3D path
additionally overrides the element's z (and any scalar z
keyframe animations); a 2D path leaves z untouched. All keyframes
of one position path MUST agree in dimensionality — mixing [x, y]
and [x, y, z] in one animation is a validation error (no silent
z = 0 promotion).
{ "property": "position", "auto_orient": true, "keyframes": [ { "time": 0, "value": [200, 800], "out_tangent": [240, -300] }, { "time": 2, "value": [1700, 300], "in_tangent": [-200, -120], "easing": "ease-in-out" } ] }
Each consecutive keyframe pair is one CUBIC BEZIER segment with control points
P0 = a.value P3 = b.value
P1 = P0 + a.out_tangent P2 = P3 + b.in_tangent
where an omitted tangent defaults to the straight-line third-point
(P1 = P0 + (P3−P0)/3, P2 = P3 − (P3−P0)/3) — a path with no
handles is exact polyline motion. On a 3D path, tangents are
[dx, dy] or [dx, dy, dz]; a 2-component tangent's dz defaults
to the straight-line third-point in z (the omitted-handle rule,
applied per axis). 3-component tangents on a 2D path are a
validation error.
Travel is ARC-LENGTH parameterized (NORMATIVE): the destination
keyframe's easing maps segment-local time to a fraction u of the
segment's length; the bezier parameter is found on a 64-chord
cumulative-length table (curve sampled at t = i/64, i = 0…64;
linear interpolation between chords). With linear easing the element
travels at constant speed however the handles stretch the curve's
parameterization. On a 3D path the chord lengths are measured on the
3D curve — constant speed means constant speed through depth too.
Before the first keyframe the element holds the first point; after
the last, the last point.
auto_orient: true adds the path's travel direction —
atan2(dy, dx) of the bezier derivative at the sampled parameter, in
degrees — to the element's own rotation. Orientation is STRICTLY
IN-PLANE: on a 3D path the tangent's xy projection is used and dz
is ignored — auto_orient never derives x_rotation or y_rotation
(a path does not tilt the element's plane). A zero xy derivative
(z-only travel or coincident control points) falls back to the
segment chord's xy projection. Position values are numbers (pixels);
length strings are not valid inside path keyframes.
Like the z field itself (§4.4.2), path z has no visual effect
without perspective somewhere in the chain — under no camera the
element renders at the path's xy projection.
7. Time, duration, sequencing#
7.1. Element activity windows#
An element is active at composition time t if:
start <= t <= start + duration
where:
start = element.time(default0).duration = element.durationif numeric, else the composition's remaining time if"auto"or"end".
Inactive elements MUST NOT be rendered.
7.2. Local time#
Many properties (keyframes, named-animation timing, particle simulation)
operate in local time — seconds elapsed since the element
became active. Local time 0 corresponds to composition time
element.time.
For animation evaluation, local time MUST be clamped to the element's duration:
local_t = min(t − element.time, element_duration)
Rationale: the activity check (§7.1) computes the element's end as
time + duration while local time computes t − time; at exact frame
boundaries these two float roundings can disagree by ~1 ulp, leaving an
element active at a local time fractionally PAST its duration — which
would skip every end-anchored animation for one frame and flash the
static value. The clamp makes the boundary frame well-defined.
7.3. The composition duration#
If Source.duration is "auto", the composition's effective duration
is the maximum end time across all elements. Otherwise, it is the
declared value.
Runtimes MUST produce frames from composition time 0 to duration
inclusive of 0, exclusive of duration. At 30 fps and duration: 5
this produces 150 frames at times 0, 1/30, 2/30, ..., 149/30.
8. Asset references#
Asset-bearing elements (image, video, audio) carry a source
string identifying the asset. The protocol does not embed binaries.
8.1. Allowed schemes#
Conforming runtimes MUST attempt to resolve:
https://URLshttp://URLs (MAY require user opt-in for mixed content)
Conforming runtimes MAY additionally support:
file://URLs (local file references)- Absolute paths (
/path/to/file) - Relative paths (
./asset.png) — resolved against an implementation-defined base data:URIs
Runtimes MUST fail with a clear error when a source cannot be
resolved. They MUST NOT silently substitute a placeholder.
8.2. Preloading#
Runtimes SHOULD provide a preload step that resolves all asset
references before rendering begins. This is REQUIRED for export
conformance (§10.3) — exports cannot tolerate runtime asset failures.
9. Output and rendering#
9.1. Determinism#
Given identical (Source, time) inputs, every conforming runtime MUST
produce visually equivalent frames. "Visually equivalent" means:
- Same element positions, sizes, rotations, opacities to within ±0.5 px.
- Same animation values to within ±0.001 for normalized properties.
- Same particle state (positions, alphas, sizes) to within ±0.5 px / ±0.001 alpha.
- Color output MAY differ by up to ~1/255 per channel due to rasterization backends. Pixel-exact equivalence is NOT REQUIRED.
9.2. Frame timing#
A runtime asked to produce frame f of a composition with frame_rate
FPS MUST render the scene state at composition time:
t = f / frame_rate
f = 0 is the first frame.
9.3. Output formats#
| Format | Behavior |
|---|---|
mp4 | Encoded video. H.264 baseline + AAC audio is the recommended baseline. Higher profiles MAY be used. |
gif | Animated GIF. Audio is silently dropped. |
Runtimes MAY support output formats beyond these.
10. Conformance levels#
CKP defines three nested conformance levels. An implementation MAY claim any level it actually supports.
10.1. Level 1 — Validation#
The implementation MUST be able to parse a Source JSON object and report whether it is a valid CKP/1.0 document. Specifically:
- Accept any valid document per §2–8.
- Reject documents with invalid
typediscriminators on required fields, missing REQUIRED fields, or out-of-range values. - Tolerate unknown additional fields (forward compatibility, §2.2).
The @clipkit/protocol package provides Level 1 validation as a
reference.
10.2. Level 2 — Rendering#
The implementation MUST also be able to produce frame images from a valid Source. Specifically:
- All element types (§5) MUST render.
- All animations (§6) MUST animate.
- Output MUST satisfy the determinism requirements (§9.1).
A Level 2 implementation MUST NOT silently skip element types unless the document is at a higher protocol version than the implementation supports.
10.3. Level 3 — Export#
The implementation MUST also be able to produce encoded output: typically MP4 with mixed audio.
- All Level 2 requirements.
- Audio elements (§5.4) MUST be mixed into the output track.
- Output duration MUST match
Source.durationprecisely (frame-accurate).
The reference runtime @clipkit/runtime is a Level 3 implementation: it
sums all sources to the master bus, then applies a fixed peak limiter
(transparent below 0 dBFS) so a hot mix is contained rather than
hard-clipped. The same limiter runs in preview, so what you hear matches
the render. The exact mix algorithm otherwise remains
implementation-defined (§1).
11. Versioning and extensions#
11.1. The clipkit_version field#
Documents SHOULD declare their protocol version:
{ "clipkit_version": "1.0", "elements": [...] }
Absence is interpreted as "1.0" for the lifetime of CKP/1.x.
11.2. Version compatibility#
Versions follow MAJOR.MINOR (semver-style without patch):
- Same MINOR: runtimes MUST render documents at the same minor version with no warning.
- Higher MINOR: runtimes MUST attempt to render. They SHOULD warn that unknown fields may be ignored.
- Higher MAJOR: runtimes MUST refuse to render and report the version mismatch. Major versions indicate breaking changes.
- Lower MAJOR: runtimes MUST render if they implement the older major version. Backward compatibility within a major line is permanent.
11.3. Adding fields#
Any minor version MAY add fields to existing element types. Unknown fields in older runtimes pass through harmlessly per §2.2.
11.4. Adding element types#
Any minor version MAY add new element types. Runtimes that do not
recognize a new type MAY skip the element with a warning.
11.5. Breaking changes#
Removing fields, changing field types, or changing the meaning of an existing field requires a major-version bump.
11.6. Extension namespace#
Implementations and tools MAY include vendor-specific fields prefixed
with x_. The protocol reserves bare names; x_ names are
implementation-defined and ignored by other implementations. Names
the protocol itself defines (e.g. x_scale, x_skew, x_rotation)
are bare protocol names, not extensions, regardless of prefix.
11.7. Version history#
| Version | Additions |
|---|---|
| 1.0 | Initial protocol. |
| 1.1 | 3D transform model (§4.4): x_rotation / y_rotation / z_rotation (alias of rotation) / z on every element; Source-level camera; paint-order + flattening compositing rules (§4.4.3); glass under 3D via the pane-plane homography (§4.7); 3D motion paths ([x, y, z] position keyframes, §6.7); text-flip per-unit 3D reveals (§6.5); particle depth (z_velocity / z_spread, §5.7.3). Additive — documents without these fields render identically to 1.0. |
12. Implementation notes (non-normative)#
These notes are advisory. They reflect lessons from the reference implementation and MAY help third-party implementers avoid pitfalls.
12.1. Premultiplied alpha#
All blending in the reference runtime uses premultiplied alpha:
textures are uploaded premultiplied, shaders output premultiplied
values, and the canvas swap-chain is configured with
alphaMode: "premultiplied". This avoids the "dark halo" artifacts
that appear with straight-alpha blending. Third-party runtimes are
free to use any blending convention internally as long as final output
matches §9.1.
12.2. Corner radii on non-square rectangles#
If you implement rounded rectangles with a signed-distance function
in shaders, the SDF MUST operate in pixel space, not in normalized UV
space. UV space is anisotropic for non-square rectangles, so doing the
corner math in UV stretches the arc into an ellipse. Pass the
rectangle's (width, height) to the shader and convert
uv * size before the SDF.
12.3. Font atlases#
The reference runtime generates a glyph atlas per (family, size, weight). It covers ASCII (0x20–0x7E) only; characters outside this range are silently dropped. Third-party implementations MAY support larger glyph ranges; documents SHOULD avoid relying on non-ASCII text in v1.0 unless the target runtime is known to support it.
12.4. Particle PRNG#
The reference runtime uses a mulberry32 PRNG seeded by
FNV-1a(element.id) + n * 0x9e3779b9. Third-party runtimes are NOT
required to match this exact PRNG, only the outputs implied by §5.7.1.
Appendix A: Reference implementation map#
| Spec section | Reference implementation |
|---|---|
| §2 Source | packages/schema/src/types.ts (Source) + zod.ts (sourceSchema) |
| §3 Units | packages/runtime/src/compositor/unit.ts |
| §4 Element model | packages/schema/src/types.ts (BaseElement) |
| §5.1 shape | packages/runtime/src/compositor/element-renderers/shape.ts |
| §5.2 text | packages/runtime/src/compositor/element-renderers/text.ts |
| §5.3–5.4 video/image/audio | packages/runtime/src/{compositor/element-renderers,audio}/ |
| §5.5 caption | packages/runtime/src/compositor/element-renderers/caption.ts |
| §5.6 shape paths | packages/runtime/src/svg/svg-renderer.ts |
| §5.7 particles | packages/runtime/src/compositor/element-renderers/particles.ts |
| §6 Animation | packages/runtime/src/animation/ |
| §10 Conformance | packages/schema/src/validate.ts |
Appendix B: Document history#
- 2026-05-28 — CKP/1.0 draft. Initial publication alongside the reference runtime.