IO.net Task Instructions

Overview

This task involves submitting high-quality video clips with accurate, physically-grounded captions to help train advanced AI video generation models. Your submissions directly impact the quality of next-generation AI video capabilities.

Technical Specifications

Resolution:
- Landscape: 1280×720 or 1920×1080 (16:9 aspect ratio)
- Portrait: 720×1280 or 1080×1920 (9:16 aspect ratio)
Format: H.264 codec in MP4 container
Frame Rate: Minimum 30fps
Length: 5-10 seconds recommended (no hard limit if quality and consistency maintained)

Content Requirements

Shot Consistency (Critical)

Each clip must contain only one continuous shot from a single camera instance
Camera can move naturally (pan, tilt, zoom) but must represent uninterrupted recording
Avoid combining multiple shots - this creates "teleporting camera effects"
Submit multiple shots of the same scene as separate clips

Quality Standards

Natural color grading with consistent white balance
Audio properly synchronized (if present) and free of clipping
Clear, focused subject matter without heavy compression
Stable lighting with no flickering or exposure jumps
Smooth playback without missing frames or jump cuts

Caption Writing Guidelines

Format Requirements

Captions must be single, natural language descriptions that flow as coherent sentences or paragraphs. Do not use separate attributes or bullet points.

Physical Grounding (Essential)

✅ GOOD - Physically Grounded:

"A person cuts metal with a plasma torch, creating bright sparks that fall to the concrete floor"
"The woman's purple scarf flows behind her as she pushes through the crowded subway platform"

❌ BAD - Not Physically Grounded:

"With tools like this, all materials will be easy to carve according to your needs"
"This technique ensures perfect results every time"

Rules for Physical Grounding:

Describe only what is visually observable in the video
Avoid predictions, capabilities, or abstract concepts
Every word should relate to concrete visual elements
No claims about effectiveness, ease, or outcomes beyond what's shown

Caption Structure Examples

“a video of iron being carved with sharp cutting tools, the cuts are perfect and precise in a metal working shop, indoor lighting, extra details about what's in the background of the video, the camera slowly zooms.”

“an anime of a woman wearing a purple scarf who is trying to run through a crowd of people. she bumps into several people while making it through. the camera is stationary. above them a sign says "Ochanomizu Station" in English and Japanese lettering.”

“an iphone video of a woman wearing a purple scarf who is trying to run through a crowd of people. she bumps into several people while making it through. She holds the camera as she walks and the camera follows while facing her.” “An 3D animation of a girl running away from a kneeling woman in a kimono. a beautiful grassy meadow and sky make up their environment.”

Required Elements in Captions

Video Style/Format: Specify if it's footage, 3D animation, anime, professional cinematography, etc.
Subject Actions: Describe what people/objects are doing using active verbs
Environment Details: Include setting, lighting conditions, background elements
Camera Behavior: Mention if camera moves, remains stationary, or follows subjects
Visual Specifics: Colors, textures, readable text, distinctive objects

Caption Variety for Training

Mix detail levels across your submissions:

Verbose: Include rich environmental details, specific materials, lighting descriptions
Moderate: Focus on main action with key environmental context
Simple: Concise descriptions of primary visual elements

This variety helps AI models generalize better across different prompting styles.

Content Diversity Requirements

Submit varied content across:

Demographics: Different ages, ethnicities, clothing styles
Environments: Indoor/outdoor, urban/rural, different lighting conditions
Subjects: People, animals, objects, nature, technology
Camera Styles: Handheld, tripod, drone, close-up, wide shots
Video Styles: Professional, amateur, animation, different aspect ratios

Evaluation Criteria

Caption Accuracy (≥95% Required)

Your descriptions must precisely match the video content. Mismatched captions result in automatic rejection.

Physical Grounding Check

Every caption element must correspond to something visually present in the video. Abstract concepts or unobservable claims will be flagged.

Content Variety Score

Multiple similar submissions (same person, location, or scenario) reduce data quality and may be rejected.

Technical Compliance

All videos must meet resolution, format, and consistency requirements.

Submission Process

Edit your video to ensure single-shot consistency and technical requirements
Write your caption following physical grounding and natural language guidelines
Review both video and caption for accuracy and compliance
Submit for verification and await approval

Common Rejection Reasons

Multiple shots combined in one clip
Captions describing things not visible in the video
Abstract or predictive language in descriptions
Poor video quality or technical non-compliance
Repetitive content without sufficient variety

Success Tips

Watch your video multiple times before writing the caption
Read your caption while watching to ensure perfect alignment
Use specific, concrete visual language
Vary your submission types to maximize training value
Focus on what the AI model needs to "see" to recreate similar content

PreviousCamp Network Task Instructions NextxFractal Task Instructions

Last updated 17 days ago