IO.net Task Instructions
Overview
This task involves submitting high-quality video clips with accurate, physically-grounded captions to help train advanced AI video generation models. Your submissions directly impact the quality of next-generation AI video capabilities.
Technical Specifications
Resolution:
Landscape: 1280×720 or 1920×1080 (16:9 aspect ratio)
Portrait: 720×1280 or 1080×1920 (9:16 aspect ratio)
Format: H.264 codec in MP4 container
Frame Rate: Minimum 30fps
Length: 5-10 seconds recommended (no hard limit if quality and consistency maintained)
Content Requirements
Shot Consistency (Critical)
Each clip must contain only one continuous shot from a single camera instance
Camera can move naturally (pan, tilt, zoom) but must represent uninterrupted recording
Avoid combining multiple shots - this creates "teleporting camera effects"
Submit multiple shots of the same scene as separate clips
Quality Standards
Natural color grading with consistent white balance
Audio properly synchronized (if present) and free of clipping
Clear, focused subject matter without heavy compression
Stable lighting with no flickering or exposure jumps
Smooth playback without missing frames or jump cuts
Caption Writing Guidelines
Format Requirements
Captions must be single, natural language descriptions that flow as coherent sentences or paragraphs. Do not use separate attributes or bullet points.
Physical Grounding (Essential)
✅ GOOD - Physically Grounded:
"A person cuts metal with a plasma torch, creating bright sparks that fall to the concrete floor"
"The woman's purple scarf flows behind her as she pushes through the crowded subway platform"
❌ BAD - Not Physically Grounded:
"With tools like this, all materials will be easy to carve according to your needs"
"This technique ensures perfect results every time"
Rules for Physical Grounding:
Describe only what is visually observable in the video
Avoid predictions, capabilities, or abstract concepts
Every word should relate to concrete visual elements
No claims about effectiveness, ease, or outcomes beyond what's shown
Caption Structure Examples
“a video of iron being carved with sharp cutting tools, the cuts are perfect and precise in a metal working shop, indoor lighting, extra details about what's in the background of the video, the camera slowly zooms.”
“an anime of a woman wearing a purple scarf who is trying to run through a crowd of people. she bumps into several people while making it through. the camera is stationary. above them a sign says "Ochanomizu Station" in English and Japanese lettering.”
“an iphone video of a woman wearing a purple scarf who is trying to run through a crowd of people. she bumps into several people while making it through. She holds the camera as she walks and the camera follows while facing her.” “An 3D animation of a girl running away from a kneeling woman in a kimono. a beautiful grassy meadow and sky make up their environment.”
Required Elements in Captions
Video Style/Format: Specify if it's footage, 3D animation, anime, professional cinematography, etc.
Subject Actions: Describe what people/objects are doing using active verbs
Environment Details: Include setting, lighting conditions, background elements
Camera Behavior: Mention if camera moves, remains stationary, or follows subjects
Visual Specifics: Colors, textures, readable text, distinctive objects
Caption Variety for Training
Mix detail levels across your submissions:
Verbose: Include rich environmental details, specific materials, lighting descriptions
Moderate: Focus on main action with key environmental context
Simple: Concise descriptions of primary visual elements
This variety helps AI models generalize better across different prompting styles.
Content Diversity Requirements
Submit varied content across:
Demographics: Different ages, ethnicities, clothing styles
Environments: Indoor/outdoor, urban/rural, different lighting conditions
Subjects: People, animals, objects, nature, technology
Camera Styles: Handheld, tripod, drone, close-up, wide shots
Video Styles: Professional, amateur, animation, different aspect ratios
Evaluation Criteria
Caption Accuracy (≥95% Required)
Your descriptions must precisely match the video content. Mismatched captions result in automatic rejection.
Physical Grounding Check
Every caption element must correspond to something visually present in the video. Abstract concepts or unobservable claims will be flagged.
Content Variety Score
Multiple similar submissions (same person, location, or scenario) reduce data quality and may be rejected.
Technical Compliance
All videos must meet resolution, format, and consistency requirements.
Submission Process
Edit your video to ensure single-shot consistency and technical requirements
Write your caption following physical grounding and natural language guidelines
Review both video and caption for accuracy and compliance
Submit for verification and await approval
Common Rejection Reasons
Multiple shots combined in one clip
Captions describing things not visible in the video
Abstract or predictive language in descriptions
Poor video quality or technical non-compliance
Repetitive content without sufficient variety
Success Tips
Watch your video multiple times before writing the caption
Read your caption while watching to ensure perfect alignment
Use specific, concrete visual language
Vary your submission types to maximize training value
Focus on what the AI model needs to "see" to recreate similar content

Last updated