IO.net Task Instructions

Overview

This task involves submitting high-quality video clips with accurate, physically-grounded captions to help train advanced AI video generation models. Your submissions directly impact the quality of next-generation AI video capabilities.

Technical Specifications

  • Resolution:

    • Landscape: 1280×720 or 1920×1080 (16:9 aspect ratio)

    • Portrait: 720×1280 or 1080×1920 (9:16 aspect ratio)

  • Format: H.264 codec in MP4 container

  • Frame Rate: Minimum 30fps

  • Length: 5-10 seconds recommended (no hard limit if quality and consistency maintained)

Content Requirements

Shot Consistency (Critical)

  • Each clip must contain only one continuous shot from a single camera instance

  • Camera can move naturally (pan, tilt, zoom) but must represent uninterrupted recording

  • Avoid combining multiple shots - this creates "teleporting camera effects"

  • Submit multiple shots of the same scene as separate clips

Quality Standards

  • Natural color grading with consistent white balance

  • Audio properly synchronized (if present) and free of clipping

  • Clear, focused subject matter without heavy compression

  • Stable lighting with no flickering or exposure jumps

  • Smooth playback without missing frames or jump cuts

Caption Writing Guidelines

Format Requirements

Captions must be single, natural language descriptions that flow as coherent sentences or paragraphs. Do not use separate attributes or bullet points.

Physical Grounding (Essential)

GOOD - Physically Grounded:

  • "A person cuts metal with a plasma torch, creating bright sparks that fall to the concrete floor"

  • "The woman's purple scarf flows behind her as she pushes through the crowded subway platform"

BAD - Not Physically Grounded:

  • "With tools like this, all materials will be easy to carve according to your needs"

  • "This technique ensures perfect results every time"

Rules for Physical Grounding:

  • Describe only what is visually observable in the video

  • Avoid predictions, capabilities, or abstract concepts

  • Every word should relate to concrete visual elements

  • No claims about effectiveness, ease, or outcomes beyond what's shown

Caption Structure Examples

“a video of iron being carved with sharp cutting tools, the cuts are perfect and precise in a metal working shop, indoor lighting, extra details about what's in the background of the video, the camera slowly zooms.”

“an anime of a woman wearing a purple scarf who is trying to run through a crowd of people. she bumps into several people while making it through. the camera is stationary. above them a sign says "Ochanomizu Station" in English and Japanese lettering.”

“an iphone video of a woman wearing a purple scarf who is trying to run through a crowd of people. she bumps into several people while making it through. She holds the camera as she walks and the camera follows while facing her.” “An 3D animation of a girl running away from a kneeling woman in a kimono. a beautiful grassy meadow and sky make up their environment.”

Required Elements in Captions

  1. Video Style/Format: Specify if it's footage, 3D animation, anime, professional cinematography, etc.

  2. Subject Actions: Describe what people/objects are doing using active verbs

  3. Environment Details: Include setting, lighting conditions, background elements

  4. Camera Behavior: Mention if camera moves, remains stationary, or follows subjects

  5. Visual Specifics: Colors, textures, readable text, distinctive objects

Caption Variety for Training

Mix detail levels across your submissions:

  • Verbose: Include rich environmental details, specific materials, lighting descriptions

  • Moderate: Focus on main action with key environmental context

  • Simple: Concise descriptions of primary visual elements

This variety helps AI models generalize better across different prompting styles.

Content Diversity Requirements

Submit varied content across:

  • Demographics: Different ages, ethnicities, clothing styles

  • Environments: Indoor/outdoor, urban/rural, different lighting conditions

  • Subjects: People, animals, objects, nature, technology

  • Camera Styles: Handheld, tripod, drone, close-up, wide shots

  • Video Styles: Professional, amateur, animation, different aspect ratios

Evaluation Criteria

Caption Accuracy (≥95% Required)

Your descriptions must precisely match the video content. Mismatched captions result in automatic rejection.

Physical Grounding Check

Every caption element must correspond to something visually present in the video. Abstract concepts or unobservable claims will be flagged.

Content Variety Score

Multiple similar submissions (same person, location, or scenario) reduce data quality and may be rejected.

Technical Compliance

All videos must meet resolution, format, and consistency requirements.

Submission Process

  1. Edit your video to ensure single-shot consistency and technical requirements

  2. Write your caption following physical grounding and natural language guidelines

  3. Review both video and caption for accuracy and compliance

  4. Submit for verification and await approval

Common Rejection Reasons

  • Multiple shots combined in one clip

  • Captions describing things not visible in the video

  • Abstract or predictive language in descriptions

  • Poor video quality or technical non-compliance

  • Repetitive content without sufficient variety

Success Tips

  • Watch your video multiple times before writing the caption

  • Read your caption while watching to ensure perfect alignment

  • Use specific, concrete visual language

  • Vary your submission types to maximize training value

  • Focus on what the AI model needs to "see" to recreate similar content

Last updated