Wan 2.6 API Guide: Text-to-Video, Image-to-Video, and Reference Video for Production

Wan 2.6 is Alibaba Cloud's "Tongyi Wanxiang" video generation lineup aimed at programmable, pipeline-friendly generation—not just one-off demos. Compared to earlier Wan versions, the API story is clearer: multi-shot storytelling, audio support, and three distinct entry points depending on whether you start from text, a first-frame image, or a reference video.

This guide is written for CTOs and engineers shipping generative video into real systems: async orchestration, budget guardrails, reliability patterns, and integration strategy (including a clean EvoLink.ai path at the end).

1. Wan 2.6 Model Family: Choose the Right Endpoint

Feature	T2V (wan2.6-t2v)	I2V (wan2.6-i2v)	R2V (wan2.6-r2v)
Use Case	No visual asset yet (ideation, storyboard drafts, synthetic B-roll)	Must anchor the first frame (product shots, character key art, brand consistency)	Need character consistency from a reference clip (appearance + voice timbre)
Resolution	720P / 1080P	720P / 1080P	720P / 1080P
Duration	5 / 10 / 15 seconds	5 / 10 / 15 seconds	5 / 10 seconds
Output Format	30fps, MP4 (H.264)	30fps, MP4 (H.264)	30fps, MP4 (H.264)
Audio	Auto voiceover or custom audio file	Auto voiceover or custom audio file	Generate voice via prompt; can reference input video's voice timbre
Multi-shot	Supported	Supported	Supported

Rule of thumb

Start with T2V for concept exploration.
Switch to I2V when you have a "source-of-truth" frame you must respect.
Use R2V when you need identity continuity across shots/scenes.

2. The Production Workflow: Async Tasks (Not Real-Time)

Wan 2.6 video generation is asynchronous. Expect "create task → poll task result" as the standard integration model, with typical completion time in the minutes range depending on queue load.

Key operational details:

You must send the async header: X-DashScope-Async: enable (DashScope HTTP mode).
You receive a task_id and poll status until it succeeds/fails.
task_id is valid for 24 hours (store it immediately; do not "re-submit" to recover).

Engineering pattern (recommended)

Submit task from an API worker
Persist task_id + request hash + user/job metadata
Poll with exponential backoff (or a scheduler/queue)
On success, persist the returned video_url and download/replicate it (URLs are often time-limited by providers)

3. Multi-Shot Storytelling: What Actually Changes in Wan 2.6

Alibaba explicitly calls out multi-shot narrative as a Wan 2.6-only capability for both T2V and I2V.

How to enable it (T2V example)

In Wan 2.6 DashScope T2V, multi-shot is enabled by setting shot_type: "multi". The official example pairs it with prompt_extend: true.

Practical prompt guidance for multi-shot:

Write your prompt like a short "shot list"
Keep the main subject description consistent across shots
Specify shot transitions ("cut to", "wide shot", "close-up") only if needed; otherwise let the model auto-segment

How it works in Wan 2.6 R2V (character references)

Wan 2.6 R2V introduces a stricter control mechanism: you reference characters using tokens like character1, character2, etc., and map them to the input reference videos by array order. Each reference video should contain a single role/object identity.

4. Audio: What You Can Safely Rely On

Wan 2.6 supports audio in different ways depending on the endpoint:

T2V / I2V

Audio support includes auto voiceover or passing a custom audio file URL to achieve audio-visual sync.
When providing a custom audio file, the platform documents practical constraints (format/size) and that audio may be truncated/left silent if it doesn't match the requested duration.

R2V

Audio is generated via prompt, and can reference the input video's voice timbre (useful when you want continuity of voice feel).

What not to promise in your product docs

Unless you have validated it end-to-end, avoid claiming "lip-sync" or "phoneme-accurate mouth matching." The official docs describe audio generation and audio-visual sync, but don't guarantee lip-level alignment.

5. Cost Model: Know Your Per-Second Pricing Up Front

Wan 2.6 is billed primarily by seconds x resolution tier, and pricing differs by region (China Mainland vs. Singapore "International").

T2V pricing (Alibaba Cloud / Bailian)

wan2.6-t2v: 0.6 RMB/sec (720P), 1 RMB/sec (1080P)

I2V pricing (first-frame)

wan2.6-i2v: 0.6 RMB/sec (720P), 1 RMB/sec (1080P)

Wan 2.6 R2V pricing (reference video)

Wan 2.6 R2V bills both input and output video seconds, and explicitly notes:

Failures are not billed
Input video billing duration is capped (documented as "not exceeding 5 seconds" for billing)

Pricing: wan2.6-r2v: 0.6 RMB/sec input + 0.6 RMB/sec output (720P); 1 RMB/sec input + 1 RMB/sec output (1080P)

Cost-control defaults (strongly recommended)

Dev/test default: 720P + shortest duration your UX allows
Add server-side caps: max duration, max resolution, max jobs/user/day
Require reference-video validation before R2V submission (format/size/duration) to reduce waste

6. Wan 2.6 Reliability Friction You'll Actually Hit

Region binding

Beijing and Singapore have independent API keys and request endpoints; mixing them can cause auth failures.

SDK gaps (I2V)

Alibaba's own docs note that wan2.6-i2v is not supported via SDK at the time of writing (HTTP-only workflow).

URLs and assets

Across workflows, you'll be passing media via URLs (HTTP/HTTPS), and you may need an upload step to produce temporary URLs for local files.

7. Using Wan 2.6 Through EvoLink.ai (Unified API + Clean Task Model)

If you don't want your application code tightly coupled to one provider's request/response quirks, EvoLink gives you a single endpoint for Wan 2.6 video generation:

POST https://api.evolink.ai/v1/videos/generations
Wan 2.6 models (examples):
- wan2.6-text-to-video
- wan2.6-reference-video
Asynchronous processing with task IDs, and generated video links valid for 24 hours (save promptly).

Example: Text-to-Video via EvoLink

Full doc: https://docs.evolink.ai/en/api-manual/video-series/wan2.6/wan2.6-text-to-video

curl --request POST \
  --url https://api.evolink.ai/v1/videos/generations \
  --header 'Authorization: Bearer YOUR_API_KEY' \
  --header 'Content-Type: application/json' \
  --data '{
    "model": "wan2.6-text-to-video",
    "prompt": "A cinematic multi-shot sequence of a runner crossing a neon-lit city bridge at night, rain reflections, dramatic camera cuts, realistic motion."
  }'

Example: Reference Video via EvoLink (copy-paste)

curl --request POST \
  --url https://api.evolink.ai/v1/videos/generations \
  --header 'Authorization: Bearer YOUR_API_KEY' \
  --header 'Content-Type: application/json' \
  --data '{
    "model": "wan2.6-reference-video",
    "prompt": "character1 walks into a bright cafe, orders a drink, then turns and smiles to camera; multi-shot narrative.",
    "video_urls": [
      "https://your-cdn.example.com/reference_character.mp4"
    ]
  }'

This endpoint accepts up to 3 reference videos and documents requirements like format (mp4/mov), file size (≤100MB), and duration range (2–30s).

8. Ship Wan 2.6 Faster

If you're building production video features—UGC creation tools, marketing automation, product visualization, or storyline generation—the hard part isn't "can the model generate video?" The hard part is operationalizing it: task orchestration, spend control, and evolving model/provider choices over time.

EvoLink.ai is built for that reality:

One API surface for Wan 2.6 (and other video models as you expand your stack)
A clean async task pattern you can standardize in your backend
A practical path to reduce integration churn when providers update parameters or add new endpoints

9. FAQ (Production Notes)

1) What durations does Wan 2.6 support for each mode?

Text-to-Video (wan2.6-t2v): 5 / 10 / 15 seconds
Image-to-Video (wan2.6-i2v): 5 / 10 / 15 seconds
Reference Video (wan2.6-r2v): 5 / 10 seconds

2) Can I bring my own audio? What are the constraints?

Yes—T2V and I2V accept audio_url. The docs specify:

Formats: wav / mp3
Duration: 3–30 seconds
Size: ≤ 15MB
If audio is longer than the requested video duration, it is truncated; if shorter, the remaining video is silent.

3) How do I force silent output (no auto audio)?

Use audio: false. It only applies when you do not pass audio_url, and audio_url has higher priority than audio.

4) What are safe prompt length limits?

Alibaba Cloud's T2V API documents a prompt limit of 1500 characters for wan2.6-t2v and a negative_prompt limit of 500 characters. EvoLink's Wan 2.6 T2V endpoint also documents prompt limited to 1500 characters.

Start with Wan 2.6 Text-to-Video →Start with Wan 2.6 Image-to-Video →Start with Wan 2.6 Reference-Video →

#Wan 2.6 #Alibaba Cloud #AI Video #Text-to-Video