Tempolor 3.5 — Research

Diffusion Architecture · 2025.1

Tempolor 3.5

Based on a diffusion model and continuous music-audio representations, it generates 44.1kHz stereo high-fidelity music. With an inference real-time factor below 0.1, full-song generation reaches industry-leading speed.

44.1kHz Stereo

Diffusion Architecture

ControlNet Melody Control

Inpaint / Repaint

RTF < 0.1

Overview

Tempolor 3.5 is based on a diffusion model and continuous music-audio representations, supporting the generation of 44.1kHz stereo high-quality music. In its technical roadmap, this version evolves from a paradigm emphasizing semantic-skeleton modeling toward a high-fidelity paradigm centered on continuous acoustic representations, markedly improving sound texture, spatial layering, dynamics and the overall naturalness and completeness of the listening experience.

By introducing controllable-generation capabilities such as ControlNet and Inpaint / Repaint, Tempolor 3.5 further expands editability and controllability in scenarios like melody control, lyric editing and local repainting, providing a technical foundation for finer music creation and interactive editing.

In terms of inference efficiency, the real-time factor is below 0.1, with full-song generation speed at the industry-leading level.

Performance

Tempolor 3.5 has a clear advantage in fine listening detail and atmospheric expression. The model more naturally restores reverb tails, dynamic swells, textural layering and spatial depth, making results more immersive in emotion-driven content such as lyrical, ethereal, suspenseful and epic pieces.

Compared with the previous two generations, 3.5 cares not only about "writing it right" but about "sounding good," making it better suited to film scoring, atmospheric music and brand mood music — scenarios with higher demands on listening completeness.

Especially in vocals, it delivers excellent vocal quality across singing skill, vocal timbre, vocal fidelity and lyric clarity.

* Data as of May 2025

Real-Time Factor (RTF)

Lower is faster

Yue

Udio V1.5

1.48

Suno v4

0.84

Mureka v5.5

0.27

DiffRhythm v1.0

0.1

AceStep v1.0

0.063

Tempolor V3.0

0.02

Inference time for 120s of audio

Unit: seconds

Yue

1200

Udio V1.5

177

Suno v4

100

Mureka v5.5

DiffRhythm v1.0

AceStep v1.0

3.84

Tempolor V3.0

2.5

Tempolor V3.0 Speed

Industry-leading commercial music-generation model

Tempolor V3.0 RTF 0.02

Generates 2 minutes of music in 2.5 seconds

* Based on NVIDIA RTX 4090

Demo

15th National Games AI Theme SongOfficial Theme Song

CinematicAnthem

36Kr WISE AI Theme SongConference Theme Song

CinematicAnthem

Boonie BearsTrailer

CinematicTrailer

The Spinning Washing MachineDigital-Human MV

ElectronicMV

Wind and Moon of My Hometown

0:00 / 0:00

Chinese FolkBallad

街角のソノリティ

0:00 / 0:00

J-PopCity Pop

Wind Over the Hills

0:00 / 0:00

FolkAcoustic

Hearing the Galaxy

0:00 / 0:00

CinematicAmbient