meta has released AudioCraft, a new set of AI tools to generate what the tech giant claims is “high-quality, realistic audio and music from text” — for example, producing a music sequence based on the text string “electronic Jamaican reggae DJ set.”
“Imagine a professional musician being able to explore new compositions without having to play a single note on an instrument,” meta says in a blog post about AudioCraft. “Or a small business owner adding a soundtrack to their latest video ad on Instagram with ease.”
AudioCraft consists of three models: MusicGen (for music), AudioGen (for sound effects) and EnCodec (a generative AI decoder). MusicGen was trained on roughly 400,000 recordings along with text description and metadata, amounting to 20,000 hours of music owned by meta or licensed specifically for this purpose, according to the tech giant. “Music tracks are more complex than environmental sounds, and generating coherent samples on the long-term structure is especially important when creating novel musical pieces,” the company says.
“With even more controls, we think MusicGen can turn into a new type of instrument — just like synthesizers when they first appeared,” the company said in the blog post.
meta shared a clip of what music generated by MusicGen sounds like. In addition to the reggae riff, the examples include “Movie scene in a desert with percussion,” “’80s electronic with drum beats,” “Jazz instrumental, medium tempo, spirited piano” and “Mellow hip-hop, vinyl scratching, deep bass”:
Meanwhile, meta said that AudioGen was trained on “public sound effects” and can generate environmental sounds and sound effects like a dog barking, cars honking or footsteps on a wooden floor. The company also released what it said is an improved version of the EnCodec decoder, “which allows higher-quality music generation with fewer artifacts.”
The company is releasing the AudioCraft models as open-source code, explaining that the goal is to give “researchers and practitioners access so they can train their own models with their own datasets for the first time, and help advance the field of AI-generated audio and music.”
meta acknowledged that the datasets used to train the AudioCraft models lack diversity — in particular, the music dataset used “contains a larger portion of Western-style music” and is limited to audio-text pairs with text and metadata written in English. “By sharing the code for AudioCraft, we hope other researchers can more easily test new approaches to limit or eliminate potential bias in and misuse of generative models,” the company said.