Twelve Labs secures $12 million for AI that understands video context • australiabusinessblog.com
For Jae Lee, a data scientist by training, it never made sense that video – which has become a huge part of our lives, which with the rise of platforms such as TikTok, Vimeo and YouTube – was difficult to search due to the technical barriers posed by understanding context . Searching through the titles, descriptions, and tags of videos has always been easy enough, requiring nothing more than a basic algorithm. But search inside videos for specific moments and scenes have long been beyond the capabilities of technology, especially if those moments and scenes weren’t labeled in an obvious way.
To solve this problem, Lee and friends from the tech industry have built a cloud service for searching and understanding videos. It became Twelve laboratories, which went on to raise $17 million in venture capital — $12 million of which came from a seed extension round that closed today. Radical Ventures led the expansion with participation from Index Ventures, WndrCo, Spring Ventures, Weights & Biases CEO Lukas Biewald and others, Lee told australiabusinessblog.com in an email.
“Twelve Labs’ vision is to help developers build programs that can see, listen and understand the world the way we do, by providing them with the most powerful video understanding infrastructure,” said Lee.

A demo of the capabilities of the Twelve Labs platform. Image Credits: Twelve laboratories
Twelve Labs, which is currently in closed beta, is using AI to try and extract “rich information” from videos, such as movement and actions, objects and people, sound, on-screen text, and speech to identify the relationships between them. The platform converts these different elements into mathematical representations called “vectors” and forms “temporary connections” between frames, enabling applications such as video scene search.
“As part of realizing the company’s vision to help developers create intelligent video applications, the Twelve Labs team is building ‘fundamental models’ for multimodal video understanding,” said Lee. “Developers can access these models through a series of APIs, not only performing semantic searches, but also performing other tasks such as ‘chapterizing’ long videos, generating summaries, and answering video questions.”
Google takes a similar approach to understanding videos with its MUM AI system, which the company uses to drive video recommendations on Google Search and YouTube by selecting topics in videos (for example, “acrylic materials”) based on the audio , text and visual content. But while the technology is similar, Twelve Labs is one of the first vendors to bring it to market; Google has chosen to keep MUM internal and refuses to make it available through a public API.
That said, Google, as well as Microsoft and Amazon, offer services (i.e. Google Cloud Video AI, Azure Video Indexer, and AWS Rekognition) that recognize objects, places, and actions in videos and extract rich, frame-level metadata. There’s also Reminiz, a French computer vision startup that claims to be able to index any type of video and add tags to both recorded and live-streamed content. But Lee argues that Twelve Labs is sufficiently differentiated, in part because the platform allows customers to tailor the AI to specific categories of video content.

Mockup of API for refining the model to work better with salad related content. Image Credits: Twelve laboratories
“What we found is that narrow AI products built to detect specific problems show high accuracy in their ideal scenarios in a controlled environment, but don’t scale as well to messy real-world data,” said Lee. . “They operate more like a rule-based system and therefore lack the ability to generalize when deviations occur. We also see this as a limitation arising from a lack of understanding of the context. Understanding context is what gives people the unique ability to make generalizations about seemingly different real-world situations, and this is where Twelve Labs comes into its own.”
In addition to search, Lee says Twelve Labs’ technology can boost things like ad insertion and content moderation, for example, intelligently figuring out which videos with knives are violent versus instructional. It can also be used for media analysis and real-time feedback, he says, as well as automatically generating highlights from videos.
Just over a year after its inception (March 2021), Twelve Labs has paying customers — Lee wouldn’t reveal exactly how many — and a multi-year contract with Oracle to train AI models using Oracle’s cloud infrastructure. Looking ahead, the startup plans to invest in building out its technology and growing its team. (Lee declined to disclose the current size of Twelve Labs’ workforce, but LinkedIn data shows it’s about 18 people.)
“Despite the tremendous value that can be achieved with large models, most companies don’t feel like training, operating and maintaining these models themselves. By leveraging a Twelve Labs platform, any organization can leverage powerful video insight capabilities with just a few intuitive API calls,” said Lee. “The future direction of AI innovation is going straight to multimodal video understanding, and Twelve Labs is well positioned to push the boundaries even further in 2023.”