Sunday Rundown #78: Audio-Video & Wonder Lizard
Sunday Bonus #38: My speech-to-image tool that turns scene descriptions into pictures
Happy Sunday, friends!
Welcome back to the weekly look at generative AI that covers the following:
Sunday Rundown (free): this week’s AI news + a fun AI fail.
Sunday Bonus (paid): a goodie for my paid subscribers.
Let’s get to it.
🗞️ AI news
Here are this week’s AI developments.
👩💻 AI releases
New stuff you can try right now:
Alibaba released a reasoning model called QwQ-32B-Preview to rival OpenAI’s o1-preview, just one week after DeepSeek did the same with R1 Lite. (Try the demo here.)
Anthropic has been busy again:
It introduced a new Model Context Protocol (MCP) - “a universal, open standard for connecting AI systems with data sources.”
You can now create custom “Styles” in Claude by uploading samples of your writing that it can mimic (you can also pick from a few basic style presets).
ElevenLabs launched GenFM, a text-to-podcast tool similar to “Audio Overviews” in NotebookLM, but it lets you select different voices and languages.
H Company launched Runner H, an AI agent that outperforms competitors in real-world applications and can handle a wider range of tasks.
Hume connected its voice interface with Anthropic’s “Computer Use,” letting you control your computer using spoken instructions.
Lightricks open-sourced its LTXV video model capable of fast, high-quality video generation. (Try the Hugging Face demo.)
Luma expanded its Dream Machine video model into a full-fledged “creative platform” with new features and an iOS app. (Try it here.)
Stability AI has enabled ControlNet tools for its latest Stable Diffusion 3.5 Large model. (Here’s more about ControlNet.)
🔬 AI research
Cool stuff you might get to try one day:
Amazon is reportedly working on an AI model code-named “Olympus” that can understand complex scenes in images or videos.
NVIDIA showcased its impressive sound model called Fugatto which accepts text and audio inputs and is capable of creating any combination of sounds, music, and voices.
Runway is gradually rolling out its text-to-image tool “Frames,” which gives creators precise control over style and visual direction.
📖 AI resources
Helpful AI tools and stuff that teaches you about AI:
“7 examples of Gemini’s multimodal capabilities” - real-world cases compiled by Google.
“GenChess” [tool] - a fun Google Labs space that lets you create virtual new chess sets based on any object or theme and then play a game with them.
🔀 AI random
Other notable AI stories of the week:
Early testers briefly leaked access to OpenAI’s much-awaited Sora video model as pushback against perceived art washing and unfair treatment by the company.
🤦♂️ AI fail of the week
I mean, I did ask for a “caricature,” but this is very much not it.
💰 Sunday Bonus #38: Turn a vague, spoken scene description into an image
I love messing around with AI image tools.
In fact, that’s what got me to start this newsletter in the first place.
I’m also a huge proponent of less-is-more image prompting, as seen here:
But many people are still hesitant to try prompting image models.
They might only have a vague idea of what they want. Or they’re not sure how to put their idea into words and what terms to use. Or they can’t find a way to condense their idea into a short, precise image prompt.
So I went ahead and built a free-to-run tool that works like this:
You turn on your mic and ramble on about the scene you’re thinking of. (Don’t worry about repeating yourself, being too wordy, vague, etc.)
The tool converts that audio input into a clean, precise image prompt.
It then turns that short prompt into a widescreen (16:9) image using the latest and greatest FLUX 1.1 Pro [Ultra] model.
You can also upload a pre-recorded scene description instead of recording it directly.
Check it out: