24 Feb 2025

Multimodal AI in 2025

It's more than just text these days....

Since the release of ChatGPT, we have seen the capabilities of Large Language Models evolve exponentially. What started with just text prompts, now includes audio, video, and computer interfaces.

This capability is commonplace among the LLM providers now, but I’m particularly impressed with what Google is doing. Check out https://aistudio.google.com for some great examples of how this works.

By using the “Stream Realtime” function, you can share your screen, and ask questions about the site that you are using. Think about how this will enable users to learn new applications, and reduce support time. Give it a try with your favorite SaaS platform!

Or, you can share your camera with the site and ask questions about what the camera is seeing in realtime.

The camera functionality is particularly groundbreaking for real-world problem solving. Imagine pointing your phone at a complex piece of machinery and getting step-by-step maintenance instructions, or showing your garden to get instant plant identification and care tips. This isn't just about convenience – it's transforming how we interact with our environment.

What's even more fascinating is how these multimodal models understand context across different types of input. They can follow a conversation that weaves between showing objects to the camera, sharing screens, and verbal questions. For instance, you could show your screen with a spreadsheet, ask about specific calculations, then point your camera at some physical receipts and have the AI help you reconcile everything seamlessly.

The implications for education and training are just as impactful. Traditional video tutorials and manuals are static, but multimodal AI adapts to your specific situation. A student struggling with geometry can show their homework through the camera, get personalized explanations, then share their screen to see interactive visualizations. An employee learning new software can get contextual help exactly when and where they need it.

Ultimately, this technology democratizes expertise. Not everyone has access to personal tutors, technical experts, or specialized consultants. Multimodal AI brings that knowledge to anyone with a smartphone or computer. Whether you're a small business owner trying to optimize your website, a DIY enthusiast tackling home repairs, or a student learning at your own pace, this technology acts as your personal expert companion.

As these systems continue to evolve, we'll likely see even deeper integration with specialized knowledge domains. The possibilities are endless, and we're just scratching the surface of what multimodal AI can achieve.

What kinds of multimodal AI applications would your business benefit from?

Thanks for reading NetNerd! This post is public so feel free to share it!

Subscribe to NetNerd AI