Google has unveiled its latest AI breakthrough: a technology that can locate misplaced items like glasses. Demonstrating its prowess in “multimodal” understanding, Google showcased its AI’s ability to interpret images, videos, sounds, and spoken language through a phone camera. This unveiling follows closely on the heels of OpenAI’s GPT-4o launch, which wowed audiences with its capacity to read human expressions and engage in fluent conversation via phone cameras.
In a competitive display of capabilities reminiscent of a “anything you can do I can do better” dynamic, Google highlighted its AI’s capacity to match, if not surpass, its rivals. The company teased the potential of its on-device systems just prior to OpenAI’s announcement.
During Google I/O, the company’s annual event for developers, Google introduced Gemini Nano, an AI assistant embedded in its Pixel phone, and the Gemini App, both featuring multimodal functionalities. They also previewed a scam alert feature for Gemini Nano, capable of detecting and warning users of potential scams during phone calls without transmitting any call data off the device.
Sir Demis Hassabis, head of Google Deepmind, reiterated Google’s longstanding commitment to multimodal AI, emphasizing its models’ innate ability to process images, videos, and sounds cohesively. Project Astra, showcased during the event, exemplified this capability by answering spoken questions about visual input captured through a phone camera. In a demonstration, the virtual assistant accurately located a pair of glasses based on real-time visual data.
Google also announced AI enhancements across various products, including AI-generated search result summaries, AI-powered search in Google Photos, and new AI systems for image, video, and music generation. Additionally, upcoming features such as email summarization in Gmail and a prototype virtual “team-mate” capable of multitasking in online meetings hinted at the company’s ambitious future endeavors in AI.