VisionClaw: Turning Ray-Ban Meta Glasses into an Autonomous Super-Agent
How a developer’s open-source experiment turned Meta’s Ray-Ban glasses into an always-on, multimodal AI agent, and what it reveals about the future of human-computer interaction.
The Prototype
In early 2026, developer Xiaoan Sean Liu released a project that demonstrates where personal technology is heading. VisionClaw is an open-source integration that combines three existing technologies (Meta Ray-Ban smart glasses, Google’s Gemini Live API, and the OpenClaw agent framework) into a wearable system that sees what you see, hears what you hear, and takes actions on your behalf.
The implementation is intentionally simple. A user puts on Ray-Ban Meta glasses, taps the temple, and speaks naturally. The glasses capture video alongside continuous audio, streaming both to Google’s Gemini Live API via WebSocket. Gemini processes this multimodal input, simultaneously analysing visual scenes and understanding spoken language, then responds with natural speech and, crucially, triggers function calls that execute real-world tasks through OpenClaw.
The result is an AI assistant that operates without screens or keyboards. Ask “What am I looking at?” while facing a café, and the system describes the scene. Say “Add milk to my shopping list,” and OpenClaw routes the request to your configured task manager. The interaction happens hands-free, through glasses that look indistinguishable from regular glasses.
Technical Architecture
VisionClaw’s architecture consists of four layers working in concert:
Sensory Input (Meta Ray-Ban Glasses)
The glasses capture video and audio. This bandwidth-conscious approach prioritises battery efficiency over fluid video. For static scene analysis (reading menus, identifying products, recognising landmarks), this proves sufficient.
Transport Layer (iOS/Android Bridge)
The VisionClaw app, built using Meta’s Wearables Device Access Toolkit, compresses video and establishes a WebSocket connection to Google’s servers. This runs on the user’s smartphone, which handles network connectivity and API authentication.
Cognitive Processing (Gemini Live API)
Google’s Gemini Live API processes multimodal input through a stateful WebSocket connection (WSS). Unlike traditional voice assistants that convert speech-to-text before processing, Gemini Live uses native audio streaming, preserving prosody and enabling natural interruption.
Action Execution (OpenClaw)
Created by Peter Steinberger (founder of PSPDFKit), OpenClaw serves as the execution layer. When Gemini determines an action is required (sending a message, searching the web, controlling smart home devices), it calls OpenClaw’s gateway, which interfaces with community-contributed skills spanning messaging (WhatsApp, Telegram, Signal, iMessage), calendar management, web search, and system automation.
This separation of concerns is deliberate. Gemini handles perception and reasoning; OpenClaw handles agency. Without OpenClaw, the system would be a conversational oracle. With it, the system becomes a capable assistant that modifies your digital environment.
Capabilities and Limitations
What it does well:
- Scene understanding: Identifies objects, reads text, recognises landmarks, and describes environments in real-time
- Task execution: Adds items to shopping lists, sends messages, schedules calendar events, controls smart home devices
- Information retrieval: Performs web searches and synthesises answers based on visual context
- Accessibility: Includes “iPhone mode” allowing users to test the full pipeline using their phone’s rear camera before purchasing glasses.
Security and Governance
VisionClaw and OpenClaw exist in a regulatory grey area with significant security concerns.
Security audits have identified vulnerabilities in OpenClaw. Peter Steinberger, OpenClaw’s creator, acknowledged these risks, describing the project as experimental. He has since implemented security improvements, but emphasises that the project is “not enterprise-ready.”
For VisionClaw specifically, the risks compound: you are streaming visual data to Google’s servers, audio to Gemini, and action commands to a locally-hosted agent with broad system access.
Market Context: Beyond the Prototype
VisionClaw arrives as smart glasses transition from niche gadget to mainstream consumer product. Meta has sold significant numbers of Ray-Ban Meta units, making them a notable entry in the AI glasses market.
This adoption curve suggests that ambient interfaces like VisionClaw may follow the smartphone trajectory: initially dismissed as unnecessary, then rapidly normalised. The technology is not the barrier; social acceptance and security governance are.
Competitors are converging on similar architectures. Google’s Android XR platform offers AI-powered navigation and translation through glasses prototypes. The hardware is commoditising; the differentiator is software integration.
The Significance
VisionClaw matters not as a product (it is explicitly a “community experiment” by Sean Liu, unaffiliated with Meta, Google, or OpenClaw), but as a proof of concept. It demonstrates that the components for world-aware AI are already available, combinable by individual developers, and deployable on consumer hardware costing under $400 (glasses) plus API fees.
The project illustrates three converging trends:
- Multimodal as default: AI is moving beyond text to process vision, audio, and action simultaneously
- Ambient intelligence: Technology that waits in the background, available without pulling focus from the physical world
- Agentic systems: AI that doesn’t just respond but executes, maintaining state across sessions and platforms
The architecture (wearable input, cloud cognition, local execution) represents a plausible future for personal technology.
In mid-February 2026, the trajectory of Agentic AI technology shifted when OpenAI hired Peter Steinberger following the viral success of OpenClaw. This move highlights an industry shift toward agentic interfaces and suggests that the approach seen in VisionClaw is shaping the way humans and computers will interact. While the OpenClaw codebase remains open source, its creator’s transition to a major lab signals that wearable assistants are moving from independent experiments into the mainstream future of our digital lives.
💼🧭Elluminate Me advises global brands on the future of technology, culture, and human experience. Get in touch for a leadership briefing on the future of spatial intelligence, emerging technologies, and sensory marketing and what this means for your organisation.
Other reports