Towards interaction with real-world objects

Project Description

Semantic Reality is an AR-plus-AI toolkit that lets everyday objects “talk back.”
Put on a Vision Pro headset, look at the things around you, and the system instantly recognises each item, surfaces practical questions you might ask (“How do I pair this speaker?”), highlights ports you point at, and even draws animated lines that explain how two devices connect. By tracking your hands, gaze and speech together, the experience feels more like a conversation with the physical world than a menu-driven app. In our pilot study, people completed assembly, comparison and translation tasks faster and reported feeling more confident because the system showed how objects relate instead of treating them one-by-one. Under the hood we fuse real-time object tracking in Unity with a multimodal large-language model, giving Semantic Reality a continuously updated “scene graph” of what you’re holding, touching or discussing. The result is an always-on, spatially anchored guide that helps you understand, connect and tinker—whether you’re wiring an Arduino, decoding a food label or just figuring out which cable goes where.

Technical Details

- Hardware: Apple Vision Pro (RGB stereo passthrough)
- Engine: Unity + PolySpatial; real-time mesh & hand-tracking APIs
- AI stack: Gemini 2.0 Flash MLLM (cloud) for open-vocabulary vision, OCR and dialogue; local speech-to-text
- Data model: persistent 3-level scene graph (detail / object / environment) synchronised at 10 Hz
- Interaction: gaze-and-pinch, ray, direct touch, voice; adaptive overlays generated via JSON prompts
- Prototype latency: ~300 ms recognition → overlay render on Wi-Fi 6 network

Research/Context

Earlier “Augmented Object Intelligence” systems could label a single item, but fell short when tasks spanned ports, parts and multiple devices. Building on XR-Objects and Reality Summary, we explore three granularities—micro-detail, whole-object and multi-object—to support real-world workflows like cooking, DIY electronics and accessibility translation. Our exploratory user study (12 participants) showed higher helpfulness scores and no extra cognitive load compared with a single-object baseline, pointing toward more transparent, learn-as-you-go Physical AI.