
Inspiration
The paradox of modern computing: I communicate naturally with humans through voice, but interact with AI through typing. In May 2024, Google introduced Agentic Vision with Gemini 3 Flash—a vision of AI that doesn't just see the world, but acts upon it.
I built I.R.I.S. to realize this vision: an interface that doesn't wait for a prompt, but reacts to attention. By combining real-time gaze signals with the Gemini 3 Flash Agentic SDK, I've bridged the gap between visual context and autonomous action.

Web PreviewThe final product is a native Swift app for macOS. This interactive simulation is a simplified representation to illustrate the concept.
What it does
I.R.I.S. (Intent Resolution and Inference System) is a reactive agent that transforms your macOS environment into an attention-aware workspace. It uses your gaze as a high-bandwidth signal of intent, allowing you to interact with any UI element simply by looking and speaking.
- Agentic Vision — continuous multimodal analysis of your screen to proactively offer Skills (Refactoring, Summarization, Bug Fixing) before you even ask
- Precision Action via TARS — overcomes the "coordinate hallucination" of LLMs using a dedicated action server for pixel-perfect clicks, scrolls, and typing
- Contextual Awareness — automatically detects chat apps or IDEs to tailor reasoning and suggestions to your current task
Look at an issue → Say "Fix this" → I.R.I.S. analyzes via Gemini 3 Flash Agentic SDK → The Agentic Loop plans the resolution → TARS executes the action.
How I built it
A multi-layered system designed for low-latency feedback and high-precision execution:
- Gemini 3 Flash Agentic SDK — the core brain powering the Agentic Loop with native function calling and tool use for autonomous OS interaction
- TARS Action Server — translates natural language into concrete screen actions (Click, Drag, Type, Hotkey) with pixel-perfect reliability
- Agility KSTK (Knowledge Stack) — foundational gaze and vision models including KSTK/LBF weights for high-speed facial landmark detection and gaze estimation
- IRIS Gaze (Rust & Swift) — high-performance bridge to the iris-gaze-rs library for real-time tracking
- IRIS Vision — local OCR and Accessibility API integration for semantic screen mapping
- IRIS Media — custom audio/video pipeline for real-time "Live" multimodal sessions
Challenges I ran into
Standard LLMs struggle to map "Look at that button" to exact screen coordinates. I solved this by architecting TARS, which treats the screen as a navigable environment rather than just a static image.
Maintaining 60 FPS gaze tracking while streaming multimodal data required strict modular separation. By offloading gaze estimation to a Rust-based core and using the Agentic SDK for asynchronous tool execution, I kept the interaction loop tight and responsive.
Handling "Proactive" mode without being intrusive. I implemented cooldowns and chat-app detection so I.R.I.S. only nudges you when it detects a genuine interaction opportunity.
Accomplishments that I'm proud of
Agentic Vision Realization — building one of the first macOS implementations of the Gemini 3 Flash agentic loop
TARS Integration — moving from "AI that talks" to "AI that clicks" with pixel-perfect accuracy
KSTK Foundations — leveraging advanced LBF models for stable gaze tracking even in challenging lighting
What I learned
- Gaze is the Ultimate Filter — by knowing where a user looks, I can prune 90% of the "noise" on a screen, making LLM tool-calling significantly more accurate
- The Hybrid Approach — real-world agents need a mix of cloud reasoning (Agentic SDK) and local precision (TARS/KSTK) to be useful
What's next for I.R.I.S.
The next phase of I.R.I.S. is the transition to a single, recursive capability: The Meta-Skill. Instead of shipping a fixed menu of features, I.R.I.S. will have only one core directive — The Skill to Learn and Create New Skills.
- Analyze the user's intent and visual workflow
- Synthesize a new Skill definition (logic, instructions, and TARS-action patterns)
- Persist that skill into the registry for future use
I.R.I.S. will become an infinite system that grows alongside the user without any further development—moving from a predefined tool to a self-evolving interface. From gaze to intention to infinite action.