I.R.I.S. Banner

Inspiration

The paradox of modern computing: I communicate naturally with humans through voice, but interact with AI through typing. In May 2024, Google introduced Agentic Vision with Gemini 3 Flash—a vision of AI that doesn't just see the world, but acts upon it.

I built I.R.I.S. to realize this vision: an interface that doesn't wait for a prompt, but reacts to attention. By combining real-time gaze signals with the Gemini 3 Flash Agentic SDK, I've bridged the gap between visual context and autonomous action.

Evolution of Human-Computer Interaction — from keyboard era, to mouse & GUI, to future multimodal interfaces

Web PreviewThe final product is a native Swift app for macOS. This interactive simulation is a simplified representation to illustrate the concept.

Target 1
Target 2
Target 3
Target 4
Initializing System...

What it does

Intent Resolution & Inference

I.R.I.S. (Intent Resolution and Inference System) is a reactive agent that transforms your macOS environment into an attention-aware workspace. It uses your gaze as a high-bandwidth signal of intent, allowing you to interact with any UI element simply by looking and speaking.

  • Agentic Vision — continuous multimodal analysis of your screen to proactively offer Skills (Refactoring, Summarization, Bug Fixing) before you even ask
  • Precision Action via TARS — overcomes the "coordinate hallucination" of LLMs using a dedicated action server for pixel-perfect clicks, scrolls, and typing
  • Contextual Awareness — automatically detects chat apps or IDEs to tailor reasoning and suggestions to your current task
The Flow

Look at an issue → Say "Fix this" → I.R.I.S. analyzes via Gemini 3 Flash Agentic SDK → The Agentic Loop plans the resolution → TARS executes the action.

How I built it

The Intelligence Stack

A multi-layered system designed for low-latency feedback and high-precision execution:

  • Gemini 3 Flash Agentic SDK — the core brain powering the Agentic Loop with native function calling and tool use for autonomous OS interaction
  • TARS Action Server — translates natural language into concrete screen actions (Click, Drag, Type, Hotkey) with pixel-perfect reliability
  • Agility KSTK (Knowledge Stack) — foundational gaze and vision models including KSTK/LBF weights for high-speed facial landmark detection and gaze estimation
The Framework Architecture

  • IRIS Gaze (Rust & Swift) — high-performance bridge to the iris-gaze-rs library for real-time tracking
  • IRIS Vision — local OCR and Accessibility API integration for semantic screen mapping
  • IRIS Media — custom audio/video pipeline for real-time "Live" multimodal sessions

Challenges I ran into

The Precision Gap

Standard LLMs struggle to map "Look at that button" to exact screen coordinates. I solved this by architecting TARS, which treats the screen as a navigable environment rather than just a static image.

Latency & Fluidity

Maintaining 60 FPS gaze tracking while streaming multimodal data required strict modular separation. By offloading gaze estimation to a Rust-based core and using the Agentic SDK for asynchronous tool execution, I kept the interaction loop tight and responsive.

State Management

Handling "Proactive" mode without being intrusive. I implemented cooldowns and chat-app detection so I.R.I.S. only nudges you when it detects a genuine interaction opportunity.

Accomplishments that I'm proud of

Agentic Vision Realization — building one of the first macOS implementations of the Gemini 3 Flash agentic loop

TARS Integration — moving from "AI that talks" to "AI that clicks" with pixel-perfect accuracy

KSTK Foundations — leveraging advanced LBF models for stable gaze tracking even in challenging lighting

What I learned

  • Gaze is the Ultimate Filter — by knowing where a user looks, I can prune 90% of the "noise" on a screen, making LLM tool-calling significantly more accurate
  • The Hybrid Approach — real-world agents need a mix of cloud reasoning (Agentic SDK) and local precision (TARS/KSTK) to be useful

What's next for I.R.I.S.

The "Infinite Skill" Paradigm

The next phase of I.R.I.S. is the transition to a single, recursive capability: The Meta-Skill. Instead of shipping a fixed menu of features, I.R.I.S. will have only one core directive — The Skill to Learn and Create New Skills.

  • Analyze the user's intent and visual workflow
  • Synthesize a new Skill definition (logic, instructions, and TARS-action patterns)
  • Persist that skill into the registry for future use
The Vision

I.R.I.S. will become an infinite system that grows alongside the user without any further development—moving from a predefined tool to a self-evolving interface. From gaze to intention to infinite action.