Gestural and Multimodal Interfaces, Revisited: From Pointing to Delegation

Gestural interfaces have been declared “the future of computing” many times over the last half century. And yet, outside of touchscreens and a handful of specialized domains, they have stubbornly resisted becoming the dominant way we interact with machines.

This is often framed as a story of hype cycles and technical immaturity. There is some truth to this framing and some lessons for those of us interested in innovation. But that is a topic for a different post.

A more faithful reading of history is that the core ideas arrived early, matured slowly, and only recently found systems capable of taking them seriously – systems designed to interpret intent rather than execute commands.

Gesture as Command: An Early and Persistent Strand

One of the earliest strands in interactive systems treated gesture as a symbolic command language.

You can trace this lineage back to Ivan Sutherland’s Sketchpad in the early 1960s — a system that already hinted at direct manipulation, pen input, and gestural interaction. Through the 1970s and 1980s, this strand expanded via data gloves, pen-based systems, and early vision-based hand recognition.

The underlying model was straightforward:

Perform a gesture → trigger an action.

Point to select.
Wave to rotate.
Draw a shape to invoke a command.

Technically, this work was impressive and foundational. Conceptually, it ran into a scaling wall. Each new capability demanded:

A new gesture
A new mapping
A new thing for users to remember

Gesture vocabularies didn’t scale, and recognition errors multiplied as systems grew more expressive.

Importantly, this strand never disappeared — it continues today in areas like sign-language recognition, robotics, and accessibility. But as a general interaction paradigm, gesture-as-command proved fragile.

Multimodal Interaction: Gesture and Speech Together

Running alongside gesture-as-command — and in many ways correcting it — was a second strand: multimodal interaction.

A canonical milestone here is Richard Bolt’s “Put That There” system from 1980. The insight was deceptively simple and enduring:

Gesture and speech are not alternatives — they are complementary.

Speech carries intent (“move,” “delete,” “compare”).
Gesture carries reference (“this,” “that,” “there”).

Throughout the 1980s, 1990s, and 2000s, research communities explored temporal alignment, mutual disambiguation, and fusion strategies in approaches ranging from pen + speech interfaces to early gaze + gesture systems.

Most multimodal systems were still designed as deterministic pipelines:

(speech + gesture) → command

Much of the work we at HP Labs contributed during this period — across pen, touch, and gesture systems — sat squarely in this lineage. For example, we examined how multi-touch gestures paired with speech can improve task performance and usability in contexts such as photo browsing and 3D modeling.

This same period saw our work move into practical problems of multimodal integration, robustness, and UI discoverability and coherence. While the sensing stack continued to be a work in progress, the research question shifted to “how does this fit into a usable system?”

In this context, we studied the usability and learnability of pose-based freehand gestures for lean-back media consumption tasks, multiuser multimodal in-air gestures for the living room, and natural interfaces for virtual classrooms, to name a few efforts.

In general these systems worked well when users were precise — and failed ungracefully when they were not.

What is sometimes forgotten is how fragile the sensing stack was at the time. Early 3D cameras used to interpret in-air gestures were low resolution, noisy, expensive, and poor at handling occlusion. Microphone arrays used for speech were similarly limited; they were narrow-band, sensitive to reverberation, and poor at separating speakers from noise.

On top of this, computer vision and speech recognition algorithms were brittle, heavily feature-engineered, and difficult to generalize beyond controlled lab settings.

Under these constraints, treating gesture as a symbolic command was almost unavoidable — it was the only way to keep error rates tolerable.

When Research Met Reality: Consumer Gesture Systems

The late 2000s and early 2010s marked a turning point in scale. Systems like Microsoft Kinect brought full-body gesture recognition into millions of living rooms. For a moment, it felt like the future had arrived.

And then reality intervened.

Fatigue & ergonomics
Mid-air interaction causes measurable arm fatigue; “gorilla arm” isn’t a meme – it’s a well-studied phenomenon, and it tends to reappear whenever you try to make mid-air gesturing the default UI.
Reliability in the real world
Hands occlude each other, lighting changes, sleeves/gloves happen, cameras move, backgrounds clutter. To keep error rates tolerable, systems often limit gesture vocabularies, which in turn limits expressive power.
Discoverability & standardization
Users can’t “see” what gestures exist, and bespoke gesture sets don’t transfer across apps.
Social acceptability
People don’t love waving their hands in public unless there’s a clear payoff (XR at home is different from a subway platform).
Competition from simpler modalities
Remote controls, controllers, touch, mouse/trackpad, and voice often beat in-air gestures on speed, precision, privacy, and low effort—especially for “2D productivity” tasks.

Kinect was not fundamentally a failure of computer vision or machine learning. It was a lesson in what gesture is good at — and what it is not.

In-air gestures saw some adoption in niches with strong constraints, or where the benefits outweighed the pitfalls, such as in hands-busy / hands-sanitized contexts, and as secondary UIs in Smart TVs and public displays.

The Counterexample: Touch Gestures

While mid-air gesture struggled, touch gestures exploded.

Pinch-to-zoom, swipe, scroll, rotate — these became some of the most successful interaction techniques in computing history. The difference is revealing.

Touch gestures are:

Situated
Performed in contact with the object of interest
Constrained by physical and visual context

Touch succeeded not because it was expressive, but because the world itself disambiguated intent.

This point turns out to be crucial.

It also helped that reliably sensing (multi) touch was a far easier problem than in-air gestures or speech. The fact that touch gestures were inspired by hand movements used to manipulate real-world objects made them easy to learn and recall, and real-time visual feedback from the surface being manipulated made them easy to use.

Spatial Computing and the Return of Embodiment

Spatial computing paradigms such as AR, VR, and MR did not invent embodied interaction, but made it practical.

In fact, research on 3D interaction, distal pointing, and embodied reference goes back decades (including work our team contributed to on pointing and target acquisition in virtual 3D environments (ICMI12)).

What changed in the last decade was infrastructure and sensing, leading to reliable hand and gaze tracking, persistent scene graphs, and integrated gaze, gesture, and speech. In fact, modern XR platforms explicitly design for such pairings.

For example, on the Apple Vision Pro, “Look + pinch + say” is a first-class interaction model:

The headset uses eye tracking to determine where you’re looking, which acts much like a pointer or cursor in traditional interfaces.
Once you’ve looked at something, a pinch gesture (bringing thumb and forefinger together) acts as the selection or activation input — akin to clicking or tapping.
Finally, you can also perform actions or launch commands by speaking (e.g., “Siri, open this app”), which works alongside gaze and gesture.

On the Meta Quest as well, hand tracking + voice commands increasingly replaces controllers for casual interaction.

Such multimodal gestures work in this context for a few reasons:

Hand gestures and voice feel socially acceptable in immersive/private contexts
3D objects benefit from deictic reference
The environment allows for strong constraints, good feedback loops, and clear affordances.

At the end of the day however, these systems still treat gestures as precise commands to be accurately recognized and executed.

What comes next: Agentic Systems that enable Situated Intent

Humans routinely express intent that is:

Underspecified
Context-dependent
Delegative rather than directive

“Handle this.”
“Fix that later.”
“Keep an eye on this.”

These utterances make sense only because they are situated — grounded in a shared environment, moment, and activity.

This leads to a useful definition:

Situated intent is an expression of goals whose meaning is jointly determined by language, embodied signals (gesture, gaze), environmental state, and temporal context — rather than by language alone.

In natural human communication, gesture’s real power was never symbolic control. It played a key role in situating intent.

Agentic systems have finally given situated intent somewhere to land. The system’s job changes from execution correctness to interpreting situated intent (including via seeking clarifications and collaborating with the user using multiple input and output modalities), planning, and follow-through.

(gesture + speech + context)
→ intent hypothesis
→ planning
→ execution
→ monitoring

For example:

“Handle this” + point → agent figures out what handling means
“Fix this later” + glance → agent creates task, prioritizes, schedules
“Make this better” + gesture over document → agent interprets improvement goals

A typical pattern for an agentic interface may be:

Gesture and/or gaze defines scope
Speech defines goal
The system decides how

This is a fundamental shift:

From commands to delegation
From precision to interpretation
From control to collaboration

An important piece of this is also that these systems can deal with noisy signals and abstract representations of multimodal input (such as embeddings), rather than requiring accurate translation of the input into symbols or words.

Looking Back (and Forward)

Seen historically, gestural and multimodal interfaces were never a series of failed ideas. They were ideas waiting for the right kind of system on the other side.

Gesture alone struggled as a language
Multimodal interaction grounded meaning
Touch succeeded through situatedness
Spatial computing restored embodiment
Agents make delegation the primary act

With agentic systems, we finally have systems that can reason, plan, and take responsibility for underspecified intent.

That, more than any new sensor or model, is what finally makes gesture and speech feel at home.

-SriG

SriG's Blog