Katha Shah — Writing Essay No. 002 · March 2025
AI · Accessibility · Deep Learning · USC

Deep Learning Meets
Accessibility: Building
Sign Language AI

Blindness cuts us off from things, but deafness cuts us off from people. — Helen Keller, American deaf-blind educator

That quote sat on the first slide of our presentation. We put it there because it said in one sentence what we were trying to solve in an entire research project — and because it reminded us, every time we looked at it, exactly who we were building for.

It was 2020. A deep learning class at the University of Southern California. Four of us — a small team who shared the belief that the most interesting problems in AI weren't the ones everyone was already racing toward. We wanted to build something for people who were being left behind by the technology boom.

We chose Sign Language Production. Not recognition — that had been explored. Production. The reverse direction: taking a word in English and automatically generating the sign language for it. Word in. 3D skeleton out.

The Gap Nobody Had Filled

The research landscape had a strange asymmetry. Sign Language Recognition — teaching a computer to understand signing from video — had received real attention. Production, the reverse direction, was almost entirely unexplored at the end-to-end level.

What that means practically: technology could help hearing people understand someone who was deaf. But it couldn't bridge the gap the other way. A truly accessible system needs to go both directions. We decided to try building one.

Why It Was Hard

Sign language is not a simplified spoken language. It uses every part of the body simultaneously — upper body motion, hand shape and trajectory, and facial expressions. Miss any channel and the meaning changes or disappears entirely.

That multi-channel complexity is what makes it genuinely difficult for a neural network. You're not generating a sequence of words. You're generating a continuous stream of 3D skeletal poses — joint coordinates across the torso, arms, hands, and face — all moving together in a way a real human signer would recognise as correct.

No one had built a fully end-to-end system that could do this. We were building into a gap with no established baseline, no prior model to build on, and a lot of open questions about whether it was even solvable.

Word to Skeleton

The core idea: give the system a word — say, "thank you" — and it outputs a stream of 3D skeletal pose frames showing exactly how to sign it in ASL. Upper body, arms, hands, face — all encoded as 3D joint coordinates, frame by frame. String the frames together and you have a fluid, animated sign.

3D Sign Language Production — Full Pipeline
English Word
"thank you"
Input
WLASL Dataset
2,000 common ASL words · video corpus
Training data
OpenPose
Video → 2D skeletons · face, torso, hands
Preprocessing
Z-Estimation
Iterative backprop · 2D → 3D lift
3D conversion
Progressive Transformer + GAN
Generator + Conditional Discriminator
Model
3D Pose Sequence
N frames · joint coordinates · animated sign
Output

Turning Video Into Data

Before training anything, we had to turn raw sign language videos into something a neural network could learn from. We used the WLASL dataset — the largest Word-Level ASL video corpus available, covering 2,000 common words signed by multiple signers.

Preprocessing had two stages. First, OpenPose extracted 2D skeleton keypoints from every video frame — detecting face, torso, and hand positions as 2D coordinate data. Then a Z-estimation algorithm — an iterative backpropagation-based approach — estimated the depth axis, lifting those 2D skeletons into full 3D space.

That 3D conversion mattered enormously. A flat 2D skeleton loses depth information that changes the meaning of a sign. Getting the z-axis right was one of the most technically demanding parts of the entire project.

OpenPose skeleton detection and Z-estimation 3D conversion
OpenPose detects face, torso & hand keypoints (top). Z-estimation converts 2D to 3D (bottom). Figure 1 · USC · 2020–21

The Architecture — Two Components

We built a GAN-based system: a Generator and a Conditional Adversarial Discriminator working in tension with each other.

The Generator used a Progressive Transformer — an encoder-decoder architecture with Counter Embedding and Continuous Embedding layers — to produce sign pose sequences frame by frame. The Counter Embedding tracked position progressively; the Continuous Embedding conditioned each new frame on the previous pose.

The Discriminator was conditioned on the input word — meaning it judged not just whether a pose sequence looked realistic, but whether it was the correct sign for that specific word. The training signal combined MSE loss from the Generator with adversarial loss from the Discriminator.

GAN Architecture — Generator vs Discriminator
Generator — Progressive Transformer
Input Embedding
Text / Gloss word encoding
Counter Embedding
Progressive frame position tracking
Continuous Embedding
Previous pose conditioning
Masked Multi-Head Attention
+ Feed Forward × N layers
Produced Sign Pose Sequence
ŷ₀ … ŷᵤ — N 3D skeleton frames
↑ MSE Loss
vs
Adversarial Loss
Discriminator — Conditioned on Word
Ground Truth Pose Sequence
y₀ … yᵤ — real signer frames
Word Conditioning
Discriminator knows the input word
Real / Fake Judgement
Is this the correct sign for this word?
The discriminator being conditioned on the specific word — not just pose quality — is what forces the generator toward semantic correctness, not just visual realism.

Three Experiments

We ran three experiments to understand what was actually contributing to performance. The first used the Progressive Transformer alone — no adversarial component, just the sequence generator trained with MSE loss. The second added the conditional GAN discriminator. The third added noise injection during training to improve robustness.

Comparing all three gave us real insight. The GAN component made the most meaningful difference to output quality. The noise experiment pushed the metrics further, giving us our best results.

The Numbers

We evaluated using BLEU-1 — borrowed from machine translation, measuring how closely generated sequences match ground truth — and Dynamic Time Warping (DTW), measuring geometric similarity between generated and real pose sequences across time.

Our best configuration achieved BLEU-1 of 31.4 and DTW of 78.69. Strong results for a problem with no prior baseline — because no one had solved it before. More importantly, we had proved the problem was solvable at all. The door was open.

Configuration
Description
BLEU-1 ↑
DTW ↑
Transformer only
Progressive Transformer, MSE loss, no adversarial component
Baseline
Baseline
+ GAN
Conditional adversarial discriminator added, word-conditioned
Improved
Improved
+ Noise · Best
Noise injection during training for robustness
31.4
78.69
We weren't just training a model. We were asking a neural network to learn the grammar of a language spoken entirely through the body.

The Work That Came After

Our model was preliminary — we were honest about that in the presentation. We had proved the concept and established a first baseline for a problem nobody had solved. But the research didn't stop when the class ended.

My teammates carried it forward. They extended the architecture, improved the training pipeline, and pushed the work further than we'd taken it together in the class. That's how good research works — one team opens the door, the next team walks further through it. I'm glad we were the ones who found it first.

Why This Kind of Work Matters

There are roughly 70 million deaf people in the world. Most navigate a world built almost entirely for people who hear. The technology industry has poured extraordinary resources into problems already well-served — and comparatively little into problems that affect people without economic power to demand attention.

Building this project changed how I think about what engineering is for. The memory of building something for people who genuinely needed it — not just wanted a faster checkout — stays with me in a different way than any other work I've done.

On the team

This project was built by four graduate students at USC as part of our deep learning coursework — every architectural decision, every training run, every late-night debugging session was shared. My teammates continued extending this research beyond the class project. The work, and the credit, belongs to all of us.

What It Left Behind

Helen Keller's words have stayed with me since I first put them on that slide. Deafness is invisible to most of us. We don't notice the barrier because we've never needed to cross it.

I build right, for the right reasons, every time. That principle didn't come from nowhere. It came from a USC classroom, a team of four, and a quote about deafness cutting people off from each other — and the conviction that it didn't have to.

The 3D Sign Language Production model was built 2020–21 at USC. Research extended by teammates post-course. WLASL dataset courtesy of the USC Information Sciences Institute. Architecture based on Progressive Transformers for Sign Language Production (Saunders et al.).

A Note on Accessibility in AI

AI has the power to lower barriers that most people don't even know exist — but only if the people building it choose to look. The most important software we can build is software that lets more people participate in the world.

I'm not suggesting every engineer has to pivot to accessibility research. But I do think we should be honest about where our attention goes and why. The hardest problems are not always the most visible ones. And sometimes the most meaningful thing you can build is the thing nobody else thought to build first.

The most important software we can build is software that lets more people participate in the world.