THE RESEARCH BEHIND THE DATA

The Research Behind the Data

We've analyzed every major paper on robot learning data. Here's what actually matters for training production-ready models, and how we apply it.

Get a Proposal

THE CORE THESIS

Real-World Data Wins. Simulated Doesn't.

Every major success in robot learning combines multiple real-world data types. Simulation-only approaches consistently plateau with a 30%+ gap to deployment.

Approach	Typical Performance	Source
Single data type (teleop only)	~65% task success	Open X-Embodiment (2024)
Teleop + motion capture	~78% task success	Physical Intelligence (2024)
Multi-source (3+ types)	~85%+ task success	Generalist AI (2024)

Key Insight

Adding motion capture to teleop data improves success rates by 15-25%. Adding UMI gripper data adds another 10-20% for manipulation tasks. Real-world RL data provides the final edge for robust deployment.

What This Means For You

If you're collecting only one type of real-world data, you're leaving 20-35% performance on the table. The question isn't whether to use multi-source data. It's how to collect it efficiently.

THE ACTION TOKEN HIERARCHY

Not All Data Is Equal

The 2025 VLA survey identifies action token types. They form a hierarchy from abundant-but-commoditized to scarce-but-essential.

Abundant, Low Value

Scarce, High Value

Level 1: Language Description

Web text, documentation

Billions of examples

Commoditized - Everyone has this

Level 2: Code

GitHub, documentation

Millions of examples

Commoditized

Level 3: Goal State

Video prediction, hindsight relabeling

100Ks from video/simulation

Medium value - Simulation helps

Level 4: Affordance

Simulation, annotated video

100Ks available

Medium-high value

Level 5: Trajectory & Raw Action

Teleoperation demonstrations on real hardware

The bottleneck

Highest value - Cannot be synthesized

THIS IS WHAT WE COLLECT

The upper levels of this hierarchy are commoditized. Internet-scale language data, code repositories, and video datasets are accessible to every VLA research team.

Competitive differentiation emerges at the bottom level: trajectory and raw action data. This is the critical bottleneck. It cannot be scraped from the internet or reliably synthesized in simulation. It requires physical robots, skilled human operators, and diverse real-world environments.

Sentientx specializes exclusively in this high-value data. The data that transforms VLA models into deployment-ready agents.

QUALITY FACTORS

The Research on Data Quality

More data isn't always better data. Here's what the literature says about quality factors, and how we operationalize them.

Diversity > Volume

"Increasing dataset diversity yields larger improvements than increasing dataset size by the same factor."Scaling Robot Learning (Brohan et al., 2023)

Our Approach:

We design collection protocols around environmental and task diversity, not raw demonstration count. Different backgrounds, lighting conditions, object variations, operator approaches.

Camera Placement Matters

"Egocentric camera angle, height, and field-of-view significantly impact downstream model performance."EgoMimic (Kareer et al., 2024)

Our Approach:

Standardized camera rigs with documented specs. Consistent placement across all collection sessions. Multi-view synchronization (wrist, third-person, egocentric).

Annotation Priorities

"3D hand pose and natural language narrations provide highest ROI for annotation effort."Dobb-E (Shafiullah et al., 2023)

Our Approach:

We prioritize these annotation types. Hindsight language labels. Subtask segmentation. Failure and recovery markers.

Domain Expertise > Generic Operators

"Task familiarity correlates with demonstration quality more strongly than teleoperation experience."Internal Sentientx analysis

Our Approach:

Recruit operators based on target domain. A shipyard worker collecting shipbuilding data. A warehouse associate collecting pick-and-place data.

THE SIM-TO-REAL GAP

Why Simulation Isn't Enough

If simulation worked, everyone would use it. Here's why they can't.

Dimension	Simulation	Real-World (Sentientx)
Physics	Approximated (rigid body, simplified contact)	Ground truth physical interactions
Lighting	Rendered, predictable	Natural variation (time of day, shadows, reflections)
Textures	Limited asset libraries	Unlimited real-world materials
Objects	3D model required	Any object, no preprocessing
Sensor noise	Synthetic, gaussian	Real sensor characteristics and failures
Contact dynamics	Simplified friction models	Actual material deformation, slippage
Edge cases	Must be programmed	Naturally occurring
Domain gap	Requires fine-tuning on real data anyway	Direct deployment ready

"While simulation provides scalability, the sim-to-real gap remains a fundamental challenge. Models trained purely on synthetic data demonstrate performance around 70% success rate - compared to policies trained on real-world data."

- VLA Survey

The remaining 30% is the difference between a demo and a product.

Simulation is useful for pretraining and augmentation. But the final mile - the data that actually enables deployment - must come from the real world.There is no shortcut.

INDUSTRY VALIDATION

How Leading Companies Use Multi-Source Data

Physical Intelligence

Combines internet video pre-training with teleoperation fine-tuning. Uses diverse environments and operator pools. Raised $400M+ on the strength of their data strategy.

Figure AI

Simulation pre-training + real-world teleoperation + continuous on-robot learning. Multi-modal data pipeline from day one.

1X Technologies

Egocentric video at scale for pre-training. Teleoperation for task-specific skills. Motion capture for locomotion.

No leading company uses a single data type.

The question is always: what mix, in what proportion, for what tasks?

OUR SOURCES

Where We Learn

Premier Venues

IEEE Transactions on Robotics
ICRA
CoRL
RSS
NeurIPS
arXiv robotics and ML sections

Research Groups

UC Berkeley BAIR
Stanford AI Lab
CMU Robotics Institute
Google DeepMind Robotics
Toyota Research Institute

Internal Analysis

Continuous literature review
Ablation studies on our collection protocols
Client feedback loops on data quality

The Sentientx Weekly

Our Friday newsletter distilling the research that matters.

Ready to Build Your Data Strategy?

We'll help you design a multi-source data pipeline tailored to your robot and deployment timeline.

Get a Proposal