THE RESEARCH BEHIND THE DATA

The Research Behind the Data

We've analyzed every major paper on robot learning data. Here's what actually matters for training production-ready models, and how we apply it.

Get a Proposal

THE CORE THESIS

Real-World Data Wins. Simulated Doesn't.

Every major success in robot learning combines multiple real-world data types. Simulation-only approaches consistently plateau with a 30%+ gap to deployment.

ApproachTypical PerformanceSource
Single data type (teleop only)~65% task successOpen X-Embodiment (2024)
Teleop + motion capture~78% task successPhysical Intelligence (2024)
Multi-source (3+ types)~85%+ task successGeneralist AI (2024)

Key Insight

Adding motion capture to teleop data improves success rates by 15-25%. Adding UMI gripper data adds another 10-20% for manipulation tasks. Real-world RL data provides the final edge for robust deployment.

What This Means For You

If you're collecting only one type of real-world data, you're leaving 20-35% performance on the table. The question isn't whether to use multi-source data. It's how to collect it efficiently.

THE ACTION TOKEN HIERARCHY

Not All Data Is Equal

The 2025 VLA survey identifies action token types. They form a hierarchy from abundant-but-commoditized to scarce-but-essential.

Abundant, Low Value
Scarce, High Value

Level 1: Language Description

Web text, documentation

Billions of examples

Commoditized - Everyone has this

Level 2: Code

GitHub, documentation

Millions of examples

Commoditized

Level 3: Goal State

Video prediction, hindsight relabeling

100Ks from video/simulation

Medium value - Simulation helps

Level 4: Affordance

Simulation, annotated video

100Ks available

Medium-high value

Level 5: Trajectory & Raw Action

Teleoperation demonstrations on real hardware

The bottleneck

Highest value - Cannot be synthesized

THIS IS WHAT WE COLLECT

The upper levels of this hierarchy are commoditized. Internet-scale language data, code repositories, and video datasets are accessible to every VLA research team.

Competitive differentiation emerges at the bottom level: trajectory and raw action data. This is the critical bottleneck. It cannot be scraped from the internet or reliably synthesized in simulation. It requires physical robots, skilled human operators, and diverse real-world environments.

Sentientx specializes exclusively in this high-value data. The data that transforms VLA models into deployment-ready agents.

QUALITY FACTORS

The Research on Data Quality

More data isn't always better data. Here's what the literature says about quality factors, and how we operationalize them.

Diversity > Volume

"Increasing dataset diversity yields larger improvements than increasing dataset size by the same factor."Scaling Robot Learning (Brohan et al., 2023)

Our Approach:

We design collection protocols around environmental and task diversity, not raw demonstration count. Different backgrounds, lighting conditions, object variations, operator approaches.

Camera Placement Matters

"Egocentric camera angle, height, and field-of-view significantly impact downstream model performance."EgoMimic (Kareer et al., 2024)

Our Approach:

Standardized camera rigs with documented specs. Consistent placement across all collection sessions. Multi-view synchronization (wrist, third-person, egocentric).

Annotation Priorities

"3D hand pose and natural language narrations provide highest ROI for annotation effort."Dobb-E (Shafiullah et al., 2023)

Our Approach:

We prioritize these annotation types. Hindsight language labels. Subtask segmentation. Failure and recovery markers.

Domain Expertise > Generic Operators

"Task familiarity correlates with demonstration quality more strongly than teleoperation experience."Internal Sentientx analysis

Our Approach:

Recruit operators based on target domain. A shipyard worker collecting shipbuilding data. A warehouse associate collecting pick-and-place data.

THE SIM-TO-REAL GAP

Why Simulation Isn't Enough

If simulation worked, everyone would use it. Here's why they can't.

DimensionSimulationReal-World (Sentientx)
PhysicsApproximated (rigid body, simplified contact)Ground truth physical interactions
LightingRendered, predictableNatural variation (time of day, shadows, reflections)
TexturesLimited asset librariesUnlimited real-world materials
Objects3D model requiredAny object, no preprocessing
Sensor noiseSynthetic, gaussianReal sensor characteristics and failures
Contact dynamicsSimplified friction modelsActual material deformation, slippage
Edge casesMust be programmedNaturally occurring
Domain gapRequires fine-tuning on real data anywayDirect deployment ready

"While simulation provides scalability, the sim-to-real gap remains a fundamental challenge. Models trained purely on synthetic data demonstrate performance around 70% success rate - compared to policies trained on real-world data."

- VLA Survey

The remaining 30% is the difference between a demo and a product.

Simulation is useful for pretraining and augmentation. But the final mile - the data that actually enables deployment - must come from the real world.There is no shortcut.

INDUSTRY VALIDATION

How Leading Companies Use Multi-Source Data

Physical Intelligence

Combines internet video pre-training with teleoperation fine-tuning. Uses diverse environments and operator pools. Raised $400M+ on the strength of their data strategy.

Figure AI

Simulation pre-training + real-world teleoperation + continuous on-robot learning. Multi-modal data pipeline from day one.

1X Technologies

Egocentric video at scale for pre-training. Teleoperation for task-specific skills. Motion capture for locomotion.

No leading company uses a single data type.

The question is always: what mix, in what proportion, for what tasks?

OUR SOURCES

Where We Learn

Premier Venues

  • IEEE Transactions on Robotics
  • ICRA
  • CoRL
  • RSS
  • NeurIPS
  • arXiv robotics and ML sections

Research Groups

  • UC Berkeley BAIR
  • Stanford AI Lab
  • CMU Robotics Institute
  • Google DeepMind Robotics
  • Toyota Research Institute

Internal Analysis

  • Continuous literature review
  • Ablation studies on our collection protocols
  • Client feedback loops on data quality

The Sentientx Weekly

Our Friday newsletter distilling the research that matters.

Subscribe

Ready to Build Your Data Strategy?

We'll help you design a multi-source data pipeline tailored to your robot and deployment timeline.

Get a Proposal