THE RESEARCH BEHIND THE DATA
The Research Behind the Data
We've analyzed every major paper on robot learning data. Here's what actually matters for training production-ready models, and how we apply it.
THE CORE THESIS
Real-World Data Wins. Simulated Doesn't.
Every major success in robot learning combines multiple real-world data types. Simulation-only approaches consistently plateau with a 30%+ gap to deployment.
| Approach | Typical Performance | Source |
|---|---|---|
| Single data type (teleop only) | ~65% task success | Open X-Embodiment (2024) |
| Teleop + motion capture | ~78% task success | Physical Intelligence (2024) |
| Multi-source (3+ types) | ~85%+ task success | Generalist AI (2024) |
Key Insight
Adding motion capture to teleop data improves success rates by 15-25%. Adding UMI gripper data adds another 10-20% for manipulation tasks. Real-world RL data provides the final edge for robust deployment.
What This Means For You
If you're collecting only one type of real-world data, you're leaving 20-35% performance on the table. The question isn't whether to use multi-source data. It's how to collect it efficiently.
THE ACTION TOKEN HIERARCHY
Not All Data Is Equal
The 2025 VLA survey identifies action token types. They form a hierarchy from abundant-but-commoditized to scarce-but-essential.
Level 1: Language Description
Web text, documentation
Billions of examples
Commoditized - Everyone has this
Level 2: Code
GitHub, documentation
Millions of examples
Commoditized
Level 3: Goal State
Video prediction, hindsight relabeling
100Ks from video/simulation
Medium value - Simulation helps
Level 4: Affordance
Simulation, annotated video
100Ks available
Medium-high value
Level 5: Trajectory & Raw Action
Teleoperation demonstrations on real hardware
The bottleneck
Highest value - Cannot be synthesized
THIS IS WHAT WE COLLECT
The upper levels of this hierarchy are commoditized. Internet-scale language data, code repositories, and video datasets are accessible to every VLA research team.
Competitive differentiation emerges at the bottom level: trajectory and raw action data. This is the critical bottleneck. It cannot be scraped from the internet or reliably synthesized in simulation. It requires physical robots, skilled human operators, and diverse real-world environments.
Sentientx specializes exclusively in this high-value data. The data that transforms VLA models into deployment-ready agents.
QUALITY FACTORS
The Research on Data Quality
More data isn't always better data. Here's what the literature says about quality factors, and how we operationalize them.
Diversity > Volume
"Increasing dataset diversity yields larger improvements than increasing dataset size by the same factor."Scaling Robot Learning (Brohan et al., 2023)
Our Approach:
We design collection protocols around environmental and task diversity, not raw demonstration count. Different backgrounds, lighting conditions, object variations, operator approaches.
Camera Placement Matters
"Egocentric camera angle, height, and field-of-view significantly impact downstream model performance."EgoMimic (Kareer et al., 2024)
Our Approach:
Standardized camera rigs with documented specs. Consistent placement across all collection sessions. Multi-view synchronization (wrist, third-person, egocentric).
Annotation Priorities
"3D hand pose and natural language narrations provide highest ROI for annotation effort."Dobb-E (Shafiullah et al., 2023)
Our Approach:
We prioritize these annotation types. Hindsight language labels. Subtask segmentation. Failure and recovery markers.
Domain Expertise > Generic Operators
"Task familiarity correlates with demonstration quality more strongly than teleoperation experience."Internal Sentientx analysis
Our Approach:
Recruit operators based on target domain. A shipyard worker collecting shipbuilding data. A warehouse associate collecting pick-and-place data.
THE SIM-TO-REAL GAP
Why Simulation Isn't Enough
If simulation worked, everyone would use it. Here's why they can't.
| Dimension | Simulation | Real-World (Sentientx) |
|---|---|---|
| Physics | Approximated (rigid body, simplified contact) | Ground truth physical interactions |
| Lighting | Rendered, predictable | Natural variation (time of day, shadows, reflections) |
| Textures | Limited asset libraries | Unlimited real-world materials |
| Objects | 3D model required | Any object, no preprocessing |
| Sensor noise | Synthetic, gaussian | Real sensor characteristics and failures |
| Contact dynamics | Simplified friction models | Actual material deformation, slippage |
| Edge cases | Must be programmed | Naturally occurring |
| Domain gap | Requires fine-tuning on real data anyway | Direct deployment ready |
"While simulation provides scalability, the sim-to-real gap remains a fundamental challenge. Models trained purely on synthetic data demonstrate performance around 70% success rate - compared to policies trained on real-world data."
- VLA Survey
The remaining 30% is the difference between a demo and a product.
Simulation is useful for pretraining and augmentation. But the final mile - the data that actually enables deployment - must come from the real world.There is no shortcut.
INDUSTRY VALIDATION
How Leading Companies Use Multi-Source Data
Physical Intelligence
Combines internet video pre-training with teleoperation fine-tuning. Uses diverse environments and operator pools. Raised $400M+ on the strength of their data strategy.
Figure AI
Simulation pre-training + real-world teleoperation + continuous on-robot learning. Multi-modal data pipeline from day one.
1X Technologies
Egocentric video at scale for pre-training. Teleoperation for task-specific skills. Motion capture for locomotion.
No leading company uses a single data type.
The question is always: what mix, in what proportion, for what tasks?
OUR SOURCES
Where We Learn
Premier Venues
- IEEE Transactions on Robotics
- ICRA
- CoRL
- RSS
- NeurIPS
- arXiv robotics and ML sections
Research Groups
- UC Berkeley BAIR
- Stanford AI Lab
- CMU Robotics Institute
- Google DeepMind Robotics
- Toyota Research Institute
Internal Analysis
- Continuous literature review
- Ablation studies on our collection protocols
- Client feedback loops on data quality
Ready to Build Your Data Strategy?
We'll help you design a multi-source data pipeline tailored to your robot and deployment timeline.
Get a Proposal