Synthetic Data Generation | Shailka-Robotics

1

Phase 1

Scene & Asset Preparation

Phase 1 builds the 3D content library that Replicator will randomize. We audit the target perception task — object detection, defect classification, pose estimation — and identify the visual diversity requirements: object categories, material variations, environmental contexts, and lighting conditions. 3D assets are created or sourced from existing CAD models, applying physically-based materials (PBR) with measured BRDF properties so that rendered images match real-world appearance under varied illumination. Environment stages represent deployment contexts — warehouse shelves, conveyor lines, outdoor loading docks — with modular construction allowing rapid composition of novel layouts. We author USD-based scene templates that separate background, foreground objects, and lighting rigs into independent layers, enabling Replicator to randomize each axis independently. Material libraries include parametric wear-and-tear variations — scratches, dust, oil stains — that reflect in-service conditions the perception model must handle. Asset validation checks polygon budgets, UV mapping quality, and material response under extreme lighting to prevent rendering artifacts. Deliverables include the validated asset library, scene templates, material catalogs, and a content-gap analysis mapping asset coverage against the deployment domain's visual diversity.

OmniverseReplicatorOpenUSD

Replicator

2

Phase 2

Domain Randomization Configuration

Phase 2 defines the randomization strategy that transforms static scenes into a diverse training distribution. Lighting randomization varies HDR environment maps, area-light positions, color temperatures (2700K–6500K), and intensity ranges to simulate indoor fluorescent, outdoor daylight, and mixed-lighting conditions. Texture randomization applies distractors — random patterns on background surfaces — to prevent the model from overfitting to environment textures. Object pose randomization samples from task-relevant distributions: parts scattered on a conveyor follow gravity-settled placements, while shelf-stocked items follow grid patterns with jitter. Camera randomization varies intrinsics (focal length, sensor size), extrinsics (mounting height, tilt), and post-processing (exposure, white balance, motion blur) to match the range of cameras the model will encounter in deployment. We implement stratified sampling to ensure uniform coverage across randomization axes, avoiding the mode collapse that naive random sampling can produce. Randomization schedules are version-controlled and linked to experiment IDs, enabling reproducibility. Deliverables include randomization configuration files, axis-coverage visualizations, a sampling-strategy document, and validation renders demonstrating the visual diversity achieved across lighting, pose, texture, and camera parameters.

ReplicatorPythonRandomization Engine

Replicator

3

Phase 3

Automated Dataset Generation

Phase 3 executes large-scale rendering and labeling. Replicator orchestrates multi-GPU rendering farms to produce datasets of tens of thousands to millions of images, each accompanied by pixel-perfect annotations — bounding boxes, instance segmentation masks, depth maps, surface normals, and 6-DOF object poses — generated without human labeling effort. Multi-sensor rendering produces synchronized camera, lidar, and radar outputs when the downstream model fuses heterogeneous inputs. We implement dataset validation pipelines that check label consistency (no zero-area bounding boxes, no overlapping instance IDs), class balance (flagging underrepresented categories for additional rendering passes), and visual quality (detecting rendering artifacts like z-fighting or texture seams). Statistical analysis compares the synthetic distribution against available real-world data along feature dimensions — object scale, aspect ratio, occlusion level — to identify and fill coverage gaps. Output formats are configurable — COCO JSON, KITTI, TFRecord, custom schemas — and datasets are versioned in object storage with full provenance linking each image to its scene template, randomization seed, and rendering parameters. Deliverables include the validated dataset, distribution-analysis reports, a label-quality dashboard, and rendering pipeline scripts ready for incremental dataset expansion.

ReplicatorMulti-GPUAuto-Labeling

Replicator

4

Phase 4

Model Training Integration

The final phase connects synthetic datasets to model training and measures real-world impact. We configure TAO Toolkit transfer-learning pipelines that fine-tune pre-trained backbones (ResNet, EfficientNet, YOLO, DINO) on synthetic data, applying progressive domain-adaptation techniques — style transfer, feature alignment, curriculum mixing — that blend synthetic and real samples to maximize deployment accuracy. Training experiments sweep synthetic-to-real mixing ratios (100% synthetic → 80/20 → 50/50) to identify the optimal blend for the target task. Model evaluation uses a held-out real-world test set with metrics aligned to operational KPIs — mAP at deployment-relevant IoU thresholds, per-class recall for safety-critical categories, inference latency on target hardware. We implement feedback loops where model failure modes — missed detections, false positives on novel objects — trigger targeted asset creation and additional rendering passes, closing the data flywheel. TensorRT optimization produces deployment-ready engines profiled on Jetson or T4 inference hardware. Deliverables include trained model checkpoints, training experiment reports, synthetic-vs-real analysis, TensorRT engine packages, and a data-flywheel run-book that teams use to continuously improve model performance through iterative synthetic data refinement.

TAO ToolkitTensorRTTransfer Learning

TAO Toolkit Triton Apps

Related Technology

ReplicatorIsaac SimOmniverseOpenUSD

Reference Architecture

Robot Training Pipeline

End-to-end closed-loop from CAD import through synthetic training to real-world deployment.

Selected Component

Synthetic Data

Replicator

Domain-randomized datasets for perception and manipulation.

Program Focus

Synthetic data delivers the most value when it is engineered around business-critical edge cases — the underrepresented defect types, rare object orientations, and adverse lighting conditions that real-world collection cannot economically cover. Shailka-Robotics builds Omniverse Replicator pipelines that produce exactly the labeled, domain-randomized datasets customers need to close perception model gaps.

Each pipeline is architected as a configurable data factory: USD-based scene templates define the environment geometry and object placement distributions, while Replicator randomizers control lighting, materials, camera pose, and distractor placement across renders. Auto-labeling generates pixel-perfect bounding boxes, semantic segmentation masks, depth maps, and keypoint annotations without manual annotation labor. The result is a repeatable production system, not a one-time dataset dump.

Where this service differentiates is the feedback loop. Model performance metrics drive targeted scene modifications — if a defect detector struggles with specular surfaces under fluorescent lighting, the pipeline generates thousands of those specific combinations. NVIDIA TAO Toolkit then fine-tunes pretrained models on the synthetic corpus, and validation against held-out real data quantifies the improvement before deployment.

Delivery Methodology

Label Schema & Coverage Analysis — Define target classes, annotation types, and coverage gaps based on current model failure analysis.
Scene Template Engineering — Build USD scene templates with parametric object placement, material libraries, and environment variations.
Domain Randomization Design — Configure Replicator randomizers for lighting, pose, texture, occlusion, and camera intrinsics tied to real-world distributions.
Render & Annotation Pipeline — Execute batch rendering with auto-labeling; validate annotation quality against ground truth samples.
Model Training & Validation Loop — Fine-tune models using TAO Toolkit on synthetic data; measure accuracy uplift on real validation sets.

Technology Stack

Omniverse Replicator — synthetic data generation with programmable domain randomization
OpenUSD — scene templates, asset composition, and variation management
TAO Toolkit — transfer learning and fine-tuning on synthetic datasets
NVIDIA Isaac Sim — robotic workcell scene generation with physics-accurate object interactions
NVIDIA-Omniverse — rendering backbone with RTX ray tracing for photorealistic output
NVIDIA Triton Inference Server — model serving for validation and production deployment

Expected Outcomes

12M+ labeled synthetic images generated per program-scale engagement
5–15% accuracy improvement on underrepresented edge cases after synthetic data augmentation
90% reduction in manual annotation cost through Replicator auto-labeling
10x faster dataset iteration cycles compared to real-world data collection campaigns
Configurable data factory that teams operate independently for ongoing model improvement

Start This Program Contact An Expert