Humanoid Robot AI
The humanoid robot embodiment represents a familiar form factor for Human Robot Interaction (HRI), capable of slotting into the world built for humans. Artificial Intelligence (AI) for humanoids entails conversation, spatial awareness, dexterous manipulation and locomotion. In Video 1, voice recognition for action allocation is implemented with OpenAI Whisper Speech-To-Text (STT) to transcribe audio. In Video 2, Voice-To-Voice AI is implemented with OpenAI Whisper STT, Llama 3.2 1b Large Language Model (LLM) to craft a response, and Piper Text-To-Speech (TTS) to generate audio. The AI models run locally in a NVIDIA Jetson Thor AGX.
Video 1. Voice action allocation.
Video 2. Voice-To-Voice conversation.
Speech recognition facilitates natural HRI action allocation and conversation, pooling from the well of knowledge parameterized in the LLM. The arm trajectories for standard greeting gestures are pre-programmed. Arm navigation dependent on vision and semantic reasoning necessitates dynamic trajectory computation. To attain this, objects are visually recognized and spatially tracked in 3D from camera RGB images with the You Only Look Once (YOLO) Deep Neural Network (DNN) and projected from pixel to 3D space with an aligned depth image in Video 3. This real-time tracked position is utilized to compute arm joint positions with Model Predictive Control (MPC) in Video 4. MPC generates accurate trajectories with a physics model with a high latency that is feasible real-time with off-board computing. For real-time, on-board, dynamic target, arm navigation, 20,000 MPC trajectories generated in simulation to random target positions in the arm’s workspace are utilized to train a simulation to reality (sim-to-real) Diffusion Transformer (DiT) which decreases latency 20x. By combining the arm navigation DiT with YOLO, STT, LLM and TTS models, in Video 5, the robot conducts visual semantic reasoning conversation.
Video 3. Object detection, localization and tracking.
Video 4. Arm navigation to object with MPC.
Video 5. Voice arm navigation with DiT trained on MPC trajectories.
Obstacle avoidance and mobile navigation are fundamentally similar in all Autonomous Mobile Robots (AMRs). Map-free navigation with sim-to-real proportionally scaled car racing parameterized Deep Reinforcement Learning (DRL) is transferred across embodiments in Video 6 with 3D spatial augmentation of observations. The navigation DNN distills high-dimensional pre-processed spatial observations to digitized, information-dense, saturated features in the first hidden layer, and computes mobility controls with intrinsic exploration in the second.
Video 6. Map-Free navigation with cross-embodiment car racing parameterized DRL.