Abstract

In dynamic unstructured environments, achieving safe, scalable, and interpretable robotic skill execution remains a core bottleneck toward general-purpose embodied intelligence. Existing vision-language-action (VLA) paradigms typically employ end-to-end large models that map high-level task semantics directly to low-level action sequences. Their black-box nature limits interpretability, reduces skill reuse, and fails to provide deterministic guarantees in safety-critical scenarios. To overcome these limitations, we propose the Skill-Aware Robotic Task Execution(S-RTE) framework, which establishes a skill perception-reasoning-execution loop for gray-box and safety-aware control. At its core, we introduce the Skill-Aware Executable-Semantic Descriptor (S-ESD), a unified representation that bridges low-level control APIs with high-level semantic descriptions, enabling structured skill retrieval and composition by large language models (LLMs). Building upon this representation, we design the Skill Perception Network (S-ESD-CLIP) that leverages vision-language embedding alignment to perceive the executability and progress status (not started / in progress / completed) of candidate skills in real time, effectively pruning unsafe or infeasible skills. Finally, a gray-box planner (CLIP-LLM) integrates perceived feasibility cues as constraints into the reasoning process, producing interpretable “skill-condition” plans that support dynamic task decomposition and closed-loop execution. Extensive real-world experiments demonstrate that S-RTE significantly improves safety, robustness, and skill scalability, while maintaining real-time inference on a single NVIDIA Jetson. The proposed framework offers a promising pathway toward deployable and interpretable embodied intelligence.

Overview of the S-RTE framework. (a) Skill-Aware Executable-Semantic Descriptor (ESD): Each skill is formalized as a structured five-tuple ⟨ID, Description, Model, State, Tag⟩, providing a unified representation of semantic abstraction and low-level control. Skills are acquired via imitation learning (IL), or end-to-end training, and stored in a modular skill library for retrieval and reuse. (b) S-ESD with CLIP (S-ESD-CLIP): A lightweight multimodal perception module leveraging CLIP-based vision-language alignment evaluates the environmental executability of candidate skills in real time and updates stage-aware execution states (“not started,” “in progress,” “completed”), enabling dynamic pruning of infeasible actions. (c) CLIP-Guided Large Language Model Planner (S-CLIP-LLM): High-level task planning incorporates the filtered skill set from S-ESD-CLIP as prior constraints for the LLM, generating sequential task plans with interpretable reasoning. The bottom-right illustration demonstrates two composite tasks—Task 1 (“place bottle → basket”) and Task 2 (“place cookie → plate”)—showing multi-skill sequential execution under closed-loop supervision.

Skill Perception Evaluation

In this section,We evaluate the S-ESD-CLIP model's skill perception using a bottle grasping task as a reference skill. The model predicts skill feasibility and execution stage under three experimental conditions: standard, cluttered, and occluded environments.

Overview of the S-ESD-CLIP framework. ((a) Pre-training of S-ESD-CLIP: Textual skill descriptions and visual observations are independently encoded by text and image encoders, and their similarity is computed through the CLIP model followed by softmax normalization. ((b) ESD-CLIP for skill selection: The module integrates current observations with the skill library to dynamically generate a subset of executable skills that guide downstream task execution. This mechanism ensures that only contextually feasible skills are selected, thereby enhancing the reliability of high-level planning and execution.

Standard environment

Cluttered environment

LLM-guided hierarchical task planning

In this section, we present S-CLIP-LLM, a CLIP-guided Large Language Model Planner that integrates the dynamically filtered skill library from S-ESD-CLIP with LLM reasoning. This enables context-aware, goal-driven long-horizon task planning through semantically valid skill sequencing.

Overview of the S-CLIP-LLM framework. (a) Real-time skill perception with S-ESD-CLIP: The module dynamically filters the skill library based on current observations, aligning visual context with textual skill representations to identify executable skills in real time. (b) LLM-guided hierarchical task planning: The CLIP-Guided Large Language Model Planner integrates the filtered skill set with LLM reasoning to generate semantically valid and physically feasible skill sequences. By considering skill execution states (“not started,” “in progress,” “completed”), it enables interpretable, safe, and context-aware long-horizon task execution.

Perception + LLM Hierarchical Planning

ACT vs. ACT+LLM vs. Ours

From the failure cases of traditional ACT and ACT+LLM, it becomes evident that real-time skill perception is essential for reliable task execution. Our perception-integrated planner combines visual understanding with language reasoning to detect failures, adapt on the fly, and achieve consistent long-horizon performance.

Tier 1 – Single Skill

In this baseline task, the robot places a bottle in the basket, testing fundamental manipulation and motion control. The results serve as a calibration reference before progressing to more complex multi-skill or disturbed scenarios.

places a bottle in the basket

Tier 2 – Sequential Composition

In this tier, the system must execute a two-step sequence: place the bottle in the basket, then place the cookie on the plate. This stage tests the planner’s ability to maintain state awareness and perform long-horizon skill coordination, ensuring smooth and correct transitions between atomic skills.

place the bottle in the basket, then place the cookie on the plate

Tier 3 – Unexpected Condition

This tier examines the controller’s real-time adaptability and safety awareness under dynamic environmental disturbances. The task repeats the two-step sequence—placing the bottle in the basket and the cookie on the plate—but introduces unexpected perturbations such as a missing target object, manual scene restoration, or partial prior skill completion. A robust system should detect infeasible conditions and respond safely by pausing, re-evaluating, or re-planning instead of continuing an invalid trajectory.

Baseline Failures (ACT / ACT+LLM)

Representative failure cases highlighting the limitations of traditional ACT and ACT+LLM under dynamic task conditions.

Case 1 – Missing Object

The system fails to detect that the target object has been removed.

Case 2 – Incorrect Skill Sequence

ACT+LLM skips placing the bottle after detecting the cookie is already on the plate, leading to failure.

Case 3 – Failure Under Occlusion

The system fails to adapt when the environment introduces occlusions, continuing execution without proper recovery.

Our Successful Executions (S-RTE)

Representative cases demonstrating the robustness, adaptability, and perception-guided planning of our S-RTE framework.

Case 1 – Scene Manually Restored

S-RTE adapts as an experimenter repositions objects, completing the task successfully.

Case 2 – Correct Skill Sequencing

S-RTE properly maintains skill states, placing the bottle and cookie in the correct order.

Case 3 – Adaptation Under Occlusion

S-RTE successfully adapts when objects are occluded, completing the task without failure.

Extra Experiments Section

In this baseline task, the robot First open the drawer, then place the cookie inside.

Multiple successful demonstrations

Baseline Failures (ACT / ACT+LLM)

Representative failure cases highlighting the limitations of traditional ACT and ACT+LLM under dynamic task conditions.

Case 1 – Missing Object

The system fails to detect that the target object has been removed.

Case 2 – Indiscernible Skill Termination

Unable to determine whether the skill has terminated normally.

Case 3 – Failure to Skip Completed Stages in Long-Horizon Tasks

In long-horizon tasks, if an intermediate stage is inadvertently completed by external interference, the system cannot automatically bypass it and proceeds to execute the stage anyway.

Improvements of SRTE over ACT / ACT+LLM

Empirical comparisons showing that SRTE effectively addresses the baseline failure cases, enabling reliable stage progression and robust execution in dynamic tasks.

Robustness Improvements Through SRTE

Citation

We kindly request that you cite our work if you utilize the code or reference our findings in your research.

Skill-Aware Robotic Task Execution with Executable-Semantic Descriptors and CLIP-LLM Planning arXiv paper Skill Perception Code (Coming Soon) Skill Collection Code (Coming Soon)

Abstract

Skill Perception Evaluation

LLM-guided hierarchical task planning

ACT vs. ACT+LLM vs. Ours

Tier 1 – Single Skill

places a bottle in the basket

Tier 2 – Sequential Composition

place the bottle in the basket, then place the cookie on the plate

Tier 3 – Unexpected Condition

Baseline Failures (ACT / ACT+LLM)

Our Successful Executions (S-RTE)

Extra Experiments Section

Multiple successful demonstrations

Baseline Failures (ACT / ACT+LLM)

Improvements of SRTE over ACT / ACT+LLM

Robustness Improvements Through SRTE

Citation

Skill-Aware Robotic Task Execution with Executable-Semantic Descriptors and CLIP-LLM Planning

arXiv

paper

Skill Perception Code (Coming Soon)

Skill Collection Code (Coming Soon)