云深處Lite3基于isaaclab的rl_training代碼結(jié)構(gòu)及其獎(jiǎng)勵(lì)函數(shù)分析

git倉(cāng)庫(kù)
代碼總體是基于manager的isaaclab rl框架,實(shí)現(xiàn)Lite3和M20(四足帶輪)基于速度控制的強(qiáng)化學(xué)習(xí),主要文件結(jié)構(gòu)和IsaacLab基本一致。

├── logs.txt // 運(yùn)行時(shí)的終端輸出
├── docs // readme圖片
├── logs // 記錄訓(xùn)練的ckpt和video等
├── outputs // hydra的log和參數(shù)yaml
│   └── 2025-09-03
├── scripts
│   ├── reinforcement_learning //標(biāo)準(zhǔn)rsl_rl庫(kù)
│   │   ├── rl_utils.py
│   │   └── rsl_rl 
│   │       └── train.py //處理參數(shù),再通過(guò)@hydra_task_config傳入task和agent獲得agent_cfg和env_cfg。
│   │       └── play.py
│   │       └── cli_args.py //處理命令行參數(shù)
│   └── tools //可獨(dú)立運(yùn)行的小工具
│       ├── check_robot.py
│       ├── clean_trash.py
│       ├── convert_mjcf.py
│       ├── convert_urdf.py
│       └── list_envs.py
├── source //主要代碼
    └── rl_training
        ├── config //配置isaaclab的extension
        ├── data //模型文件
        ├── pyproject.toml //pip構(gòu)建系統(tǒng)配置
        ├── rl_training //rl的主要代碼
        └── setup.py //pip模塊安裝

對(duì)于其中的rl_training進(jìn)一步展開,主要文件如下。

rl_training
├── assets
│   ├── deeprobotics.py //配置輪式和足式的ArticulationCfg
│   └── utils
│       └── usd_converter.py
├── tasks
│   ├── manager_based //配置isaaclab的manager
│   │   ├── locomotion //運(yùn)動(dòng)方面的學(xué)習(xí)
│   │   │   └── velocity  //基于速度的運(yùn)動(dòng)學(xué)習(xí)
│   │   │       ├── config //env cfg
│   │   │       │   ├── quadruped //四足的env cfg
│   │   │       │   │   ├── deeprobotics_lite3
│   │   │       │   │   │   ├── __init__.py //配置了gym.registry,指定終端命令中的task name的具體配置
│   │   │       │   │   │   ├── agents
│   │   │       │   │   │   │   ├── rsl_rl_ppo_cfg.py //配置RL的agent參數(shù)
│   │   │       │   │   │   ├── flat_env_cfg.py //平地的env cfg,基于rough
│   │   │       │   │   │   └── rough_env_cfg.py//主要的env cfg,繼承了velocity_env_cfg
│   │   │       │   └── wheeled//輪式的env cfg,類似于quadruped
│   │   │       │       ├── deeprobotics_m20
│   │   │       │       │   ├── agents
│   │   │       │       │   │   └── rsl_rl_ppo_cfg.py
│   │   │       │       │   ├── flat_env_cfg.py
│   │   │       │       │   └── rough_env_cfg.py
│   │   │       ├── mdp //定義了mdp各模塊用到的func
│   │   │       │   ├── commands.py
│   │   │       │   ├── curriculums.py
│   │   │       │   ├── events.py
│   │   │       │   ├── observations.py
│   │   │       │   └── **rewards.py** //各種基礎(chǔ)動(dòng)作的獎(jiǎng)勵(lì)函數(shù)
│   │   │       └── **velocity_env_cfg.py** //主要的env cfg文件,配置manager各模塊

isaaclab基于manager的rl配置主要在env cfg,其包括基礎(chǔ)的scene, observation, action, reward, termination, event,以及其他如command, curriculum。
velocity_env_cfg.py中的LocomotionVelocityRoughEnvCfg是主要的env cfg,flat_env_cfg和rough_env_cfg都繼承自它。它包含了MySceneCfg, ActionCfg, ObservationCfg, CommandsCfg, RewardsCfg, TerminationsCfg, EventCfg, CurriculumCfg。
而env cfg中最重要的是reward,因此可以梳理一下reward.py中所有的reward項(xiàng)。

當(dāng)前訓(xùn)練目標(biāo)參考readme中對(duì)play的說(shuō)明,應(yīng)該是使機(jī)器狗能夠按照指定速度移動(dòng),其中線速度包括自身的xy軸平面內(nèi)的任意方向移動(dòng)、以及繞z軸的旋轉(zhuǎn)(yaw),支持實(shí)時(shí)的加減速。
下面列出所有reward項(xiàng),其中Flat訓(xùn)練中用到的用#Flat標(biāo)注,Rough的也同理。

track_lin_vel_xy_exp #Flat=3

def track_lin_vel_xy_exp(
    env: ManagerBasedRLEnv, std: float, command_name: str, asset_cfg: SceneEntityCfg = SceneEntityCfg("robot")
) -> torch.Tensor:
    """Reward tracking of linear velocity commands (xy axes) using exponential kernel."""
    # extract the used quantities (to enable type-hinting)
    asset: RigidObject = env.scene[asset_cfg.name]
    # compute the error
    lin_vel_error = torch.sum(
        torch.square(env.command_manager.get_command(command_name)[:, :2] - asset.data.root_lin_vel_b[:, :2]),
        dim=1,
    )
    reward = torch.exp(-lin_vel_error / std**2)
    # reward *= torch.clamp(-env.scene["robot"].data.projected_gravity_b[:, 2], 0, 0.7) / 0.7
    return reward

該獎(jiǎng)勵(lì)的核心因子為command中的xy平面的速度和當(dāng)前在xy平面的速度之差,通過(guò)指數(shù)核函數(shù)轉(zhuǎn)化為一個(gè)非線性負(fù)相關(guān)的獎(jiǎng)勵(lì):誤差為0時(shí)獎(jiǎng)勵(lì)為1,誤差越大,獎(jiǎng)勵(lì)越接近于0。std起到調(diào)節(jié)作用,同樣的誤差水平,std越小則獎(jiǎng)勵(lì)值越低。

track_ang_vel_z_exp #Flat=1.5

def track_ang_vel_z_exp(
    env: ManagerBasedRLEnv, std: float, command_name: str, asset_cfg: SceneEntityCfg = SceneEntityCfg("robot")
) -> torch.Tensor:
    """Reward tracking of angular velocity commands (yaw) using exponential kernel."""
    # extract the used quantities (to enable type-hinting)
    asset: RigidObject = env.scene[asset_cfg.name]
    # compute the error
    ang_vel_error = torch.square(env.command_manager.get_command(command_name)[:, 2] - asset.data.root_ang_vel_b[:, 2])
    reward = torch.exp(-ang_vel_error / std**2)
    # reward *= torch.clamp(-env.scene["robot"].data.projected_gravity_b[:, 2], 0, 0.7) / 0.7
    return reward

該獎(jiǎng)勵(lì)和上一個(gè)基本一致,只是對(duì)xy的線速度改為對(duì)z的角速度。

track_lin_vel_xy_yaw_frame_exp

def track_lin_vel_xy_yaw_frame_exp(
    env, std: float, command_name: str, asset_cfg: SceneEntityCfg = SceneEntityCfg("robot")
) -> torch.Tensor:
    """Reward tracking of linear velocity commands (xy axes) in the gravity aligned robot frame using exponential kernel."""
    # extract the used quantities (to enable type-hinting)
    asset = env.scene[asset_cfg.name]
    vel_yaw = quat_apply_inverse(yaw_quat(asset.data.root_quat_w), asset.data.root_lin_vel_w[:, :3])
    lin_vel_error = torch.sum(
        torch.square(env.command_manager.get_command(command_name)[:, :2] - vel_yaw[:, :2]), dim=1
    )
    reward = torch.exp(-lin_vel_error / std**2)
    # reward *= torch.clamp(-env.scene["robot"].data.projected_gravity_b[:, 2], 0, 0.7) / 0.7
    return reward

該獎(jiǎng)勵(lì)主要考慮在坡度環(huán)境下機(jī)器人自身的xy平面速度并不契合真實(shí)平面的速度,所以用yaw_quat()計(jì)算了一個(gè)機(jī)器人姿態(tài)的四元數(shù),其對(duì)應(yīng)的坐標(biāo)系是yaw frame,再用quat_apply_inverse()將機(jī)器人在世界坐標(biāo)系下的速度轉(zhuǎn)換到y(tǒng)aw frame的速度。另外,quat_apply()是將局部坐標(biāo)系轉(zhuǎn)世界。

track_ang_vel_z_world_exp

def track_ang_vel_z_world_exp(
    env, command_name: str, std: float, asset_cfg: SceneEntityCfg = SceneEntityCfg("robot")
) -> torch.Tensor:
    """Reward tracking of angular velocity commands (yaw) in world frame using exponential kernel."""
    # extract the used quantities (to enable type-hinting)
    asset = env.scene[asset_cfg.name]
    ang_vel_error = torch.square(env.command_manager.get_command(command_name)[:, 2] - asset.data.root_ang_vel_w[:, 2])
    reward = torch.exp(-ang_vel_error / std**2)
    reward *= torch.clamp(-env.scene["robot"].data.projected_gravity_b[:, 2], 0, 0.7) / 0.7
    return reward

該獎(jiǎng)勵(lì)計(jì)算世界坐標(biāo)系中z軸角速度的誤差,主要消除坡度的影響,適合需要全局方向控制的場(chǎng)景。

joint_power #*Flat=-2e-05 *

def joint_power(env: ManagerBasedRLEnv, asset_cfg: SceneEntityCfg = SceneEntityCfg("robot")) -> torch.Tensor:
    """Reward joint_power"""
    # extract the used quantities (to enable type-hinting)
    asset: Articulation = env.scene[asset_cfg.name]
    # compute the reward
    reward = torch.sum(
        torch.abs(asset.data.joint_vel[:, asset_cfg.joint_ids] * asset.data.applied_torque[:, asset_cfg.joint_ids]),
        dim=1,
    )
    return reward

該獎(jiǎng)勵(lì)對(duì)asset_cfg.joint_ids指定的所有關(guān)節(jié)計(jì)算功率(角速度×力矩)求和,懲罰過(guò)大的瞬時(shí)功率。

stand_still_without_cmd #Flat=-2

def stand_still_without_cmd(
    env: ManagerBasedRLEnv,
    command_name: str,
    command_threshold: float,
    asset_cfg: SceneEntityCfg = SceneEntityCfg("robot"),
) -> torch.Tensor:
    """Penalize joint positions that deviate from the default one when no command."""
    # extract the used quantities (to enable type-hinting)
    asset: Articulation = env.scene[asset_cfg.name]
    # compute out of limits constraints
    diff_angle = asset.data.joint_pos[:, asset_cfg.joint_ids] - asset.data.default_joint_pos[:, asset_cfg.joint_ids]
    reward = torch.sum(torch.abs(diff_angle), dim=1)
    reward *= torch.linalg.norm(env.command_manager.get_command(command_name), dim=1) < command_threshold # 指令范數(shù),越大代表速度指令越明確
    # reward *= torch.clamp(-env.scene["robot"].data.projected_gravity_b[:, 2], 0, 0.7) / 0.7
    return reward

如果存在有效的速度指令則該獎(jiǎng)勵(lì)項(xiàng)的值為0,否則就是所有關(guān)節(jié)位置和默認(rèn)(站立)位置的誤差和。作用是在無(wú)有效指令時(shí),讓機(jī)器人保持穩(wěn)定減少抖動(dòng)。

joint_pos_penalty #Flat=-0.1

def joint_pos_penalty(
    env: ManagerBasedRLEnv,
    command_name: str,
    asset_cfg: SceneEntityCfg,
    stand_still_scale: float,
    velocity_threshold: float,
    command_threshold: float,
) -> torch.Tensor:
    """Penalize joint position error from default on the articulation."""
    # extract the used quantities (to enable type-hinting)
    asset: Articulation = env.scene[asset_cfg.name]
    cmd = torch.linalg.norm(env.command_manager.get_command(command_name), dim=1)# 指令范數(shù),越大代表速度指令越明確
    body_vel = torch.linalg.norm(asset.data.root_lin_vel_b[:, :2], dim=1) # 機(jī)器人在xy平面的速度
    running_reward = torch.linalg.norm(
        (asset.data.joint_pos[:, asset_cfg.joint_ids] - asset.data.default_joint_pos[:, asset_cfg.joint_ids]), dim=1
    )# 機(jī)器人關(guān)節(jié)偏離默認(rèn)位置的程度
    reward = torch.where(
        torch.logical_or(cmd > command_threshold, body_vel > velocity_threshold),# 只要是執(zhí)行任務(wù)或者在移動(dòng)
        running_reward, #就使用基礎(chǔ)的懲罰
        stand_still_scale * running_reward, # 否則如果是靜止,則放大該值
    )
    # reward *= torch.clamp(-env.scene["robot"].data.projected_gravity_b[:, 2], 0, 0.7) / 0.7
    return reward

該獎(jiǎng)勵(lì)根據(jù)‘是否有任務(wù)’和‘是否在運(yùn)動(dòng)’,用不同強(qiáng)度懲罰關(guān)節(jié)偏離默認(rèn)姿勢(shì) —— 執(zhí)行任務(wù) / 運(yùn)動(dòng)時(shí)放松懲罰(保靈活),無(wú)指令 / 靜止時(shí)強(qiáng)化懲罰(保穩(wěn)定)。具體見注釋

wheel_vel_penalty

GaitReward

joint_mirror #Flat=-0.05

def joint_mirror(env: ManagerBasedRLEnv, asset_cfg: SceneEntityCfg, mirror_joints: list[list[str]]) -> torch.Tensor:
    # extract the used quantities (to enable type-hinting)
    asset: Articulation = env.scene[asset_cfg.name]
   #如沒有緩存則創(chuàng)建緩存,提高查找效率
    if not hasattr(env, "joint_mirror_joints_cache") or env.joint_mirror_joints_cache is None: 
        # Cache joint positions for all pairs
        env.joint_mirror_joints_cache = [
            [asset.find_joints(joint_name) for joint_name in joint_pair] for joint_pair in mirror_joints
        ]
    reward = torch.zeros(env.num_envs, device=env.device)
    # Iterate over all joint pairs,計(jì)算每個(gè)鏡像關(guān)節(jié)對(duì)的對(duì)稱誤差并累加
    for joint_pair in env.joint_mirror_joints_cache:
        # Calculate the difference for each pair and add to the total reward
        diff = torch.sum(
            torch.square(asset.data.joint_pos[:, joint_pair[0][0]] - asset.data.joint_pos[:, joint_pair[1][0]]),
            dim=-1,
        )
        reward += diff
    # 若有鏡像關(guān)節(jié)對(duì)(len(mirror_joints) > 0),則總懲罰值除以關(guān)節(jié)對(duì)數(shù)量,實(shí)現(xiàn) “歸一化”;若無(wú)關(guān)節(jié)對(duì),則懲罰值設(shè)為 0;
    reward *= 1 / len(mirror_joints) if len(mirror_joints) > 0 else 0
    # reward *= torch.clamp(-env.scene["robot"].data.projected_gravity_b[:, 2], 0, 0.7) / 0.7
    return reward

該獎(jiǎng)勵(lì)項(xiàng)主要懲罰機(jī)器人對(duì)稱關(guān)節(jié)的位置偏離,盡量使關(guān)節(jié)保持鏡像對(duì)稱。具體見注釋

action_mirror

action_sync

feet_air_time_positive_biped

def feet_air_time_positive_biped(env, command_name: str, threshold: float, sensor_cfg: SceneEntityCfg) -> torch.Tensor:
    """Reward long steps taken by the feet for bipeds.

    This function rewards the agent for taking steps up to a specified threshold and also keep one foot at
    a time in the air.

    If the commands are small (i.e. the agent is not supposed to take a step), then the reward is zero.
    """
    contact_sensor: ContactSensor = env.scene.sensors[sensor_cfg.name]
    # compute the reward
    air_time = contact_sensor.data.current_air_time[:, sensor_cfg.body_ids]
    contact_time = contact_sensor.data.current_contact_time[:, sensor_cfg.body_ids]
    in_contact = contact_time > 0.0
    in_mode_time = torch.where(in_contact, contact_time, air_time)
    single_stance = torch.sum(in_contact.int(), dim=1) == 1
    reward = torch.min(torch.where(single_stance.unsqueeze(-1), in_mode_time, 0.0), dim=1)[0]
    reward = torch.clamp(reward, max=threshold)
    # no reward for zero command
    reward *= torch.norm(env.command_manager.get_command(command_name)[:, :2], dim=1) > 0.1
    # reward *= torch.clamp(-env.scene["robot"].data.projected_gravity_b[:, 2], 0, 0.7) / 0.7
    return reward

獎(jiǎng)勵(lì)雙足機(jī)器人邁出更長(zhǎng)的步距

feet_air_time_variance_penalty

懲罰足滯空時(shí)間的方差。

feet_contact

def feet_contact(
    env: ManagerBasedRLEnv, command_name: str, expect_contact_num: int, sensor_cfg: SceneEntityCfg
) -> torch.Tensor:
    """Reward feet contact"""
    # extract the used quantities (to enable type-hinting)
    contact_sensor: ContactSensor = env.scene.sensors[sensor_cfg.name]
    # compute the reward
    contact = contact_sensor.compute_first_contact(env.step_dt)[:, sensor_cfg.body_ids]
    contact_num = torch.sum(contact, dim=1)
    reward = (contact_num != expect_contact_num).float()# 接觸數(shù)量不一致時(shí)值為1
    # no reward for zero command
    reward *= torch.linalg.norm(env.command_manager.get_command(command_name), dim=1) > 0.5
    # 摔倒時(shí)懲罰值置零
    reward *= torch.clamp(-env.scene["robot"].data.projected_gravity_b[:, 2], 0, 0.7) / 0.7
    return reward

懲罰足部觸地?cái)?shù)量與預(yù)期不符的情況。具體見注釋。主要用于步態(tài)控制和姿態(tài)穩(wěn)定。

feet_contact_without_cmd #Flat=0.1

def feet_contact_without_cmd(env: ManagerBasedRLEnv, command_name: str, sensor_cfg: SceneEntityCfg) -> torch.Tensor:
    """Reward feet contact"""
    # extract the used quantities (to enable type-hinting)
    contact_sensor: ContactSensor = env.scene.sensors[sensor_cfg.name]
    # compute the reward
    contact = contact_sensor.compute_first_contact(env.step_dt)[:, sensor_cfg.body_ids]
    # print(contact, "contact")
    reward = torch.sum(contact, dim=-1).float()
    # print(reward, "reward after sum")
    reward *= torch.linalg.norm(env.command_manager.get_command(command_name), dim=1) < 0.5
    # print(env.command_manager.get_command(command_name), "env.command_manager.get_command(command_name)")
    # print(reward, "reward after multiply")
    # reward *= torch.clamp(-env.scene["robot"].data.projected_gravity_b[:, 2], 0, 0.7) / 0.7
    return reward

無(wú)指令時(shí)獎(jiǎng)勵(lì)觸地足的數(shù)量。

feet_stumble #Flat=-0.1

def feet_stumble(env: ManagerBasedRLEnv, sensor_cfg: SceneEntityCfg) -> torch.Tensor:
    # extract the used quantities (to enable type-hinting)
    contact_sensor: ContactSensor = env.scene.sensors[sensor_cfg.name]
    forces_z = torch.abs(contact_sensor.data.net_forces_w[:, sensor_cfg.body_ids, 2])
    forces_xy = torch.linalg.norm(contact_sensor.data.net_forces_w[:, sensor_cfg.body_ids, :2], dim=2)
    # Penalize feet hitting vertical surfaces
    reward = torch.any(forces_xy > 4 * forces_z, dim=1).float()
    # reward *= torch.clamp(-env.scene["robot"].data.projected_gravity_b[:, 2], 0, 0.7) / 0.7
    return reward

懲罰絆倒,條件是當(dāng)水平面受力超過(guò)垂直方向4倍時(shí)。

feet_distance_y_exp

def feet_distance_y_exp(
    env: ManagerBasedRLEnv, stance_width: float, std: float, asset_cfg: SceneEntityCfg = SceneEntityCfg("robot")
) -> torch.Tensor:
    asset: RigidObject = env.scene[asset_cfg.name]
# 將世界坐標(biāo)轉(zhuǎn)為相對(duì)本體的
    cur_footsteps_translated = asset.data.body_link_pos_w[:, asset_cfg.body_ids, :] - asset.data.root_link_pos_w[
        :, :
    ].unsqueeze(1)
# 將本體坐標(biāo)根據(jù)本體姿態(tài)進(jìn)一步區(qū)分出前后左右四腳的坐標(biāo)
    n_feet = len(asset_cfg.body_ids)
    footsteps_in_body_frame = torch.zeros(env.num_envs, n_feet, 3, device=env.device)
    for i in range(n_feet):
        footsteps_in_body_frame[:, i, :] = math_utils.quat_apply(
            math_utils.quat_conjugate(asset.data.root_link_quat_w), cur_footsteps_translated[:, i, :]
        )
    side_sign = torch.tensor(
        [1.0 if i % 2 == 0 else -1.0 for i in range(n_feet)],
        device=env.device,
    )
# 計(jì)算預(yù)期足部寬度
    stance_width_tensor = stance_width * torch.ones([env.num_envs, 1], device=env.device)
    desired_ys = stance_width_tensor / 2 * side_sign.unsqueeze(0)
# 計(jì)算偏差的平方
    stance_diff = torch.square(desired_ys - footsteps_in_body_frame[:, :, 1])
# 求和算指數(shù)
    reward = torch.exp(-torch.sum(stance_diff, dim=1) / (std**2))
    # reward *= torch.clamp(-env.scene["robot"].data.projected_gravity_b[:, 2], 0, 0.7) / 0.7
    return reward

獎(jiǎng)勵(lì)足部站立在Y軸(寬度)上的位置和預(yù)期stance_width的一致程度。

feet_distance_xy_exp

與上類似,同時(shí)計(jì)算在X軸上的位置偏差。

feet_height

def feet_height(
    env: ManagerBasedRLEnv,
    command_name: str,
    asset_cfg: SceneEntityCfg,
    target_height: float,
    tanh_mult: float,
) -> torch.Tensor:
    """Reward the swinging feet for clearing a specified height off the ground"""
    asset: RigidObject = env.scene[asset_cfg.name]
    foot_z_target_error = torch.square(asset.data.body_pos_w[:, asset_cfg.body_ids, 2] - target_height)
    # foot_velocity_tanh = torch.tanh(
    #     tanh_mult * torch.linalg.norm(asset.data.body_lin_vel_w[:, asset_cfg.body_ids, :2], dim=2)
    # )
    # reward = torch.sum(foot_z_target_error * foot_velocity_tanh, dim=1)
    reward = torch.sum(foot_z_target_error, dim=1)
    # print(foot_z_target_error, "foot_z_target_error")
    # no reward for zero command
    reward *= torch.linalg.norm(env.command_manager.get_command(command_name), dim=1) > 0.2
    # reward *= torch.clamp(-env.scene["robot"].data.projected_gravity_b[:, 2], 0, 0.7) / 0.7
    return reward

懲罰足部高度與預(yù)期高度的偏差。

feet_height_body #Flat=-5.0

def feet_height_body(
    env: ManagerBasedRLEnv,
    command_name: str,
    asset_cfg: SceneEntityCfg,
    target_height: float,
    tanh_mult: float,
) -> torch.Tensor:
    """Reward the swinging feet for clearing a specified height off the ground"""
    asset: RigidObject = env.scene[asset_cfg.name]
    cur_footpos_translated = asset.data.body_pos_w[:, asset_cfg.body_ids, :] - asset.data.root_pos_w[:, :].unsqueeze(1)
    footpos_in_body_frame = torch.zeros(env.num_envs, len(asset_cfg.body_ids), 3, device=env.device)
    cur_footvel_translated = asset.data.body_lin_vel_w[:, asset_cfg.body_ids, :] - asset.data.root_lin_vel_w[
        :, :
    ].unsqueeze(1)
    footvel_in_body_frame = torch.zeros(env.num_envs, len(asset_cfg.body_ids), 3, device=env.device)
    for i in range(len(asset_cfg.body_ids)):
        # 獲得機(jī)體坐標(biāo)系下的z軸高度
        footpos_in_body_frame[:, i, :] = math_utils.quat_apply_inverse(
            asset.data.root_quat_w, cur_footpos_translated[:, i, :]
        )
        # 獲得機(jī)體坐標(biāo)系下的xy速度 
        footvel_in_body_frame[:, i, :] = math_utils.quat_apply_inverse(
            asset.data.root_quat_w, cur_footvel_translated[:, i, :]
        )
# 獲得偏差平方
    foot_z_target_error = torch.square(footpos_in_body_frame[:, :, 2] - target_height).view(env.num_envs, -1)
# 計(jì)算權(quán)重
    foot_velocity_tanh = torch.tanh(tanh_mult * torch.norm(footvel_in_body_frame[:, :, :2], dim=2))
    reward = torch.sum(foot_z_target_error * foot_velocity_tanh, dim=1)
    reward *= torch.linalg.norm(env.command_manager.get_command(command_name), dim=1) > 0.5
    # reward *= torch.clamp(-env.scene["robot"].data.projected_gravity_b[:, 2], 0, 0.7) / 0.7
    return reward

與上類似,但一方面考慮的是機(jī)體坐標(biāo)系下的高度,一方面加入了速度權(quán)重,速度越大,高度偏差的懲罰項(xiàng)越大,防止碰撞等。

feet_slide #Flat=-0.1

def feet_slide(
    env: ManagerBasedRLEnv, sensor_cfg: SceneEntityCfg, asset_cfg: SceneEntityCfg = SceneEntityCfg("robot")
) -> torch.Tensor:
    """Penalize feet sliding.

    This function penalizes the agent for sliding its feet on the ground. The reward is computed as the
    norm of the linear velocity of the feet multiplied by a binary contact sensor. This ensures that the
    agent is penalized only when the feet are in contact with the ground.
    """
    # Penalize feet sliding
    contact_sensor: ContactSensor = env.scene.sensors[sensor_cfg.name]
    contacts = contact_sensor.data.net_forces_w_history[:, :, sensor_cfg.body_ids, :].norm(dim=-1).max(dim=1)[0] > 1.0#接觸力>1視為接觸
    asset: RigidObject = env.scene[asset_cfg.name]

    # feet_vel = asset.data.body_lin_vel_w[:, asset_cfg.body_ids, :2]
    # reward = torch.sum(feet_vel.norm(dim=-1) * contacts, dim=1)
# 步驟1:計(jì)算足部相對(duì)于根節(jié)點(diǎn)的世界坐標(biāo)系速度偏移
    cur_footvel_translated = asset.data.body_lin_vel_w[:, asset_cfg.body_ids, :] - asset.data.root_lin_vel_w[
        :, :
    ].unsqueeze(1)
# 步驟2:初始化機(jī)體坐標(biāo)系下的足部速度存儲(chǔ)
    footvel_in_body_frame = torch.zeros(env.num_envs, len(asset_cfg.body_ids), 3, device=env.device)
# 步驟3:逐足將相對(duì)速度從世界坐標(biāo)系轉(zhuǎn)換到機(jī)體坐標(biāo)系
    for i in range(len(asset_cfg.body_ids)):
        footvel_in_body_frame[:, i, :] = math_utils.quat_apply_inverse(
            asset.data.root_quat_w, cur_footvel_translated[:, i, :]
        )
# 計(jì)算每個(gè)足的xy軸滑動(dòng)速度平方和
    foot_leteral_vel = torch.sqrt(torch.sum(torch.square(footvel_in_body_frame[:, :, :2]), dim=2)).view(
        env.num_envs, -1
    )
    reward = torch.sum(foot_leteral_vel * contacts, dim=1)
    # reward *= torch.clamp(-env.scene["robot"].data.projected_gravity_b[:, 2], 0, 0.7) / 0.7
    return reward

懲罰足部與地面的滑動(dòng)

upward #Flat=0.15

def upward(env: ManagerBasedRLEnv, asset_cfg: SceneEntityCfg = SceneEntityCfg("robot")) -> torch.Tensor:
    """Penalize z-axis base linear velocity using L2 squared kernel."""
    # extract the used quantities (to enable type-hinting)
    asset: RigidObject = env.scene[asset_cfg.name]
# projected_gravity_b[:,2] 反映自身z軸和重力的對(duì)齊程度,完全對(duì)齊時(shí)其值為-1,否則完全摔倒時(shí)接近1
    reward = torch.square(1 - asset.data.projected_gravity_b[:, 2])
    return reward

獎(jiǎng)勵(lì)機(jī)體z軸和重力的對(duì)齊程度。

base_height_l2 #Flat=-10

def base_height_l2(
    env: ManagerBasedRLEnv,
    target_height: float,
    asset_cfg: SceneEntityCfg = SceneEntityCfg("robot"),
    sensor_cfg: SceneEntityCfg | None = None,
) -> torch.Tensor:
    """Penalize asset height from its target using L2 squared kernel.

    Note:
        For flat terrain, target height is in the world frame. For rough terrain,
        sensor readings can adjust the target height to account for the terrain.
    """
    # extract the used quantities (to enable type-hinting)
    asset: RigidObject = env.scene[asset_cfg.name]
    if sensor_cfg is not None:
        sensor: RayCaster = env.scene[sensor_cfg.name]
        # Adjust the target height using the sensor data
        ray_hits = sensor.data.ray_hits_w[..., 2]
        if torch.isnan(ray_hits).any() or torch.isinf(ray_hits).any() or torch.max(torch.abs(ray_hits)) > 1e6:
            adjusted_target_height = asset.data.root_link_pos_w[:, 2]
        else:
            adjusted_target_height = target_height + torch.mean(ray_hits, dim=1)
    else:
        # Use the provided target height directly for flat terrain
        adjusted_target_height = target_height
    # Compute the L2 squared penalty
    reward = torch.square(asset.data.root_pos_w[:, 2] - adjusted_target_height)
    # reward *= torch.clamp(-env.scene["robot"].data.projected_gravity_b[:, 2], 0, 0.7) / 0.7
    return reward

懲罰機(jī)體高度和預(yù)期高度的偏差

lin_vel_z_l2 #Flat=-2

def lin_vel_z_l2(env: ManagerBasedRLEnv, asset_cfg: SceneEntityCfg = SceneEntityCfg("robot")) -> torch.Tensor:
    """Penalize z-axis base linear velocity using L2 squared kernel."""
    # extract the used quantities (to enable type-hinting)
    asset: RigidObject = env.scene[asset_cfg.name]
    reward = torch.square(asset.data.root_lin_vel_b[:, 2])
    # reward *= torch.clamp(-env.scene["robot"].data.projected_gravity_b[:, 2], 0, 0.7) / 0.7
    return reward

懲罰機(jī)體在z軸的移動(dòng)

ang_vel_xy_l2 #Flat=-0.05

def ang_vel_xy_l2(env: ManagerBasedRLEnv, asset_cfg: SceneEntityCfg = SceneEntityCfg("robot")) -> torch.Tensor:
    """Penalize xy-axis base angular velocity using L2 squared kernel."""
    # extract the used quantities (to enable type-hinting)
    asset: RigidObject = env.scene[asset_cfg.name]
    reward = torch.sum(torch.square(asset.data.root_ang_vel_b[:, :2]), dim=1)
    # reward *= torch.clamp(-env.scene["robot"].data.projected_gravity_b[:, 2], 0, 0.7) / 0.7
    return reward

懲罰機(jī)體在x/y軸的旋轉(zhuǎn)

undesired_contacts

def undesired_contacts(env: ManagerBasedRLEnv, threshold: float, sensor_cfg: SceneEntityCfg) -> torch.Tensor:
    """Penalize undesired contacts as the number of violations that are above a threshold."""
    # extract the used quantities (to enable type-hinting)
    contact_sensor: ContactSensor = env.scene.sensors[sensor_cfg.name]
    # check if contact force is above threshold
    net_contact_forces = contact_sensor.data.net_forces_w_history
    is_contact = torch.max(torch.norm(net_contact_forces[:, :, sensor_cfg.body_ids], dim=-1), dim=1)[0] > threshold
    # sum over contacts for each environment
    reward = torch.sum(is_contact, dim=1).float()
    # reward *= torch.clamp(-env.scene["robot"].data.projected_gravity_b[:, 2], 0, 0.7) / 0.7
    return reward

flat_orientation_l2

def flat_orientation_l2(env: ManagerBasedRLEnv, asset_cfg: SceneEntityCfg = SceneEntityCfg("robot")) -> torch.Tensor:
    """Penalize non-flat base orientation using L2 squared kernel.

    This is computed by penalizing the xy-components of the projected gravity vector.
    """
    # extract the used quantities (to enable type-hinting)
    asset: RigidObject = env.scene[asset_cfg.name]
    reward = torch.sum(torch.square(asset.data.projected_gravity_b[:, :2]), dim=1)
    # reward *= torch.clamp(-env.scene["robot"].data.projected_gravity_b[:, 2], 0, 0.7) / 0.7
    return reward

機(jī)器人基座水平姿態(tài)懲罰函數(shù),用 “投影重力 XY 分量” 判斷基座水平性


以上獎(jiǎng)勵(lì)項(xiàng)均為reward.py中列出,其他還有部分在isaaclab.envs.mdp中直接定義。

joint_torques_l2  # -2.5e-05
joint_acc_l2 # -1e-07
joint_pos_limits #-5.0
action_rate_l2 # -0.01
contact_forces # -0.00015
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

友情鏈接更多精彩內(nèi)容