jaxdem.rl.environments.swarm_navigator#

Multi-agent 2-D swarm navigation with potential-based rewards.

Classes

SwarmNavigator(state, system, env_params, ...)

Multi-agent 2-D swarm navigation with potential-based rewards.

class jaxdem.rl.environments.swarm_navigator.SwarmNavigator(state: State, system: System, env_params: dict[str, Any], n_lidar_rays: int, k_objectives: int)#

Bases: Environment

Multi-agent 2-D swarm navigation with potential-based rewards.

Each agent controls a force vector applied directly to a sphere inside a reflective box. Viscous drag -friction * vel is added every step. Objectives are shared among all agents; each agent dynamically tracks its k nearest objectives. The potential-based shaping signal is computed independently for each of the k objectives and summed. Occupancy is determined via strict symmetry breaking: only the closest agent to each objective within the activation threshold may claim it.

Reward

\[R_i = w_s\,\sum_{j \in \text{top-}k} (e^{-2d_{ij}} - e^{-2d_{ij}^{\mathrm{prev}}}) + w_g\,\mathbf{1}[d_i < f \cdot r_i] - w_c\,\left\|\sum_j l_j\,\hat{r}_j\right\| - w_w\,\|a_i\|^2 + w_v\,\mathbf{1}[\text{all }k\text{ occupied}] - \bar{r}_i\]

where \(\bar{r}_i\) is an EMA baseline updated with factor \(\alpha\). All weights are constructor parameters stored in env_params.

Notes

The observation vector per agent is:

Feature

Size

Velocity

dim

LiDAR proximity

n_lidar_rays

LiDAR radial relative velocity

n_lidar_rays

LiDAR objective proximity

n_lidar_rays

Unit direction to top k objectives

k_objectives * dim

Clamped displacement to top k

k_objectives * dim

Occupancy status of top k

k_objectives

n_lidar_rays: int#

Number of angular bins for the agent-to-agent LiDAR sensor.

k_objectives: int#

Number of closest objectives tracked per agent.

classmethod Create(N: int = 64, min_box_size: float = 1.0, max_box_size: float = 1.0, box_padding: float = 20.0, max_steps: int = 5760, friction: float = 0.2, shaping_weight: float = 2.0, goal_weight: float = 0.001, crowding_weight: float = 0.005, work_weight: float = 0.0005, vacancy_weight: float = 0.005, goal_radius_factor: float = 1.0, alpha_r_bar: float = 0.07, lidar_range: float = 0.4, n_lidar_rays: int = 8, k_objectives: int = 5) SwarmNavigator[source]#

Create a swarm navigator environment.

Parameters:
  • N (int) – Number of agents.

  • min_box_size (float) – Range for the random square domain side length sampled at each reset().

  • max_box_size (float) – Range for the random square domain side length sampled at each reset().

  • box_padding (float) – Extra padding around the domain in multiples of the particle radius.

  • max_steps (int) – Episode length in physics steps.

  • friction (float) – Viscous drag coefficient applied as -friction * vel.

  • shaping_weight (float) – Multiplier \(w_s\) on the potential-based shaping signal summed over the k nearest objectives.

  • goal_weight (float) – Bonus \(w_g\) for uniquely claiming a target.

  • crowding_weight (float) – Penalty \(w_c\) per unit of LiDAR crowding vector norm.

  • work_weight (float) – Weight \(w_w\) of the quadratic action penalty \(\|a\|^2\).

  • vacancy_weight (float) – Reward \(w_v\) granted when all k nearest objectives are occupied.

  • goal_radius_factor (float) – Multiplicative factor \(f\) applied to the particle radius to define the goal activation threshold \(d < f \cdot r\).

  • alpha_r_bar (float) – EMA smoothing factor \(\alpha\) for the differential reward baseline \(\bar{r}\).

  • lidar_range (float) – Maximum detection range for the LiDAR sensor.

  • n_lidar_rays (int) – Number of angular LiDAR bins spanning \([-\pi, \pi)\).

  • k_objectives (int) – Number of closest objectives tracked per agent.

Returns:

A freshly constructed environment (call reset() before use).

Return type:

SwarmNavigator

static reset(env: SwarmNavigator, key: Array | ndarray | bool | number | bool | int | float | complex | TypedNdArray) Environment[source]#

Reset the environment to a random initial configuration.

Parameters:
  • env (Environment) – The environment instance to reset.

  • key (ArrayLike) – PRNG key used to sample the domain, positions, objectives, and initial velocities.

Returns:

The environment with a fresh episode state.

Return type:

Environment

static step(env: SwarmNavigator, action: Array) Environment[source]#

Advance the environment by one physics step.

Applies force actions with viscous drag. After integration the method updates all sensor caches and computes the reward with a differential baseline. The shaping signal is summed over the k nearest objectives.

Parameters:
  • env (Environment) – Current environment.

  • action (jax.Array) – Force actions for every agent, shape (N * dim,).

Returns:

Updated environment after physics integration, sensor updates, and reward computation.

Return type:

Environment

static observation(env: SwarmNavigator) Array[source]#

Build the per-agent observation vector from cached sensors.

All state-dependent components are pre-computed in step() and reset(). This method only concatenates cached arrays.

Returns:

Observation matrix of shape (N, obs_dim). See the class docstring for the feature layout.

Return type:

jax.Array

static reward(env: SwarmNavigator) Array[source]#

Return the reward cached by step().

Returns:

Reward vector of shape (N,).

Return type:

jax.Array

static done(env: SwarmNavigator) Array[source]#

Return True when the episode has exceeded max_steps.

property action_space_size: int[source]#

Number of scalar actions per agent (equal to dim).

property action_space_shape: tuple[int][source]#

Shape of a single agent’s action ((dim,)).

property observation_space_size: int[source]#

Dimensionality of a single agent’s observation vector.