jaxdem.rl.environments.swarm_navigator#
Multi-agent 2-D swarm navigation with potential-based rewards.
Classes
|
Multi-agent 2-D swarm navigation with potential-based rewards. |
- class jaxdem.rl.environments.swarm_navigator.SwarmNavigator(state: State, system: System, env_params: dict[str, Any], n_lidar_rays: int, k_objectives: int)#
Bases:
EnvironmentMulti-agent 2-D swarm navigation with potential-based rewards.
Each agent controls a force vector applied directly to a sphere inside a reflective box. Viscous drag
-friction * velis added every step. Objectives are shared among all agents; each agent dynamically tracks its k nearest objectives. The potential-based shaping signal is computed independently for each of the k objectives and summed. Occupancy is determined via strict symmetry breaking: only the closest agent to each objective within the activation threshold may claim it.Reward
\[R_i = w_s\,\sum_{j \in \text{top-}k} (e^{-2d_{ij}} - e^{-2d_{ij}^{\mathrm{prev}}}) + w_g\,\mathbf{1}[d_i < f \cdot r_i] - w_c\,\left\|\sum_j l_j\,\hat{r}_j\right\| - w_w\,\|a_i\|^2 + w_v\,\mathbf{1}[\text{all }k\text{ occupied}] - \bar{r}_i\]where \(\bar{r}_i\) is an EMA baseline updated with factor \(\alpha\). All weights are constructor parameters stored in
env_params.Notes
The observation vector per agent is:
Feature
Size
Velocity
dimLiDAR proximity
n_lidar_raysLiDAR radial relative velocity
n_lidar_raysLiDAR objective proximity
n_lidar_raysUnit direction to top k objectives
k_objectives * dimClamped displacement to top k
k_objectives * dimOccupancy status of top k
k_objectives- n_lidar_rays: int#
Number of angular bins for the agent-to-agent LiDAR sensor.
- k_objectives: int#
Number of closest objectives tracked per agent.
- classmethod Create(N: int = 64, min_box_size: float = 1.0, max_box_size: float = 1.0, box_padding: float = 20.0, max_steps: int = 5760, friction: float = 0.2, shaping_weight: float = 2.0, goal_weight: float = 0.001, crowding_weight: float = 0.005, work_weight: float = 0.0005, vacancy_weight: float = 0.005, goal_radius_factor: float = 1.0, alpha_r_bar: float = 0.07, lidar_range: float = 0.4, n_lidar_rays: int = 8, k_objectives: int = 5) → SwarmNavigator[source]#
Create a swarm navigator environment.
- Parameters:
N (int) – Number of agents.
min_box_size (float) – Range for the random square domain side length sampled at each
reset().max_box_size (float) – Range for the random square domain side length sampled at each
reset().box_padding (float) – Extra padding around the domain in multiples of the particle radius.
max_steps (int) – Episode length in physics steps.
friction (float) – Viscous drag coefficient applied as
-friction * vel.shaping_weight (float) – Multiplier \(w_s\) on the potential-based shaping signal summed over the k nearest objectives.
goal_weight (float) – Bonus \(w_g\) for uniquely claiming a target.
crowding_weight (float) – Penalty \(w_c\) per unit of LiDAR crowding vector norm.
work_weight (float) – Weight \(w_w\) of the quadratic action penalty \(\|a\|^2\).
vacancy_weight (float) – Reward \(w_v\) granted when all k nearest objectives are occupied.
goal_radius_factor (float) – Multiplicative factor \(f\) applied to the particle radius to define the goal activation threshold \(d < f \cdot r\).
alpha_r_bar (float) – EMA smoothing factor \(\alpha\) for the differential reward baseline \(\bar{r}\).
lidar_range (float) – Maximum detection range for the LiDAR sensor.
n_lidar_rays (int) – Number of angular LiDAR bins spanning \([-\pi, \pi)\).
k_objectives (int) – Number of closest objectives tracked per agent.
- Returns:
A freshly constructed environment (call
reset()before use).- Return type:
- static reset(env: SwarmNavigator, key: Array | ndarray | bool | number | bool | int | float | complex | TypedNdArray) → Environment[source]#
Reset the environment to a random initial configuration.
- Parameters:
env (Environment) – The environment instance to reset.
key (ArrayLike) – PRNG key used to sample the domain, positions, objectives, and initial velocities.
- Returns:
The environment with a fresh episode state.
- Return type:
- static step(env: SwarmNavigator, action: Array) → Environment[source]#
Advance the environment by one physics step.
Applies force actions with viscous drag. After integration the method updates all sensor caches and computes the reward with a differential baseline. The shaping signal is summed over the k nearest objectives.
- Parameters:
env (Environment) – Current environment.
action (jax.Array) – Force actions for every agent, shape
(N * dim,).
- Returns:
Updated environment after physics integration, sensor updates, and reward computation.
- Return type:
- static observation(env: SwarmNavigator) → Array[source]#
Build the per-agent observation vector from cached sensors.
All state-dependent components are pre-computed in
step()andreset(). This method only concatenates cached arrays.- Returns:
Observation matrix of shape
(N, obs_dim). See the class docstring for the feature layout.- Return type:
jax.Array
- static reward(env: SwarmNavigator) → Array[source]#
Return the reward cached by
step().- Returns:
Reward vector of shape
(N,).- Return type:
jax.Array
- static done(env: SwarmNavigator) → Array[source]#
Return
Truewhen the episode has exceededmax_steps.