Dynamic Assortment

Select which K items to offer at each step to maximize revenue: customer preferences evolve dynamically based on purchase history (hype and saturation effects).

using DecisionFocusedLearningBenchmarks
using Plots

b = DynamicAssortmentBenchmark()

DynamicAssortmentBenchmark{false, Flux.Chain{Tuple{Flux.Dense{typeof(identity), Matrix{Float64}, Vector{Float64}}, typeof(vec)}}}(Chain(Dense(5 => 1), vec), 20, 2, 4, 80)

Observable input

Generate one environment and roll it out with the greedy policy to collect a sample trajectory. At each step the agent observes item prices, hype levels, saturation, and purchase history:

policies = generate_baseline_policies(b)
env = generate_environments(b, 1)[1]
_, trajectory = evaluate_policy!(policies.expert, env)

(665.8147598999699, DataSample{@NamedTuple{instance::Tuple{Matrix{Float64}, Vector{Int64}}}, @NamedTuple{reward::Float64, step::Int64}, Matrix{Float64}, BitVector, Nothing}[DataSample(x=[0.276248 0.493051 … 0.359861 0.745574; 0.118557 0.160504 … 0.460234 0.59337; … ; 0.0 0.0 … 0.0 0.0; 0.0125 0.0125 … 0.0125 0.0125], y=Bool[0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], instance=([2.76248 4.93051 … 3.59861 7.45574; 1.18557 1.60504 … 4.60234 5.9337; … ; 1.0837 8.79223 … 3.59 7.07633; 6.34842 2.31458 … 6.30755 6.59859], Int64[]), reward=9.49059, step=1), DataSample(x=[0.276248 0.493051 … 0.359861 0.745574; 0.118557 0.160504 … 0.460234 0.59337; … ; 0.0 0.0 … 0.0 0.0; 0.025 0.025 … 0.025 0.025], y=Bool[0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], instance=([2.76248 4.93051 … 3.59861 7.45574; 1.18557 1.60504 … 4.60234 5.9337; … ; 1.0837 8.79223 … 3.59 7.07633; 6.34842 2.31458 … 6.30755 6.59859], [5]), reward=9.49059, step=2), DataSample(x=[0.276248 0.493051 … 0.359861 0.745574; 0.118557 0.160504 … 0.460234 0.59337; … ; 0.0 0.0 … 0.0 0.0; 0.0375 0.0375 … 0.0375 0.0375], y=Bool[0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], instance=([2.76248 4.93051 … 3.59861 7.45574; 1.18557 1.60504 … 4.60234 5.9337; … ; 1.0837 8.79223 … 3.59 7.07633; 6.34842 2.31458 … 6.30755 6.59859], [5, 5]), reward=9.49059, step=3), DataSample(x=[0.276248 0.493051 … 0.359861 0.745574; 0.118557 0.160504 … 0.460234 0.59337; … ; 0.0 0.0 … 0.0 0.0; 0.05 0.05 … 0.05 0.05], y=Bool[0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], instance=([2.76248 4.93051 … 3.59861 7.45574; 1.18557 1.60504 … 4.60234 5.9337; … ; 1.0837 8.79223 … 3.59 7.07633; 6.34842 2.31458 … 6.30755 6.59859], [5, 5, 5]), reward=9.49059, step=4), DataSample(x=[0.276248 0.493051 … 0.359861 0.745574; 0.118557 0.160504 … 0.460234 0.59337; … ; 0.0 0.0 … 0.0 0.0; 0.0625 0.0625 … 0.0625 0.0625], y=Bool[0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], instance=([2.76248 4.93051 … 3.59861 7.45574; 1.18557 1.60504 … 4.60234 5.9337; … ; 1.0837 8.79223 … 3.59 7.07633; 6.34842 2.31458 … 6.30755 6.59859], [5, 5, 5, 5]), reward=0.0, step=5), DataSample(x=[0.276248 0.493051 … 0.359861 0.745574; 0.118557 0.160504 … 0.460234 0.59337; … ; 0.0 0.0 … 0.0 0.0; 0.075 0.075 … 0.075 0.075], y=Bool[0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], instance=([2.76248 4.93051 … 3.59861 7.45574; 1.18557 1.60504 … 4.60234 5.9337; … ; 1.0837 8.79223 … 3.59 7.07633; 6.34842 2.31458 … 6.30755 6.59859], [5, 5, 5, 5, 21]), reward=9.49059, step=6), DataSample(x=[0.276248 0.493051 … 0.359861 0.745574; 0.118557 0.160504 … 0.460234 0.59337; … ; 0.0 0.0 … 0.0 0.0; 0.0875 0.0875 … 0.0875 0.0875], y=Bool[0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], instance=([2.76248 4.93051 … 3.59861 7.45574; 1.18557 1.60504 … 4.60234 5.9337; … ; 1.0837 8.79223 … 3.59 7.07633; 6.34842 2.31458 … 6.30755 6.59859], [5, 5, 5, 5, 21, 5]), reward=9.49059, step=7), DataSample(x=[0.276248 0.493051 … 0.359861 0.745574; 0.118557 0.160504 … 0.460234 0.59337; … ; 0.0 0.0 … 0.0 0.0; 0.1 0.1 … 0.1 0.1], y=Bool[0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], instance=([2.76248 4.93051 … 3.59861 7.45574; 1.18557 1.60504 … 4.60234 5.9337; … ; 1.0837 8.79223 … 3.59 7.07633; 6.34842 2.31458 … 6.30755 6.59859], [5, 5, 5, 5, 21, 5, 5]), reward=9.49059, step=8), DataSample(x=[0.276248 0.493051 … 0.359861 0.745574; 0.118557 0.160504 … 0.460234 0.59337; … ; 0.0 0.0 … 0.0 0.0; 0.1125 0.1125 … 0.1125 0.1125], y=Bool[0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], instance=([2.76248 4.93051 … 3.59861 7.45574; 1.18557 1.60504 … 4.60234 5.9337; … ; 1.0837 8.79223 … 3.59 7.07633; 6.34842 2.31458 … 6.30755 6.59859], [5, 5, 5, 5, 21, 5, 5, 5]), reward=9.49059, step=9), DataSample(x=[0.276248 0.493051 … 0.359861 0.745574; 0.118557 0.160504 … 0.460234 0.59337; … ; 0.0 0.0 … 0.0 0.0; 0.125 0.125 … 0.125 0.125], y=Bool[0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], instance=([2.76248 4.93051 … 3.59861 7.45574; 1.18557 1.60504 … 4.60234 5.9337; … ; 1.0837 8.79223 … 3.59 7.07633; 6.34842 2.31458 … 6.30755 6.59859], [5, 5, 5, 5, 21, 5, 5, 5, 5]), reward=9.49059, step=10)  …  DataSample(x=[0.276248 0.493051 … 0.359861 0.745574; 0.118557 0.160504 … 0.460234 0.59337; … ; 0.0 0.0 … 0.0 0.0; 0.8875 0.8875 … 0.8875 0.8875], y=Bool[0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], instance=([2.76248 4.93051 … 3.59861 7.45574; 1.18557 1.60504 … 4.60234 5.9337; … ; 1.0837 8.79223 … 3.59 7.07633; 6.34842 2.31458 … 6.30755 6.59859], [5, 5, 5, 5, 21, 5, 5, 5, 5, 5  …  5, 5, 21, 11, 5, 5, 3, 5, 5, 5]), reward=7.84427, step=71), DataSample(x=[0.276248 0.493051 … 0.359861 0.745574; 0.118557 0.160504 … 0.460234 0.59337; … ; 0.0 0.0 … 0.0 0.0; 0.9 0.9 … 0.9 0.9], y=Bool[0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], instance=([2.76248 4.93051 … 3.59861 7.45574; 1.18557 1.60504 … 4.60234 5.9337; … ; 1.0837 8.79223 … 3.59 7.07633; 6.34842 2.31458 … 6.30755 6.59859], [5, 5, 5, 5, 21, 5, 5, 5, 5, 5  …  5, 21, 11, 5, 5, 3, 5, 5, 5, 10]), reward=7.84427, step=72), DataSample(x=[0.276248 0.493051 … 0.359861 0.745574; 0.118557 0.160504 … 0.460234 0.59337; … ; 0.0 0.0 … 0.0 0.0; 0.9125 0.9125 … 0.9125 0.9125], y=Bool[0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], instance=([2.76248 4.93051 … 3.59861 7.45574; 1.18557 1.60504 … 4.60234 5.9337; … ; 1.0837 8.79223 … 3.59 7.07633; 6.34842 2.31458 … 6.30755 6.59859], [5, 5, 5, 5, 21, 5, 5, 5, 5, 5  …  21, 11, 5, 5, 3, 5, 5, 5, 10, 10]), reward=9.49059, step=73), DataSample(x=[0.276248 0.493051 … 0.359861 0.745574; 0.118557 0.160504 … 0.460234 0.59337; … ; 0.0 0.0 … 0.0 0.0; 0.925 0.925 … 0.925 0.925], y=Bool[0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], instance=([2.76248 4.93051 … 3.59861 7.45574; 1.18557 1.60504 … 4.60234 5.9337; … ; 1.0837 8.79223 … 3.59 7.07633; 6.34842 2.31458 … 6.30755 6.59859], [5, 5, 5, 5, 21, 5, 5, 5, 5, 5  …  11, 5, 5, 3, 5, 5, 5, 10, 10, 5]), reward=0.0, step=74), DataSample(x=[0.276248 0.493051 … 0.359861 0.745574; 0.118557 0.160504 … 0.460234 0.59337; … ; 0.0 0.0 … 0.0 0.0; 0.9375 0.9375 … 0.9375 0.9375], y=Bool[0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], instance=([2.76248 4.93051 … 3.59861 7.45574; 1.18557 1.60504 … 4.60234 5.9337; … ; 1.0837 8.79223 … 3.59 7.07633; 6.34842 2.31458 … 6.30755 6.59859], [5, 5, 5, 5, 21, 5, 5, 5, 5, 5  …  5, 5, 3, 5, 5, 5, 10, 10, 5, 21]), reward=7.84427, step=75), DataSample(x=[0.276248 0.493051 … 0.359861 0.745574; 0.118557 0.160504 … 0.460234 0.59337; … ; 0.0 0.0 … 0.0 0.0; 0.95 0.95 … 0.95 0.95], y=Bool[0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], instance=([2.76248 4.93051 … 3.59861 7.45574; 1.18557 1.60504 … 4.60234 5.9337; … ; 1.0837 8.79223 … 3.59 7.07633; 6.34842 2.31458 … 6.30755 6.59859], [5, 5, 5, 5, 21, 5, 5, 5, 5, 5  …  5, 3, 5, 5, 5, 10, 10, 5, 21, 10]), reward=7.84427, step=76), DataSample(x=[0.276248 0.493051 … 0.359861 0.745574; 0.118557 0.160504 … 0.460234 0.59337; … ; 0.0 0.0 … 0.0 0.0; 0.9625 0.9625 … 0.9625 0.9625], y=Bool[0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], instance=([2.76248 4.93051 … 3.59861 7.45574; 1.18557 1.60504 … 4.60234 5.9337; … ; 1.0837 8.79223 … 3.59 7.07633; 6.34842 2.31458 … 6.30755 6.59859], [5, 5, 5, 5, 21, 5, 5, 5, 5, 5  …  3, 5, 5, 5, 10, 10, 5, 21, 10, 10]), reward=7.84427, step=77), DataSample(x=[0.276248 0.493051 … 0.359861 0.745574; 0.118557 0.160504 … 0.460234 0.59337; … ; 0.0 0.0 … 0.0 0.0; 0.975 0.975 … 0.975 0.975], y=Bool[0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], instance=([2.76248 4.93051 … 3.59861 7.45574; 1.18557 1.60504 … 4.60234 5.9337; … ; 1.0837 8.79223 … 3.59 7.07633; 6.34842 2.31458 … 6.30755 6.59859], [5, 5, 5, 5, 21, 5, 5, 5, 5, 5  …  5, 5, 5, 10, 10, 5, 21, 10, 10, 10]), reward=7.84427, step=78), DataSample(x=[0.276248 0.493051 … 0.359861 0.745574; 0.118557 0.160504 … 0.460234 0.59337; … ; 0.0 0.0 … 0.0 0.0; 0.9875 0.9875 … 0.9875 0.9875], y=Bool[0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], instance=([2.76248 4.93051 … 3.59861 7.45574; 1.18557 1.60504 … 4.60234 5.9337; … ; 1.0837 8.79223 … 3.59 7.07633; 6.34842 2.31458 … 6.30755 6.59859], [5, 5, 5, 5, 21, 5, 5, 5, 5, 5  …  5, 5, 10, 10, 5, 21, 10, 10, 10, 10]), reward=9.20851, step=79), DataSample(x=[0.276248 0.493051 … 0.359861 0.745574; 0.118557 0.160504 … 0.460234 0.59337; … ; 0.0 0.0 … 0.0 0.0; 1.0 1.0 … 1.0 1.0], y=Bool[0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], instance=([2.76248 4.93051 … 3.59861 7.45574; 1.18557 1.60504 … 4.60234 5.9337; … ; 1.0837 8.79223 … 3.59 7.07633; 6.34842 2.31458 … 6.30755 6.59859], [5, 5, 5, 5, 21, 5, 5, 5, 5, 5  …  5, 10, 10, 5, 21, 10, 10, 10, 10, 11]), reward=9.49059, step=80)])

The observable state at step 1: item prices (fixed across steps):

plot_context(b, trajectory[1])

A training sample

Each step in a trajectory is a labeled tuple (x, θ, y) plus state and reward:

x: (d+8) × N feature matrix per step (prices, hype, saturation, history, time)
θ: predicted utility score per item
y: offered assortment at this step (BitVector of length N, true = offered)
instance: full state tuple (features matrix, purchase history)
reward: price of the purchased item (0 if no purchase)

One step with the offered assortment highlighted (green = offered):

plot_sample(b, trajectory[1])

A few steps side by side (prices are fixed; assortment composition changes over time):

plot_trajectory(b, trajectory[1:min(4, length(trajectory))])

DFL pipeline components

The DFL agent chains two components: a neural network predicting utility scores per item:

model = generate_statistical_model(b)     # MLP: state features → predicted utility per item

Chain(
  Dense(10 => 5),                       # 55 parameters
  Dense(5 => 1),                        # 6 parameters
  vec,
)                   # Total: 4 arrays, 61 parameters, 452 bytes.

and a maximizer offering the K items with the highest predicted utilities:

maximizer = generate_maximizer(b)         # top-K selection by predicted utility

DecisionFocusedLearningBenchmarks.Utils.TopKMaximizer(4)

At each step, the model maps the current state (prices, hype, saturation, history) to a utility score per item. The maximizer selects the K items with the highest scores.

In the Dynamic Assortment problem, a retailer has $N$ items and must select $K$ to offer at each time step. Customer preferences evolve based on purchase history through hype (recent purchases increase demand) and saturation (repeated purchases slightly decrease demand).

Mathematical Formulation

State $s_t = (p, f, h_t, \sigma_t, t, \mathcal{H}_t)$ where:

$p$: fixed item prices
$f$: static item features
$h_t, \sigma_t$: current hype and saturation levels
$t$: current time step
$\mathcal{H}_t$: purchase history (last 5 purchases)

Action: $a_t \subseteq \{1,\ldots,N\}$ with $|a_t| = K$

Customer choice (multinomial logit):

\[\mathbb{P}(i \mid a_t, s_t) = \frac{\exp(\theta_i(s_t))}{\sum_{j \in a_t} \exp(\theta_j(s_t)) + 1}\]

Transition dynamics:

Hype: $h_{t+1}^{(i)} = h_t^{(i)} \times m^{(i)}$ where the multiplier reflects recent purchases
Saturation: increases by ×1.01 for the purchased item

Reward: $r(s_t, a_t) = p_{i^\star}$ (price of the purchased item, 0 if no purchase)

Objective:

\[\max_\pi \; \mathbb{E}\!\left[\sum_{t=1}^T r(s_t, \pi(s_t))\right]\]

Key Components