Dynamic Assortment

Select which K items to offer at each step to maximize revenue: customer preferences evolve dynamically based on purchase history (hype and saturation effects).

using DecisionFocusedLearningBenchmarks
using Plots

b = DynamicAssortmentBenchmark()
DynamicAssortmentBenchmark{false, Flux.Chain{Tuple{Flux.Dense{typeof(identity), Matrix{Float64}, Vector{Float64}}, typeof(vec)}}}(Chain(Dense(5 => 1), vec), 20, 2, 4, 80)

Observable input

Generate one environment and roll it out with the greedy policy to collect a sample trajectory. At each step the agent observes item prices, hype levels, saturation, and purchase history:

policies = generate_baseline_policies(b)
env = generate_environments(b, 1)[1]
_, trajectory = evaluate_policy!(policies.expert, env)
(476.2144745856152, DataSample{@NamedTuple{instance::Tuple{Matrix{Float64}, Vector{Int64}}}, @NamedTuple{reward::Float64, step::Int64}, Matrix{Float64}, BitVector, Nothing}[DataSample(x=[0.382239 0.577244 … 0.891936 0.558106; 0.386732 0.448377 … 0.418176 0.327177; … ; 0.0 0.0 … 0.0 0.0; 0.0125 0.0125 … 0.0125 0.0125], y=Bool[0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0], instance=([3.82239 5.77244 … 8.91936 5.58106; 3.86732 4.48377 … 4.18176 3.27177; … ; 8.42504 7.15157 … 4.58344 3.19582; 9.31262 7.86952 … 5.33052 4.48871], Int64[]), reward=6.07343, step=1), DataSample(x=[0.382239 0.577244 … 0.891936 0.558106; 0.386732 0.448377 … 0.418176 0.327177; … ; 0.0 0.0 … 0.0 0.0; 0.025 0.025 … 0.025 0.025], y=Bool[0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0], instance=([3.82239 5.77244 … 8.91936 5.58106; 3.86732 4.48377 … 4.18176 3.27177; … ; 8.42504 7.15157 … 4.58344 3.19582; 9.31262 7.86952 … 5.33052 4.48871], [17]), reward=6.07343, step=2), DataSample(x=[0.382239 0.577244 … 0.891936 0.558106; 0.386732 0.448377 … 0.418176 0.327177; … ; 0.0 0.0 … 0.0 0.0; 0.0375 0.0375 … 0.0375 0.0375], y=Bool[0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0], instance=([3.82239 5.77244 … 8.91936 5.58106; 3.86732 4.48377 … 4.18176 3.27177; … ; 8.42504 7.15157 … 4.58344 3.19582; 9.31262 7.86952 … 5.33052 4.48871], [17, 17]), reward=6.07343, step=3), DataSample(x=[0.382239 0.577244 … 0.891936 0.558106; 0.386732 0.448377 … 0.418176 0.327177; … ; 0.0 0.0 … 0.0 0.0; 0.05 0.05 … 0.05 0.05], y=Bool[0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0], instance=([3.82239 5.77244 … 8.91936 5.58106; 3.86732 4.48377 … 4.18176 3.27177; … ; 8.42504 7.15157 … 4.58344 3.19582; 9.31262 7.86952 … 5.33052 4.48871], [17, 17, 17]), reward=6.07343, step=4), DataSample(x=[0.382239 0.577244 … 0.891936 0.558106; 0.386732 0.448377 … 0.418176 0.327177; … ; 0.0 0.0 … 0.0 0.0; 0.0625 0.0625 … 0.0625 0.0625], y=Bool[0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0], instance=([3.82239 5.77244 … 8.91936 5.58106; 3.86732 4.48377 … 4.18176 3.27177; … ; 8.42504 7.15157 … 4.58344 3.19582; 9.31262 7.86952 … 5.33052 4.48871], [17, 17, 17, 17]), reward=6.07343, step=5), DataSample(x=[0.382239 0.577244 … 0.891936 0.558106; 0.386732 0.448377 … 0.418176 0.327177; … ; 0.0 0.0 … 0.0 0.0; 0.075 0.075 … 0.075 0.075], y=Bool[0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0], instance=([3.82239 5.77244 … 8.91936 5.58106; 3.86732 4.48377 … 4.18176 3.27177; … ; 8.42504 7.15157 … 4.58344 3.19582; 9.31262 7.86952 … 5.33052 4.48871], [17, 17, 17, 17, 17]), reward=6.07343, step=6), DataSample(x=[0.382239 0.577244 … 0.891936 0.558106; 0.386732 0.448377 … 0.418176 0.327177; … ; 0.0 0.0 … 0.0 0.0; 0.0875 0.0875 … 0.0875 0.0875], y=Bool[0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0], instance=([3.82239 5.77244 … 8.91936 5.58106; 3.86732 4.48377 … 4.18176 3.27177; … ; 8.42504 7.15157 … 4.58344 3.19582; 9.31262 7.86952 … 5.33052 4.48871], [17, 17, 17, 17, 17, 17]), reward=6.07343, step=7), DataSample(x=[0.382239 0.577244 … 0.891936 0.558106; 0.386732 0.448377 … 0.418176 0.327177; … ; 0.0 0.0 … 0.0 0.0; 0.1 0.1 … 0.1 0.1], y=Bool[0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0], instance=([3.82239 5.77244 … 8.91936 5.58106; 3.86732 4.48377 … 4.18176 3.27177; … ; 8.42504 7.15157 … 4.58344 3.19582; 9.31262 7.86952 … 5.33052 4.48871], [17, 17, 17, 17, 17, 17, 17]), reward=6.07343, step=8), DataSample(x=[0.382239 0.577244 … 0.891936 0.558106; 0.386732 0.448377 … 0.418176 0.327177; … ; 0.0 0.0 … 0.0 0.0; 0.1125 0.1125 … 0.1125 0.1125], y=Bool[0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0], instance=([3.82239 5.77244 … 8.91936 5.58106; 3.86732 4.48377 … 4.18176 3.27177; … ; 8.42504 7.15157 … 4.58344 3.19582; 9.31262 7.86952 … 5.33052 4.48871], [17, 17, 17, 17, 17, 17, 17, 17]), reward=6.07343, step=9), DataSample(x=[0.382239 0.577244 … 0.891936 0.558106; 0.386732 0.448377 … 0.418176 0.327177; … ; 0.0 0.0 … 0.0 0.0; 0.125 0.125 … 0.125 0.125], y=Bool[0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0], instance=([3.82239 5.77244 … 8.91936 5.58106; 3.86732 4.48377 … 4.18176 3.27177; … ; 8.42504 7.15157 … 4.58344 3.19582; 9.31262 7.86952 … 5.33052 4.48871], [17, 17, 17, 17, 17, 17, 17, 17, 17]), reward=6.07343, step=10)  …  DataSample(x=[0.382239 0.577244 … 0.891936 0.558106; 0.386732 0.448266 … 0.418176 0.327177; … ; 0.0 0.00743763 … 0.0 0.0; 0.8875 0.8875 … 0.8875 0.8875], y=Bool[0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0], instance=([3.82239 5.77244 … 8.91936 5.58106; 3.86732 4.48266 … 4.18176 3.27177; … ; 8.42504 7.15157 … 4.58344 3.19582; 9.31262 7.86952 … 5.33052 4.48871], [17, 17, 17, 17, 17, 17, 17, 17, 17, 17  …  17, 17, 17, 17, 17, 2, 17, 17, 17, 17]), reward=6.07343, step=71), DataSample(x=[0.382239 0.577244 … 0.891936 0.558106; 0.386732 0.448266 … 0.418176 0.327177; … ; 0.0 0.00743763 … 0.0 0.0; 0.9 0.9 … 0.9 0.9], y=Bool[0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0], instance=([3.82239 5.77244 … 8.91936 5.58106; 3.86732 4.48266 … 4.18176 3.27177; … ; 8.42504 7.15157 … 4.58344 3.19582; 9.31262 7.86952 … 5.33052 4.48871], [17, 17, 17, 17, 17, 17, 17, 17, 17, 17  …  17, 17, 17, 17, 2, 17, 17, 17, 17, 17]), reward=6.07343, step=72), DataSample(x=[0.382239 0.577244 … 0.891936 0.558106; 0.386732 0.448266 … 0.418176 0.327177; … ; 0.0 0.00743763 … 0.0 0.0; 0.9125 0.9125 … 0.9125 0.9125], y=Bool[0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0], instance=([3.82239 5.77244 … 8.91936 5.58106; 3.86732 4.48266 … 4.18176 3.27177; … ; 8.42504 7.15157 … 4.58344 3.19582; 9.31262 7.86952 … 5.33052 4.48871], [17, 17, 17, 17, 17, 17, 17, 17, 17, 17  …  17, 17, 17, 2, 17, 17, 17, 17, 17, 17]), reward=0.0, step=73), DataSample(x=[0.382239 0.577244 … 0.891936 0.558106; 0.386732 0.448266 … 0.418176 0.327177; … ; 0.0 0.00743763 … 0.0 0.0; 0.925 0.925 … 0.925 0.925], y=Bool[0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0], instance=([3.82239 5.77244 … 8.91936 5.58106; 3.86732 4.48266 … 4.18176 3.27177; … ; 8.42504 7.15157 … 4.58344 3.19582; 9.31262 7.86952 … 5.33052 4.48871], [17, 17, 17, 17, 17, 17, 17, 17, 17, 17  …  17, 17, 2, 17, 17, 17, 17, 17, 17, 21]), reward=6.07343, step=74), DataSample(x=[0.382239 0.577244 … 0.891936 0.558106; 0.386732 0.448266 … 0.418176 0.327177; … ; 0.0 0.00743763 … 0.0 0.0; 0.9375 0.9375 … 0.9375 0.9375], y=Bool[0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0], instance=([3.82239 5.77244 … 8.91936 5.58106; 3.86732 4.48266 … 4.18176 3.27177; … ; 8.42504 7.15157 … 4.58344 3.19582; 9.31262 7.86952 … 5.33052 4.48871], [17, 17, 17, 17, 17, 17, 17, 17, 17, 17  …  17, 2, 17, 17, 17, 17, 17, 17, 21, 17]), reward=6.07343, step=75), DataSample(x=[0.382239 0.577244 … 0.891936 0.558106; 0.386732 0.448266 … 0.418176 0.327177; … ; 0.0 0.00743763 … 0.0 0.0; 0.95 0.95 … 0.95 0.95], y=Bool[0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0], instance=([3.82239 5.77244 … 8.91936 5.58106; 3.86732 4.48266 … 4.18176 3.27177; … ; 8.42504 7.15157 … 4.58344 3.19582; 9.31262 7.86952 … 5.33052 4.48871], [17, 17, 17, 17, 17, 17, 17, 17, 17, 17  …  2, 17, 17, 17, 17, 17, 17, 21, 17, 17]), reward=6.07343, step=76), DataSample(x=[0.382239 0.577244 … 0.891936 0.558106; 0.386732 0.448266 … 0.418176 0.327177; … ; 0.0 0.00743763 … 0.0 0.0; 0.9625 0.9625 … 0.9625 0.9625], y=Bool[0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0], instance=([3.82239 5.77244 … 8.91936 5.58106; 3.86732 4.48266 … 4.18176 3.27177; … ; 8.42504 7.15157 … 4.58344 3.19582; 9.31262 7.86952 … 5.33052 4.48871], [17, 17, 17, 17, 17, 17, 17, 17, 17, 17  …  17, 17, 17, 17, 17, 17, 21, 17, 17, 17]), reward=6.07343, step=77), DataSample(x=[0.382239 0.577244 … 0.891936 0.558106; 0.386732 0.448266 … 0.418176 0.327177; … ; 0.0 0.00743763 … 0.0 0.0; 0.975 0.975 … 0.975 0.975], y=Bool[0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0], instance=([3.82239 5.77244 … 8.91936 5.58106; 3.86732 4.48266 … 4.18176 3.27177; … ; 8.42504 7.15157 … 4.58344 3.19582; 9.31262 7.86952 … 5.33052 4.48871], [17, 17, 17, 17, 17, 17, 17, 17, 17, 17  …  17, 17, 17, 17, 17, 21, 17, 17, 17, 17]), reward=6.07343, step=78), DataSample(x=[0.382239 0.577244 … 0.891936 0.558106; 0.386732 0.448266 … 0.418176 0.327177; … ; 0.0 0.00743763 … 0.0 0.0; 0.9875 0.9875 … 0.9875 0.9875], y=Bool[0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0], instance=([3.82239 5.77244 … 8.91936 5.58106; 3.86732 4.48266 … 4.18176 3.27177; … ; 8.42504 7.15157 … 4.58344 3.19582; 9.31262 7.86952 … 5.33052 4.48871], [17, 17, 17, 17, 17, 17, 17, 17, 17, 17  …  17, 17, 17, 17, 21, 17, 17, 17, 17, 17]), reward=6.07343, step=79), DataSample(x=[0.382239 0.577244 … 0.891936 0.558106; 0.386732 0.448266 … 0.418176 0.327177; … ; 0.0 0.00743763 … 0.0 0.0; 1.0 1.0 … 1.0 1.0], y=Bool[0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0], instance=([3.82239 5.77244 … 8.91936 5.58106; 3.86732 4.48266 … 4.18176 3.27177; … ; 8.42504 7.15157 … 4.58344 3.19582; 9.31262 7.86952 … 5.33052 4.48871], [17, 17, 17, 17, 17, 17, 17, 17, 17, 17  …  17, 17, 17, 21, 17, 17, 17, 17, 17, 17]), reward=6.07343, step=80)])

The observable state at step 1: item prices (fixed across steps):

plot_context(b, trajectory[1])
Example block output

A training sample

Each step in a trajectory is a labeled tuple (x, θ, y) plus state and reward:

  • x: (d+8) × N feature matrix per step (prices, hype, saturation, history, time)
  • θ: predicted utility score per item
  • y: offered assortment at this step (BitVector of length N, true = offered)
  • instance: full state tuple (features matrix, purchase history)
  • reward: price of the purchased item (0 if no purchase)

One step with the offered assortment highlighted (green = offered):

plot_sample(b, trajectory[1])
Example block output

A few steps side by side (prices are fixed; assortment composition changes over time):

plot_trajectory(b, trajectory[1:min(4, length(trajectory))])
Example block output

DFL pipeline components

The DFL agent chains two components: a neural network predicting utility scores per item:

model = generate_statistical_model(b)     # MLP: state features → predicted utility per item
Chain(
  Dense(10 => 5),                       # 55 parameters
  Dense(5 => 1),                        # 6 parameters
  vec,
)                   # Total: 4 arrays, 61 parameters, 452 bytes.

and a maximizer offering the K items with the highest predicted utilities:

maximizer = generate_maximizer(b)         # top-K selection by predicted utility
DecisionFocusedLearningBenchmarks.Utils.TopKMaximizer(4)

At each step, the model maps the current state (prices, hype, saturation, history) to a utility score per item. The maximizer selects the K items with the highest scores.


Problem Description

Overview

In the Dynamic Assortment problem, a retailer has $N$ items and must select $K$ to offer at each time step. Customer preferences evolve based on purchase history through hype (recent purchases increase demand) and saturation (repeated purchases slightly decrease demand).

Mathematical Formulation

State $s_t = (p, f, h_t, \sigma_t, t, \mathcal{H}_t)$ where:

  • $p$: fixed item prices
  • $f$: static item features
  • $h_t, \sigma_t$: current hype and saturation levels
  • $t$: current time step
  • $\mathcal{H}_t$: purchase history (last 5 purchases)

Action: $a_t \subseteq \{1,\ldots,N\}$ with $|a_t| = K$

Customer choice (multinomial logit):

\[\mathbb{P}(i \mid a_t, s_t) = \frac{\exp(\theta_i(s_t))}{\sum_{j \in a_t} \exp(\theta_j(s_t)) + 1}\]

Transition dynamics:

  • Hype: $h_{t+1}^{(i)} = h_t^{(i)} \times m^{(i)}$ where the multiplier reflects recent purchases
  • Saturation: increases by ×1.01 for the purchased item

Reward: $r(s_t, a_t) = p_{i^\star}$ (price of the purchased item, 0 if no purchase)

Objective:

\[\max_\pi \; \mathbb{E}\!\left[\sum_{t=1}^T r(s_t, \pi(s_t))\right]\]

Key Components

DynamicAssortmentBenchmark

ParameterDescriptionDefault
NNumber of items in catalog20
dStatic feature dimension per item2
KAssortment size4
max_stepsSteps per episode80
exogenousWhether dynamics are exogenousfalse

State Observation

Agents observe a $(d+8) \times N$ normalized feature matrix per step containing: current prices, hype, saturation, static features, change in hype/saturation from previous step and from initial state, and normalized time step.

Baseline Policies

PolicyDescription
ExpertBrute-force enumeration of all $\binom{N}{K}$ subsets; optimal but slow
GreedySelects the $K$ items with highest prices

DFL Policy

\[\xrightarrow[\text{State}]{s_t} \fbox{Neural network $\varphi_w$} \xrightarrow[\text{Utilities}]{\theta \in \mathbb{R}^N} \fbox{Top-K} \xrightarrow[\text{Assortment}]{a_t}\]

Model: Chain(Dense(d+8 → 5), Dense(5 → 1), vec): predicts one utility score per item from the current state features.

Maximizer: TopKMaximizer(K): selects the top $K$ items by predicted utility.


This page was generated using Literate.jl.