Four Armed Bandit Task
Background
Multi-Armed Bandit Tasks are behavioral paradigms designed to study how humans make decisions under uncertainty. They get their name from a hypothetical gambler deciding which lever to pull on a row of slot machines, notoriously known as "one-armed bandits". The core purpose of these tasks is to measure the explore-exploit trade-off: the constant, everyday dilemma between sticking with a known, rewarding option (exploitation) versus trying new or uncertain options in the hope of finding something better (exploration).
In a typical experiment, participants are seated before a computer and presented with two or more options (e.g., slot machines, cards, or doors). Each option has an initially unknown and hidden probability of yielding a reward (like winning points or money). Participants are instructed to maximize their total rewards over a fixed number of trials. They must test different options to figure out which one pays out the best, and then repeatedly choose that favorite option to earn the most points, while occasionally checking other options which are designed to fluctuate their profits.
Nathaniel Daw and colleagues published a landmark study in 2006 that used a "four-armed bandit" task paired with fMRI imaging. This work effectively established the standard paradigm for using bandit tasks to map how the human brain manages the explore-exploit trade-off. They found that that human exploration and exploitation are managed by distinct neural networks, with exploitation involving the ventromedial prefrontal cortex and striatum, while exploration is driven by the frontopolar cortex. Their research further identified that behaviorally people's decision making under uncertainty involves switching between exploratory and exploitative behavioral actions. They found that people explore by taking a random gamble on an alternative that previously looked "decent" rather than explicitly targeting the least known option simply to reduce uncertainty; a method often employed by machine learning.
Task Procedure
Each trial in the Four Armed Bandit Game presents four colorful 'slots'. Under the cover each slot is assigned a different starting profit value (e.g. 20,40,60,80). These value are constantly updated throughout the game and replaced by new values that are calculated by a mathematical formula that forces the current payoff values to "walk" randomly (with possible built-in jumps), but includes a decaying factor that - over time- pulls the values back towards a central value (the same for all slots: 50) if they drift too far away. The actual payoff points for each slot are then calculated by sampling from a random distribution (truncated between 1 and 100) around the currently calculated 'drifted' profit values.
Participants are instructed to play the four slots and to maximize their profit during the game. They are further warned that the payoffs of the slots fluctuate from trial to trial and that they have 1.5sec to make their choice. If they do not make their choice in time, they lose the opportunity to add to their points. After selecting the slot of their choice, the slot plays an animation (2s) and reveals the payoff (1s). If no slot is selected in time, a red x is presented for about 4s. Each trial concludes with a white screen, presented for 1s.
After playing 5 practice trials to get familiar with the setup, participants play two rounds of the game with 150 trials each.
What it Measures
Armed Bandit Tasks are a measure of adaptive decision-making under uncertainty.
Psychological domains
- Decision-making: Response to potential rewards and losses over time
- Risk-taking: Preference for high-reward/high risk or low-reward/low-risk
- Delay Discounting: Foregoing immediate rewards for better long-term outcomes
- Executive Control: The ability of our prefrontal cortex to override automatic, reward-seeking behavior to execute an exploratory choice
Main Performance Metrics
- Total: Absolute and relative final payout; measures of 'Reward Maximation'
- Proportion of HighestPayOff: Proportion highest payOff option selected; measure of 'Optimal Choice Making'
- Exploration Rate: Proportion of times participants selected a new options (relative to all choices made)
Psychiatric Conditions
Armed Bandit Task performance tends to be expressed differently in patients with the following psychiatric conditions.
- Substance Use Disorders
- Schizophrenia
- Major Depressive Disorder
- Obsessive-Compulsive Disorder (OCD
- Attention Deficit Hyperactivity Disorder (ADHD)
Test Variations
A decision making game in which participants tradeoff pursuing known resources vs exploring ones as described in Daw et al (2006).
The Four-Armed Bandit Task with keyboard input
References
Daw, N.D., O'Doherty, J.P., Dayan, P., Seymour, B. & Dolan, R.J. (2006). Cortical substrates for exploratory decisions in humans. Nature. 2006 June 15; 441(7095): 876–879.