Two Armed Bandit Task
Background
The Two-Armed Bandit Task by W. Bradley Knox and colleagues (2012) is a simplified variant of the Multi-Armed Bandit paradigm designed to see exactly how humans plan ahead when rewards are constantly changing. They specifically designed a simpler design with only two instead of four slots (as is used by the Four Armed Bandit Task) as they argue that the 4-arm environment may simply be too complex for people to allow researchers to adequately study human explorative behavior under uncertainty conditions.
In their simplified environment, participants only had to choose between two options: option A and option B. The payoff values of option A and option B were set in such a way that one option was always worth 10 points more than the other but the payoff schedules flipped with a certain probability (e.g. 7.5%) each trial by adding 20 points to the lower paid option (thus 'leapfrogging' the previously higher paid option).
The results of the study supports Knox et al's claim that if an environment is too complex (like Daw's 4-armed task), humans seem to default to a somewhat random, value-sensitive guessing to save brainpower. But if the environment is simple enough to mentally manage, the human brain will actively calculate and hunt down uncertainty. But even so, people do not engage in optimal long-term planning but instead choose simply the option on every trial that they believe to have the highest immediate payoff. In contrast, a mathematically programmed 'Ideal' Player would sacrifice known higher values more often in order to eliminate uncertainty that could yield higher payouts more often down the road.
Task Procedure
The two-armed bandit task is divided into two phases: (1) Passive Observation Phase (POP) and (2) Active Game (AG). During the POP, participants simply watch the game for 300 trials while the computer makes choices. Anytime, the payoffs switch (happens with p=7.5%), an alert is presented on screen. Beginning with trial 200, participants are asked to estimate how many reversals in payoff they expect to observe during the next set of 100 trials. During the AG phase, participants actively play the game for 300 trials. Payoff switches are no longer broadcasted to the participants. Participants select option A or option B via keyboard presses. If no response is made within 1500ms, the trial gets repeated.
Psychological domains
- Decision-making: Response to potential rewards and losses over time
- Risk-taking: Preference for high-reward/high risk or low-reward/low-risk
- Delay Discounting: Foregoing immediate rewards for better long-term outcomes
- Executive Control: The ability of our prefrontal cortex to override automatic, reward-seeking behavior to execute an exploratory choice
Main Performance Metrics
- Total: Absolute and relative final payout; measures of 'Reward Maximation'
- Proportion of HighestPayOff: Proportion highest payOff option selected; measure of 'Optimal Choice Making'
- Exploration Rate: Proportion of times participants selected a new options (relative to all choices made)
Psychiatric Conditions
Armed Bandit Task performance tends to be expressed differently in patients with the following psychiatric conditions.
- Substance Use Disorders
- Schizophrenia
- Major Depressive Disorder
- Obsessive-Compulsive Disorder (OCD
- Attention Deficit Hyperactivity Disorder (ADHD)
A decision making game in which participants tradeoff pursuing one known resource vs exploring one new resource as described in Knox et al (2012).
References
Knox, W.B., Otto, A.R., Stone, P. & Love, B.C. (2012). The nature of belief-directed exploratory choice in human decision-making. Frontiers in Psychology, Volume 2, Article 298, 1-12 .