Advertisement

Bag of Tricks: Choosing actions via the matching law

Started by November 04, 2003 05:29 PM
0 comments, last by baylor 21 years ago
i'm reposting this from comp.ai.games 'cause i like to see my name on lots of forums It's an attempt to come up with a standard format for documenting various techniques i use in game AI

Problem Area: action selection
     AI Type: decision making, learning (secondary), 
              personality (secondary)
Detail Level: mid-level
   Technique: matching law
 Assumptions: Options are relatively equal
Example Uses: sports: choosing a shot type
              FPS: choosing a weapon
              RPG: choosing a spell
              RTS: choosing a build unit type
 
Explanation ----------- In 1962, psychologist Herrnstein discovered that, when an animal is presented with two options, the relative rate of responding (choosing) of a given option is equal to the relative rate of reinforcement (success) of that option. This was called the matching law and written as RA/RA+RB = rA/rA+rB. In 1974, psychologist Baum refined the formula by adding the notion of bias and sensitivity, resulting in the new equation RA/RB = b(rA/rB)^s. The matching law has been repeatedly upheld in human and non-human animal studies In 2000, Vollmer and Bourret studied the shot selection of 26 college basketball players. They found that how often a player attempted a three-point shot or a two-point shot was proportional to the relative rates of reinforcement (success rates) for each. If a player successfully completed 10% more two-point shots than three-point shots, the player would should 10% more two-point shots than three-point shots Variables ---------

RA/RA+RB = b(rA/rA+rB)^s = matching law
RA       = Rate of response for option A. 
           How often option A is chosen.
           This is a counter
RA/RA+RB = Relative rate of response for option A. 
           The percentage of time option A is chosen from
           options A&B
rA       = Rate of reinforcement for option A.
           The percentage of time choosing option A has lead 
           to a good result
rA/rA+rB = Relative rate of response for option A. 
           The percentage of overall successes that have 
           come from choosing option A
b        = Response bias.
           A preference for a given option or outcome. 
             b>1 means prefers
s        = Sensitivity.
           How much better an option must be to switch to it.
             s<1 means must be better than the other option(s). 
           A major cause for sensitivity is switching cost.
             In an FPS, switching cost would be the time spent
             unarmed while switching to the new weapon
    
Game Example ------------ A wizard is 30 meters away from a group of orcs. He has three third level spells that are appropriate to use - flamestrike, iceblade and stonestorm. Question: which spell should the wizard cast? Assume that the wizard has successfully hit his enemies 3/10 times with flamestrike, 2/4 times with iceblade and 6/7 times with stonestorm.

  r(flamestrike) = 3/10 = 0.3  (30%)
  r(iceblade)    = 2/4  = 0.5  (50%)
  r(stonestorm)  = 6/7  = 0.86 (86%)

  relative r(flamestrike) = .3/.3+.5+.86  = .3/1.66  = 0.18 (18%)
  relative r(iceblade)    = .5/.3+.5+.86  = .5/1.66  = 0.30 (30%)
  relative r(stonestorm)  = .86/.3+.5+.86 = .86/1.66 = 0.52 (52%)
    
So the wizard would cast stonestorm 52% of the time, iceblade 30% of the time and flamestrike 18% of the time Personality . To see how this affects the personality of the character, assume we have an NPC named Pyro the Flame Wizard. The NPC has no special advantages with fire, he just likes how pretty it is. Pyro's bias for fire-based spells is 1.25 (25% bias).

  relative r(flamestrike) = 1.25(.3/.3+.5+.86) = 1.25(.3/1.66) = 0.23 (23%)
    
Note that if the other options (iceblade and stonestorm) do not have their biases downgraded accordingly, the percentages (.23, .30, .52) will add up to 105%, not 100%. Given that this is, by definition, impossible, all numbers would need to be normalized. We do this by dividing each result by the overage (1.05)

  relative r(flamestrike) = 0.23/1.05 = 0.22 (22%)
  relative r(iceblade)    = 0.30/1.05 = 0.29 (29%)
  relative r(stonestorm)  = 0.52/1.05 = 0.50 (50%)
    
The numbers here add up to 101 because i did a lot of rounding, but if you don't round they should add to 100% Note that, even with the bias, Pyro the Flame Wizard still only casts flamestrike 22% of the time. This is because he'd be suicidal or a fool to cast it more than that given his poor track record with the spell Learning . Action selection is driven by two counters, numAttempted and numAttemptsSuccessful. Because of that, the agents will constantly adapt their actions based on feedback and their history Limitations ----------- - Learning via the matching law is slow. Does not smoothly accomodate rapid changes in skills such as happens in games when leveling up or aquiring a skill-enhancing weapon - As stated, does not accomodate learning by observation (ie, switching to a sniper rifle because the winner of the last four rounds used the sniper rifle) - Modifications must be made to accomodate context-specific decisions (ie, tracking numAttempts at a per-enemy or per-enemyType level rather than at a global level) Notes ----- - Bias and sensitivity are not learned, they are hard coded by the designer - Sensitivity should be situation specific. Sensitivity to switching weapons (armor, etc.) should be normal (~1) in safe spots and should be lower when surrounded by enemies. The sensitivity of the current item should reflect the fact that there is no switching cost (ie, s=1) [edited by - baylor on November 4, 2003 6:30:36 PM] [edited by - baylor on November 4, 2003 6:31:40 PM] [edited by - baylor on November 4, 2003 6:33:08 PM]
I''ll explain a slightly different method that can be used in even more situations.

Suppose an agent has options o_1, o_2, ..., o_n. We somehow estimate the expected utility of each one, u_1, u_2, ..., u_n. Then the probabilities of the agent taking option o_i is proportional to exp(A*u_i), where A is some constant. Of course, after normalization, it means that the probability is
exp(A*u_i)/(exp(A*u_1)+exp(A*u_2)+...+exp(A*u_n))

For instance, this has been used to model chess playing styles (http://citeseer.nj.nec.com/jansen00inductive.html).

The constant A plays more or less the role of the sensitivity in your method. Of course, introducing bias to this method is as easy as artificially adding some value to the expected utility of an option.

I wouldn''t use the sensitivity to prevent frequent switches from one weapon to another. I would incorporate that as a bias in favour of the current weapon. The sensitivity should be an indicator of how sure the agent is that his estimates are accurate.

Regarding the speed at which the agent adapts to changing conditions, you could always weight your experience in the past using a negative exponential weight. The simplest way of achieving this is using an update rule like
new_estimate_of_probability_of_success = (1-epsilon)*old_estimate + epsilon*current_observation

A small epsilon will learn slowly and a large epsilon will learn fast. Of course, learning too fast means you end up using only the very last data points to guide your actions, which may not be a good idea.

You could accomodate learning by observation by including other agents'' experiences in your estimates of the utility of the options. Usually you would want to weight down those experiences, because they might be differences between the other agent''s skills and your own. But it can all be introduced as a more sophisticated estimation.

This topic is closed to new replies.

Advertisement