When to make decisions when utilities are continuously updating...
Hi everyone,
I'm sorry for the length of my post I've tried to be as specific as possible!
I'm working on a system for an agent that needs to balance several needs. My focus is not that it needs to balance the needs in an optimal way, but that I can tweak the parameters to get the desired frequencies of different behaviours that appears most appropriate.
The current system has a set of normalised 'state' values that continuously update with time, and may be affected by agent actions. The states are then in turn used to calculate an additional set of 'motivation' values (they might correspond one-to-one, or they might combine different values, or they may be mapped non-linearly). The system enumerates a set of viable plans and estimates each plan's effect on each of the state values. This effect is then evaluated with respect to each motivation type to produce a motivation specific utility; these utilities are then summed, weighted by the value of the motivation itself, into one global utility.
Having so many different values at the different stages is probably conceptually equivalent to one combined utility function however I've been trying to keep the individual values transparent and debuggable, and the system works well (i.e. functions, and is easily tweakable) in the form described above.
My problem is that once selected the plans execute to completion, the conditions of which have to be arbitrarily determined. I want to change the system so that plans can be interrupted, both by higher priority motivations or by it being no longer beneficial to continue. I have been re-evaluating the plan selection at every update (ignoring performance for now); and very quickly the behaviour finds a cycle where it addresses one motivation until another is higher priority, so swaps to that one until the previous is higher priority, so swaps back to the other etc etc.
I obviously do not want the cyclic behaviour, however I do want interruptions to be possible - in particular, to address new information i.e. 'startling' events. I don't want to simply 'latch' the behaviour until it is complete as there may be no concrete finishing point.
I have tried to implement a system of activation and inhibition, where the need is activated due to one state value, and only inhibited by the threshold activation of some other, indirectly linked state value. Even then it is still possible to find a rapid cycle because the iteration of repeated plan selection quickly converges on the balance of values where the inhibition of one motivation is matched against the activation of another.
I've searched a lot for research / documentation on this kind of problem. Despite the fact that it feels a quite natural problem that must occur a lot, I cannot find anything that deals specifically with plans selected via utility function *as well as* their update and interruption later due to continuous changes in utility.
It feels like there should be a fairly straightforward, common sense solution to this - thanks in advance for any suggestions!
The design you described initially (before the paragraph that starts with "My problem is that...") makes perfect sense to me. Couldn't you simply have a penalty in utility for aborting a plan? That would give your plan selection some hysteresis without disturbing your general scheme too much.
In addition to alvaro's simple and elegant suggestion, a less "in-system" way of handling cyclic behavior would be to keep a history of past actions (or plans, etc) and use a pattern-matching predictor[1] to penalize actions based on how likely they are to occur. If a cycle forms, the predictor should be able to quickly be able to predict the cycle and thus will assign actions in the cycle a high probability and thus they'll receive a higher penalty. A simple formula would be to multiply the utility of any action by (1.0 - probability).
[1] I can never remember the name of the simplest algorithm I know for pattern-based predicting - it's a lot like the "longest substring match" look-up used in compression algorithms like LZ77. Basically, you start with N=1 and look at the last N characters of the buffer and try to find that sequence elsewhere in the buffer. If you find matches, you can see what character comes next and assign it appropriate probability based on the number of matches. Then you try with N=2 and find matches and calculate probabilities and increase N until no matches are found and then use the statistics from the previous N. Essentially, you have an n-order model with no need to be fixed at any particular order. You can then do fancy things like weight matches near the end of the history buffer (more recent patterns) higher so new patterns will be discovered more quickly, etc.
[1] I can never remember the name of the simplest algorithm I know for pattern-based predicting - it's a lot like the "longest substring match" look-up used in compression algorithms like LZ77. Basically, you start with N=1 and look at the last N characters of the buffer and try to find that sequence elsewhere in the buffer. If you find matches, you can see what character comes next and assign it appropriate probability based on the number of matches. Then you try with N=2 and find matches and calculate probabilities and increase N until no matches are found and then use the statistics from the previous N. Essentially, you have an n-order model with no need to be fixed at any particular order. You can then do fancy things like weight matches near the end of the history buffer (more recent patterns) higher so new patterns will be discovered more quickly, etc.
"Walk not the trodden path, for it has borne it's burden." -John, Flying Monk
You were on the right track when you were talking about building in weights for things that are almost done. That way you don't strobe as much. Another method for doing this that handles minor digressions better is including a computation for the utility of time. That is, changing from Noble Task A to Noble Task B might not be a good idea, but from Noble Task A to Tiny Task C is not a big deal... you can pick up on A again when you are done.
In general, you are looking at the concepts of partial order planning and re-planning, however. Look up those terms and you should get more ideas.
In general, you are looking at the concepts of partial order planning and re-planning, however. Look up those terms and you should get more ideas.
Dave Mark - President and Lead Designer of Intrinsic Algorithm LLC
Professional consultant on game AI, mathematical modeling, simulation modeling
Co-founder and 10 year advisor of the GDC AI Summit
Author of the book, Behavioral Mathematics for Game AI
Blogs I write:
IA News - What's happening at IA | IA on AI - AI news and notes | Post-Play'em - Observations on AI of games I play
"Reducing the world to mathematical equations!"
Quote: Original post by PeteZ
Hi everyone,
I'm sorry for the length of my post I've tried to be as specific as possible!
I'm working on a system for an agent that needs to balance several needs. My focus is not that it needs to balance the needs in an optimal way, but that I can tweak the parameters to get the desired frequencies of different behaviours that appears most appropriate.
The current system has a set of normalised 'state' values that continuously update with time, and may be affected by agent actions. The states are then in turn used to calculate an additional set of 'motivation' values (they might correspond one-to-one, or they might combine different values, or they may be mapped non-linearly). The system enumerates a set of viable plans and estimates each plan's effect on each of the state values. This effect is then evaluated with respect to each motivation type to produce a motivation specific utility; these utilities are then summed, weighted by the value of the motivation itself, into one global utility.
Having so many different values at the different stages is probably conceptually equivalent to one combined utility function however I've been trying to keep the individual values transparent and debuggable, and the system works well (i.e. functions, and is easily tweakable) in the form described above.
My problem is that once selected the plans execute to completion, the conditions of which have to be arbitrarily determined. I want to change the system so that plans can be interrupted, both by higher priority motivations or by it being no longer beneficial to continue. I have been re-evaluating the plan selection at every update (ignoring performance for now); and very quickly the behaviour finds a cycle where it addresses one motivation until another is higher priority, so swaps to that one until the previous is higher priority, so swaps back to the other etc etc.
I obviously do not want the cyclic behaviour, however I do want interruptions to be possible - in particular, to address new information i.e. 'startling' events. I don't want to simply 'latch' the behaviour until it is complete as there may be no concrete finishing point.
I have tried to implement a system of activation and inhibition, where the need is activated due to one state value, and only inhibited by the threshold activation of some other, indirectly linked state value. Even then it is still possible to find a rapid cycle because the iteration of repeated plan selection quickly converges on the balance of values where the inhibition of one motivation is matched against the activation of another.
I've searched a lot for research / documentation on this kind of problem. Despite the fact that it feels a quite natural problem that must occur a lot, I cannot find anything that deals specifically with plans selected via utility function *as well as* their update and interruption later due to continuous changes in utility.
It feels like there should be a fairly straightforward, common sense solution to this - thanks in advance for any suggestions!
You need to add an inertia factor to the evaluation calculations so that the further along you are in the task (closer to / increasing probability of achieving the goal/payoff) the higher the 'stay on target' factor gets. The progress made towards the goal/expenditure of resources (including time) needs to count somehow. Opportunity situations should be reacted to if their payoff is easily achieved (and possibly not losing much progress on the original task). If the situation changes significabntly and the original solution's probability of success falls sufficiently then it can warrant discarding the effort/resources spent if the payoff no longer is 'worth' further expenditures at the higher cost.
It often happens if the evaluations are done right for such a changeable environment that solutions to goals are best chosen with an emphasis on versatility -- when the expenditures made lead to positions/create situations where payoffs have high chance of nearterm achievement and opportunities for multiple payoff options. Different strategies can be 'preferred' by different objects (ex- low risk - low payoff - high frequency VS high risk - high payoff == low frequency), especially in an environment where other objects compete for the same goals.
--------------------------------------------[size="1"]Ratings are Opinion, not Fact
Hi guys thanks to all of you for your responses :)
Alvaro: a utility penalty would probably help reduce the cyclic behaviour, however my focus is very much on being able to tweak the balance of different behaviours, while also grounding it in a 'common sense' logic - i.e. so if something is not happening, there should be a visible train of cause and effect explaining why. My worry with a utility penalty is that it introduces further numerical complexity that would need to be balanced for each combination of competing motivations
Extrarius: I can see how pattern predicting might work over many iterations, however for my application I want each single action to be understandable and sensible, as well as grounded in some logic about the state of the world, so I don't think it suits my purpose
InnocuousFox: Partial order planning appears a little more complex than what I need - in my particular case I want my agents to be simple and reactive (with the player responsible for their long term goals); they don't need to form plans that require reasoning about cause and effect. However, looking into this did make me realise that what I've been missing out on is an explicit consideration of future concerns
wodinoneeye: yep, I think you're right about the 'inertia factor' to decision calculations, and that this shouldn't be a simple bias but something qualitative about how the decision is made that makes it self-reinforcing. So in a relatively ambiguous environment, any progress made towards a taken decision helps to converge future evaluations on that option, as opposed to just eliminating the criteria that led to it being selected in the first place (as it currently does)
Bearing all this in mind, I think the next approach I'll look into will be that instead of trying to minimise the weighted sum of motivations at any given moment, there needs to be some idea of minimising the area under the graph of motivations against time, taking initial costs of plans into account.
In this way each step towards obtaining a particular goal will reduce its initial cost making it more favourable, and once engaged in the subset of the plan that actually reduces the motivation it will still be preferable to keep reducing the motivation, even when others are rising above it, because the cost of switching from this plan to the other outweighs the cost of completing the current plan before switching.
I don't expect this to be easy though, in particular I don't expect to evaluate this future integral directly, rather try to attack it heuristically as the motivations, along with their activations and inhibitions, are themselves heuristics to maintaining ideal state values. In the end I might be making this overly numerically complex after all :S - but for now it seems worth pursuing this route.
Thanks again for all your feedback!
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement