type of determinism of an evironment - artificial-intelligence

Studying the environment where an agent has to work, I have to tell about it's determinism.
An environment is deterministic if the next state is perfectly
predictable given knowledge of the previous state and the agent's
action
Otherwise it's non-deterministic.
My question is: What if I know that the next state is one of few I can calculate, but I can't tell exactly which? Is it still non-deterministic? Or it is a different type of determinism?

If you know which possible resulting states there are, and you know the associated probabilities, it is still said to be non-deterministic. This is assuming that there actually are probabilities involved (for instance, rolling a die and transitioning to a state depending on the outcome of rolling that die).
If you don't know which possible resulting states there are, and/or you don't know the associated probabilities, it is also said to be non-deterministic. For example, this is the case if you are rolling an unfair die, where not every outcome has an equal probability, and you do not know in which way it is unfair (is the "1" more likely? or the "6"?).
If there aren't any probabilities involved in the state transitions, but you simply don't know the rules that determine to which state you transition, it is deterministic. For example, this can be the case in a game where you need to explore the map to figure out where an opponent is located. In this case you say you have incomplete information.

Related

Reinforcement Learning in Dynamic Environment with large state -action space

I have a 500*500 grid with 7 different penalty values. I need to make an RL agent whose action space contains 11 actions. (Left, Right, Up, Down, 4 Diagonal Directions, Speed Up, Speed Down And Normal Speed). How can I solve this problem?
The Probability of 'action performed' which was chosen is 0.8. Otherwise a random action is selected. Also, the penalty values can change dynamically.
Take a look at this chapter by Sutton incompleteideas.net/sutton/book/ebook/node15.html, especially his experiments in later sections. Your problem seems similar to the N-Armed bandit in that each of the arms returns a normal distribution of reward. While this chapter mostly focuses on exploration, the problem applies.
Another way to look at it, is if your state really returns a normal distribution of penalties, you will need to sufficiently explore the domain to get the mean of the state, action tuple. The mean in these cases is Q*, which will give you the optimal policy.
As a follow up, if the state space is too large or continuous, it may be worth looking into generalization with a function approximator. While the same convergence rules apply, there are cases where function approximations run into issues. I would say that is beyond the scope of this discussion though.

Why do we weight recent rewards higher in non-stationary reinforcement learning?

The book 'Introduction to Reinforcement Learning' by Barto and Sutton, mentions the following about non-stationary RL problems -
"we often encounter reinforcement learning problems that are effectively nonstationary. In such cases, it makes sense to weight recent rewards more heavily than long-past ones. " (see here -https://webdocs.cs.ualberta.ca/~sutton/book/ebook/node20.html)
I am not absolutely convinced by this. For example, an explorer agent whose task is to find an exit for a maze might actually lose because it made a wrong choice in the distant past. Could you please explain why it makes sense to weight more recent rewards higher in simple terms?
If the problem is non-stationary, then past experience is increasingly out of date and should be given lower weight. That way, if an explorer makes a mistake in distant past, the mistake is overwritten by more recent experience.
The text explicitly refers to nonstationary problems. In such problems, the MDP characteristics change. For example, the environment can change and therefore the transition matrix or the reward function might be different. In this case, a reward collected in the past might not be significant anymore.
In your example, the MDP is stationary, because the maze never changes, so your statement is correct. If (for example) the exit of the maze would change according to some law (which you do not know), then it makes sense to weigh recent rewards more (for example, if the reward is the Manhattan distance from the agent position to the exit).
In general, dealing with nonstationary MDPs is very complex, because usually you don't know how the characteristics change (in the example above, you don't know how the exit location is changed). On the contrary, if you know the law determining how the environment changes, you should include it in the MDP model.

Quantification over actions

Looking for examples that show me how to quantify over actions (and perhaps fluents?) in situation calculus (Reiter 2001).
I understand the difference between actions, fluents and situations, but why do they need to be represented in 2nd order logic? Why not use first order? Can you please explain?
All but a few formulas used to encode a dynamic world require Second Order Logic (SOL). In particular,
the [properties of the] Initial State
the Action Preconditions
the Action Effects
the Successor State axioms (used to circumvent the "Frame problem")
can all be expressed with First Order Logic (FOL).
With some domains it may be convenient -and more concise- to use SOL for the above items, but AFAIK, it is always possible to convert such SOL to FOL in the context of the items listed above, for a finite domain, and hence SOL is not necessary (again, for the above items).
Typically the need for SOL with Situation Calculus only comes from some of the "Foundational axioms" such as the axiom to perform induction on Situations.
Furthermore, depending on the particular application, one may not need the SOL-based Foundational axiom(s) and hence the whole world can be exclusively described in terms of FOL.
I'm not a specialist in the field, but I think that in many cases, for the purpose of [Logical] Filtering, Planning, Temporal Projection, we can do away with the need for induction and hence only rely on FOL only.

Best way to automate testing of AI algorithms?

I'm wondering how people test artificial intelligence algorithms in an automated fashion.
One example would be for the Turing Test - say there were a number of submissions for a contest. Is there any conceivable way to score candidates in an automated fashion - other than just having humans test them out.
I've also seen some data sets (obscured images of numbers/letters, groups of photos, etc) that can be fed in and learned over time. What good resources are out there for this.
One challenge I see: you don't want an algorithm that tailors itself to the test data over time, since you are trying to see how well it does in the general case. Are there any techniques to ensure it doesn't do this? Such as giving it a random test each time, or averaging its results over a bunch of random tests.
Basically, given a bunch of algorithms, I want some automated process to feed it data and see how well it "learned" it or can predict new stuff it hasn't seen yet.
This is a complex topic - good AI algorithms are generally the ones which can generalize well to "unseen" data. The simplest method is to have two datasets: a training set and an evaluation set used for measuring the performances. But generally, you want to "tune" your algorithm so you may want 3 datasets, one for learning, one for tuning, and one for evaluation. What defines tuning depends on your algorithm, but a typical example is a model where you have a few hyper-parameters (for example parameters in your Bayesian prior under the Bayesian view of learning) that you would like to tune on a separate dataset. The learning procedure would already have set a value for it (or maybe you hardcoded their value), but having enough data may help so that you can tune them separately.
As for making those separate datasets, there are many ways to do so, for example by dividing the data you have available into subsets used for different purposes. There is a tradeoff to be made because you want as much data as possible for training, but you want enough data for evaluation too (assuming you are in the design phase of your new algorithm/product).
A standard method to do so in a systematic way from a known dataset is cross validation.
Generally when it comes to this sort of thing you have two datasets - one large "training set" which you use to build and tune the algorithm, and a separate smaller "probe set" that you use to evaluate its performance.
#Anon has the right of things - training and what I'll call validation sets. That noted, the bits and pieces I see about developments in this field point at two things:
Bayesian Classifiers: there's something like this probably filtering your email. In short you train the algorithm to make a probabilistic decision if a particular item is part of a group or not (e.g. spam and ham).
Multiple Classifiers: this is the approach that the winning group involved in the Netflix challenge took, whereby it's not about optimizing one particular algorithm (e.g. Bayesian, Genetic Programming, Neural Networks, etc..) by combining several to get a better result.
As for data sets Weka has several available. I haven't explored other libraries for data sets, but mloss.org appears to be a good resource. Finally data.gov offers a lot of sets that provide some interesting opportunities.
Training data sets and test sets are very common for K-means and other clustering algorithms, but to have something that's artificially intelligent without supervised learning (which means having a training set) you are building a "brain" so-to-speak based on:
In chess: all possible future states possible from the current gameState.
In most AI-learning (reinforcement learning) you have a problem where the "agent" is trained by doing the game over and over. Basically you ascribe a value to every state. Then you assign an expected value of each possible action at a state.
So say you have S states and a actions per state (although you might have more possible moves in one state, and not as many in another), then you want to figure out the most-valuable states from s to be in, and the most valuable actions to take.
In order to figure out the value of states and their corresponding actions, you have to iterate the game through. Probabilistically, a certain sequence of states will lead to victory or defeat, and basically you learn which states lead to failure and are "bad states". You also learn which ones are more likely to lead to victory, and these are subsequently "good" states. They each get a mathematical value associated, usually as an expected reward.
Reward from second-last state to a winning state: +10
Reward if entering a losing state: -10
So the states that give negative rewards then give negative rewards backwards, to the state that called the second-last state, and then the state that called the third-last state and so-on.
Eventually, you have a mapping of expected reward based on which state you're in, and based on which action you take. You eventually find the "optimal" sequence of steps to take. This is often referred to as an optimal policy.
It is true of the converse that normal courses of actions that you are stepping-through while deriving the optimal policy are called simply policies and you are always implementing a certain "policy" with respect to Q-Learning.
Usually the way of determining the reward is the interesting part. Suppose I reward you for each state-transition that does not lead to failure. Then the value of walking all the states until I terminated is however many increments I made, however many state transitions I had.
If certain states are extremely unvaluable, then loss is easy to avoid because almost all bad states are avoided.
However, you don't want to discourage discovery of new, potentially more-efficient paths that don't follow just this-one-works, so you want to reward and punish the agent in such a way as to ensure "victory" or "keeping the pole balanced" or whatever as long as possible, but you don't want to be stuck at local maxima and minima for efficiency if failure is too painful, so no new, unexplored routes will be tried. (Although there are many approaches in addition to this one).
So when you ask "how do you test AI algorithms" the best part is is that the testing itself is how many "algorithms" are constructed. The algorithm is designed to test a certain course-of-action (policy). It's much more complicated than
"turn left every half mile"
it's more like
"turn left every half mile if I have turned right 3 times and then turned left 2 times and had a quarter in my left pocket to pay fare... etc etc"
It's very precise.
So the testing is usually actually how the A.I. is being programmed. Most models are just probabilistic representations of what is probably good and probably bad. Calculating every possible state is easier for computers (we thought!) because they can focus on one task for very long periods of time and how much they remember is exactly how much RAM you have. However, we learn by affecting neurons in a probabilistic manner, which is why the memristor is such a great discovery -- it's just like a neuron!
You should look at Neural Networks, it's mindblowing. The first time I read about making a "brain" out of a matrix of fake-neuron synaptic connections... A brain that can "remember" basically rocked my universe.
A.I. research is mostly probabilistic because we don't know how to make "thinking" we just know how to imitate our own inner learning process of try, try again.

In game programming, how can I test whether a heuristic used is consistent or not?

I have thought of some heuristics for a big (higher dimensions) tic-tac-toe game. How do I check which of them are actually consistent?
What is meant by consistency anyways?
Heuristics produce some sort of cost value for a given state. Consistency in this context means the estimate for a state plus the cost of moving to the next state is less than or equal to the estimate for that new state. If this wasn't true then it would imply that - if the heuristic was accurate - that transitioning from one state to the next could incur negative cost, which is typically impossible or incorrect.
This is intuitive to prove when it comes to pathfinding, as you expect every step along the path to take some time, therefore the estimate at step 1 must be lower than the estimate at any step 2. It's probably a bit more complex for tic-tac-toe since you probably have to arbitrarily decide what constitutes a 'cost' in your system. If your heuristic can go both up or down as a result of playing a move - eg. because you encode good moves with positive numbers and bad moves with negative numbers - then your heuristic cannot be consistent.
However, lack of a consistent heuristic is not always a problem. You may not be guaranteed of reaching an optimal solution without one, but it may still speed up the search compared to a brute force state search.
EDITED: This answer confused admissibility and consistency. I have corrected it to refer to admissibility, but the original question was about consistency, and this answer does not fully answer the question.
You could do it analytically, by distinguishing all different cases and thereby proving that your heuristic is indeed admissible.
For informed search, a heuristic is admissible with a search problem (say, the search for the best move in a game) if and only if it underestimates the 'distance' to a suitable state.
EXAMPLE: Search for the shortest route to a target city via a network of highways between cities. Here, one could use the Eucidean distance as a heuristic: the length of a straight line to the goal is always shorter or equally long than the best possible way.
Admissibility is required by algorithms like A*, which then quarantuee you to be optimal (i.e. they will find the best 'route' to a goal state if one exists).
I would recommend to look the topic up in an AI textbook.

Resources