Non-continuous values as inputs to recurrent neural network? - artificial-intelligence

I am hoping to use a recurrent neural network to perform time series predictions.
Some of my inputs are between 0 and 1, while others can be > 1.
My goal is to analyze stock market trends (please do not lecture me on this...this is only an example). Here are some of my inputs:
Price change % from opening (>= 0)
Slope (decrease/increase) of stock in last n minutes (>= 0).
Whether stock is likely at one of its peaks for the day (0 to 1).
As you can see, some of the inputs are not between 0 and 1.
Can a recurrent network receive non-continuous numeric values (> 1) as input? If so, would I just pass values that are > 1 in as inputs?

Related

Random cut forest score distribution

I am using SageMaker Random Cut Forest. It seems that setting too high values for hyperparameters <num_samples_per_tree, num_trees> (if num_samples_per_tree*num_trees is higher than number of records) leads to very strange score distribution.
Example 1 - Number of records in training set=cca 200000, num_samples_per_tree=2048, num_trees=256
Example 2 - Number of records in training set=cca 200000, num_samples_per_tree=1028, num_trees=100
Samples should be created using Reservoir Sampling - so I do not see any reason why this happens. Is there any relationship between number of records and these hyperparameters?

Monte Carlo Tree Search for card games like Belot and Bridge, and so on

I've been trying to apply MCTS in card games. Basically, I need a formula or modify the UCB formula, so it is best when selecting which node to proceed.
The problem is, the card games are no win/loss games, they have score distribution in each node, like 158:102 for example. We have 2 teams, so basically it is 2-player game. The games I'm testing are constant sum games (number of tricks, or some score from the taken tricks and so on).
Let's say the maximum sum of teamA and teamB score is 260 at each leaf. Then I search the best move from the root, and the first I try gives me average 250 after 10 tries. I have 3 more possible moves, that had never been tested. Because 250 is too close to the maximum score, the regret factor to test another move is very high, but, what should be mathematically proven to be the optimal formula that gives you which move to chose when you have:
Xm - average score for move m
Nm - number of tries for move m
MAX - maximum score that can be made
MIN - minimum score that can be made
Obviously the more you try the same move, the more you want to try the other moves, but the more close you are to the maximum score, the less you want to try others. What is the best math way to choose a move based ot these factors Xm, Nm, MAX, MIN?
Your problem obviously is an exploration problem, and the problem is that with Upper Confidence Bound (UCB), the exploration cannot be tuned directly. This can be solved by adding an exploration constant.
The Upper Confidence Bound (UCB) is calculated as follows:
with V being the value function (expected score) which you are trying to optimize, s the state you are in (the cards in the hands), and a the action (putting a card for example). And n(s) is the number of times a state s has been used in the Monte Carlo simulations, and n(s,a) the same for the combination of s and action a.
The left part (V(s,a)) is used to exploit knowledge of the previously obtained scores, and the right part is the adds a value to increase exploration. However there is not way to increase/decrease this exploration value, and this is done in the Upper Confidence Bounds for Trees (UCT):
Here Cp > 0 is the exploration constant, which can be used to tune the exploration. It was shown that:
holds the Hoeffding's inequality if the rewards (scores) are between 0 and 1 (in [0,1]).
Silver & Veness propose: Cp = Rhi - Rlo, with Rhi being the highest value returned using Cp=0, and Rlo the lowest value during the roll outs (i.e. when you randomly choose actions when no value function is calculated yet).
Reference:
Cameron Browne, Edward J. Powley, Daniel Whitehouse, Simon M. Lucas, Peter I. Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis and Simon Colton.
A Survey of Monte Carlo Tree Search Methods.
IEEE Trans. Comp. Intell. AI Games, 4(1):1–43, 2012.
Silver, D., & Veness, J. (2010). Monte-Carlo planning in large POMDPs. Advances in Neural Information Processing Systems, 1–9.

How come random weight initiation is better then just using 0 as weights in ANN?

In a trained neural net the weight distribution will fall close around zero. So it makes sense for me to initiate all weights to zero. However there are methods such as random assignment for -1 to 1 and Nguyen-Widrow that outperformes zero initiation. How come these random methods are better then just using zero?
Activation & learning:
Additionally to the things cr0ss said, in a normal MLP (for example) the activation of layer n+1 is the dot product of the output of layer n and the weights between layer n and n + 1...so basically you get this equation for the activation a of neuron i in layer n:
Where w is the weight of the connection between neuron j (parent layer n-1) to current neuron i (current layer n), o is the output of neuron j (parent layer) and b is the bias of current neuron i in the current layer.
It is easy to see initializing weights with zero would practically "deactivate" the weights because weights by output of parent layer would equal zero, therefore (in the first learning steps) your input data would not be recognized, the data would be negclected totally.
So the learning would only have the data supplied by the bias in the first epochs.
This would obviously render the learning more challenging for the network and enlarge the needed epochs to learn heavily.
Initialization should be optimized for your problem:
Initializing your weights with a distribution of random floats with -1 <= w <= 1 is the most typical initialization, because overall (if you do not analyze your problem / domain you are working on) this guarantees some weights to be relatively good right from the start. Besides, other neurons co-adapting to each other happens faster with fixed initialization and random initialization ensures better learning.
However -1 <= w <= 1 for initialization is not optimal for every problem. For example: biological neural networks do not have negative outputs, so weights should be positive when you try to imitate biological networks. Furthermore, e.g. in image processing, most neurons have either a fairly high output or send nearly nothing. Considering this, it is often a good idea to initialize weights between something like 0.2 <= w <= 1, sometimes even 0.5 <= w <= 2 showed good results (e.g. in dark images).
So the needed epochs to learn a problem properly is not only dependent on the layers, their connectivity, the transfer functions and learning rules and so on but also to the initialization of your weights.
You should try several configurations. In most situations you can figure out what solutions are adequate (like higher, positive weights for processing dark images).
Reading the Nguyen article, I'd say it is because when you assign the weight from -1 to 1, you are already defining a "direction" for the weight, and it will learn if the direction is correct and it's magnitude to go or not the other way.
If you assign all the weights to zero (in a MLP neural network), you don't know which direction it might go to. Zero is a neutral number.
Therefore, if you assign a small value to the node's weight, the network will learn faster.
Read Picking initial weights to speed training section of the article. It states:
First, the elements of Wi are assigned values from a uniform random distributation between -1 and 1 so that its direction is random. Next, we adjust the magnitude of the weight vectors Wi, so that each hidden node is linear over only a small interval.
Hope it helps.

Is there a supervised learning algorithm that takes tags as input, and produces a probability as output?

Let's say I want to determine the probability that I will upvote a question on SO, based only on which tags are present or absent.
Let's also imagine that I have plenty of data about past questions that I did or did not upvote.
Is there a machine learning algorithm that could take this historical data, train on it, and then be able to predict my upvote probability for future questions? Note that it must be the probability, not just some arbitrary score.
Let's assume that there will be up-to 7 tags associated with any given question, these being drawn from a superset of tens of thousands.
My hope is that it is able to make quite sophisticated connections between tags, rather than each tag simply contributing to the end result in a "linear" way (much as words do in a Bayesian spam filter).
So for example, it might be that the word "java" increases my upvote probability, except when it is present with "database", however "database" might increase my upvote probability when present with "ruby".
Oh, and it should be computationally reasonable (training within an hour or two on millions of questions).
What approaches should I research here?
Given that there probably aren't many tags per message, you could just create "n-gram" tags and apply naive Bayes. Regression trees would also produce an empirical probability at the leaf nodes, using +1 for upvote and 0 for no upvote. See http://www.stat.cmu.edu/~cshalizi/350-2006/lecture-10.pdf for some readable lecture notes and http://sites.google.com/site/rtranking/ for an open source implementation.
You can try several methods (linear regression, SMV, neural networks). The input vector should consist of all possible tags, where each tag represents one dimension.
Then each record in a training set has to be transformed to the input vector according to the tags. For example let's say you have different combinations of 4 tags in your training set (php, ruby, ms, sql) and you define an unweighted input vector [php, ruby, ms, sql]. Let's say you have the following 3 records whic are transformed to weighted input vectors:
php, sql -> [1, 0, 0, 1]
ruby -> [0, 1, 0, 0]
ms, sql -> [0, 0, 1, 1]
In case you use linear regression you use the following formula
y = k * X
where y represents an answer (upvote/downvote) in your case and by inserting known values (X - weighted input vectors).
How ta calculate weights in case you use linear regression you can read here but the point is to create binary input vectors which size is equal (or larger in case you take into account some other variables) to the number of all tags and then for each record you set weights for each tag (0 if it is not included or 1 otherwise).

Determining which inputs to weigh in an evolutionary algorithm

I once wrote a Tetris AI that played Tetris quite well. The algorithm I used (described in this paper) is a two-step process.
In the first step, the programmer decides to track inputs that are "interesting" to the problem. In Tetris we might be interested in tracking how many gaps there are in a row because minimizing gaps could help place future pieces more easily. Another might be the average column height because it may be a bad idea to take risks if you're about to lose.
The second step is determining weights associated with each input. This is the part where I used a genetic algorithm. Any learning algorithm will do here, as long as the weights are adjusted over time based on the results. The idea is to let the computer decide how the input relates to the solution.
Using these inputs and their weights we can determine the value of taking any action. For example, if putting the straight line shape all the way in the right column will eliminate the gaps of 4 different rows, then this action could get a very high score if its weight is high. Likewise, laying it flat on top might actually cause gaps and so that action gets a low score.
I've always wondered if there's a way to apply a learning algorithm to the first step, where we find "interesting" potential inputs. It seems possible to write an algorithm where the computer first learns what inputs might be useful, then applies learning to weigh those inputs. Has anything been done like this before? Is it already being used in any AI applications?
In neural networks, you can select 'interesting' potential inputs by finding the ones that have the strongest correlation, positive or negative, with the classifications you're training for. I imagine you can do similarly in other contexts.
I think I might approach the problem you're describing by feeding more primitive data to a learning algorithm. For instance, a tetris game state may be described by the list of occupied cells. A string of bits describing this information would be a suitable input to that stage of the learning algorithm. actually training on that is still challenging; how do you know whether those are useful results. I suppose you could roll the whole algorithm into a single blob, where the algorithm is fed with the successive states of play and the output would just be the block placements, with higher scoring algorithms selected for future generations.
Another choice might be to use a large corpus of plays from other sources; such as recorded plays from human players or a hand-crafted ai, and select the algorithms who's outputs bear a strong correlation to some interesting fact or another from the future play, such as the score earned over the next 10 moves.
Yes, there is a way.
If you choose M selected features there are 2^M subsets, so there is a lot to look at.
I would to the following:
For each subset S
run your code to optimize the weights W
save S and the corresponding W
Then for each pair S-W, you can run G games for each pair and save the score L for each one. Now you have a table like this:
feature1 feature2 feature3 featureM subset_code game_number scoreL
1 0 1 1 S1 1 10500
1 0 1 1 S1 2 6230
...
0 1 1 0 S2 G + 1 30120
0 1 1 0 S2 G + 2 25900
Now you can run some component selection algorithm (PCA for example) and decide which features are worth to explain scoreL.
A tip: When running the code to optimize W, seed the random number generator, so that each different 'evolving brain' is tested against the same piece sequence.
I hope it helps in something!

Resources