Which activation function should I use for this Neural Network? - artificial-intelligence

We are developing a neural network for Checkers game. In our training data,
0 represents blank cell,
1 represents white piece,
-1 represents white king,
2 represents black piece and
-2 represents black king
So, what we need is an activation function withing range [-2, 2]. Which activation function should we use? Please give your suggestions regarding this.

I don't see reason why you couldn't use sigmoid function
Range of sigmoid function is [0, 1]
To modify sigmoid function to fit your needs, you can multiply it by 4, multiplying function by a constant affect it's amplitude ( Range = [0, 4] ), and than subtract 2, with subtracting / adding you can move function on Y-axis ( Range = [-2, 2] )
So the function would look like this:
S(t) = 4* ( 1 / (1 + e ^ (-t)) ) - 2

Your state encoding is not optimal. Usually neural networks work better with 1-of-c encoding for categories. Then, it is easy to use sigmoid units. Just take the argmax of the 5 outputs to determine the state.

Yeah! sigmoid function works well since it neatly fits into the range you have specified. I also use it for a pattern recognition problem i am developing. Linear tends to work well.

Related

AForge.NET - Backpropagation learning always returns values [-1;1]

I have some problem with backpropagation learning using AForge.NET - Neuro Learning - Backpropagation . I actually try to implement neural network as in samples (Aproximation). My problem is about this:
1. input vector {1,2,3,...,19,20}
2. output vector {1,2,3,...,19,20} (it's linear function)
3. ActivationNetwork network = new ActivationNetwork(new BipolarSigmoidFunction(2), 1, 20, 1);
4. Then about 10k times - teacher.RunEpoch(input, output);
When learning is complete my network.Compute() returns values in [-1;1] Why?
In sample there is something like normalising values of vectors ( x -> [-1; 1] and y -> [-0.85; 0.85] ) and when I do it everything is fine... but it's only sample with which I want to learn about how neural networks working. My current problem which I want to implement is more complex (It more than 40 input neurons)
Can anyone help me?
I did not work with AForge yet, but the BipolarSigmoidFunction is most probably tanh, i.e. the output is within [-1, 1]. This is usually used for classification or sometimes for bounded regression. In your case you can either scale the data or use a linear activation function (e.g. identity, g(a) = a).

Does Convolution Neural Network need normalized input?

I have trained a Convolution Neural Network, after comparing two normalizations,
I found that simple minus mean and divided by standard variance is better than scaling into [0, 1], it seems that the interval of input value is unnecessary in [0, 1] with sigmoid function.
Does anybody could explain about it?
If you're talking about a NN using logistic regression, then you are correct that a suitable sigmoid function (or logistic function in this context) will give you a [0, 1] range from your original inputs.
However, the logistic function works best when the inputs are in a small range on either side of zero - so, for example, your input to the logistic function might be [-3, +3].
By rescaling your data to [0, 1] first, you would flatten out any underlying distribution and move all of your data to the positive side of zero, which is not what the logistic function expects. So you will get a worse result than by normalising (i.e. subtract mean and divide by standard deviation, as you said) because that normalisation step takes account of the variance in the original distribution and makes sure that the mean is zero so you get both positive and negative data input to the logistic function.
In your question, you said "comparing two normalisations" - I think you are misunderstanding what "normalisation" means and actually comparing normalisation with rescaling, which is different.

Does a Neural Network with Sigmoid Activation use Thresholds?

I'm a tad confused here. I just started on the subject of Neural Networks and the first one I constructed used the Step-Activation with thresholds on each neuron. Now I wan't to implement the sigmoid activation but it seems that this type of activation doesn't use thresholds, only weights between the neurons. But in the information I find about this there is word of thresholds, only I can't find where they should be in the activation function.
Are thresholds used in a sigmoid activation function in neural networks?
There is no discrete jump as in step activation. The threshold could be considered to be the point where the sigmoid function is 0.5. Some sigmoid functions will have this at 0, while some will have it set to a different 'threshold'.
The step function may be thought of as a version of the sigmoid function that has the steepness set to infinity. There is an obvious threshold in this case, and for less steep sigmoid functions, the threshold could be considered to be where the function's value is 0.5, or the point of maximum steepness.
Sigmoid function's value is in the range [0;1], 0.5 is taken as a threshold, if h(theta) < 0.5 we assume that it's value is 0, if h(theta) >= 0.5 then it's 1.
Thresholds are used only on the output layer of the network and it's only when classifying. So, if you're trying to classify between 4 classes, then the output layer has 4 nodes y = [y1,y2,y3,y4], you'll use this threshold to assign y[i] 1 or 0.
It doesn't need to. Sigmoid curve itself partially can act as a threshold.

1D Number Array Clustering

So let's say I have an array like this:
[1,1,2,3,10,11,13,67,71]
Is there a convenient way to partition the array into something like this?
[[1,1,2,3],[10,11,13],[67,71]]
I looked through similar questions yet most people suggested using k-means to cluster points, like scipy, which is quite confusing to use for a beginner like me. Also I think that k-means is more suitable for two or more dimensional clustering right? Are there any ways to partition an array of N numbers to many partitions/clustering depending on the numbers?
Some people also suggest rigid range partitioning, but it doesn't always render the results as
expected
Don't use multidimensional clustering algorithms for a one-dimensional problem. A single dimension is much more special than you naively think, because you can actually sort it, which makes things a lot easier.
In fact, it is usually not even called clustering, but e.g. segmentation or natural breaks optimization.
You might want to look at Jenks Natural Breaks Optimization and similar statistical methods. Kernel Density Estimation is also a good method to look at, with a strong statistical background. Local minima in density are be good places to split the data into clusters, with statistical reasons to do so. KDE is maybe the most sound method for clustering 1-dimensional data.
With KDE, it again becomes obvious that 1-dimensional data is much more well behaved. In 1D, you have local minima; but in 2D you may have saddle points and such "maybe" splitting points. See this Wikipedia illustration of a saddle point, as how such a point may or may not be appropriate for splitting clusters.
See this answer for an example how to do this in Python (green markers are the cluster modes; red markers a points where the data is cut; the y axis is a log-likelihood of the density):
This simple algorithm works:
points = [0.1, 0.31, 0.32, 0.45, 0.35, 0.40, 0.5 ]
clusters = []
eps = 0.2
points_sorted = sorted(points)
curr_point = points_sorted[0]
curr_cluster = [curr_point]
for point in points_sorted[1:]:
if point <= curr_point + eps:
curr_cluster.append(point)
else:
clusters.append(curr_cluster)
curr_cluster = [point]
curr_point = point
clusters.append(curr_cluster)
print(clusters)
The above example clusters points into a group, such that each element in a group is at most eps away from another element in the group. This is like the clustering algorithm DBSCAN with eps=0.2, min_samples=1. As others noted, 1d data allows you to solve the problem directly, instead of using the bigger guns like DBSCAN.
The above algorithm is 10-100x faster for some small datasets with <1000 elements I tested.
You may look for discretize algorithms. 1D discretization problem is a lot similar to what you are asking. They decide cut-off points, according to frequency, binning strategy etc.
weka uses following algorithms in its , discretization process.
weka.filters.supervised.attribute.Discretize
uses either Fayyad & Irani's MDL method or Kononeko's MDL criterion
weka.filters.unsupervised.attribute.Discretize
uses simple binning
CKwrap is a fast and straightforward k-means clustering function, though a bit light on documentation.
Example Usage
pip install ckwrap
import ckwrap
nums= np.array([1,1,2,3,10,11,13,67,71])
km = ckwrap.ckmeans(nums,3)
print(km.labels)
# [0 0 0 0 1 1 1 2 2]
buckets = [[],[],[]]
for i in range(len(nums)):
buckets[km.labels[i]].append(nums[i])
print(buckets)
# [[1, 1, 2, 3], [10, 11, 13], [67, 71]]
exit()
I expect the authors intended you to make use of the nd array functionality rather than create a list of lists.
other measures:
km.centers
km.k
km.sizes
km.totss
km.betweenss
km.withinss
The underlying algorithm is based on this article.
Late response and just for the record. You can partition a 1D array using Ckmeans.1d.dp.
This method guarantees optimality and it is O(n^2), where n is the num of observations. The implementation is in C++ and there is a wrapper in R.
The code for Has QUIT--Anony-Mousse's answer to Clustering values by their proximity in python (machine learning?)
When you have 1-dimensional data, sort it, and look for the largest
gaps
I only added that gaps need to be relatively large
import numpy as np
from scipy.signal import argrelextrema
# lst = [1,1,5,6,1,5,10,22,23,23,50,51,51,52,100,112,130,500,512,600,12000,12230]
lst = [1,1,2,3,10,11,13,67,71]
lst.sort()
diff = [lst[i] - lst[i-1] for i in range(1, len(lst))]
rel_diff = [diff[i]/lst[i] for i in range(len(diff))]
arg = argrelextrema(np.array(rel_diff), np.greater)[0]
last = 0
for x in arg:
print(f'{last}:{x + 1} {lst[last:x + 1]}')
last = x + 1
print(f'{last}: {lst[last:]}')
output:
0:2 [1, 1]
2:4 [2, 3]
4:7 [10, 11, 13]
7: [67, 71]

Artificial neural networks

I want to know whether Artificial Neural Networks can be applied to discrete values inputs? I know they can be applied to continuous valued inputs, but can they be applied to discrete valued ones? Also, will perform well for discrete valued inputs?
Yes, artificial neural networks may be applied to data featuring discrete-value input variables. In the most commonly used neural network architectures (which are numeric), discrete inputs are typically represented by a series of dummy variables, just as in statistical regression. Also, as with regression, one less than the number of distinct values dummy variables is needed. There are other methods, but this is the most straightforward.
Well, good question let me say!
First of all let me answer directly yes to your question!
The answer implies to consider few aspects about the use and the implementation of the network itself.
Than let me explain why:
The easiest way is to normalize input as usual, this is the first rule of thumb with NN,
than let the neural network compute the task, and once you have your output, invert the normalization to get the output in the original range but still continuous, to get back descrete values just consider the integer part of your output. It is easy, it works and is fine, DONE! A good result just depends on the topology you design for you network.
As a plus you could consider the use of "step" transfer function, instead of "tan-sigmoid", between layers just to strenght and mimic a sort of digitization forcing the output to be just 0 or 1. But you should reconsider also the starting normalization as well as the use of well tuned thresholds.
NB: this latter trick is not really necessary but could give some secondary benefits; maybe test it in a second stage of your development and look at the differences.
PS: Just let me suggest something that should apply to your issue; if you would be smart take into account the use of some fuzzy logic on your learning algorithm ;-)
Cheers!
I'm late on this question, but this may help someone.
Say you have a categorical output variable, for example 3 different categories (0, 1 and 2),
outputs
0
2
1
2
1
0
then becomes
1, 0, 0
0, 0, 1
0, 1, 0
0, 0, 1
0, 1, 0
1, 0, 0
A possible NN output result is
0.2, 0.3, 0.5 (winner is categ 2)
0.05, 0.9, 0.05 (winner is categ 1)
...
Then your NN hill have 3 output nodes in this case, so take the max value.
To improve this, use entropy as a error measure and a softmax activation on the output layer, so that the outputs sum up to 1.
The purpose of a neural network is to approximate complicated functions by interpolating samples. As such, they tend to be a poor fit for discrete data, unless that data can be expressed by thresholding a continuous function. Depending on your problem, there are likely to be much more effective learning methods.

Resources