Related
I am trying to implement Neuro-Evolution of Augmenting Topologies in C#. I am running into a problem with recurrent connections. I understand that, for a recurrent connection, the output is basically temporally displaced.
http://i.imgur.com/FQYjCLZ.png
In the linked image, I show a pretty simple neural network with 2 inputs, 3 hidden nodes, and one output. Without an activation function or transfer function, I think it would be evaluated as:
n3[t] = (i1[t]*a + n6[t-1]*e)*d + i2[t]*b*c) * f
However I am having a hard time figuring out how to identify the fact that the link e is a recurrent connection. The paper that I read about NEAT showed how the minimal solutions of the XOR problem and the dual pole no velocity problem both had recurrent connections.
It seems rather straight forward if you have a fixed topology, because you can analyze the topology yourself, and identify which connections you need to time delay.
How exactly would you identify these connections?
I had a similar problem when i started implememting this paper. I don't know what your network looks like in the momen, so i'll explain to you what i did.
My network starts out as input & output layers only. To create connections and neurons i implemented some kind of DNA (in my case this is an array of instructions like 'connect neuron nr. 2 with neuron nr. 5 and set the weight to 0.4'). Every neuron in my network has a "layerNumber" which tells me where a neuron is in my network. This layerNumber is set for every in and output neuron. for inputneurons i used Double.minvalue and for outputneurons i used Double.maxvalue.
This is the basic setup. From now on just follow these rules when modifying the network:
Whenever you want to create a connection, make sure the 'from' neuron has a layerNumber < Double.maxValue
Whenever you want to create a connection, make sure that the 'to' neuron has a bigger layerNumber than the 'from' neuron.
Whenever a connection is split up into 2 connections and a neuron between them, set the neurons layerNumber to NeuronFrom.layerNumber*0.5 + NeuronTo.layerNumber*0.5
This is important, you can't add them and simply divide by 2, because this would likely result in Double.maxValue + something, which would return some weird number (i guess an overflow would happen, so you would get a negative number?).
If you follow all the rules you should always have forwarding connections only. No recurrent ones. If you want recurrent connections you can create them by just swapping 'from' & 'to' while creating a new connection.
Pro tricks:
Use only one ArrayList of Neurons.
Make the DNA use ID's of neurons to find them, but create a 'Connection' class which will have the Neuron objects as attributes.
When filtering your connections/neurons use ArrayList.stream().filter()
When later propagating trough the network you can just sort your neurons by layerNumber, set the inputValues and go trough all neurons using a for() loop. Just calculate the neurons outputvalue and transfer it to every neuron which has a connection where 'from' is == the current neuron.
Hope it's not too complicated...
Okay, so instead of telling you to just not have recurrent connections, i'm actually going to tell you how to identify them.
First thing you need to know is that recurrent connections are calculated after all other connections and neurons. So which connection is recurrent and which is not depends on the order of calculation of your NN.
Also, the first time when you put data into the system, we'll just assume that every connection is zero, otherwise some or all neurons can't be calculated.
Lets say we have this neural network:
Neural Network
We devide this network into 3 layers (even though conceptually it has 4 layers):
Input Layer [1, 2]
Hidden Layer [5, 6, 7]
Output Layer [3, 4]
First rule: All outputs from the output layer are recurrent connections.
Second rule: All outputs from the input layer may be calculated first.
We create two arrays. One containing the order of calculation of all neurons and connections and one containing all the (potentially) recurrent connections.
Right now these arrays look somewhat like this:
Order of
calculation: [1->5, 2->7 ]
Recurrent: [ ]
Now we begin by looking at the output layer. Can we calculate Neuron 3? No? Because 6 is missing. Can we calculate 6? No? Because 5 is missing. And so on. It looks somewhat like this:
3, 6, 5, 7
The problem is that we are now stuck in a loop. So we introduce a temporary array storing all the neuron id's that we already visited:
[3, 6, 5, 7]
Now we ask: Can we calculate 7? No, because 6 is missing. But we already visited 6...
[3, 6, 5, 7,] <- 6
Third rule is: When you visit a neuron that has already been visited before, set the connection that you followed to this neuron as a recurrent connection.
Now your arrays look like this:
Order of
calculation: [1->5, 2->7 ]
Recurrent: [6->7 ]
Now you finish the process and in the end join the order of calculation array with your recurrent array so, that the recurrent array follows after the other array.
It looks somethat like this:
[1->5, 2->7, 7, 7->4, 7->5, 5, 5->6, 6, 6->3, 3, 4, 6->7]
Let's assume we have [x->y, y]
Where x->y is the calculation of x*weight(x->y)
And
Where y is the calculation of Sum(of inputs to y). So in this case Sum(x->y) or just x->y.
There are still some problems to solve here. For example: What if the only input of a neuron is a recurrent connection? But i guess you'll be able to solve this problem on your own...
I am currently learning pattern recognition. I have a 7 year background in programming, so, I think like a programmer.
The documentation on ANN's tell me nothing about what order everything is processed, or at least does not make it very clear. This is annoying as I don't know how to code the formulas.
I found a nice gif which I hope is correct. Can someone please give me a step by step process of a artificial neural network back propagation with for example 2 inputs, 1 hidden layer with 3 nodes, 2 outputs using the sigmoid.
Here is the gif.
As Emile said you go layer by layer from input to output and then you propagate error backwards again layer by layer.
From what you have said I expect that you are trying to make "object oriented" implementation where every neuron is object. But that is not exactly the fastest nor easiest way. The most usual implementation is done by Matrix operations where
every layer is described by single Matrix (every row contains weights of one neuron plus threshold)
this is matlab code should do the trick:
output_hidden = logsig( hidden_layer * [inputs ; 1] );
inputs is column vector of inputs to layer
hidden_layer is matrix of weights plus one row which describes thresholds in hidden layer
output_hidden is again column vector of outputs of all neurons in layer which can be used as input into next layer
logsig is function which do sigmoid transform on all members of vector one by one
[inputs ; 1] creates new vector with 1 at the end of column vector inputs it is here because you need "virtual input" for thresholds to be multiplied with.
if you will think about it you will see that matrix multiplication will do exactly summation over all inputs multiplied by weight to output, you will also see that it doesn't matter in what order you do all the things. in order to implement it in any other language just find yourself good linear-algebra library. Implementing back-propagation is a bit trickier and you will need to tho some matrix transpositions (e.g. flipping matrix by diagonal)
As you can see in the gif, processing is per layer. As there are no connections within a layer, the processing order within a layer does not matter. Using the ANN (classifying) is done from input layer through hidden layers to the output layer. Training (using backpropagation) is done from output layer back to input layer.
So let's say I have an array like this:
[1,1,2,3,10,11,13,67,71]
Is there a convenient way to partition the array into something like this?
[[1,1,2,3],[10,11,13],[67,71]]
I looked through similar questions yet most people suggested using k-means to cluster points, like scipy, which is quite confusing to use for a beginner like me. Also I think that k-means is more suitable for two or more dimensional clustering right? Are there any ways to partition an array of N numbers to many partitions/clustering depending on the numbers?
Some people also suggest rigid range partitioning, but it doesn't always render the results as
expected
Don't use multidimensional clustering algorithms for a one-dimensional problem. A single dimension is much more special than you naively think, because you can actually sort it, which makes things a lot easier.
In fact, it is usually not even called clustering, but e.g. segmentation or natural breaks optimization.
You might want to look at Jenks Natural Breaks Optimization and similar statistical methods. Kernel Density Estimation is also a good method to look at, with a strong statistical background. Local minima in density are be good places to split the data into clusters, with statistical reasons to do so. KDE is maybe the most sound method for clustering 1-dimensional data.
With KDE, it again becomes obvious that 1-dimensional data is much more well behaved. In 1D, you have local minima; but in 2D you may have saddle points and such "maybe" splitting points. See this Wikipedia illustration of a saddle point, as how such a point may or may not be appropriate for splitting clusters.
See this answer for an example how to do this in Python (green markers are the cluster modes; red markers a points where the data is cut; the y axis is a log-likelihood of the density):
This simple algorithm works:
points = [0.1, 0.31, 0.32, 0.45, 0.35, 0.40, 0.5 ]
clusters = []
eps = 0.2
points_sorted = sorted(points)
curr_point = points_sorted[0]
curr_cluster = [curr_point]
for point in points_sorted[1:]:
if point <= curr_point + eps:
curr_cluster.append(point)
else:
clusters.append(curr_cluster)
curr_cluster = [point]
curr_point = point
clusters.append(curr_cluster)
print(clusters)
The above example clusters points into a group, such that each element in a group is at most eps away from another element in the group. This is like the clustering algorithm DBSCAN with eps=0.2, min_samples=1. As others noted, 1d data allows you to solve the problem directly, instead of using the bigger guns like DBSCAN.
The above algorithm is 10-100x faster for some small datasets with <1000 elements I tested.
You may look for discretize algorithms. 1D discretization problem is a lot similar to what you are asking. They decide cut-off points, according to frequency, binning strategy etc.
weka uses following algorithms in its , discretization process.
weka.filters.supervised.attribute.Discretize
uses either Fayyad & Irani's MDL method or Kononeko's MDL criterion
weka.filters.unsupervised.attribute.Discretize
uses simple binning
CKwrap is a fast and straightforward k-means clustering function, though a bit light on documentation.
Example Usage
pip install ckwrap
import ckwrap
nums= np.array([1,1,2,3,10,11,13,67,71])
km = ckwrap.ckmeans(nums,3)
print(km.labels)
# [0 0 0 0 1 1 1 2 2]
buckets = [[],[],[]]
for i in range(len(nums)):
buckets[km.labels[i]].append(nums[i])
print(buckets)
# [[1, 1, 2, 3], [10, 11, 13], [67, 71]]
exit()
I expect the authors intended you to make use of the nd array functionality rather than create a list of lists.
other measures:
km.centers
km.k
km.sizes
km.totss
km.betweenss
km.withinss
The underlying algorithm is based on this article.
Late response and just for the record. You can partition a 1D array using Ckmeans.1d.dp.
This method guarantees optimality and it is O(n^2), where n is the num of observations. The implementation is in C++ and there is a wrapper in R.
The code for Has QUIT--Anony-Mousse's answer to Clustering values by their proximity in python (machine learning?)
When you have 1-dimensional data, sort it, and look for the largest
gaps
I only added that gaps need to be relatively large
import numpy as np
from scipy.signal import argrelextrema
# lst = [1,1,5,6,1,5,10,22,23,23,50,51,51,52,100,112,130,500,512,600,12000,12230]
lst = [1,1,2,3,10,11,13,67,71]
lst.sort()
diff = [lst[i] - lst[i-1] for i in range(1, len(lst))]
rel_diff = [diff[i]/lst[i] for i in range(len(diff))]
arg = argrelextrema(np.array(rel_diff), np.greater)[0]
last = 0
for x in arg:
print(f'{last}:{x + 1} {lst[last:x + 1]}')
last = x + 1
print(f'{last}: {lst[last:]}')
output:
0:2 [1, 1]
2:4 [2, 3]
4:7 [10, 11, 13]
7: [67, 71]
Let's say you have 3 inputs: A, B, C. Can an artificial neural network (not necessarily feed forward) learn this pattern?
if C > k
output is A
else
output is B
Are there curtain types of networks, which can or which are well suited for this type of problem?
Yes, that's a relatively easy pattern for a feedforward neural network to learn.
You will need at least 3 layers I think assuming sigmoid functions:
1st layer can test C>k (and possibly also scale A and B down into the linear range of the sigmoid function)
2nd layer can calculate A/0 and 0/B conditional on the 1st layer
3rd (output) layer can perform a weighted sum to give A/B (you may need to make this layer linear rather than sigmoid depending on the scale of values you want)
Having said that, if you genuinely know the structure of you problem and what kind of calculation you want to perform, then Neural Networks are unlikely to be the most effective solution: they are better in situations when you don't know much about the exact calculations required to model the functions / relationships.
If the inputs can be only zeros and ones, then this is the network:
Each neuron has a Heaviside step function as an activation function. The neurons y0 and z have bias = 0.5; the neuron y1 has a bias = 1.5. The weights are shown above the corresponding connections. When s = 0, the output z = d0. When s = 1, the output z = d1.
If the inputs are continuous, then Sigmoid, tanh or ReLU can be used as the activation functions of the neurons, and the network can be trained with the back-propagation algorithm.
For my Advanced Algorithms and Data Structures class, my professor asked us to pick any topic that interested us. He also told us to research it and to try and implement a solution in it. I chose Neural Networks because it's something that I've wanted to learn for a long time.
I've been able to implement an AND, OR, and XOR using a neural network whose neurons use a step function for the activator. After that I tried to implement a back-propagating neural network that learns to recognize the XOR operator (using a sigmoid function as the activator). I was able to get this to work 90% of the time by using a 3-3-1 network (1 bias at the input and hidden layer, with weights initialized randomly). At other times it seems to get stuck in what I think is a local minima, but I am not sure (I've asked questions on this before and people have told me that there shouldn't be a local minima).
The 90% of the time it was working, I was consistently presenting my inputs in this order: [0, 0], [0, 1], [1, 0], [1, 0] with the expected output set to [0, 1, 1, 0]. When I present the values in the same order consistently, the network eventually learns the pattern. It actually doesn't matter in what order I send it in, as long as it is the exact same order for each epoch.
I then implemented a randomization of the training set, and so this time the order of inputs is sufficiently randomized. I've noticed now that my neural network gets stuck and the errors are decreasing, but at a very small rate (which is getting smaller at each epoch). After a while, the errors start oscillating around a value (so the error stops decreasing).
I'm a novice at this topic and everything I know so far is self-taught (reading tutorials, papers, etc.). Why does the order of presentation of inputs change the behavior of my network? Is it because the change in error is consistent from one input to the next (because the ordering is consistent), which makes it easy for the network to learn?
What can I do to fix this? I'm going over my backpropagation algorithm to make sure I've implemented it right; currently it is implemented with a learning rate and a momentum. I'm considering looking at other enhancements like an adaptive learning-rate. However, the XOR network is often portrayed as a very simple network and so I'm thinking that I shouldn't need to use a sophisticated backpropagation algorithm.
the order in which you present the observations (input vectors) comprising your training set to the network only matters in one respect--randomized arrangement of the observations according to the response variable is strongly preferred versus ordered arrangement.
For instance, suppose you have 150 observations comprising your training set, and for each the response variable is one of three class labels (class I, II, or III), such that observations 1-50 are in class I, 51-100 in class II, and 101-50 in class III. What you do not want to do is present them to the network in that order. In other words, you do not want the network to see all 50 observations in class I, then all 50 in class II, then all 50 in class III.
What happened during training your classifier? Well initially you were presenting the four observations to your network, unordered [0, 1, 1, 0].
I wonder what was the ordering of the input vectors in those instances in which your network failed to converge? If it was [1, 1, 0, 0], or [0, 1, 1, 1], this is consistent with this well-documented empirical rule i mentioned above.
On the other hand, i have to wonder whether this rule even applies in your case. The reason is that you have so few training instances that even if the order is [1, 1, 0, 0], training over multiple epochs (which i am sure you must be doing) will mean that this ordering looks more "randomized" rather than the exemplar i mentioned above (i.e., [1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0] is how the network would be presented with the training data over three epochs).
Some suggestions to diagnose the problem:
As i mentioned above, look at the ordering of your input vectors in the non-convergence cases--are they sorted by response variable?
In the non-convergence cases, look at your weight matrices (i assume you have two of them). Look for any values that are very large (e.g., 100x the others, or 100x the value it was initialized with). Large weights can cause overflow.