I am using LSTM as the hidden layer function in a time series prediction network. Is input normalization necessary? If it is, is data = data / sum(data) the correct normalization?
Should the output also be normalized with the inputs?
Is input normalization necessary?
No, but it might make your network converge faster. Use this calculation to scale your values to [0,1]:
.
Should the output also be normalized with the inputs?
No, I can't think of a reason why you would ever want to do that.
Related
I am developing (for my senior project) a dumbbell that is able to classify and record different exercises. The device has to be able to classify a range of these exercises based on the data given from an IMU (Inertial Measurement Unit). I have acceleration, gyroscope, compass, pitch, yaw, and roll data.
I am leaning towards using an Artificial Neural Network in order to do this, but am open to other suggestions as well. Ultimately I want to pass in the IMU data into the network and have it tell me what kind of exercise it is (Bicep curl, incline fly etc...).
If I use an ANN, what kind should I use (recurrent or not) and how should I implement it? I am not sure how to get the network to recognize an exercise when I am passing it a continuous stream of data. I was thinking about constantly performing an FFT on a portion of the inputs and sending a set number of frequency magnitudes into the network, but am not sure if that will work either. Any suggestions/comments?
Your first task should be to collect some data from the dumbbell. There are many, many different schemes that could be used to classify the data, but until you have some sample data to work with, it is hard to predict exactly what will work best.
If you get 5 different people to do all of the exercises and look at the resulting data yourself (e.g. pilot the different parts of the data collected), can you distinguish which exercise is which? This may give you hints on what pre-processing you might want to perform on the data before sending it to a classifier.
First you create a large training set.
Then you train it, telling it what actually happens.
And you might uses averages of data as well.
Perhaps use actual movement and movement that is averaged over 2 sec 5 sec and 10 sec. use those too as for input nodes.
while exercising the trained network can be feeded with the averaged data as well ea (the last x samples divided by x), this will give you a stable approach. Otherwise the neural network can become hectic erratic.
Notice the training set might require averaged data as well and thus you will need a large training set.
My professor asked my class to make a neural network to try to predict if a breast cancer is benign or malignant. To do this I'm using the Breast Cancer Wisconsin (Diagnostic) Data Set.
As a tip for doing this my professor said not all 30 atributes needs to be used as an input (there are 32, but the first 2 are the ID and Diagnosis), what I want to ask is: How am I supposed to take those 30 inputs (that would create like 100+ weights depending on how many neurons I would use) and get them into a lesser number?
I've already found how to "prune" a neural net, but I don't think that's what I want. I'm not trying to eliminate unnecessary neurons, but to shrink the input itself.
PS: Sorry for any english errors, it's not my native language.
That is a question that is being under research right now. It is called feature selection and there are some techniques already. One is Principal Componetns Analysis (PCA) that reduces the dimensionality of your dataset taking those feature that keeps the most variance. Another thing you can do is to see if there are highly corelated variables. If two inputs are highly correlated may mean that they carry almost the same information so it may be remove without worsen much the performance of your classifier. As a third technique you could use is deep-learning which is a technique that tries to learn the features that will later be used to feed your trainer. More info about deep learning and PCA can be found here http://deeplearning.stanford.edu/wiki/index.php/Main_Page
This problem is called feature selection. It is mostly the same for neural networks as for other classifiers. You could prune your dataset while retaining the most variance using PCA. To go further, you could use a greedy approach and evaluate your features one by one by training and testing your network with each feature excluded in turn.
There is a technique for feature selection using just neural networks
Split your dataset into three groups:
Training data used for supervised training
Validation data used to verify that the neural network is able to generalize
Accuracy testing used to test which of the features are required
The steps:
Train a network on your training and validation set, just like you would normally do.
Test the accuracy of the network with the third dataset.
Locate the varible which yields the smallest drop in the accuracy test above when dropped (dropped meaning always feeding a zero as the input signal )
Retrain your network with the new selection of features
Keep doing this either to the network fails to be trained or there is just one variable left.
Here is a paper on the technique
I am having a problem with training a neural network with sparse input data to solve a supervised regression problem. When I perform mean normalization (subtract mean then divide standard deviation) on the input data, I get a lot of NaN values. I am wondering if anyone has experience dealing this kind of problem. What is the correct way to scale the sparse input data?
thanks
Joe
Sounds like your data is so sparse that the standard deviation is occasionally zero.
Test for that, and if so, don't divide your input by it (stdev normalization is not necessary in that case anyway).
I currently have a lot of data that will be used to train a prediction neural network (gigabytes of weather data for major airports around the US). I have data for almost every day, but some airports have missing values in their data. For example, an airport might not have existed before 1995, so I have no data before then for that specific location. Also, some are missing whole years (one might span from 1990 to 2011, missing 2003).
What can I do to train with these missing values without misguiding my neural network? I though about filling the empty data with 0s or -1s, but I feel like this would cause the network to predict these values for some outputs.
I'm not an expert, but surely this would depend on the type of neural network you have?
The whole point of neural networks is they can deal with missing information and so forth.
I agree though, setting empty data with 1's and 0's can't be a good thing.
Perhaps you could give some info on your neural network?
I'm using a lot NNs for forecasting and I can say you that you can simply leave that "holes" in your data. In fact, NNs are able to learn relationships inside observed data and so if you don't have a specific period it doesn't matter...if you set empty data as a constant value you will have give to your training algorithm misleading information. NNs don't need "continuous" data, in fact it's a good practise to shuffle the data sets before training in order to do the backpropagation phase on not-contiguous samples...
Well a type of neural network named autoencoder is suitable for your work. Autoencoders can be used to reconstruct the input. An autoencoder is trained to learn the underlying data manifold/distribution. However, they are mostly used for signal reconstruction tasks such as image and sound. You could however use them to fill the missing features.
There is also another technique coined as "matrix-factorization" which is used in many recommendation systems. People use matrix factorization techniques to fill huge matrices with a lot of missing values. For instance, suppose there are 1 million movies on IMDb. Almost no one has watched even 1/10 of those movies throughout her life. But she has voted for some movies. The matrix is N by M where N is the number of users and M the number of movies. Matrix factorization are among the techniques used to fill the missing values and suggest movies to the users based on their previous votes for other movies.
I have a couple of questions about how to code the backpropagation algorithm of neural networks:
The topology of my networks is an input layer, hidden layer and output layer. Both the hidden layer and output layer have sigmoid functions.
First of all, should I use the bias?
To where should I connect the bias
in my network? Should I put one bias
unit per layer in both the hidden
layer and output layer? What about
the input layer?
In this link, they define the last delta as the input - output and they backpropagate the deltas as can be seen in the figure. They hold a table to put
all the deltas before actually
propagating the errors in a
feedforward fashion. Is this a
departure from the standard
backpropagation algorithm?
Should I decrease the learning
factor over time?
In case anyone knows, is Resilient
Propagation an online or batch
learning technique?
Thanks
edit: One more thing. In the following picture, d f1(e) / de, assuming I'm using the sigmoid function, is f1(e) * [1- f1(e)], right?
It varies. Personally, I don't see much of a reason for bias, but I haven't studied NN enough to actually make a valid case for or against them. I'd try it out to and test results.
That's correct. Backpropagation involves calculation of deltas first, and then propagating them across the network.
Yes. Learning factor should be decreased over time. However, with BP, you can hit local, incorrect plateaus, so sometimes around the 500th iteration, it makes sense to reset the learning factor to the intial rate.
I can't answer that.....never heard anything about RP.
Your question needs to be specified a bit more thoroughly... What is your need? Generalization or memorization? Are you anticipating a complex pattern matching data set, or a continuous-domain input-output relationship? Here are my $0.02:
I would suggest you leave a bias neuron in just in case you need it. If it is deemed unnecessary by the NN, training should drive the weights to negligible values. It will connect to every neuron in the layer up ahead, but is not connected to from any neuron in the preceding layer.
The equation looks like standard backprop as far as I can tell.
It is hard to generalize whether your learning rate needs to be decreased over time. The behaviour is highly data-dependent. The smaller your learning rate, the more stable your training will be. However, it can be painfully slow, especially if you're running it in a scripting language like I did once upon a time.
Resilient backprop (or RProp in MATLAB) should handle both online and batch training modes.
I'd just like to add that you might want to consider alternative activation functions if possible. The sigmoid function doesn't always give the best results...