What is the 'order' of a perceptron - artificial-intelligence

A few simple marks for those who know the answer.
I'm doing revision for exams at the moment and one of the past questions is:
What is meant by the order of a perceptron?
I can't find any information about this in my lecture notes, and even google seems at a loss.
My guess is that the order is the number of layers in a neural network, but this doesn't seem quite right.

If you want to evaluate the order or cardinality of a multilayered NN you should consider just the number of inner layer as input and output layer are not considered belonging to the cardinality of the NN topology.
For example a NN with 2 inner layer has order=2.
The funniest thing is that more than one layer is, most of the times, unusefull neither for performance neither for training.

Order of approximation in the learning algorithm. See orders of optimization algorithms.

Related

How to eliminate "unnecessary" values to a neural network?

My professor asked my class to make a neural network to try to predict if a breast cancer is benign or malignant. To do this I'm using the Breast Cancer Wisconsin (Diagnostic) Data Set.
As a tip for doing this my professor said not all 30 atributes needs to be used as an input (there are 32, but the first 2 are the ID and Diagnosis), what I want to ask is: How am I supposed to take those 30 inputs (that would create like 100+ weights depending on how many neurons I would use) and get them into a lesser number?
I've already found how to "prune" a neural net, but I don't think that's what I want. I'm not trying to eliminate unnecessary neurons, but to shrink the input itself.
PS: Sorry for any english errors, it's not my native language.
That is a question that is being under research right now. It is called feature selection and there are some techniques already. One is Principal Componetns Analysis (PCA) that reduces the dimensionality of your dataset taking those feature that keeps the most variance. Another thing you can do is to see if there are highly corelated variables. If two inputs are highly correlated may mean that they carry almost the same information so it may be remove without worsen much the performance of your classifier. As a third technique you could use is deep-learning which is a technique that tries to learn the features that will later be used to feed your trainer. More info about deep learning and PCA can be found here http://deeplearning.stanford.edu/wiki/index.php/Main_Page
This problem is called feature selection. It is mostly the same for neural networks as for other classifiers. You could prune your dataset while retaining the most variance using PCA. To go further, you could use a greedy approach and evaluate your features one by one by training and testing your network with each feature excluded in turn.
There is a technique for feature selection using just neural networks
Split your dataset into three groups:
Training data used for supervised training
Validation data used to verify that the neural network is able to generalize
Accuracy testing used to test which of the features are required
The steps:
Train a network on your training and validation set, just like you would normally do.
Test the accuracy of the network with the third dataset.
Locate the varible which yields the smallest drop in the accuracy test above when dropped (dropped meaning always feeding a zero as the input signal )
Retrain your network with the new selection of features
Keep doing this either to the network fails to be trained or there is just one variable left.
Here is a paper on the technique

Missing values for the data to be used in a Neural Network model for prediction

I currently have a lot of data that will be used to train a prediction neural network (gigabytes of weather data for major airports around the US). I have data for almost every day, but some airports have missing values in their data. For example, an airport might not have existed before 1995, so I have no data before then for that specific location. Also, some are missing whole years (one might span from 1990 to 2011, missing 2003).
What can I do to train with these missing values without misguiding my neural network? I though about filling the empty data with 0s or -1s, but I feel like this would cause the network to predict these values for some outputs.
I'm not an expert, but surely this would depend on the type of neural network you have?
The whole point of neural networks is they can deal with missing information and so forth.
I agree though, setting empty data with 1's and 0's can't be a good thing.
Perhaps you could give some info on your neural network?
I'm using a lot NNs for forecasting and I can say you that you can simply leave that "holes" in your data. In fact, NNs are able to learn relationships inside observed data and so if you don't have a specific period it doesn't matter...if you set empty data as a constant value you will have give to your training algorithm misleading information. NNs don't need "continuous" data, in fact it's a good practise to shuffle the data sets before training in order to do the backpropagation phase on not-contiguous samples...
Well a type of neural network named autoencoder is suitable for your work. Autoencoders can be used to reconstruct the input. An autoencoder is trained to learn the underlying data manifold/distribution. However, they are mostly used for signal reconstruction tasks such as image and sound. You could however use them to fill the missing features.
There is also another technique coined as "matrix-factorization" which is used in many recommendation systems. People use matrix factorization techniques to fill huge matrices with a lot of missing values. For instance, suppose there are 1 million movies on IMDb. Almost no one has watched even 1/10 of those movies throughout her life. But she has voted for some movies. The matrix is N by M where N is the number of users and M the number of movies. Matrix factorization are among the techniques used to fill the missing values and suggest movies to the users based on their previous votes for other movies.

Pruning Deductions in Expert Systems

In a rule system, or any reasoning system that deduces facts via forward-chaining inference rules, how would you prune "unnecessary" branches? I'm not sure what the formal terminology is, but I'm just trying to understand how people are able to limit their train-of-thought when reasoning over problems, whereas all semantic reasoners I've seen appear unable to do this.
For example, in John McCarthy's paper An Example for Natural Language Understanding and the AI Problems It Raises, he describes potential problems in getting a program to intelligently answer questions about a news article in the New York Times. In section 4, "The Need For Nonmonotonic Reasoning", he discusses the use of Occam's Razer to restrict the inclusion of facts when reasoning about the story. The sample story he uses is one about robbers who victimize a furniture store owner.
If a program were asked to form a "minimal completion" of the story in predicate calculus, it might need to include facts not directly mentioned in the original story. However, it would also need some way of knowing when to limit its chain of deduction, so as not to include irrelevant details. For example, it might want to include the exact number of police involved in the case, which the article omits, but it won't want to include the fact that each police officer has a mother.
Good Question.
From your Question i think what you refer to as 'pruning' is a model-building step performed ex ante--ie, to limit the inputs available to the algorithm to build the model. The term 'pruning' when used in Machine Learning refers to something different--an ex post step, after model construction and that operates upon the model itself and not on the available inputs. (There could be a second meaning in the ML domain, for the term 'pruning.' of, but i'm not aware of it.) In other words, pruning is indeed literally a technique to "limit its chain of deduction" as you put it, but it does so ex post, by excision of components of a complete (working) model, and not by limiting the inputs used to create that model.
On the other hand, isolating or limiting the inputs available for model construction--which is what i think you might have had in mind--is indeed a key Machine Learning theme; it's clearly a factor responsible for the superior performance of many of the more recent ML algorithms--for instance, Support Vector Machines (the insight that underlies SVM is construction of the maximum-margin hyperplane from only a small subset of the data, i.e, the 'support vectors'), and Multi-Adaptive Regression Splines (a regression technique in which no attempt is made to fit the data by "drawing a single continuous curve through it", instead, discrete section of the data are fit, one by one, using a bounded linear equation for each portion, ie., the 'splines', so the predicate step of optimal partitioning of the data is obviously the crux of this algorithm).
What problem is solving by pruning?
At least w/r/t specific ML algorithms i have actually coded and used--Decision Trees, MARS, and Neural Networks--pruning is performed on an initially over-fit model (a model that fits the training data so closely that it is unable to generalize (accurately predict new instances). In each instance, pruning involves removing marginal nodes (DT, NN) or terms in the regression equation (MARS) one by one.
Second, why is pruning necessary/desirable?
Isn't it better to just accurately set the convergence/splitting criteria? That won't always help. Pruning works from "the bottom up"; the model is constructed from the top down, so tuning the model (to achieve the same benefit as pruning) eliminates not just one or more decision nodes but also the child nodes that (like trimming a tree closer to the trunk). So eliminating a marginal node might also eliminate one or more strong nodes subordinate to that marginal node--but the modeler would never know that because his/her tuning eliminated further node creation at that marginal node. Pruning works from the other direction--from the most subordinate (lowest-level) child nodes upward in the direction of the root node.

Generating 'neighbours' for users based on rating

I'm looking for techniques to generate 'neighbours' (people with similar taste) for users on a site I am working on; something similar to the way last.fm works.
Currently, I have a compatibilty function for users which could come into play. It ranks users on having 1) rated similar items 2) rated the item similarly. The function weighs point 2 heigher and this would be the most important if I had to use only one of these factors when generating 'neighbours'.
One idea I had would be to just calculate the compatibilty of every combination of users and selecting the highest rated users to be the neighbours for the user. The downside of this is that as the number of users go up then this process couls take a very long time. For just a 1000 users, it needs 1000C2 (0.5 * 1000 * 999 = = 499 500) calls to the compatibility function which could be very heavy on the server also.
So I am looking for any advice, links to articles etc on how best to achieve a system like this.
In the book Programming Collective Intelligence
http://oreilly.com/catalog/9780596529321
Chapter 2 "Making Recommendations" does a really good job of outlining methods of recommending items to people based on similarities between users. You could use the similarity algorithms to find the 'neighbours' you are looking for. The chapter is available on google book search here:
http://books.google.com/books?id=fEsZ3Ey-Hq4C&printsec=frontcover
Be sure to look at Collaborative Filtering. Many recommendation systems use collaborative filtering to suggest items to users. They do it by finding 'neighbors' and then suggesting items your neighbors rated highly but you haven't rated. You could go as far as finding neighbors, and who knows, maybe you'll want recommendations in the future.
GroupLens is a research lab at the University of Minnesota that studies collaborative filtering techniques. They have a ton of published research as well as a few sample datasets.
The Netflix Prize is a competition to determine who can most effectively solve this sort of problem. Follow the links off their LeaderBoard. A few of the competitors share their solutions.
As far as a computationally inexpensive solution, you could try this:
Create categories for your items. If we're talking about music, they might be classical, rock, jazz, hip-hop... or go further: Grindcore, Math Rock, Riot Grrrl...
Now, every time a user rates an item, roll up their ratings at the category level. So you know 'User A' likes Honky Tonk and Acid House because they give those items high ratings frequently. Frequency and strength is probably important for your category aggregate score.
When it's time to find neighbors, instead of cruising through all ratings, just look for similar scores in the categories.
This method wouldn't be as accurate but it's fast.
Cheers.
What you need is a clustering algorithm, which would automatically group similar users together. The first difficulty that you are facing is that most clustering algorithms expect the items they cluster to be represented as points in a Euclidean space. In your case, you don't have the coordinates of the points. Instead, you can compute the value of the "similarity" function between pairs of them.
One good possibility here is to use spectral clustering, which needs precisely what you have: a similarity matrix. The downside is that you still need to compute your compatibility function for every pair of points, i. e. the algorithm is O(n^2).
If you absolutely need an algorithm faster than O(n^2), then you can try an approach called dissimilarity spaces. The idea is very simple. You invert your compatibility function (e. g. by taking its reciprocal) to turn it into a measure of dissimilarity or distance. Then you compare every item (user, in your case) to a set of prototype items, and treat the resulting distances as coordinates in a space. For instance, if you have 100 prototypes, then each user would be represented by a vector of 100 elements, i. e. by a point in 100-dimensional space. Then you can use any standard clustering algorithm, such as K-means.
The question now is how do you choose the prototypes, and how many do you need. Various heuristics have been tried, however, here is a dissertation which argues that choosing prototypes randomly may be sufficient. It shows experiments in which using 100 or 200 randomly selected prototypes produced good results. In your case if you have 1000 users, and you choose 200 of them to be prototypes, then you would need to evaluate your compatibility function 200,000 times, which is an improvement of a factor of 2.5 over comparing every pair. The real advantage, though, is that for 1,000,000 users 200 prototypes would still be sufficient, and you would need to make 200,000,000 comparisons, rather than 500,000,000,000 an improvement of a factor of 2500. What you get is O(n) algorithm, which is better than O(n^2), despite a potentially large constant factor.
The problem seems like to be 'classification problems'. Yes there are so many solutions and approaches.
To start exploration check this:
http://en.wikipedia.org/wiki/Statistical_classification
Have you heard of kohonen networks?
Its a self organing learning algorithm that clusters similar variables into similar slots. Although most sites like the one I link you to displays the net as bidimensional there is little involved in extending the algorithm into a multiple dimension hypercube.
With such a data structure finding and storing neighbours with similar tastes is trivial as similar users should be stores into similar locations (almost like a reverse hash code).
This reduces your problem into one of finding the variables that will define similarity and establishing distances between possible enumerate values ,like for example classical and acoustic are close toghether while death metal and reggae are quite distant (at least in my oppinion)
By the way in order to find good dividing variables the best algorithm is a decision tree. The nodes closer to the root will be the most important variables to establish 'closeness'.
It looks like you need to read about clustering algorithms. The general idea is that instead of comparing every point with every other point each time you divide them in clusters of similar points. Then the neighborhood may be all the points in the same cluster. The number/size of the clusters is usually a parameter of the clustering algorithm.
Yo can find a video about clustering in Google's series about cluster computing and mapreduce.
Concerns over performance can be greatly mitigated if you consider this as a build/batch problem rather than a realtime query.
The graph can be statically computed then latently updated e.g. hourly, daily etc. to then generate edges and storage optimized for runtime query e.g. top 10 similar users for each user.
+1 for Programming Collective Intelligence too - it is very informative - wish it wasn't (or I was!) as Python-oriented, but still good.

Measuring the performance of classification algorithm

I've got a classification problem in my hand, which I'd like to address with a machine learning algorithm ( Bayes, or Markovian probably, the question is independent on the classifier to be used). Given a number of training instances, I'm looking for a way to measure the performance of an implemented classificator, with taking data overfitting problem into account.
That is: given N[1..100] training samples, if I run the training algorithm on every one of the samples, and use this very same samples to measure fitness, it might stuck into a data overfitting problem -the classifier will know the exact answers for the training instances, without having much predictive power, rendering the fitness results useless.
An obvious solution would be seperating the hand-tagged samples into training, and test samples; and I'd like to learn about methods selecting the statistically significant samples for training.
White papers, book pointers, and PDFs much appreciated!
You could use 10-fold Cross-validation for this. I believe it's pretty standard approach for classification algorithm performance evaluation.
The basic idea is to divide your learning samples into 10 subsets. Then use one subset for test data and others for train data. Repeat this for each subset and calculate average performance at the end.
As Mr. Brownstone said 10-fold Cross-Validation is probably the best way to go. I recently had to evaluate the performance of a number of different classifiers for this I used Weka. Which has an API and a load of tools that allow you to easily test the performance of lots of different classifiers.

Resources