Relation between Centering matrix and multidimensional scaling - static

I try to understand the steps of multidimensional scaling, and the the method is based on the centering matrix, I do not understand what's his exact role

Multidimensionl scaling computes a set of coordinates in a series of increasing dimensions so you can see which number accounts for major steps in variance reduction, its dimensionality, and which may better be considered randomness, the randomness being spread on all dimensions while a few account for the data. Centering takes a persons data and makes it have the same weight by bringing it to a common variance and standard deviation. How much of this centering makes sense for your data generating process is a matter of some delicacy and has had extensive discussion, particularly with regard to variance. Variance uniformity can be by person or in the case of multiple measures, the measures. Articles on centering by mean is at https://en.wikipedia.org/wiki/Centering_matrix and for MDS is at https://en.wikipedia.org/wiki/Multidimensional_scaling

Related

Deep Learning: Dataset containing images of varying dimensions and orientations

This is a repost of the question asked in ai.stackexchange. Since there is not much traction in that forum, I thought I might try my chances here.
I have a dataset of images of varying dimensions of a certain object. A few images of the object are also in varying orientations. The objective is to learn the features of the object (using Autoencoders).
Is it possible to create a network with layers that account for varying dimensions and orientations of the input image, or should I strictly consider a dataset containing images of uniform dimensions? What is the necessary criteria of an eligible dataset to be used for training a Deep Network in general.
The idea is, I want to avoid pre-processing my dataset by normalizing it via scaling, re-orienting operations etc. I would like my network to account for the variability in dimensions and orientations. Please point me to resources for the same.
EDIT:
As an example, consider a dataset consisting of images of bananas. They are of varying sizes, say, 265x525 px, 1200x1200 px, 165x520 px etc. 90% of the images display the banana in one orthogonal orientation (say, front view) and the rest display the banana in varying orientations (say, isometric views).
Almost always people will resize all their images to the same size before sending them to the CNN. Unless you're up for a real challenge this is probably what you should do.
That said, it is possible to build a single CNN that takes input of images as varying dimensions. There are a number of ways you might try to do this, and I'm not aware of any published science analyzing these different choices. The key is that the set of learned parameters needs to be shared between the different inputs sizes. While convolutions can be applied at different images sizes, ultimately they always get converted to a single vector to make predictions with, and the size of that vector will depend on the geometries of the inputs, convolutions and pooling layers. You'd probably want to dynamically change the pooling layers based on the input geometry and leave the convolutions the same, since the convolutional layers have parameters and pooling usually doesn't. So on bigger images you pool more aggressively.
Practically you'd want to group together similarly (identically) sized images together into minibatches for efficient processing. This is common for LSTM type models. This technique is commonly called "bucketing". See for example http://mxnet.io/how_to/bucketing.html for a description of how to do this efficiently.
Is it possible to create a network with layers that account for varying dimensions and orientations of the input image
The usual way to deal with different images is the following:
You take one or multiple crops of the image to make width = height. If you take multiple crops, you pass all of them through the network and average the results.
You scale the crop(s) to the size which is necessary for the network.
However, there is also Global Average Pooling (e.g. Keras docs).
What is the necessary criteria of an eligible dataset to be used for training a Deep Network in general.
That is a difficult question to answer as (1) there are many different approaches in deep learning and the field is quite young (2) I'm pretty sure there is no quantitative answer right now.
Here are two rules of thumb:
You should have at least 50 examples per class
The more parameters your model has, the more data you need
Learning curves and validation curves help to estimate the effect of more training data.

Feature selection for sparse and unbalanced high dimensional data

I have a highly unbalanced data with very scarce positive labels. The data is very high dimensional. On top of that my features are also very sparse.
So what would be the best way to do feature selection in this case. Any correlation measure rank based like spearmann or pearson correlation will not be a good one. Because most of my labels as well as features are zeros and it might seem that this feature is highly correlated or something even though it is not that much significant.
Any suggestion guys?
SVMs work well for classification of sparse data. By examining the kernel matrix produced you can identify the features that were more important than others and used those for your feature selection.

Most appropriate AI for parameter weighting?

I have data of this form:
[(v1, A1, B1), (v2, A2, B2), (v3, A3, B3), ...]
The vs correspond to the data elements and the As and Bs to numerical values characterizing the vs.
A human looking at this data can look at it and see which tuple seems the best "match" according to the A and B values. I want a form of AI that I could train by picking one of these tuples as the best, and that would adjust the weights given to A and B.
Basically, each tuple represents an approximation to a value. A represents an error and B represents the complexity of each approximation. I want some compromise between error and complexity by assigning them different weightings. I want to run several trials with approximations to different values, and choose the one I think looks the best, and have the AI adjust the weightings correspondingly.
What you described is also known as a model selection problem, something often encountered in machine learning and statistics. You basically have some models that fit your data by some measure of goodness (typically measured as error or log likelihood) and those models have some complexity measure (typically the number of parameters in the model). You want to pick the best fitting model and penalize its complexity because that can be a sign of overfitting.
Typically, the degree to which overfitting can affect you is driven by the size of your data. But there are some measures that explicitly allow you to trade off model fitness and complexity:
Akaike information criterion
Bayesian information criterion
Regularization
Choose a model based on your data as above can bias the model choice toward the data. Thus, this is done typically using a validation set and then evaluated on a test set.
I don't know if your approach in having an algorithm solve this problem is a good one. Typically it is dependent on your data and some degree of intuition. The meta-machine-learning technique you described probably won't be too reliable, in my opinion. Better to start with some more principled and simpler ideas first.

2D Game: Fast(est) way to find x closest entities for another entity - huge amount of entities, highly dynamic

I'm working on a 2D game that has a huge amount of dynamic entities.
For fun's sake, let's call them soldiers, and let's say there are 50000 of them (which I just randomly thought up, it might be much more or much less :)).
All these soldiers are moving every frame according to rules - think boids / flocking / steering behaviour.
For each soldier, to update it's movement I need the X soldiers that are closest to the one I'm processing.
What would be the best spatial hierarchy to store them to facilitate calculations like this without too much overhead ?
(All entities are updated/moved every frame, so it has to handle dynamic entities very well)
The simplest approach is to use a grid. It has several advantages:
simple
fast
easy to add and remove objects
easy to change the grid to a finer detail if you are still doing too many distance checks
Also, make sure you don't do a squareroot for every distance check. Since you are only comparing the distances, you can also compare the distance squared.
For broad-phase collision detection, a spatial index like a quad-tree (since it's 2D) or a grid will do. I've linked to Metanet Software's tutorial before; it outlines a grid-based scheme. Of course, your game doesn't even need to use grids so extensively. Just store each actor in a hidden grid and collide it with objects in the same and neighboring cells.
The whole point of selecting a good spatial hierarchy is to be able to quickly select only the objects that need testing.
(Once you've found out that small subset, the square-root is probably not going to hurt that much anymore)
I'm also interested in what the best / most optimal method is for 2d spatial indexing of highly dynamic objects.
Quadtree / kd-tree seem nice, but they're not too good with dynamic insertions.
KD-trees can become unbalanced.
Just a random thought, if your entities are points a double insertion-sorted structure (by X and Y) in combination with a binary search might be something to try..?

Generating 'neighbours' for users based on rating

I'm looking for techniques to generate 'neighbours' (people with similar taste) for users on a site I am working on; something similar to the way last.fm works.
Currently, I have a compatibilty function for users which could come into play. It ranks users on having 1) rated similar items 2) rated the item similarly. The function weighs point 2 heigher and this would be the most important if I had to use only one of these factors when generating 'neighbours'.
One idea I had would be to just calculate the compatibilty of every combination of users and selecting the highest rated users to be the neighbours for the user. The downside of this is that as the number of users go up then this process couls take a very long time. For just a 1000 users, it needs 1000C2 (0.5 * 1000 * 999 = = 499 500) calls to the compatibility function which could be very heavy on the server also.
So I am looking for any advice, links to articles etc on how best to achieve a system like this.
In the book Programming Collective Intelligence
http://oreilly.com/catalog/9780596529321
Chapter 2 "Making Recommendations" does a really good job of outlining methods of recommending items to people based on similarities between users. You could use the similarity algorithms to find the 'neighbours' you are looking for. The chapter is available on google book search here:
http://books.google.com/books?id=fEsZ3Ey-Hq4C&printsec=frontcover
Be sure to look at Collaborative Filtering. Many recommendation systems use collaborative filtering to suggest items to users. They do it by finding 'neighbors' and then suggesting items your neighbors rated highly but you haven't rated. You could go as far as finding neighbors, and who knows, maybe you'll want recommendations in the future.
GroupLens is a research lab at the University of Minnesota that studies collaborative filtering techniques. They have a ton of published research as well as a few sample datasets.
The Netflix Prize is a competition to determine who can most effectively solve this sort of problem. Follow the links off their LeaderBoard. A few of the competitors share their solutions.
As far as a computationally inexpensive solution, you could try this:
Create categories for your items. If we're talking about music, they might be classical, rock, jazz, hip-hop... or go further: Grindcore, Math Rock, Riot Grrrl...
Now, every time a user rates an item, roll up their ratings at the category level. So you know 'User A' likes Honky Tonk and Acid House because they give those items high ratings frequently. Frequency and strength is probably important for your category aggregate score.
When it's time to find neighbors, instead of cruising through all ratings, just look for similar scores in the categories.
This method wouldn't be as accurate but it's fast.
Cheers.
What you need is a clustering algorithm, which would automatically group similar users together. The first difficulty that you are facing is that most clustering algorithms expect the items they cluster to be represented as points in a Euclidean space. In your case, you don't have the coordinates of the points. Instead, you can compute the value of the "similarity" function between pairs of them.
One good possibility here is to use spectral clustering, which needs precisely what you have: a similarity matrix. The downside is that you still need to compute your compatibility function for every pair of points, i. e. the algorithm is O(n^2).
If you absolutely need an algorithm faster than O(n^2), then you can try an approach called dissimilarity spaces. The idea is very simple. You invert your compatibility function (e. g. by taking its reciprocal) to turn it into a measure of dissimilarity or distance. Then you compare every item (user, in your case) to a set of prototype items, and treat the resulting distances as coordinates in a space. For instance, if you have 100 prototypes, then each user would be represented by a vector of 100 elements, i. e. by a point in 100-dimensional space. Then you can use any standard clustering algorithm, such as K-means.
The question now is how do you choose the prototypes, and how many do you need. Various heuristics have been tried, however, here is a dissertation which argues that choosing prototypes randomly may be sufficient. It shows experiments in which using 100 or 200 randomly selected prototypes produced good results. In your case if you have 1000 users, and you choose 200 of them to be prototypes, then you would need to evaluate your compatibility function 200,000 times, which is an improvement of a factor of 2.5 over comparing every pair. The real advantage, though, is that for 1,000,000 users 200 prototypes would still be sufficient, and you would need to make 200,000,000 comparisons, rather than 500,000,000,000 an improvement of a factor of 2500. What you get is O(n) algorithm, which is better than O(n^2), despite a potentially large constant factor.
The problem seems like to be 'classification problems'. Yes there are so many solutions and approaches.
To start exploration check this:
http://en.wikipedia.org/wiki/Statistical_classification
Have you heard of kohonen networks?
Its a self organing learning algorithm that clusters similar variables into similar slots. Although most sites like the one I link you to displays the net as bidimensional there is little involved in extending the algorithm into a multiple dimension hypercube.
With such a data structure finding and storing neighbours with similar tastes is trivial as similar users should be stores into similar locations (almost like a reverse hash code).
This reduces your problem into one of finding the variables that will define similarity and establishing distances between possible enumerate values ,like for example classical and acoustic are close toghether while death metal and reggae are quite distant (at least in my oppinion)
By the way in order to find good dividing variables the best algorithm is a decision tree. The nodes closer to the root will be the most important variables to establish 'closeness'.
It looks like you need to read about clustering algorithms. The general idea is that instead of comparing every point with every other point each time you divide them in clusters of similar points. Then the neighborhood may be all the points in the same cluster. The number/size of the clusters is usually a parameter of the clustering algorithm.
Yo can find a video about clustering in Google's series about cluster computing and mapreduce.
Concerns over performance can be greatly mitigated if you consider this as a build/batch problem rather than a realtime query.
The graph can be statically computed then latently updated e.g. hourly, daily etc. to then generate edges and storage optimized for runtime query e.g. top 10 similar users for each user.
+1 for Programming Collective Intelligence too - it is very informative - wish it wasn't (or I was!) as Python-oriented, but still good.

Resources