I have been searching the web, but I have not found huge linear systems datasets.
Do you know any web site where I can get one, let's say of size $100000 \times 100000$ or maybe a little bigger?
Perhaps the University of Florida Sparse Matrix Collection at http://www.cise.ufl.edu/research/sparse/matrices/ has what you are looking for.
The other answers to the question "Where can one obtain good data sets/test problems for testing algorithms/routines?" on the Computational Science SE may also be useful for you.
Related
I have to develop a personality/job suitability online test for an HR department. Basically, users will answer questions, on a scale of 0-10 for example, and after say 50 questions, I want to translate that to a rating in 5 different personality/ job suitability characteristics.
I don't have any real data to start with, so first, is it even worth it to use a recommendation engine like MyMediaLite (github). How many samples will I need to train it to a decent performance?
I previously built a training course recommender, by simply doing and hand-weighted sum where each question increased the weight of several courses that were related to that question. It was an expert system, built like a feed-forward neural network, where I personally tuned all the weights based on my knowledge of the questions and the courses' content.
I would like to this time around use a recommender system, but I'm wondering how many times I would have to take the 50 question test, and then assign the results manually. would 100 examples do? that could be possible. 1000 would be too long. How can I know ahead of time?
Though useless, I want to say this is not possible to give a definite number. You should focus on learning curve when adding new samples.
You can process the samples by hand and engine on parallel, and compare the result given by both. Once the measurement e.g. recall and precision of the result given by engine achieve your expectation, then you get enough samples.
Hope this helpful!
I have a set of 50 million text snippets and I would like to create some clusters out of them. The dimensionality might be somewhere between 60k-100k. The average text snippet length would be 16 words. As you can imagine, the frequency matrix would be pretty sparse. I am looking for a software package / libray / sdk that would allow me to find those clusters. I had tried CLUTO in the past but this seems a very heavy task for CLUTO. From my research online I found that BIRCH is an algorithm that can handle such problems, but, unfortunately, I couldn't find any BIRCH implementation software online (I only found a couple of ad-hoc implementations, like assignment projects, that lacked any sort of documentation whatsoever). Any suggestions?
You may be interested to checkout the Streaming EM-tree algorithm that uses the TopSig representation. Both are these are from my Ph.D. thesis on the topic of large scale document clustering.
We recently clustered 733 million documents on a single 16-core machine (http://ktree.sf.net). It took about 2.5 days to index the documents and 15 hours to cluster them.
The Streaming EM-tree algorithm can be found at https://github.com/cmdevries/LMW-tree. It works with binary document vectors produced by TopSig which can be found at http://topsig.googlecode.com.
I wrote a blog post about a similar approach earlier at http://chris.de-vries.id.au/2013/07/large-scale-document-clustering.html. However, the EM-tree scales better for parallel execution and also produces better quality clusters.
If you have any questions please feel free to contact me at chris#de-vries.id.au.
My professor made this implementation of BIRCH Algorithm in Java. It is easy to read with some inline comments.
Try it with the graph partition algorithm. It may help you to make clustering on high dimensional data possible.
I suppose you're rather looking for something like all-pairs search.
This will give you pairs of similar records up to desired threshold. You can use bits of graph theory to extract clusters afterwards - consider each pair an edge. Then extracting connected components will give you something like single-linkage clustering, cliques will give you complete linkage clusters.
I just found implementation of BIRCH in C++.
I have a table of daily closing stock prices and commodity prices such as Gold, Oil, etc. I want to find what stocks move closely with another stock or a commodity.
Where do I start to do this type of analysis - I know java, SQL, python, perl, and a little bit of R.
Willing to buy and learn new tools like Matlab if necessary.
Any guidance will be highly appreciated.
This is not a homework question.
Thanks..
The technique you are looking for is called cointegration. Language is not important at all when computing cointegration of two time series so use whatever you are comfortable with.
I disagree with other responses that computation is not a problem. It is a huge problem to be able to compute potentially billions of cointegration coefficients between different time series. Using a highly optimized library is critical. However this article on cointegration testing in R should get you started.
Also checkout quant.stackexchange.com for more info on quantitative finance.
Try this:
http://www.sectorspdr.com/correlation/
http://www.etfscreen.com/corr.php
http://correlate.googlelabs.com/faq
https://quant.stackexchange.com/questions/1027/correlation-and-cointegration-similarities-differences-relationships/1038#1038
Where do I start to do this type of analysis
If I were you, I'd start by searching Google Scholar for the word "comovement". Not everything that turns up is directly relevant, but there's quite a lot of stuff that is relevant.
By looking through the papers and googling some more, you should get a clearer picture of what types of statistical methods to learn.
I agree with Ben Bolker that computational tools are not the main issue at this point.
I've got a classification problem in my hand, which I'd like to address with a machine learning algorithm ( Bayes, or Markovian probably, the question is independent on the classifier to be used). Given a number of training instances, I'm looking for a way to measure the performance of an implemented classificator, with taking data overfitting problem into account.
That is: given N[1..100] training samples, if I run the training algorithm on every one of the samples, and use this very same samples to measure fitness, it might stuck into a data overfitting problem -the classifier will know the exact answers for the training instances, without having much predictive power, rendering the fitness results useless.
An obvious solution would be seperating the hand-tagged samples into training, and test samples; and I'd like to learn about methods selecting the statistically significant samples for training.
White papers, book pointers, and PDFs much appreciated!
You could use 10-fold Cross-validation for this. I believe it's pretty standard approach for classification algorithm performance evaluation.
The basic idea is to divide your learning samples into 10 subsets. Then use one subset for test data and others for train data. Repeat this for each subset and calculate average performance at the end.
As Mr. Brownstone said 10-fold Cross-Validation is probably the best way to go. I recently had to evaluate the performance of a number of different classifiers for this I used Weka. Which has an API and a load of tools that allow you to easily test the performance of lots of different classifiers.
I need to implement some kind of metric space search in Postgres(*) (PL or PL/Python). So, I'm looking for good sources (or papers) with a very clear and crisp explanation of the machinery behind these ideas, in such way that I can implement it myself.
I would prefer clarity over efficiency.
(*) The need for that is described better here.
Especially for geographical data, look at PostGIS first to see if you need to implement anything. If you do, start with the papers listed in the Wikipedia entry on GiST.
Looking at your link, it seems your metric space is strings with some sort of edit distance as the metric. A nice but oldish overview of some solutions is given by Navarro, Baeza-Yates, Sutinen, and Tarhio, IEEE Data Engineering Bulletin, 2001; the related papers on Citeseer could also be useful. Locality Sensitive Hashing is a newer technique that might be useful, but a lot of the papers are heavy on math.
BK-Trees are useful for indexing and searching anything that obeys the triangle inequality, metric spaces included. The canonical example is searching for strings within a given edit distance of a target. I wrote an article about that here.
Unfortunately, there's no built in support for this in Postgres. You could implement it yourself using GIST, but obviously that'll be a lot of work. I can't think of any way to implement it without writing your own indexes short of storing the tree in a table, which obviously isn't going to be very efficient.
You can try http://sisap.org where many modern metric indexes are listed, including BK-trees. You can find code in C to try different alternatives.
Some techniques that involve space search that might help you are Hill-Climbing, Neural Network Training, Genetic Algorithm, and Particle Swarm.
You will also need to define a distance metric over your metric space. Have you done so?(& out of curiosity, what is it, if you have done so)