Stack to handle large computation and data explosion - sql-server

I recently started working with a team that has been building a solution that involves parallel calculations & data explosion
The input to the system is provided in a set of excel files. Says there are 5 sets of data A, B, C, D and E the calculated output is a multiple of A, B, C, D and E. This output also grows over years - i.e. if the data is spread across 5 yrs - the output for yr1 is the smallest and output of yr5 is the largest (~3 billion rows)
We currently use Microsoft SQL Server to store the input, Microsoft Orleans for computation and store the calculated output in Hadoop. Some concerns I have here are - what we are doing seems to be opposite of map reduce and we have limited big data skills on the team.
I wanted to see if someone has experience working on similar systems and what kind of solution stack was used
Thanks

Related

Six degree of separation interview problem

A was asked an interesting question on an interview lately.
You have 1 million users
Each user has 1 thousand friends
Your system should efficiently answer on Do I know him? question for each couple of users. A user "knows" another one, if they are connected through 6 levels of friends.
E.g. A is friend of B, B is a friend of C, C is friend of D, D is a friend of E, E is a friend of F. So we can say that, A knows F.
Obviously you can't to solve this problem efficiently using BFS or other standard traversing technic. The question is - how to store this data structure in DB and how to quickly perform this search.
What's wrong with BFS?
Execute three steps of BFS from the first node, marking accessible users by flag 1. It requires 10^9 steps.
Execute three steps of BFS from the second node, marking accessible users by flag 2. If we meet mark 1 - bingo.
What about storing the data as 1 million x 1 million matrix A where A[i][j] is the minimum number of steps to reach from user i to user j. Then you can query it almost instantly. The update however is more costly.

SPSS creating a loop for a multiple regression over several variables

For my master thesis I have to use SPSS to analyse my data. Actually I thought that I don't have to deal with very difficult statistical issues, which is still true regarding the concepts of my analysis. BUT the problem is now that in order to create my dependent variable I need to use the syntax editor/ programming in general and I have no experience in this area at all. I hope you can help me in the process of creating my syntax.
I have in total approximately 900 companies with 6 year observations. For all of these companies I need the predicted values of the following company-specific regression:
Y= ß1*X1+ß2*X2+ß3*X3 + error
(I know, the ß won t very likely be significant, but this is nothing to worry about in my thesis, it will be mentioned in the limitations though).
So far my data are ordered in the following way
COMPANY YEAR X1 X2 X3
1 2002
2 2002
1 2003
2 2003
But I could easily change the order, e.g. in
1
1
2
2 etc.
Ok let's say I have rearranged the data: what I need now is that SPSS computes for each company the specific ß and returns the output in one column (the predicted values with those ß multiplied by the specific X in each row). So I guess what I need is a loop that does a multiple linear regression for 6 rows for each of the 939 companies, am I right?
As I said I have no experience at all, so every hint is valuable for me.
Thank you in advance,
Janina.
Bear in mind that with only six observations per company and three (or 4 if you also have a constant term) coefficients to estimate, the coefficient estimates are likely to be very imprecise. You might want to consider whether companies can be pooled at least in part.
You can use SPLIT FILE to estimate the regressions specific for each company, example below. Note that one would likely want to consider other panel data models, and assess whether there is autocorrelation in the residuals. (This is IMO a useful approach though for exploratory analysis of multi-level models.)
The example declares a new dataset to pipe the regression estimates to (see the OUTFILE subcommand on REGRESSION) and suppresses the other tables (with 900+ tables much of the time is spent rendering the output). If you need other statistics either omit the OMS that suppresses the tables, or tweak it to only show the tables you want. (You can use OMS to pipe other results to other datasets as well.)
************************************************************.
*Making Fake data.
SET SEED 10.
INPUT PROGRAM.
LOOP #Comp = 1 to 1000.
COMPUTE #R1 = RV.NORMAL(10,2).
COMPUTE #R2 = RV.NORMAL(-3,1).
COMPUTE #R3 = RV.NORMAL(0,5).
LOOP Year = 2003 to 2008.
COMPUTE Company = #Comp.
COMPUTE Rand1 = #R1.
COMPUTE Rand2 = #R2.
COMPUTE Rand3 = #R3.
END CASE.
END LOOP.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME Companies.
COMPUTE x1 = RV.NORMAL(0,1).
COMPUTE x2 = RV.NORMAL(0,1).
COMPUTE x3 = RV.NORMAL(0,1).
COMPUTE y = Rand1*x1 + Rand2*x2 + Rand3*x3 + RV.NORMAL(0,1).
FORMATS Company Year (F4.0).
*Now sorting cases by Company and Year, then using SPLIT file to estimate
*the regression.
SORT CASES BY Company Year.
*Declare new set and have OMS suppress the other results.
DATASET DECLARE CoeffTable.
OMS
/SELECT TABLES
/IF COMMANDS = 'Regression'
/DESTINATION VIEWER = NO.
*Now split file to get the coefficients.
SPLIT FILE BY Company.
REGRESSION
/DEPENDENT y
/METHOD=ENTER x1 x2 x3
/SAVE PRED (CompSpePred)
/OUTFILE = COVB ('CoeffTable').
SPLIT FILE OFF.
OMSEND.
************************************************************.

clustering a SUPER large data set [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I am working on a project as a part of my class curriculum . Its a project for Advanced Database Management Systems and it goes like this.
1)Download large number of images (1000,000) --> Done
2)Cluster them according to their visual Similarity
a)Find histogram of each image --> Done
b)Now group (cluster) images according to their visual similarity.
Now, I am having a problem with part 2b. Here is what I did:
A)I found the histogram of each image using matlab and now have represented it using a 1D vector(16 X 16 X 16) . There are 4096 values in a single vector.
B)I generated an ARFF file. It has the following format. There are 1000,000 histograms (1 for each image..thus 1000,000 rows in the file) and 4097 values in each row (image_name + 4096 double values to represent the histogram)
C)The file size is 34 GB. THE BIG QUESTION: HOW THE HECK DO I CLUSTER THIS FILE???
I tried using WEKA and other online tools. But they all hang. Weka gets stuck and says "Reading a file".
I have a RAM of 8 GB on my desktop. I don't have access to any cluster as such. I tried googling but couldn't find anything helpful about clustering large datasets. How do I cluster these entries?
This is what I thought:
Approach One:
Should I do it in batches of 50,000 or something? Like, cluster the first 50,000 entries. Find as many possible clusters call them k1,k2,k3... kn.
Then pick the the next 50,000 and allot them to one of these clusters and so on? Will this be an accurate representation of all the images. Because, clustering is done only on the basis of first 50,000 images!!
Approach Two:
Do the above process using random 50,000 entries?
Any one any inputs?
Thanks!
EDIT 1:
Any clustering algorithm can be used.
Weka isn't your best too for this. I found ELKI to be much more powerful (and faster) when it comes to clustering. The largest I've ran are ~3 million objects in 128 dimensions.
However, note that at this size and dimensionality, your main concern should be result quality.
If you run e.g. k-means, the result will essentially be random because of you using 4096 histogram bins (way too much, in particular with squared euclidean distance).
To get good result, you need to step back an think some more.
What makes two images similar. How can you measure similarity? Verify your similarity measure first.
Which algorithm can use this notion of similarity? Verify the algorithm on a small data set first.
How can the algorithm be scaled up using indexing or parallelism?
In my experience, color histograms worked best on the range of 8 bins for hue x 3 bins for saturation x 3 bins for brightness. Beyond that, the binning is too fine grained. Plus it destroys your similarity measure.
If you run k-means, you gain absolutely nothing by adding more data. It searches for statistical means and adding more data won't find a different mean, but just some more digits of precision. So you may just as well use a sample of just 10k or 100k pictures, and you will get virtually the same results.
Running it several times for independent sets of pictures results in different cluster clusters which are difficult to merge. Thus two similar images are placed in different clusters. I would run the clustering algorithm for a random set of images (as large as possible) and use these cluster definitions to sort all other images.
Alternative: Reduce the compexity of your data, e.g. to a histogram of 1024 double values.

Variation Amongst Arrays

I have several items(topics) each featuring several sub-items, as outlined below...
Application
microsoft word
excel
visual studio
DB
mysql
mssql
I want to compare several of these groups and give a score to each topic based on how many subitems are in their respective topic compared to how many are in the other topics, ideally on a scale of 1 - 10. This is just conceptual, no specific languge. I would want to compare arrays, I just don't know how to compare every single array with all others and come up with a score after the fact. Thank you.
This boils down to computing unions. Most modern languages implement data types to help with this. Python has dictionaries and sets, C++ has STL maps and sets, etc. I would avoid doing any manual computation of unions, as the provided data types are much more efficient at doing it. Each topic can be thought of as a set of sub-topics, and the union of these sub-topics will determine how many are in common (i.e. in each topic).
If you are wanting to find scores between each item (topic), and if you nave n topics, you will be computing n(n - 1) / 2 scores. Just be aware that as the number of topics increases, the number of computed scores will quickly rise.
As to computing a score, you will find a union between set A and set B. This union can either
Contain all items (meaning A and B had the exact same set); score of 10
Contain all of 1 set (meaning B contains all of A, or vice versa) score depending on how many items unique to one set
Contains less than the minimum size of either A or B
So a simple computation may be
(union.length / max(A.length, B.length)) * 10

Bioinformatics databases with negative results?

Bioinformatics databases such as BioGRID collect a lot of interaction results for proteins and genes in different species from all sorts of publications and experiments, but such collations sufer from testing biases since not all combinations are tested, and some are tested more than once. Shouldn't they also collect all negative results? Is there such a resource which systematically collects both positive and negative interactions from High Throughput and Low Throughput experiments?
These might help:
http://www.jnrbm.com/info/about/
http://www.jnr-eeb.org/index.php/jnr
and so far I know databases of non-hitters or non-binding drug-like compounds exist
You should look for the 'Negatome', database of non-interacting protein pairs.
Smialowski P, Pagel P, Wong P, Brauner B, Dunger I, Fobo G, Frishman G,
Montrone C, Rattei T, Frishman D, Ruepp A. The Negatome database: a reference set
of non-interacting protein pairs. Nucleic Acids Res. 2010 Jan;38(Database
issue):D540-4. Epub 2009 Nov 17. PubMed PMID: 19920129; PubMed Central PMCID:
PMC2808923. Available from: http://www.ncbi.nlm.nih.gov/pubmed/19920129
1) High throughput screens that are published in peer reviewed journals often have such data. Cessarini has published negative results regaring domain/peptide interactions.
2) You can contact data bases like mint/reactome/etc... and mention that you want the negative results where they are available. Many such organizations are required by mandate to share any such data with you, even if its not on their site.
3) A good resource on this subject is here http://www.nature.com/nmeth/journal/v4/n5/full/nmeth0507-377.html
We have been working on an opensource protein interactions meta-database & prediction server (which does include data from BioGRID among other sources) that deals with both negative and positive data as you have asked for...
MAYETdb does the following:
Classify protein interactions as either "interacting" or "not interacting"
Includes data from a variety of experimental set-ups (Y2H, TAP-MS, more) and species (yeast, human, c.elegans) (inc. both literature-mined and databased data e.g. BioGRID).
It also yields false-positive and false-negative errors of those classifications.
Random forest machine learning system makes predictions for previously un-tested interactions by learning form a wide variety of protein features and works at rather high accuracy (~92% AUG).
It is not yet running on a server but the source code is available and heavily commented if you are curiously impatient: https://bitbucket.org/dknightg/ppidb/src
Please ask if you have any queries :)

Resources