I am writing my Bachelor thesis about Market Basket Analysis and I need a data set to make an example of this analysis, can anyone recommend me something?
It would be very good if data would be big enough, for example around 1000 rows or more and with names of items purchased not just numbers...
Any help would be very helpful!
For others who are seeking for dataset related to market basket,I found dataset in kaggle interesting
https://www.kaggle.com/puneetbhaya/online-retail
Related
So I am working on a final project where I have to use data analysis tools, ANNs, etc. My idea is that an agent will recognize the movie genres from a poster but the data set I have contains 28 separate genres but I want to work with the most important 10 of them for example. Is there any way I can use data analysis tools such as PCA/SVD, LDA etc to do this?
I tried one hot encoding all the different genres and applying PCA to them but a lot of people said it doesn't give a meaningful answer. I am new to all of this so any help would be appreciated.
I am trying to build doc2vec model, using gensim + sklearn to perform sentiment analysis on short sentences, like comments, tweets, reviews etc.
I downloaded amazon product review data set, twitter sentiment analysis data set and imbd movie review data set.
Then combined these in 3 categories, positive, negative and neutral.
Next I trinaed gensim doc2vec model on the above data so I can obtain the input vectors for the classifying neural net.
And used sklearn LinearReggression model to predict on my test data, which is about 10% from each of the above three data sets.
Unfortunately the results were not good as I expected. Most of the tutorials out there seem to focus only on one specific task, 'classify amazon reviews only' or 'twitter sentiments only', I couldn't manage to find anything that is more general purpose.
Can some one share his/her thought on this?
How good did you expect, and how good did you achieve?
Combining the three datasets may not improve overall sentiment-detection ability, if the signifiers of sentiment vary in those different domains. (Maybe, 'positive' tweets are very different in wording than product-reviews or movie-reviews. Tweets of just a few to a few dozen words are often quite different than reviews of hundreds of words.) Have you tried each separately to ensure the combination is helping?
Is your performance in line with other online reports of using roughly the same pipeline (Doc2Vec + LinearRegression) on roughly the same dataset(s), or wildly different? That will be a clue as to whether you're doing something wrong, or just have too-high expectations.
For example, the doc2vec-IMDB.ipynb notebook bundled with gensim tries to replicate an experiment from the original 'Paragraph Vector' paper, doing sentiment-detection on an IMDB dataset. (I'm not sure if that's the same dataset as you're using.) Are your results in the same general range as that notebook achieves?
Without seeing your code, and details of your corpus-handling & parameter choices, there could be all sorts of things wrong. Many online examples have nonsense choices. But maybe your expectations are just off.
I am asked to give a lecture on clustering algorithms for an audience that is not very technical. With that in mind, I wanted to do a simple exercise where I will ask the audience to identify groups from a dataset. However, I cannot find good datasets that could be usable for this purpose.
Is there a dataset of customers and some products they have bought that I can use for this purpose? Or any other dataset that might look suitable!
I can suggest a simple geo location database for example all cities in germany. I think you can find it for free. Or you can look for the NASA sky data. Would be nice to cluster too.
Here is the Ta-Feng dataset containing 4 months of transactions. Got it from Prof. Chun Nan himself. It is now stored in my dropbox folder: https://www.dropbox.com/s/tsd5zd8a7afmzs7/D11-02.ZIP?dl=0 The first line of each file shows the column names in Chinese. In English is:
Date; Membership Card ID; Product Category; Product Code; Quantity; Total Transaction Amount (in TWD)
I have a table of daily closing stock prices and commodity prices such as Gold, Oil, etc. I want to find what stocks move closely with another stock or a commodity.
Where do I start to do this type of analysis - I know java, SQL, python, perl, and a little bit of R.
Willing to buy and learn new tools like Matlab if necessary.
Any guidance will be highly appreciated.
This is not a homework question.
Thanks..
The technique you are looking for is called cointegration. Language is not important at all when computing cointegration of two time series so use whatever you are comfortable with.
I disagree with other responses that computation is not a problem. It is a huge problem to be able to compute potentially billions of cointegration coefficients between different time series. Using a highly optimized library is critical. However this article on cointegration testing in R should get you started.
Also checkout quant.stackexchange.com for more info on quantitative finance.
Try this:
http://www.sectorspdr.com/correlation/
http://www.etfscreen.com/corr.php
http://correlate.googlelabs.com/faq
https://quant.stackexchange.com/questions/1027/correlation-and-cointegration-similarities-differences-relationships/1038#1038
Where do I start to do this type of analysis
If I were you, I'd start by searching Google Scholar for the word "comovement". Not everything that turns up is directly relevant, but there's quite a lot of stuff that is relevant.
By looking through the papers and googling some more, you should get a clearer picture of what types of statistical methods to learn.
I agree with Ben Bolker that computational tools are not the main issue at this point.
Need an example of a database model to be attached to a database for data quality. Best form of the answer would at the very least be DDL that's executable in MySQL; other RDMS DDL's are okay, I'll just post another question asking for a porting of the code.
A good explaintion would be a huge plus.
Questions, comments, feedback, etc. -- just comment, thanks!!
The biggest problem is identifying meaningful measures of quality. That's so highly application-dependent, I doubt that anybody will be able to help you very much. (At least not without a lot more information--perhaps more than you're allowed to give.)
But let's say your application records observations of birds by individuals. (I'm just throwing this together off the top of my head. Read it for the gist, and expect the details to crumble under scrutiny.) Under average field conditions,
some species are hard for even a beginner to get wrong
some species are hard for an expert to get right
a specific individual's ability varies irregularly over time (good days, bad days)
individuals usually become more skilled over time
you might be highly skilled at identifying hawks, and totally suck at identifying gulls
individuals are prone to suggestion (who they're with makes a difference in their reliability)
So, to take a shot at assessing the quality of an identification, you might try to record a lot of information besides the observation "3 red-tailed hawks at Cape May on 05-Feb-2011 at 4:30 pm". You might try to record
weather
lighting
temperature (some birders suck in the cold)
hours afield (some birders suck after 3 hours, or after 20 cold minutes)
names of others present
average difficulty of correctly
identifying red-tailed hawks
probability that this individual
could correctly identify red-tails
under these field conditions
alcohol intake
Although this might be "meta" to field birders, to the database designer it's just data. And you'd design the tables just like you'd design them for any other application. (That's what I did, anyway.)