What data analysis tools I can apply on categorical variables such as movie genres in order to reduce dimentionality? - dataset

So I am working on a final project where I have to use data analysis tools, ANNs, etc. My idea is that an agent will recognize the movie genres from a poster but the data set I have contains 28 separate genres but I want to work with the most important 10 of them for example. Is there any way I can use data analysis tools such as PCA/SVD, LDA etc to do this?
I tried one hot encoding all the different genres and applying PCA to them but a lot of people said it doesn't give a meaningful answer. I am new to all of this so any help would be appreciated.

Related

Data set for Doc2Vec general sentiment analysis

I am trying to build doc2vec model, using gensim + sklearn to perform sentiment analysis on short sentences, like comments, tweets, reviews etc.
I downloaded amazon product review data set, twitter sentiment analysis data set and imbd movie review data set.
Then combined these in 3 categories, positive, negative and neutral.
Next I trinaed gensim doc2vec model on the above data so I can obtain the input vectors for the classifying neural net.
And used sklearn LinearReggression model to predict on my test data, which is about 10% from each of the above three data sets.
Unfortunately the results were not good as I expected. Most of the tutorials out there seem to focus only on one specific task, 'classify amazon reviews only' or 'twitter sentiments only', I couldn't manage to find anything that is more general purpose.
Can some one share his/her thought on this?
How good did you expect, and how good did you achieve?
Combining the three datasets may not improve overall sentiment-detection ability, if the signifiers of sentiment vary in those different domains. (Maybe, 'positive' tweets are very different in wording than product-reviews or movie-reviews. Tweets of just a few to a few dozen words are often quite different than reviews of hundreds of words.) Have you tried each separately to ensure the combination is helping?
Is your performance in line with other online reports of using roughly the same pipeline (Doc2Vec + LinearRegression) on roughly the same dataset(s), or wildly different? That will be a clue as to whether you're doing something wrong, or just have too-high expectations.
For example, the doc2vec-IMDB.ipynb notebook bundled with gensim tries to replicate an experiment from the original 'Paragraph Vector' paper, doing sentiment-detection on an IMDB dataset. (I'm not sure if that's the same dataset as you're using.) Are your results in the same general range as that notebook achieves?
Without seeing your code, and details of your corpus-handling & parameter choices, there could be all sorts of things wrong. Many online examples have nonsense choices. But maybe your expectations are just off.

Thought on creating a searchable database

I guess what I need is two things. First a way to input data into an Excel like application or a form builder, then a way to search those entries. For example.. CAR PART put a car Part A into Field 1 the next Field 2 would be car Type, followed by make and model. The fields would need to be made into a form consisting of preset inputs such as ( Title/Type ) and (Variable Categories) so a drop down menu, icons, or checkboxes would help narrow down the list of results. What pieces need to be in place to build/use a lightweight database/application design like this that allows inputting new information and then being able to search for latet search for variables? Also is there any application that does this already, a programming code to learn, or estimated cost and requirements to have it built?
First, there might be something off the shelf that does this already, and there are applications like this. Microsoft's Access would be a good place to start to see if it would fit your needs -- you can build forms and store data without much programming effort. As time goes on, you can scale up to a SQL Server.
It's not clear to me if your data is relational or not, and it might not matter much at first (any database will likely handle your queries to start). I originally thought your data was not relational, but re-reading your post, I'm not so sure now.
If that doesn't work, or you want more flexibility, then I'd start looking at NoSQL as an option. Some good choices include Mongo and RavenDB (there are many others).
You can program it yourself with just about any major language -- some provide more or less functionality based on the tie-in to the data.

Where to find already filled in databases

I am new to using external databases, and i was wonderding if there might be a place with many "common"databases?
I know db can be very product specific but i'm talking about like:
Car brand + model
Food + nutrisious values
...
(Common stuff)
Database structure varies with requirements so you may not find what you need exactly but however, if you just need sample databases, then try Google search.
Make sure you don't use any of those for a real time projects.

Making solr to understand English

I'm trying to setup solr that should understand English. For example I've indexed our company website (www.biginfolabs.com) or it could be any other website or our own data.
If i put some English like queries i should get the one word answer just what Google does;queries are:
Where is India located.
who is the father of Obama.
Workaround:
Integrated UIMA,Mahout with solr(person name,city name extraction is done).
I read the book called "Taming Text" and implemented https://github.com/tamingtext/book. But Did not get what i want.
Can anyone please tell how to move further. It can be anything our team is ready to do it.
This task is called Named Entity Recognition. You can look up this tutorial to see how they use Solr for extractic atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. and then learning a model to answer queries.
Also have a look at Stanford NLP for more ideas on algorithms that you can use.

drink calculator for flash

My charity organisation has asked me to produce a alcohol units calculator just like the one illustrated on the website http://www.drinkaware.co.uk/tips-and-tools/drink-diary/.
I am going to use flash with action script. My question is as I am new to action script
Will i need a database to link the alcohol beverages to in the drop down menus'
If so how is this done or is there a easier way to achieve.
Any tutorial links or advice would be very much aprechiated.
Thanks
Peter
You may not necessarily need a database, as there are a number of ways this could be done. If you're interested in keeping the data outside the flash file, you could use a database.
Connecting Flash to an e database can get pretty complex though, and often there are better options, particularly when the project is a fairly small one like this.
I think XML is a nice option for your data in this case. You could store all the drink data in an XML file and then write AS code to populate the combo boxes with data from the XML.
This will help with creating the drop-down menus (ComboBoxes): http://help.adobe.com/en_US/FlashPlatform/reference/actionscript/3/fl/controls/ComboBox.html
And here's some more info on using XML to store your data:
http://help.adobe.com/en_US/as3/dev/WS5b3ccc516d4fbf351e63e3d118a9b90204-7e6a.html
http://help.adobe.com/en_US/as3/dev/WS5b3ccc516d4fbf351e63e3d118a9b90204-7e71.html
It could all be coded into the Flash piece also, without using a database or XML, but that's generally not considered good practice since it makes updating the data somewhat more difficult.

Resources