Working on an NLP project and would really benefit from any expert help.
I'm looking to narrow down my options and select the most appropriate analysis methods and techniques for a project I'm working on. My question relates to what I should do in relation to the data I have. Any help (for a newbie) is very appreciated.
My data: Open text, short string data responses to a survey question. I have multiple survey responses, each survey has a high number of respondents (3K+) although a relatively low number respond to the question (typically 50 per survey). The responses are short (typically one line/sentence response), but I have about 20 surveys, so a reasonable corpus to work with.
Here's what I was planning (high level): Preprocess and clean the data, run some descriptives on the text data itself (BOWS, word frequency, maybe tf-idf, word clouds), then attempt some Topic Modelling and maybe Sentiment Analysis.
My main questions as I work my way through this massive learning process:
Would this type of data set warrant any particular Topic Modelling or Sentiment Analysis techniques?
Are there any obvious or less obvious limitations or considerations I should keep in mind, as a result of the data I've got?
Are there any clear step by step guides you can recommend? (I've been dipping in and out of a lot course and reading, but any similar experiences or examples would be invaluable).
I appreciate this is a bit text heavy and asking a lot, but any help and support would be really fantastic.
I'm a little bit late to the party, but in terms of topic modeling, the best starting point is LDA. There are a bunch of implementations of it (the best is MALLET), and it is relatively easy to understand. There are a bunch of topic models designed for short texts such as open-ended survey responses, including some that I helped design. Our models can be found in the python package GDTM. Take a look at NLDA, which is designed for short texts, and Guided Topic Model (GTM), which is also designed for short texts, but which allows you to provide seed topics if you already know some important topics. Have fun :)
I am replying as a self-starter in NLP like yourself, so I have come across similar considerations in my projects too. First of all, it sounds like the corpus you have sounds adequate for the analyses you intend to carry out, however the best test is really to apply a topic model and see what kind of results you get.
For topic modelling, I find using Gensim quite handy and comprehensive:
https://nicharuc.github.io/topic_modeling/#topic=10&lambda=1&term= - provides a more comfortable intro to LDA including ways to evaluate the results e.g. coherence values
and sensitivity analysis.
For ease of simplicity, Top2Vec provides a simple way of embedding a topic model: https://github.com/ddangelov/Top2Vec - do note that the corpus needs to be large enough for it to conduct anything (let alone something useful).
Hope this helps and good luck!
Related
I'm building a website that will rely on heavy computations to make guess and suggestion on objects of objects (considering the user preferences and those of users with similar profiles). Right now I'm using MongoDB for my projects, but I suppose that I'll have to go back to SQL for this one.
Unfortunately my knowledge on the subject is high school level. I know that there are a lot of relational databases, and was wondering about what could have been some of the most appropriate for this kind of heavily dynamic cluster analysis. Also I would really appreciate some suggestion regarding possible readings (would be really nice if free and online, but I won't mind reading a book. Just maybe not a 1k pages one if possible).
Thanks for your help, extremely appreciated.
Recommondations are typically a graph like problem, so you should also consider looking into graph databases, e.g. Neo4j
I am writing bot for one rts game.
I am using fuzzy logic to evaluate current position (mine and enemies') and to issue commands.
I have couple fuzzy variables: military_buildings, civilian_building, army_power, enemy_power and distance. I also have couple fuzzy linguistic values like VERY_GOOD, GOOD, NORMAL, BAD, VERY_BAD.
My next task is to make bots to learn, to avoid to all behave on same way. Any advice or idea how to solve this?
To use GA for tuning parameters (but I don't know ratings of players so I don't know if bot wins over a weak player or loses to a strong player).
Does anyone have experience with similar problems (I can change implementation and replace fuzzy logic if there is easier way to learn bots from experience)?
Have a look at reinforcement learning. Here are a quick preview and a book that can help you.
Based on your description, this is what I'd use :)
The idea of using GAs to tune the parameters to Fuzzy Linguistic Variables is a good one (I wish I thought of it!); the fuzzy logic gives you a nice continuous response curve while the GA will search through a large solution space. I think it's definitely a strategy worth pursuing; you should write up your results.
If I were you I would look at the AIIDE annual Starcraft Competition, it is sponsored in part by AAAI so there are some really high quality approaches to this problem. In particular if you are concerned with higher-level reasoning like resource management etc. Starcraft Competition Site Also, the competitors source code is all available open source so if you want to check out some other techniques I recommend it. FYI, most of the top competitors for this type of problem have historically used some variant of a Probabilistic State Machine Paper on Probabilistic FSMs, so this may make a good test bed for parameter tuning. FYI this is also the approach that some of the top Game AI middleware software uses for Game AI, like XAIT.
I need help to chose a project to work on for my master graduation, The project must involve Ai / Machine learning or Business intelegence.. but if there is any other suggestion out of these topics it is Ok, please help me.
One of the most rapid growing areas in AI today is Computer Vision. There are many practical needs where the results of your Master Thesis can be helpful. You can try research something like Emotion Detection, Eye-Tracking, etc.
An appropriate work for a MS in CS in any good University can highlight the current status of research on this field, compare different approaches and algorithms. As a practical part, it makes also a lot of fun when your program recognizes your mood properly :)
Netflix
If you want to work more on non trivial datasets (not google size, but not trivial either and with real application), with an objective measure of success, why not working on the netflix challenge (the first one) ? You can get all the data for free, you have many papers on it, as well as pretty good way to compare your results vs other peoples (since everyone used exactly the same dataset, and it was not so easy to "cheat", contrary to what happens quite often in the academic literature). While not trivial in size, you can work on it with only one computer (assuming it is recent enough), and depending on the type of algorithms you are using, you can implement them in a language which is not C/C++, at least for prototyping (for example, I could get decent results doing things entirely in python).
Bonus point, it passes the "family" test: easy to tell your parents what you are working on, which is always a pain in my experience :)
Music-related tasks
A bit more original: something that is both cool, not trivial but not too complicated in data handling is anything around music, like music genre recognition (classical / electronic / jazz / etc...). You would need to know about signal processing as well, though - I would not advise it if you cannot get easy access to professors who know about the topic.
I can use the same answer I used on a previous, similar question:
Russ Greiner has a great list of project topics for his machine learning course, so that's a great place to start.
Both GAs and ANNs are learners/classifiers. So I ask you the question, what is an interesting "thing" to learn? Maybe it's:
Detecting cancer
Predicting the outcome between two sports teams
Filtering spam
Detecting faces
Reading text (OCR)
Playing a game
The sky is the limit, really!
Since it has a business tie in - given some input set determine probable business fraud from the input (something the SEC seems challenged in doing). We now have several examples (Madoff and others). Or a system to estimate investment risk (there are lots of such systems apparently but were any accurate in the case of Lehman for example).
A starting point might be the Chen book Genetic Algorithms and Genetic Programming in Computational Finance.
Here's an AAAI writeup of an award to the National Association of Securities Dealers for a system thatmonitors NASDAQ insider trading.
Many great answers posted already, but I wanted to add my 2 cents.There is one hot topic in which big companies all around are investing lots of resources into, and is still a very challenging topic with lots of potential: Automated detection of fake news.
This is even more relevant nowadays where most of us are connecting though social media and there's a huge crisis looming over.
Fake news, content removal, source reliability... The problem is huge and very exciting. It is as I said challenging as it can be seen from many perspectives (from analising images to detect fakes using adversarial netwotks to detecting fake written news based on text content (NLP) or using graph theory to find sources) and the possbilities for a research proyect are endless.
I suggest you read some general articles (e.g this or this) or have a look at research articles from the last couple of years (a quick google seach will throw you a lot of related stuff).
I wish I had the opportunity of starting over a project based on this topic. I think it's going to be of the upmost relevance in the next few years.
I'm working on a project at the moment where it would be really useful to be able to detect when a certain topic/idea is mentioned in a body of text. For instance, if the text contained:
Maybe if you tell me a little more about who Mr Jones is, that would help. It would also be useful if I could have a description of his appearance, or even better a photograph?
It'd be great to be able to detect that the person has asked for a photograph of Mr Jones. I could take a really naïve approach and just look for the word "photo" or "photograph", but this would obviously be no good if they wrote something like:
Please, never send me a photo of Mr Jones.
Does anyone know where to start with this? Is it even possible?
I've looked into things like nltk, but I've yet to find an example of someone doing something similar and am still not entirely sure what this kind of analysis is called. Any help that can get me off the ground would be great.
Thanks!
The best thing out there that might be useful to you is automatic sentiment analysis. This is used, for example, to judge whether, say, a customer review is positive or negative. I cannot give you direct pointers to available tools, but this is what you are looking for.
I must say, though, that this is a current hot topic in natural language processing and I’ve seen a number of papers at conferences. It’s definitely quite a complex matter and if you’re starting from scratch, it might take quite some time before you get the results that you want.
NLTK is not a bad framework for parsing natural language but beware that this is not a simple matter. Doing stuff like this is really research level programming.
A good thing that makes it much easier is if you have a very limited domain - say your application focuses on information about famous writers, then you can avoid some complexities of natural language like certain types of ambiguities.
Where to start? Good question. I don't know of any tutorials on the topic (and I presume you tried the Google option) but I'd imagine that iTunes U would have a course on the topic. If not I can post a link to a course I've done that mentions the subject and wasn't completely horrible: http://www.inf.ed.ac.uk/teaching/courses/inf2a/lecturematerials/index.html#lecture01
The problem that u tackle is very challenging.
I would start by first identifying the entities in the text (problem referred as Named Entity Recognition, google it), and then a I would try to identify concepts.
If want to roughly identify what is the text about, I suggest that you start by using WordNet and according to the words and their places in the hierarchy to identify the concepts involved.
If you want to produce a system which show real intelligence than you should start researching about resources such as CYC (OpenCYC) which will allow you to convert the sentences into FOL sentences.
This hardcore AI, approach to solving your problem. For simple chat bot, it would be easier to rely on simple statistical methods.
good luck
Long time listener, first time caller.
I'm a full time SE during the day and a full time data mining student at night. I've taken the courses, and heard what our professors think. Now, I come to you - the stackoverflowers, to bring out the real truth.
What is your favorite data mining algorithm and why? Are there any special techniques you've used that have helped you to be successful in the past?
Thanks!
Most of my professional experience involved last-minute feature additions like, "Hey, we should add a recommendation system to this e-Commerce site." The solution was usually a quick and dirty nearest neighbor search - brute force, euclidean distance, doomed to fail if the site ever became popular. But hey, premature optimization and all that...
I do enjoy the idea that data mining can be elegant and wonderful. I've followed the Netflix Prize and played with its dataset. In particular, I like the fact the imagination and experimentation have played such a large part in developing the top ten entries:
Acmehill blog
Acmehill New York Times article
Just a guy in a garage blog
Just a guy in a garage Wired article
So mostly, like a lot of software dev, I think the best algorithm is an open mind and some creativity.
There is a lot of data mining algorithms for different tasks so I found it a little bit hard to choose.
It would say that my favorite data mining algorithm is Apriori because it has inspired hundred of other algorithms and it has several applications. The Apriori algorithm in itself is quite simple. But it has laid the basis for many other algorithms (FPGrowth, PrefixSpan, etc.) that use the so called "Apriori property".
If you can be more specific about the task the data mining algorithm will perform, sure we can help you (classification, clustering, association rules detection, etc)