Long time listener, first time caller.
I'm a full time SE during the day and a full time data mining student at night. I've taken the courses, and heard what our professors think. Now, I come to you - the stackoverflowers, to bring out the real truth.
What is your favorite data mining algorithm and why? Are there any special techniques you've used that have helped you to be successful in the past?
Thanks!
Most of my professional experience involved last-minute feature additions like, "Hey, we should add a recommendation system to this e-Commerce site." The solution was usually a quick and dirty nearest neighbor search - brute force, euclidean distance, doomed to fail if the site ever became popular. But hey, premature optimization and all that...
I do enjoy the idea that data mining can be elegant and wonderful. I've followed the Netflix Prize and played with its dataset. In particular, I like the fact the imagination and experimentation have played such a large part in developing the top ten entries:
Acmehill blog
Acmehill New York Times article
Just a guy in a garage blog
Just a guy in a garage Wired article
So mostly, like a lot of software dev, I think the best algorithm is an open mind and some creativity.
There is a lot of data mining algorithms for different tasks so I found it a little bit hard to choose.
It would say that my favorite data mining algorithm is Apriori because it has inspired hundred of other algorithms and it has several applications. The Apriori algorithm in itself is quite simple. But it has laid the basis for many other algorithms (FPGrowth, PrefixSpan, etc.) that use the so called "Apriori property".
If you can be more specific about the task the data mining algorithm will perform, sure we can help you (classification, clustering, association rules detection, etc)
Related
Working on an NLP project and would really benefit from any expert help.
I'm looking to narrow down my options and select the most appropriate analysis methods and techniques for a project I'm working on. My question relates to what I should do in relation to the data I have. Any help (for a newbie) is very appreciated.
My data: Open text, short string data responses to a survey question. I have multiple survey responses, each survey has a high number of respondents (3K+) although a relatively low number respond to the question (typically 50 per survey). The responses are short (typically one line/sentence response), but I have about 20 surveys, so a reasonable corpus to work with.
Here's what I was planning (high level): Preprocess and clean the data, run some descriptives on the text data itself (BOWS, word frequency, maybe tf-idf, word clouds), then attempt some Topic Modelling and maybe Sentiment Analysis.
My main questions as I work my way through this massive learning process:
Would this type of data set warrant any particular Topic Modelling or Sentiment Analysis techniques?
Are there any obvious or less obvious limitations or considerations I should keep in mind, as a result of the data I've got?
Are there any clear step by step guides you can recommend? (I've been dipping in and out of a lot course and reading, but any similar experiences or examples would be invaluable).
I appreciate this is a bit text heavy and asking a lot, but any help and support would be really fantastic.
I'm a little bit late to the party, but in terms of topic modeling, the best starting point is LDA. There are a bunch of implementations of it (the best is MALLET), and it is relatively easy to understand. There are a bunch of topic models designed for short texts such as open-ended survey responses, including some that I helped design. Our models can be found in the python package GDTM. Take a look at NLDA, which is designed for short texts, and Guided Topic Model (GTM), which is also designed for short texts, but which allows you to provide seed topics if you already know some important topics. Have fun :)
I am replying as a self-starter in NLP like yourself, so I have come across similar considerations in my projects too. First of all, it sounds like the corpus you have sounds adequate for the analyses you intend to carry out, however the best test is really to apply a topic model and see what kind of results you get.
For topic modelling, I find using Gensim quite handy and comprehensive:
https://nicharuc.github.io/topic_modeling/#topic=10&lambda=1&term= - provides a more comfortable intro to LDA including ways to evaluate the results e.g. coherence values
and sensitivity analysis.
For ease of simplicity, Top2Vec provides a simple way of embedding a topic model: https://github.com/ddangelov/Top2Vec - do note that the corpus needs to be large enough for it to conduct anything (let alone something useful).
Hope this helps and good luck!
I'm not sure if this is the right place to ask this, but here goes.
I have been a programmer for about 12 years now with experience in php, java, c#, vb.net and asp. I have always been rather intrigued about Artificial Intelligence. I think it is really the ultimate challenge for any developer.
I have written many simple scripts to play games but nothing compared to what I want to do next. I want to write an AI program that will play a MMORTSG (Massively Multiplayer Online Real Time Strategy Game. I have been searching through many AI techniques but none seem to tackle the problems that I know I will face:
Problems I can foresee:
The game has no "win situation", instead, the best strategy is the one that has the greatest growth in comparison to that of other players. Growth is determined by 3 factors, economy, military and research.
Parts of the game state are unpredictable. Other players can attack me at random.
The game is time based and actions take time. ie. Building a new building make take several hours. While that building is being built, no other buildings can be built.
All the AI systems I have researched, require some sort of "Winning Function" to test if the AI has found an end. Where in my situation it would more likely be something like "I have X, Y, Z options, the best one is X".
ps. Sample code would be awesome. Even Psuedo would be great.
I've seen a few applications of Artificial Intelligence in the Gaming area, but most of this was for FPS, MMORPGs and RTS games. The genre type that you appear to be relating to sounds similar to 'Clash of Clans', where research, military and economy as well as random attacks occur over a random period of time, and runs over an endless period of time.
It seems that a model would be used at key points in the game (building is finished, or research is available, or castle is full) to make strategic decisions for progression. Perhaps a Genetic Algorithm could be applied at key moments determine a suitable sequence of future steps. A modular neural network could be defined to decide the logical Growth factor to take, but training such a network may be difficult as the game rules can change over time (either from previously unknown resources, research options, military and even game updates). Scripts are quite common as well in the MMORPG genre, but defining the manual rules could also be difficult without knowing all of the available options. The fact is that there are so many ways that your challenge can be addressed that it would be difficult to give a clear-cut answer to your problem, let alone the code or psudocode.
Looking briefly over the problem, it appears that the contributing factors to the problem would be current economic state, current military state, current research state, time lost if saving for next upgrade, time required to build next upgrade, cost of upgrade as well as other unknown factors.
Given that the problem has no clear winning objectives, I guess it is a matter of maintaining a healthy balance between the three growth factors. But how does one define the balance? Is research more important? Should you always have money, or just save enough for the next planned upgrade? Should military be as large as possible?
The challenge that you are placing before yourself is quite adventurous, but I would recommend taking on smaller challenges if you are not yet familiar with the models that AI have to provide. There are quite a number of Gaming Applications for AI resources available to inspire your model (including ziggystar's examples noted above).
I'm starting to study machine learning and bayesian inference applied to computer vision and affective computing.
If I understand right, there is a big discussion between
classical IA, ontology, semantic web researchers
and machine learning and bayesian guys
I think it is usually referred as strong AI vs weak AI related also to philosophical issues like functional psychology (brain as black box set) and cognitive psychology (theory of mind, mirror neuron), but this is not the point in a programming forum like this.
I'd like to understand the differences between the two points of view. Ideally, answers will reference examples and academic papers where one approach get good results and the other fails. I am also interested in the historical trends: why approaches fell out of favour and a newer approaches began to rise up. For example, I know that Bayesian inference is computationally intractable, problem in NP, and that's why for a long time probabilistic models was not favoured in information technology world. However, they've began to rise up in econometrics.
I think you have got several ideas mixed up together. It's true that there is a distinction that gets drawn between rule-based and probabilistic approaches to 'AI' tasks, however it has nothing to do with strong or weak AI, very little to do with psychology and it's not nearly as clear cut as being a battle between two opposing sides. Also, I think saying Bayesian inference was not used in computer science because inference is NP complete in general is a bit misleading. That result often doesn't matter that much in practice and most machine learning algorithms don't do real Bayesian inference anyway.
Having said all that, the history of Natural Language Processing went from rule-based systems in the 80s and early 90s to machine learning systems up to the present day. Look at the history of the MUC conferences to see the early approaches to information extraction task. Compare that with the current state-of-the-art in named entity recognition and parsing (the ACL wiki is a good source for this) which are all based on machine learning methods.
As far as specific references, I doubt you'll find anyone writing an academic paper that says 'statistical systems are better than rule-based systems' because it's often very hard to make a definite statement like that. A quick Google for 'statistical vs. rule based' yields papers like this which looks at machine translation and recommends using both approaches, according to their strengths and weaknesses. I think you'll find that this is pretty typical of academic papers. The only thing I've read that really makes a stand on the issue is 'The Unreasonable Effectiveness of Data' which is a good read.
As for the "rule-based" vs. " probabilistic" thing you can go for the classic book by Judea Pearl - "Probabilistic Reasoning in Intelligent Systems. Pearl writes very biased towards what he calls "intensional systems" which is basically the counter-part to rule-based stuff. I think this book is what set off the whole probabilistic thing in AI (you can also argue the time was due, but then it was THE book of that time).
I think machine-learning is a different story (though it's nearer to probabilistic AI than to logics).
I need help to chose a project to work on for my master graduation, The project must involve Ai / Machine learning or Business intelegence.. but if there is any other suggestion out of these topics it is Ok, please help me.
One of the most rapid growing areas in AI today is Computer Vision. There are many practical needs where the results of your Master Thesis can be helpful. You can try research something like Emotion Detection, Eye-Tracking, etc.
An appropriate work for a MS in CS in any good University can highlight the current status of research on this field, compare different approaches and algorithms. As a practical part, it makes also a lot of fun when your program recognizes your mood properly :)
Netflix
If you want to work more on non trivial datasets (not google size, but not trivial either and with real application), with an objective measure of success, why not working on the netflix challenge (the first one) ? You can get all the data for free, you have many papers on it, as well as pretty good way to compare your results vs other peoples (since everyone used exactly the same dataset, and it was not so easy to "cheat", contrary to what happens quite often in the academic literature). While not trivial in size, you can work on it with only one computer (assuming it is recent enough), and depending on the type of algorithms you are using, you can implement them in a language which is not C/C++, at least for prototyping (for example, I could get decent results doing things entirely in python).
Bonus point, it passes the "family" test: easy to tell your parents what you are working on, which is always a pain in my experience :)
Music-related tasks
A bit more original: something that is both cool, not trivial but not too complicated in data handling is anything around music, like music genre recognition (classical / electronic / jazz / etc...). You would need to know about signal processing as well, though - I would not advise it if you cannot get easy access to professors who know about the topic.
I can use the same answer I used on a previous, similar question:
Russ Greiner has a great list of project topics for his machine learning course, so that's a great place to start.
Both GAs and ANNs are learners/classifiers. So I ask you the question, what is an interesting "thing" to learn? Maybe it's:
Detecting cancer
Predicting the outcome between two sports teams
Filtering spam
Detecting faces
Reading text (OCR)
Playing a game
The sky is the limit, really!
Since it has a business tie in - given some input set determine probable business fraud from the input (something the SEC seems challenged in doing). We now have several examples (Madoff and others). Or a system to estimate investment risk (there are lots of such systems apparently but were any accurate in the case of Lehman for example).
A starting point might be the Chen book Genetic Algorithms and Genetic Programming in Computational Finance.
Here's an AAAI writeup of an award to the National Association of Securities Dealers for a system thatmonitors NASDAQ insider trading.
Many great answers posted already, but I wanted to add my 2 cents.There is one hot topic in which big companies all around are investing lots of resources into, and is still a very challenging topic with lots of potential: Automated detection of fake news.
This is even more relevant nowadays where most of us are connecting though social media and there's a huge crisis looming over.
Fake news, content removal, source reliability... The problem is huge and very exciting. It is as I said challenging as it can be seen from many perspectives (from analising images to detect fakes using adversarial netwotks to detecting fake written news based on text content (NLP) or using graph theory to find sources) and the possbilities for a research proyect are endless.
I suggest you read some general articles (e.g this or this) or have a look at research articles from the last couple of years (a quick google seach will throw you a lot of related stuff).
I wish I had the opportunity of starting over a project based on this topic. I think it's going to be of the upmost relevance in the next few years.
There are many papers about ranged combat artificial intelligences, like Killzones's (see this paper), or Halo. But I've not been able to find much about a fighting IA except for this work, which uses neural networs to learn how to fight, which is not exactly what I'm looking for.
Occidental AI in games is heavily focused on FPS, it seems! Does anyone know which techniques are used to implement a decent fighting AI? Hierarchical Finite State Machines? Decision Trees? They could end up being pretty predictable.
In our research labs, we are using AI planning technology for games. AI Planning is used by NASA to build semi-autonomous robots. Planning can produce less predictable behavior than state machines, but planning is a highly complex problem, that is, solving planning problems has a huge computational complexity.
AI Planning is an old but interesting field. Particularly for gaming only recently people have started using planning to run their engines. The expressiveness is still limited in the current implementations, but in theory the expressiveness is limited "only by our imagination".
Russel and Norvig have devoted 4 chapters on AI Planning in their book on Artificial Intelligence. Other related terms you might be interested in are: Markov Decision Processes, Bayesian Networks. These topics are also provided sufficient exposure in this book.
If you are looking for some ready-made engine to easily start using, I guess using AI Planning would be a gross overkill. I don't know of any AI Planning engine for games but we are developing one. If you are interested in the long term, we can talk separately about it.
You seem to know already the techniques for planning and executing. Another thing that you need to do is predict the opponent's next move and maximize the expected reward of your response. I wrote a blog article about this: http://www.masterbaboon.com/2009/05/my-ai-reads-your-mind-and-kicks-your-ass-part-2/ and http://www.masterbaboon.com/2009/09/my-ai-reads-your-mind-extensions-part-3/ . The game I consider is very simple, but I think the main ideas from Bayesian decision theory might be useful for your project.
I have reverse engineered the routines related to the AI subsystem within the Street Figher II series of games. It does not incorporate any of the techniques mentioned above. It is entirely reactive and involves no planning, learning or goals. Interestingly, there is no "technique weight" system that you mention, either. They don't use global weights for decisions to decide the frequency of attack versus block, for example. When taking apart the routines related to how "difficulty" is made to seem to increase, I did expect to find something like that. Alas, it relates to a number of smaller decisions that could potentially affect those ratios in an emergent way.
Another route to consider is the so called Ghost AI as described here & here. As the name suggests you basically extract rules from actual game play, first paper does it offline and the second extends the methodology for online real time learning.
Check out also the guy's webpage, there are a number of other papers on fighting games that are interesting.
http://www.ice.ci.ritsumei.ac.jp/~ftgaic/index-R.html
its old but here are some examples