A webapp called StatSheet got funded today (August 4th, 2010)
http://techcrunch.com/2010/08/04/former-crunchies-finalist-statsheet-recieves-1-3-million-in-series-a/
They are doing 'automated journalism' - using computers to generate human-looking reports of sports games from the statistics
http://www.guardian.co.uk/media/pda/2010/mar/30/digital-media-algorithms-reporting-journalism
Does anyone have any insight into what approach/algorithms are being used to do this / how it might be replicated ?
The details for projects like this are a little sparse, but it looks like the baseball summarizer Stats Monkey consists of:
Statistical model: They build a model of how baseball games typically unfold, most likely by looking at how certain variables (e.g. runs, at bats, etc.) change during the course of a game or differ from what you'd expect to see going into the game (e.g. a no-name team scores more runs than a highly-favored team). How well a given game fits (or doesn't fit) this model gives them an idea of what might be interesting about that game (e.g. key plays or players).
Text generation: Given a library of pre-written narrative arcs (e.g. back-and-forth game, come-from-behind victory, etc.) they use the "interesting information" from the model of the game to construct a summary of the game. I'm not sure, but it looks like they use a decision tree -- conditioned on the information from the model -- to select one of these arcs.
Miscellaneous glue: This isn't mentioned in their writeup, but there I'd imagine that there are a fair number of hard-coded rules that "glue" the main narrative arcs into a single, cohesive story.
The authors of Stats Monkey have done a fair amount of research in related areas, like website summarization and automatic content aggregation and generation. Here are a few papers that might be interesting:
Nathan Nichols and Kristian Hammond. “Machine-Generated Multimedia Content.” Proceedings of the Second International Conference on Advances in Computer-Human Interactions, 2009.
Nathan Nichols, Lisa Gandy, and Kristian Hammond. "From Generating to Mining: Automatically Scripting Conversation Using Existing Online Sources." The Proceedings of the Third International Conference on Weblogs and Social Media, 2009.
J. Liu and L. Birnbaum. 2008. "LocalSavvy: Aggregating Local Points of View about News Issues". WWW 2008 Workshop on Location on the Web.
Related
Currently we are working on Laboratory domain.
The Laboratory domain embraces many profiles and each profiles consist of lots of actors.
Where LAB TF mentioned LAB-n several times.
In example:
LAB-1~5 (5)
LAB-21~23 (3)
LAB-26~31 (6)
LAB-51 (1)
LAB-61~62 (2)
What are they actually?
Machine, Device, Actor or anything else?
What are they use for?
I don't know where is the correct dictionary-style definition and I failed to find one on the www.ihe.net website, but my translation is:
they are actually transaction identifiers.
Some short sequences of letters and digits uniquely identifying the "process/scenario/sequence of events/unit of change/use case/.." you are talking about.
A patient came to the doctor, a nurse took his blood sample and send it to the laboratory in order to diagnose what kind of bad-specimens live in there. After some time of analyzing and cultivation the results are know, travel back to the doctor, who then picks a diagnosis and proposes a treatment.
This flow of events can be described as sequence of transactions. The details may vary, but the concepts are approximately the same, regardless of the country, town, hospital, gender. Uniquely identified "steps" where the actors are typically mapped to (or supported by) communicating software components. The more compatible the "steps" or transactions are the easier it is to integrate equipment, people, software coming from different cultures or vendors. IHE attempts to identify those patterns and give them names - transaction identifiers
For instance LAB-26 describes (my interpretation) what happens when the analyzer device (called "(LD)Pre/PostAnalyzer") detects a specimen and needs to send notification to the automatic test scheduler (software component called "Automation Manager") saying that the result is known and can be further processed, e.g. that laboratory worker can take the sample tubes out of the machine and insert another set and the laboratory doctor can schedule another (refining) set of tests for this sample
See also:
chapter "2.2 The generic IHE Transaction Model" in your document
I'm facing ranking problems using solr and I'm stucked.
Given a e-commerce site, for the query "ipad" i obtain:
ipad case for ipad 2
ipad case
ipad connection kit
ipad 32gb wifi
This is a problem, since we want to rank first the main products (or products by itself) and tf/idf ranks first the accessories due to descriptions like "ipad case compatible with ipad, ipad2, ipad3, ipad retina, ipad mini, etc".
Furthermore, using the categories we have no way of determining whether is an accessory or a product.
I wonder if using automatic classification would help. Another solution that improves this ranking (like Named Entity Recognition) would be appreciated.
Could you provide tagged data?
If you have >50k items a Naive Bayes with a bigram language model trained on the product name will almost catch all accessories with 99% accuracy. I guess you can train such a naive bayes with Mahout, however product names have a pretty limited bigram amount so this can be trained even on a smartphone easily and fast nowadays.
This is a typical mechanical turk task, shouldn't be that expensive to tag a few items. However if you insist on some semi-supervised algorithm, I found Iterative similarity aggregation pretty useful.
The main idea is that you give a few tokens like "case"/"power adapter" and it iteratively finds new tokens that are indicators of spam because they appear in the same context.
Here is the paper, but I have written a blogpost about this as well which sums up the intention in plain language. This paper also mentions the same "let the user find the right item" paradigm that Sean has proposed, so both can be used in conjunction.
Oh and if you need some advice of machine learning with Lucene&SOLR I can recommend you the talk of my friend Tommaso Teofili at ApacheCon Europe this year. You can find the slides on slideshare. There is also a youtube video of the talk out there, just search for it ;)
TF/IDF is just going to rank based on the words in the query vs words in the title as you have found. That sounds like it is not the right definition of "good result" and that you want to favor products over accessories.
Of course you can simply attach heuristics to patch the problem. For example, consider the title as a set of words, not multiset, so the appearance of "iPad" several times makes no difference. Or just boost the score of items that you know are products. This isn't learning per se, but are simple, directly reflect your business knowledge, and probably have some positive effect.
If you want to learn here, you probably need to use the one best source of knowledge about what the best results are: your users. You know what they click in response to each query. You can learn a term-item model that associates search terms to items clicked. You can view that as many types of problem -- actually a latent-factor recommender model could work well there.
Have a look at Ted's slides on how to use a recommender as a "search engine": http://www.slideshare.net/tdunning/search-as-recommendation
I'm looking to build an AI system to "pick" a fantasy football team. I have only basic knowledge of AI techniques (especially when it comes to game theory), so I am looking for advice on what techniques could be used to accomplish this and pointers to some reading materials.
I am aware that this may be a very difficult or maybe even impossible task for AI to accurately complete: however I am not too concerned on the accuracy, rather I am interested in learning some AI and this seems like a fun way to apply it.
Some basic facts about the game:
A team of 14 players must be picked
There is a limit on the total cost of players picked
The players picked must adhere to a certain configuration (there must always be one goalkeeper, at least two defenders, one midfielder and one forward)
The team may be altered on a weekly basis but removing/adding more than one player a week will inccur a penalty
P.S. I have stats on every match played in last season, could this be used to train the AI system?
This is interesting.
So if you didn't really care about accuracy at all, you could just come up with some heuristic for the quality of a team. For instance, assign a point value to each player and then try to maximize it using dynamic programming. Something like: http://www.cse.unl.edu/~goddard/Courses/CSCE310J/Lectures/Lecture8-DynamicProgramming.pdf
This would be similar to the knapsack problem.
Technically this is AI since a computer is deciding something but maybe not what you had in mind.
You sound like you want a learning AI (http://en.wikipedia.org/wiki/Machine_learning) which is an interesting field. Here's how you can approach the problem.
Define your inputs. Right now you have last years data. You'll probably want data on many years. Also, you might be able to include the ranking of pundits, maybe a bunch of magazines rank players or something, that seems useful as well.
Take your inputs and feed them into some machine learning algorithm for each season. Wikipedia will help you out there.
Essentially, for each season you'll want to feed in your data, have your AI pick a team, and then rate the performance of the team based on the seasons results.
Keep doing this and maybe your bot will get better at picking teams, and you can apply to this year's data.
(If you only have last year's data, it's okay to train the algorithm with just that but your AI will probably be over trained on that one set and won't be as accurate.)
This was just a sketch of how it might look. For a romp into AI, this problem is probably pretty hard so don't feel disheartened if it seems overwhelming at first.
I am trying to do the following:
we are trying to design a fraud detection system for stock market.
I know the Specification for the frauds (they are like templates).
so I want to know if I can design a template, and find all records that match this template.
Notice:
I can't use the traditional queries cause the templates are complex
for example one of my Fraud is circular trading,it's like this :
A bought from B, and B bought from C, And C bought from A (it's a cycle)
and this cycle can include 4 or 5 persons.
is there any good suggestion for this situation.
I don't see why you can't use "traditional queries" as you've stated. SQL can be used to write extraordinarily complex queries. For that matter I'm not sure that this is a hugely challenging question.
Firstly, I'd look at the behavior you have described as vary transactional, therefore I treat the transactions as a model. I'd likely have a transactions table with some columns like buyer, seller, amount, etc...
You could alternatively have the shares as its own table and store say the previous 100 owners of that share in the same table using STI (Single Table Inheritance) buy putting all the primary keys of the owners into an "owners" column in your shares table like 234/823/12334/1234/... that way you can do complex queries and see if that share was owned by the same person or look for patterns in the string really easily and quickly.
-update-
I wouldn't suggest making up a "small language" I don't see why you'd want to do something like that when you have huge selection of wonderful languages and databases to choose from, all of which have well refined and tested methods to solve exactly what you are doing.
My best advice is pop open your IDE (thumbs up for TextMate) and pick your favorite language (Ruby in my case). Find some sample data and create your database and start writing some code! You can't go wrong trying to experiment like this, it'll will totally expose better ways to go about it than we can dream up here on Stackoverflow.
Definitely Data Mining. But as you point out, you've already got the models (your templates). Look up fraud DETECTION rather than prevention for better search results?
I know a some banks use SPSS PASW Modeler for fraud detection. This is very intuitive and you can see what you are doing as you play around with the data. So you can implement your templates. I agree with Joseph, you need to get playing, making some new data structures.
Maybe a timeseries model?
Theoretically you could develop a "Small Language" first, something with a simple syntax (that makes expressing the domain - in your case fraud patterns - easy) and from it generate one or more SQL queries.
As most solutions, this could be thought of as a slider: at one extreme there is the "full Fraud Detection Language" at the other, you could just build stored procedures for the most common cases, and write new stored procedures which use the more "basic" blocks you wrote before to implement the various patterns.
What you are trying to do falls under the Data Mining umbrella, so you could also try to learn more about it: maybe you can find a Data Mining package for your specific DB (you didn't specify) and see if it helps you finding common patterns in your data.
There are two courses: "AI" and "AI in Games" both 15 students for 15 weeks.
I want to keep them motivated and creative.
I know I want some kind of competition (obvious for the latter course).
Maybe something like Marathon Match or ICFP.
I will need good visualization, so it would be great if it already exist.
One idea was to write AI for "Battle of Wesnoth", but I guess it's to diverse / boring.
Another game of Go. But that's too hard.
What are your ideas?
It will be work in groups of 3 students for 15 weeks.
MIT hosts a competition called BattleCode.
BattleCode, is a real-time strategy
game. Two teams of robots roam the
screen managing resources and
attacking each other with different
kinds of weapons. However, in
BattleCode each robot functions
autonomously; under the hood it runs a
Java virtual machine loaded up with
its team's player program. Robots in
the game communicate by radio and must
work together to accomplish their
goals.
Teams of one to four students enter
are given the BattleCode software and
a specification of the game rules.
Each team develops a player program,
which will be run by each of their
robots during BattleCode matches.
Contestants often use artificial
intelligence, pathfinding, distributed
algorithms, and/or network
communications to write their player.
At the final tournaments, the
autonomous players are pitted against
each other in a dramatic head-to-head
tournament. The final rounds of the
MIT tournament are played out in front
of a live audience, with the top teams
receiving cash prizes.
(source: mit.edu)
BattleCode in action.
You essentially are given the BattleCode software from MIT and your students can program the AI for their robots. They have a test suite so you can practice running your autonomous bots on your own in a practice arena. Towards the end of the semester they can enter in MIT's Open Tournament, where they compete with their software AI robots against schools all over the nation. Up to $40,000 is given away in cash and prizes as well as bragging rights for winning.
If you are looking to teach them about AI, Pathfinding, Swarm Intelligence, etc. I can't think of a more fun way.
May the best AI bot win!
Wii gesture recognition using hidden markov models.
I wouldn't count out Go. It's computationally hard for Go AI to compete with top human players, but the simple rules of Go (compared to Chess) make it a relatively easy game to write AI for. Your students' programs only need to compete against each other, not against Dan level human players. See An Introduction to the Computer Go Field and Associated Internet Resources for a lot of Go programming resources.
I think it's a good idea to select a theme both challenging enough that it can't be completely solved, yet allows the user to see the value of it in the real world and not so much a toy problem. My suggestion would thus be:
Word segmentation problem (e.g. convert "iamaboy" to "i'am a boy")
Word sense disambiguation (e.g. "The apple is nice to eat" - The apple is a fruit or a company?)
Optical character recognition
What I just list down is some of the more basic stuff of natural language processing. If your students is much more technically inclined, you can probably take it to the next level and let them tackle the problem of machine translation.
Empire, it's addictive as whatever and there are open source D versions (1 and 2) and a not quite free c++ version .