I run a website that allows users to write blog-post, I would really like to summarize the written content and use it to fill the <meta name="description".../>-tag for example.
What methods can I employ to automatically summarize/describe the contents of user generated content?
Are there any (preferably free) methods out there that have solved this problem?
(I've seen other websites just copy the first 100 or so words but this strikes me as a sub-optimal solution.)
Think of the task of summarization as a challenge to 'select the most important sentences' from the document.
The method described in The Automatic Creation of Literature Abstracts by H.P. Luhn (1958) describes a naive method that actually performs quite well. Try giving it a shot.
If your website is in Python coding this algorithm using the NLTK (Natural Language Toolkit) is a fun task.
Make it predictable.
From a users perspective simply using the first paragraph is not bad at all.
Using any automation is bound to fall flat in some cases. So I suggest to display
the first paragraph (maybe truncating at some point) as a summary and offer the ability to override that by an optional field.
I might try using mechanical Turk or any number of other crowdsourcing options.
Another item to check out, a SourceForge project, AutoSummary Semantic Analysis Engine
Not a trivial task... You should look for articles or books on "extractive summarization"
A few starters could be:
Books:
Natural Language Processing with Python
Foundations of Statistical Natural Language Processing
Articles:
Language independent extractive summarization
Extractive summarization: how to identify the gist of a text
Extractive Summarization using Inter- and Intra- Event Relevance
Yahoo has a free API for this:
http://developer.yahoo.com/search/content/V1/termExtraction.html
Apple's patent 6424362 - Auto-summary of document content contains sample code which might be useful...
This borders on artificial intelligence so there's not going to be an "easy" solution out there, but there are products that target this problem.
Check out Copernic Summarizer, for one.
Noun phrases typically tend to be important elements of a sentence. Picking sentence(s) with a high density of noun phrases could yield a good summary. You could get noun phrases using a POS tagger.
For a good summary, it is desirable that it is a meaningful sentence. Reading a broken sentence is slightly jarring.
Alternatively, when the author posts the article, the author can highlight what are the keywords that can be used in the description which can then be automatically put in the meta description tag.
Related
I was wondering if there's any open-source tool that I can use to design an AI which would read sentences from a file, understand their structure( by breaking them into their basic components) and then report back its components in detail.
I will provide it some sets of words belonging to different components of sentences(like set of prepositions, set of verbs, set of adjectives, etc) so as to help it determine different components.
I have an elaborate plan for it but my question is whether there's a tool available or do I have to program it from the scratch?
You are looking for a Parts of Speech Tagger, of which there are many. They are actually not all that hard to write (I made a simple one during grad school) but robust taggers do take a fair amount of work.
Here is one that is a part of the popular NLTK Package for Python.
Just as an aside, there is a lot more to Natural Language understanding that POS, but POS tags could be part of the feature vector that you feed into a larger ML/AI algorithm.
Working on testing a React component, I was reading the docs and found scryRenderedDOMComponentsWithClass. I'm having trouble understanding the function of this component because it's unpronounceable, so I don't understand how it's naming maps to a mental model of what it's doing. (There are a number of related names, such as scryRenderedDOMComponentsWithTag.)
What does the scry part of this method name refer to? Scary? Scurry? What concept is this name trying to illustrate?
Short answer
"Scry" in this context just means "find all". See this comment on ReactTestUtils.scryRenderedComponentsWithClass. It's a single word, not an abbreviation, and it's pronounced like "cry" but with an "s" at the beginning.
Longer (and nerdier) answer
Elsewhere in that same file, you'll see a reference to DOM.scry:
/**
* Todo: Support the entire DOM.scry query syntax. For now, these simple
* utilities will suffice for testing purposes.
* #lends ReactTestUtils
*/
zpao explains in a comment on a GitHub issue:
That's a reference to an internal Facebook module. It's basically querySelectorAll with fallback behavior for handling old browsers and special cases. It is pretty unremarkable and doesn't actually translate super well here (except maybe a scryRenderedDOMComponentsWithQSA or something, but meh). We're working on improving the testing in other ways so I don't think there's anything we really want to do with this right now.
jimfb takes it a bit further in another GitHub issue, explaining that the name is a reference to Dungeons & Dragons:
Back in the day, we had a bunch of D&D fans on the team.
For reference:
http://www.dandwiki.com/wiki/SRD:Scrying
http://www.dandwiki.com/wiki/SRD3e:Scry_Skill
https://en.wikipedia.org/wiki/Scrying
Historically, we've used scry to indicate a helper that finds a set of results. As the framework matures, we should start choosing function names based on what the functions actually do instead of fantasy words that have very little meaning to the typical developer.
Though I would agree that the word has very little meaning to most, it's worth noting that "scry" is a real English word:
scry
[skrahy]
verb (used without object), scried, scrying.
to use divination to discover hidden knowledge or future events, especially by means of a crystal ball.
Interestingly, according to the data from Google's Ngram Viewer, it seems that the word fell out of normal usage in the early 19th century and then wallowed in obscurity until the 1980s, presumably after D&D gained popularity:
So I can't say I object to jimfb calling it a "fantasy word", especially considering the kind of imagery my imagination conjures up when I hear it.
I want to implement a conversation system into my RPG (trying to get advanced AI as possible). Conversation as in, the player types:
"Hi, I would like a beer"
and the bartender would respond with
"Coming right up"
and then hand the player a beer.
I've got some ideas and some things I'd like to try, but first I would like to look at what's already been done. But extensive Googling does not turn up anything, so I'm wondering: has this been done or is there research being done in it? (I know this is very complicated, but I'm willing to give it a shot.)
Sure it has. Have a look at the "Eliza" program and its descendants. There's also a Wiki article on chatterbots that might interest you. Have a look at AIML as a way to represent the rules you might use.
For an advanced design, look up the game "Façade". The game's site describes the technologies used and gives links to relevant papers. There was also recently an extensive article in Gamasutra about this, called Beyond Façade: Pattern Matching for Natural Language Applications.
You may also want to look into the Turing Test and it's relevant scientific following/conferences/publications to see what has been done in the humanizing of AI speech.
For a linguistics course we implemented Part of Speech (POS) tagging using a hidden markov model, where the hidden variables were the parts of speech. We trained the system on some tagged data, and then tested it and compared our results with the gold data.
Would it have been possible to train the HMM without the tagged training set?
In theory you can do that. In that case you would use the Baum-Welch-Algorithm. It is described very well in Rabiner's HMM Tutorial.
However, having applied HMMs to part of speech, the error you get with the standard form will not be so satisfying. It is a form of expectation maximization which only converges to local maxima. Rule based approaches beat HMMs hands down, iirc.
I believe the natural language toolkit NLTK for python has an HMM implementation for that exact purpose.
NLP was a couple years ago, but I believe without tagging the HMM could help determine the symbol emission/state transition probabilities of n-grams (i.e. what are the odds of "world" occurring after "hello"), but not parts-of-speech. It needs the tagged corpus to learn how the POS interrelate.
If I'm way off on this let me know in the comments!
I would like to know if there are any tools that can help me model C applications i.e. Functional programming.
E.g. I'm currently building a shared library.
But to communicate my design visually, I need something like UML. I would like to do this so that the person reviewing my design need not read through 100s of pages of functions, variables and so on.
I have read about UML for C, which I'm considering.
If there is anything better out there, please let me know.
The bottom line is to visualize the design of C applications and modules without reading through 100s of pages of text, because it takes time and is difficult for the reviewers.
Any help in this area from the experts here would be much appreciated.
Thanks.
A well written text documentation brings you a far. Much further than any UML diagram could ever achieve.
You should split this in two parts:
What do you want to say?
What's the best way to saying it?
Whatever formalism you use to answer the second part, you should be sure it's not ambigous.
The good of UML is that a lot of semantic is already defined by the language so you don't have to include a definition of what those boxes, lines and arrows mean in a collaboration diagram.
But most importantly, documenting something means create a path for others to understand the subject you are documenting. A very precise description that offers no clue on how to read it is as good as none. So, use UML, Finite state machines, ER diagrams, plain English, whatever you want but be sure to include a logical path that your "readers" can follow to understand what's going on.
I had a friend that was a fan of "preciseness at all cost" and it would ask us to go through all the details before some sort of meaning would emerge.
I once ask him to do this experiment: on his next trip to an unknown city, he would have to carry the most precise map he could get. Much better, he would have to carry a 1:1 map of the city with every single detail exactly reported in scale. That way he couldn't get lost!
He declined but I would love to see him trying to use that map. Just even folding it! :)
Whatever you like. It's not a standard but many devs use it and understand it. If it does help you to communicate with other people and document your work -> its for you. If it just takes too much time and you think it's not effective, drop it. Also, don't bother with all details, as long as it resembles UML and your team can work with it, it's fine.
It's meant to help you, not waste you time.