Entity extraction on large documents - ibm-watson

I have a need to extract entities from word and pdf documents. Documents can be in the range of 10 to 20 pages. Are there scalable library/APIs available that we can plug into our processing pipeline? Any comparative study of different solutions will be helpful.

Take a look at the Watson Natural Language Understanding (you'll need to get an IBM ID and then login to see this content - don't worry , cost is $0). With Watson Natural Language Understanding you will want to look at the API Explorer to find the correct API syntax to use to get the results that you are looking for.
I also noticed that mention Word/PDF documents. You will need to convert those using the Watson Discovery service, and then you can pass the converted documents to Watson Natural Language Understanding, which takes in JSON, text or HTML inputs.

Related

How Watson Discovery is different from Retrieve and Rank?

I'm trying to understand, from a technical perspective, how Watson Discovery works behind the curtains compared to Retrieve and Rank (which leverages Apache SOLR)?
Watson Discovery is based on ElasticSearch and then exposes its own simplified query language for interaction. This allows for structured queries that enable aggregations and content analysis as well as free text queries.
Comparing the two services: Retrieve and Rank enables search with a machine learning rank model on private data while Discovery enables search and content analytics on a combination of private and public content with NLP enrichments.
In both Watson Discovery & Retrieve and Rank, you need to provide your own data.
But, the difference lies in their way of processing the data, in R&R you need to train your own model .i.e. rank your answer based on all probable types of questions can be asked to watson. It uses another service, Document Conversion to retrieve the answer from your provided document (all by dividing a document into small sections)
Whereas, in Discovery, it acts as a service on the top of R&R, once documents are uploaded, it applies it own cognitive search capability to look for right answer (basically among your provided documents) of your query.
For better understanding the process please go through their documentation:
Retrieve and Rank
and Discovery
If you are used to with Retrieve and Rank, Discovery can provide more or less similar results if you are using Relevancy training for Discovery.
As of now, Relevancy Training in Discovery is still in Beta.
https://www.ibm.com/watson/developercloud/doc/discovery/train.html
Technically, as #tmarkiewicz has suggested the Discovery feature of Watson analytics uses the 'Elastic Search', and has the capability of Natural Language Programming. It's a good feature to have when it comes to enriching the answers and providing high usability.

IBM Watson Service for identifying particular characteristic such as "helpfulness" in person's tweets or facebook posts?

I am currently exploring three services for identifying person's tweets or facebook post's are helpfulness or not:
Personality Insights
Natural Language Understanding
Discovery
will I need to write my on wrapper on these services to identify the helpfulness characteristic or is there any other way to just query & get result.
can anyone please guide which service I need to use for this task
Thanks
According to Neil, sure, all depends on how you define helpfulness.
Discovery:
If you want use Discovery you need some base to get the data, you can filter the data about you want with filter. By using data analysis combined with cognitive intuition to take your unstructured data and enrich it so you can discover the information you need.
Personality:
If you want use Personality, understand personality characteristics, needs, and values in written text. The service uses linguistic analytics to infer individuals' intrinsic personality characteristics, including Big Five, Needs, and Values, from digital communications such as email, text messages, tweets, and forum posts.
Watson Knowledge Studio:
If you want to work with models for tweets, you can use WKS (Watson knowledge Studio), this service provides easy-to-use tools for annotating unstructured domain literature and uses those annotations to create a custom machine-learning model that understands the language of the domain. The accuracy of the model improves through iterative testing, ultimately resulting in an algorithm that can learn from the patterns that it sees and recognize those patterns in large collections of new documents. For example, if you want learn about car, you can simple give some models to WKS.
It all depends on how you define helpfulness. Whether it is in general, or helpful to answering a question etc.
For Personality Insights, have a look at https://www.ibm.com/watson/developercloud/doc/personality-insights/models.html which has all the traits, as well as what they mean. The closest trait to helpfulness is probably Conscientiousness.
Neil

When will we be able to use a custom corpus?

When will ibm make it's Watson Q&A api capable of accepting a custom corpus?
Is there a roadmap I can see?
Besides the Question and Answer service currently doesn't provide a way to use your own data. You can get similar or better results by combining Document Conversion and Retrieve and Rank.
You will use Document Conversion to convert your corpus documents (PDF, docx, html) to answer units that will be indexed by the Retrieve and Rank service.
The Retrieve and Rank service is built on top of Apache Solr, and once you load your data into the Solr index, you can create and train a Ranker (machine learning model that knows how to sort results).
To expand on German's answer, also take a look at the Watson Natural Language Classifier (NLC) and Dialog services, which are additional building blocks for creating a custom Question and Answer application. NLC classifies text and allows you to trigger an action, and Dialog allows you to create and manage virtual conversations with your users.
Here is a great blog with an introduction to both NLC and Dialog. And another good blog that introduces the Watson Document Conversion and Retrieve and Rank services.

Is it possible to get the keywords in the documents returned by Solr

Solr provides an easy way to search documents based on keywords, but I was wondering if it had the ability to return the keywords themselves?
For example, I may want to search for all documents created by Joe Blogs last week and then get a feel for the contents of those documents by the keywords inside them. Or do I have to work out the key words myself and save them in a field?
Assuming by keywords you mean the tokens that Solr generates when parsing a particular field, you may want to review the documentation and examples for the Term Vector Component.
Before implementing it though, just checking the Analysis screen of the Solr (4+) Admin WebUI, as it has a section that shows the terms/tokens particular field actually generates.
If these are not quite the keywords that you are trying to produce, you may need to have a separate field that generates those keywords, possibly by using UpdateRequestProcessor in the indexing pipeline.
Finally, if you are trying to get a feel to do some sort of clustering, you may want to look at the Carrot2, which already does this and integrates with Solr.
What you are asking for is know as "Topic Model". Solr does not have out of the box support for this. However there are other tools that you can integrate to achieve this.
Apache Mahout supports LDA algorithm, that can be used to model topics. There are several examples of integrating Solr with Mahout. Here is one such.
Apache UIMA (Unstructured Information Management Applications.) I won't bother typing about it. Instead, here is a brilliant presentation.

Post processing of pages crawled using nutch

I have a set of pages crawled using nutch. And I understand that this crawled pages are saved as segments. I want to extract certain key values from this pages and feed it to solr as xml.
A sample situation is that I have crawled a shopping website with many product listings. I want to extract key infos like Name, Price, Specs of the product and ignore rest of the data. So that I may provide to solr some xml like
qwerty123qwerty
This is so that using solr I should be able to do sorting of different product listings based on the price.
Now how this extraction part can be done? Does map reduce come anywhere in picture?
Turning raw web pages into information is not a trivial task. One tool used for this job is Boilerpipe. However, it won't give you a solution on a plate.
If you are working on a fixed target, you might just write your own procedural code to find the data you need. If you need to find this sort of thing in arbitrary HTML, you are facing a very hard problem with no off-the-shelf solutions.

Resources