When will we be able to use a custom corpus?

When will we be able to use a custom corpus? - ibm-watson

When will ibm make it's Watson Q&A api capable of accepting a custom corpus?
Is there a roadmap I can see?

Besides the Question and Answer service currently doesn't provide a way to use your own data. You can get similar or better results by combining Document Conversion and Retrieve and Rank.
You will use Document Conversion to convert your corpus documents (PDF, docx, html) to answer units that will be indexed by the Retrieve and Rank service.
The Retrieve and Rank service is built on top of Apache Solr, and once you load your data into the Solr index, you can create and train a Ranker (machine learning model that knows how to sort results).

To expand on German's answer, also take a look at the Watson Natural Language Classifier (NLC) and Dialog services, which are additional building blocks for creating a custom Question and Answer application. NLC classifies text and allows you to trigger an action, and Dialog allows you to create and manage virtual conversations with your users.
Here is a great blog with an introduction to both NLC and Dialog. And another good blog that introduces the Watson Document Conversion and Retrieve and Rank services.

Related

How to use ML models in Vespa.ai?

We are trying to use the ML model in Vespa, we have textual data stored in Vespa, can somebody help us with the below question-
One example of onnx model trained using scikit-learn used in Vespa.
Where to add preprocessing steps before model training and prediction using onnx model in Vespa with example.

This is a very broad question and the answer very much depends on what your goals are. In general, the documentation for using an ONNX model in Vespa can be found here:
https://docs.vespa.ai/documentation/onnx.html
An example that uses an ONNX BERT model for ranking can be found in the Transformers sample application:
https://github.com/vespa-engine/sample-apps/tree/master/transformers
Note that both these links assume that you have an existing model. In general, Vespa is a serving platform and not usually used in the model training process. As such Vespa doesn't really care where your model comes from, be that scikit-learn, pytorch or any other system. ONNX is a general format for ML model exchange between various systems.
However, there are some foundational ideas that I think I should get across that maybe can clarify a bit. Vespa currently considers all ML models to have numeric (in the form of tensors) inputs and outputs. This means you can't directly put text to your model and have text come out on the other side. Most textual data these days are encoded to some form of numeric representation such as embedding vectors, or, as the BERT example above shows, text is tokenized such that each token gets its own vector representation. After model computation, embedding vectors or token-set representations can be decoded back to text.
Vespa currently handles the computational part, the (pre-)processing of encoding/decoding text to embeddings or other representations are currently up to the user. Vespa does offer a rich set of features to help out in this regard in the form of document and query processors. So you can create a document processor that encodes the text of each incoming document to some representation before storing it. Likewise, a searcher (query processor) can be created that encodes incoming textual queries to a compatible representation before documents are scored against it.
So, in general, you would train your models outside of Vespa using whatever embedding or tokenization strategies are necessary for your model. When deploying the Vespa application you add the models with any required custom processing code, which is used when feeding or querying Vespa.
If you have a more concrete example of what you are trying to achieve I could be more specific.

how to prepare data for domain specific chat-bot

I am trying to make a chatbot. all the chatbots are made of structure data. I looked Rasa, IBM watson and other famous bots. Is there any ways that we can convert the un-structured data into some sort of structure, which can be used for bot training? Let's consider bellow paragraph-
Packaging unit
A packaging unit is used to combine a certain quantity of identical items to form a group. The quantity specified here is then used when printing the item labels so that you do not have to label items individually when the items are not managed by serial number or by batch. You can also specify the dimensions of the packaging unit here and enable and disable them separately for each item.
It is possible to store several EAN numbers per packaging unit since these numbers may differ for each packaging unit even when the packaging units are identical. These settings can be found on the Miscellaneous tab:
There are also two more settings in the system settings that are relevant to mobile data entry:
When creating a new item, the item label should be printed automatically. For this reason, we have added the option ‘Print item label when creating new storage locations’ to the settings. When using mobile data entry devices, every item should be assigned to a storage location, where an item label is subsequently printed that should be applied to the shelf in the warehouse to help identify the item faster.
how to make the bot from such a data any lead would be highly appreciated. Thanks!
is this idea in picture will work?just_a_thought

The data you are showing seems to be a good candidate for a passage search. Basically, you would like to answer user question by the most relevant paragraph found in your training data. This uses-case is handled by Watson Discovery service that can analyze unstructured data as you are providing and then you can query the service with input text and the service answers with the closest passage found in the data.
From my experience you also get a good results by implementing your own custom TF/IDF algorithm tailored for your use-case (TF/IDF is a nice similarity search tackling e.g. the stopwords for you).
Now if your goal would be to bootstrap a rule based chatbot using these kind of data then these data are not that ideal. For rule-based chatbot the best data would be some actual conversations between users asking questions about the target domain and the answers by some subject matter expert. Using these data you might be able to at least do some analysis helping you to pinpoint the relevant topics and domains the chatbot should handle however - I think - you will have hard time using these data to bootstrap a set of intents (questions the users will ask) for the rule based chatbot.
TLDR
If I would like to use Watson service, I would start with Watson Discovery. Alternatively, I would implement my own search algorithm starting with TF/IDF (which maps rather nicely to your proposed solution).

Entity extraction on large documents

I have a need to extract entities from word and pdf documents. Documents can be in the range of 10 to 20 pages. Are there scalable library/APIs available that we can plug into our processing pipeline? Any comparative study of different solutions will be helpful.

Take a look at the Watson Natural Language Understanding (you'll need to get an IBM ID and then login to see this content - don't worry , cost is $0). With Watson Natural Language Understanding you will want to look at the API Explorer to find the correct API syntax to use to get the results that you are looking for.
I also noticed that mention Word/PDF documents. You will need to convert those using the Watson Discovery service, and then you can pass the converted documents to Watson Natural Language Understanding, which takes in JSON, text or HTML inputs.

How Watson Discovery is different from Retrieve and Rank?

I'm trying to understand, from a technical perspective, how Watson Discovery works behind the curtains compared to Retrieve and Rank (which leverages Apache SOLR)?

Watson Discovery is based on ElasticSearch and then exposes its own simplified query language for interaction. This allows for structured queries that enable aggregations and content analysis as well as free text queries.
Comparing the two services: Retrieve and Rank enables search with a machine learning rank model on private data while Discovery enables search and content analytics on a combination of private and public content with NLP enrichments.

In both Watson Discovery & Retrieve and Rank, you need to provide your own data.
But, the difference lies in their way of processing the data, in R&R you need to train your own model .i.e. rank your answer based on all probable types of questions can be asked to watson. It uses another service, Document Conversion to retrieve the answer from your provided document (all by dividing a document into small sections)
Whereas, in Discovery, it acts as a service on the top of R&R, once documents are uploaded, it applies it own cognitive search capability to look for right answer (basically among your provided documents) of your query.
For better understanding the process please go through their documentation:
Retrieve and Rank
and Discovery

If you are used to with Retrieve and Rank, Discovery can provide more or less similar results if you are using Relevancy training for Discovery.
As of now, Relevancy Training in Discovery is still in Beta.
https://www.ibm.com/watson/developercloud/doc/discovery/train.html
Technically, as #tmarkiewicz has suggested the Discovery feature of Watson analytics uses the 'Elastic Search', and has the capability of Natural Language Programming. It's a good feature to have when it comes to enriching the answers and providing high usability.

IBM Watson Service for identifying particular characteristic such as "helpfulness" in person's tweets or facebook posts?

I am currently exploring three services for identifying person's tweets or facebook post's are helpfulness or not:
Personality Insights
Natural Language Understanding
Discovery
will I need to write my on wrapper on these services to identify the helpfulness characteristic or is there any other way to just query & get result.
can anyone please guide which service I need to use for this task
Thanks

According to Neil, sure, all depends on how you define helpfulness.
Discovery:
If you want use Discovery you need some base to get the data, you can filter the data about you want with filter. By using data analysis combined with cognitive intuition to take your unstructured data and enrich it so you can discover the information you need.
Personality:
If you want use Personality, understand personality characteristics, needs, and values in written text. The service uses linguistic analytics to infer individuals' intrinsic personality characteristics, including Big Five, Needs, and Values, from digital communications such as email, text messages, tweets, and forum posts.
Watson Knowledge Studio:
If you want to work with models for tweets, you can use WKS (Watson knowledge Studio), this service provides easy-to-use tools for annotating unstructured domain literature and uses those annotations to create a custom machine-learning model that understands the language of the domain. The accuracy of the model improves through iterative testing, ultimately resulting in an algorithm that can learn from the patterns that it sees and recognize those patterns in large collections of new documents. For example, if you want learn about car, you can simple give some models to WKS.

It all depends on how you define helpfulness. Whether it is in general, or helpful to answering a question etc.
For Personality Insights, have a look at https://www.ibm.com/watson/developercloud/doc/personality-insights/models.html which has all the traits, as well as what they mean. The closest trait to helpfulness is probably Conscientiousness.
Neil