Document Conversion Watson service not working? - solr

I've been trying to use the IBM Watson Document Conversion service with the demo PDF, but it's not converting the document into little bits. All it's doing, is creating 1 answer unit, that's really long:
"text": "Watson is an artificially intelligent computer system capable of answering questions posed in natural language,[2] developed in IBM's DeepQA project by a research team led by principal investigator David Ferrucci. Watson was named after IBM's first CEO and industrialist Thomas J. Watson.[3][4] The computer system was specifically developed to answer questions on the quiz show Jeopardy![5] In 2011, Watson competed on Jeopardy! against former winners Brad Rutter and Ken Jennings.[3][6] Watson received the first place prize of $1 million.[7] Watson had access to 200 million pages of structured and unstructured content consuming four terabytes of disk storage[8] including the full text of Wikipedia,[9] but was not connected to the Internet during the game.[10][11] For each clue, Watson's three most probable responses were displayed on the television screen. Watson consistently outperformed its human opponents on the game's signaling device, but had trouble responding to a few categories, notably those having short clues containing only a few words. In February 2013, IBM announced that Watson software system's first commercial application would be for utilization management decisions in lung cancer treatment at Memorial Sloan- Kettering Cancer Center in conjunction with health insurance company WellPoint.[12] IBM Watson's former business chief Manoj Saxena says that 90% of nurses in the field who use Watson now follow its guidance.[13]"
Thanks in advance!

Unfortunately, that demo PDF is not the best document to use: Currently, Answer Units are split based on heading tags (h1 - h6), and that PDF doesn't contain any headers. =(
If you set the conversion_target to NORMALIZED_HTML, you'll be able to see the converted PDF before it is split up into Answer Units. It will contain paragraphs but no headings.
In the future, we expect to also allow splitting Answer Units by paragraph, but that hasn't been released yet.
UPDATE:
We updated the PDF on the demo site with one that's a much better example.

Related

Watson assistant algorithm used

Can anyone tell me what algorithm is used to classify intents and understand entities in Watson assistant? Have they published any papers or articles regarding this?
Yes, they published this paper explaining in a manner how the Watson Work, and for more information you should learn about Cognitive Systems, but in advance it's not just one algorithm used, but many approaches that combined are capable of getting the desired result.
Another material you should learn if this is your interest is the computer science area "Information Retrieval", in which many subjects are used to comprehend what the user wants and give the needed information. The book Modern Information Retrieval is a good start point.
According to IBM Developer Answers:
"Intents are classified using an SVM, with some pre training by IBM. entities use a fuzzy matching algorithm."
https://developer.ibm.com/answers/questions/387916/watson-conversation-algorithm/
Support Vector Machine (SVM) is a supervised machine learning algorithm.

IBM Watson for Oncology API

I'll be short to save your time :)
I'm new at StackOverflow and also new with IBM Watson.
We are building an EMR (electronic medical records) system and would be glad to enhance it with Watson cognitive capabilities for healthcare.
Where do I start from?
Is here anyone who has ever used cognitive approach for assisted medical decision making? Can anyone give me an orientation?
I thought to start with Q&A for doctors but Q&A has been depreciated by IBM. Predictive analytics would also be exciting for physicians, however, what is the starting point?
Thank you beforehand!
I think you refer to a deprecated Bluemix API for health. One thing you can do is use Retrieve and Rank API on a trusted set of documents.
yes I have used the following for health with IBM Watson
Reading chest X-rays - https://www.ibm.com/watson/developercloud/doc/visual-recognition/
Reading EKG's - https://www.ibm.com/watson/developercloud/doc/visual-recognition/
Patient diagnosis for chest pain - https://www.ibm.com/watson/developercloud/dialog.html
Physical exam - we started to use retrieve and rank for machine learning of a patient's physical exam over the years. - https://www.ibm.com/watson/developercloud/retrieve-rank.html
Speech to text (patient telling watson where it hurts) - https://www.ibm.com/watson/developercloud/speech-to-text.html
As you can see there are many different watson api's .

Chat/conversation database

For a personal interest I try to define a simulated AI who is based on information that he learned and internet search in order to give more details than what the system know.
I took the example of a child, when he's born he need to learn everything, he heard a lot and then propose some answers. His mom/dad tell him if the answer are suitable or not.
In order to do that I wanted to stock a lot of chat conversations in an hadoop system and parse all of those conversation in order to determine which are the most frequent answer given. With that I want to construct a neuronal database who contains conversations types with the determined answers.
So my question is can I find somewhere legally on the internet one or more chat/conversation database in any format? (file, database, csv, ...)
The most data I have the best my chance are to be able to determine correctly the answers ;)
Thanks for the help and cheers,
Frédéric
PS: English is not my mother tongue
There is a collection of conversational datasets. Most of them are collected from publicly available sources. For you the most interesting ones could be the Santa Barbara corpus (although it's a transcript of speech conversations) or the movie dialog dataset.
Here is a fairly comprehensive collection of human-human and human-machine text dialogue datasets, as well as audio dialogue datasets.
https://breakend.github.io/DialogDatasets/
Credits goes to "Default picture"'s answer from above for the extensive library of Human-Human, Human-Machine sonversation resources at https://breakend.github.io/DialogDatasets/ including the Let’s Go dialogs from provided by Research Center at CMU https://github.com/DialRC/LetsGoDataset those resources are also used to train conversational agents at https://any.company/
The best way to have a Chat dataset is to generate on your own. You know what exactly you want. But IRC has some chat datasets that one of them has been used in this research.

Access to the Music Genome Database

I'm interesting in finding songs based on attributes (minor key tonality, etc). These are things listed in the details of why Pandora picks songs, but using Pandora, I have to give it songs/artists.
Is there any way to get the Music Genome database (or something similar) so I can search for songs based on attributes (that someone else has already cataloged)
You can use Gracenote's Global Media Database and search with Track-level attributes.
"Gracenote's Media Technology Lab scientists and engineers take things further by utilizing technologies like Machine-Listening and Digital Signal Processing to create deep and detailed track level descriptors such as Mood and Tempo."
I don't think there is any way to access this proprietary data, something I asked them about long ago. It seems to me they want to protect this unique part of their system; after all, they've paid for the man hours to label each song. Even if Pandora releases a developer API, which they've hinted at, I doubt it will provide access to the Music Genome information.
Give Echo Nest a shot!
To add to above answers, Pandora's statement (as viewed using the above link in combination with the Internet Archive) was:
"A number of folks also asked about the prospect for an open API, to allow individual developers to start building on the platform. We're not there yet, but it's certainly food for thought."
Given that this was seven years ago, I think their decision is pretty clear.

Google Summer of Code: web classification dataset

I heard that Google hosted (or will host) a web classification competition and they provided a large (170k+ documents) dataset of web sites that were classified into multiple categories (sports, computers, science, etc.) I tried looking around in their Summer of Code web site for 2009 through 2011, but didn't find anything. Does anybody know where I can get that dataset?
I think I found it (although I'm not sure if the data was provided by google): the ECML/PKDD 2010 Discovery Challenge Data Set contains 22 training labels (i.e. labels about the content), URLs and hyperlinks, content-based and link-based web spam features, term frequencies and Natural Language Processing features.

Resources