What is the data format required for input in the IBM-Watson cloud product? - database

I'm having a hard time figuring out what type of data watson accepts: RDF triples, relational, delimited text, etc.
There's really no documentation anywhere.
Does anyone know?

Watson currently eats unstructured English Prose in HTML, Word Doc, Text, and certain formats of PDF.
Some API documentation can be found here: https://www.ibmdw.net/watson/wp-content/uploads/sites/19/2013/11/An-Ecosystem-Of-Innovation-Creating-Cognitive-Applications-PoweredByWatson.pdf
You can also get a bit more if you go to the bottom of the mobile developer challenge page that's here: ibmwatson.com (see 'Helpful Hints about Watson')
If there's other documentation you're looking for, specific feedback would be helpful to pass on

Related

Matching article text against pre-existing list of categories

I'm new to Azure Cognitive Services, and while I'm pretty sure it can help me solve my problem, I don't quite understand which part of it to use for it...
Here's what I want to do:
We have blog posts, say ~1k, and those blog posts all have categories and tags (multiple each). What I want to do, is to "guess" the right categories/tags for each article based on the content, and then present that to the editor as a suggestions at the time of input ("looks like this article is about: health, well-being, ..."). The ~1k articles we already have in the system are currently correctly tagged/categorized, so I'd like to use these a data source for this "guessing".
I've used Azure Search before, and it seems like some combination of EntityRecognition and KeyPhraseExtraction might be a way in the right direction? Azure Cognitive Services also seems to have an API that supports TextAnalytics that would do something similar. I'm a bit confused about why these are two different things (or are they not?)
This also seems like an entirely common problem (matching text against pre-defined categories based on other text that is categorized), so I'm wondering if I'm just missing an obvious solution here?
Thanks in advance.
I think the Azure Cognitive Text Analytics API is your best bet as you are looking for real-time analysis prior to tagging/categorizing for storage.
Text Analytics could return a list of named entities that you could map to your available tags/categories and present to the user.
Azure Cognitive Search requires an indexer and skillset to process target text with an end result of storing the processed results to an index specifically for searching.

IBM Watson, how to input data of entire books

Im using the IBM Watson analytics trial, it says it only takes data as CSV, Excel and a few others. How can i convert books or bodies of text into an acceptable format? thank you
It seems like the architecture of WCA(Watson Context Analytics) does not support PDF itself. Please refer the following images from IBM Link
I think it would be better to convert pdf to text with converter such as CONVERTER and pushing it into database or others.
Then, you can crawing the text data from it.
FYI, the document has to have a KEY column (i.e. name of the book).
Even if you do convert your book into an acceptable text format (.csv. .xls, .xlsx. .sav), Watson Analytics isn't optimized for text analytics. It sounds like Watson Explorer is the offering that'd best suit your needs.
Hope this helps.
Even though CSV or XLS is the acceptable format of the file, Datasets needs to be in the specific structure. You need headers for all the tables and data following it. I am not sure how a data of the book can fit into that format.
I have recently published this blog post on how to structure and refine data before importing into Watson Analytics to get the best results.
For your specific requirement, you can look into Watson Explorer as suggested by Brennan above, or even better you can learn to use IBM Content Analytics here.

Pulling out Popular terms from a Solr core

I have an Apache Solr core where i need to pull out the popular terms out of it, i am already aware of luke, facets, and Apache Solr stopwords but i am not getting what i want, for example, when i try to use luke to get the popular terms and after applying the stopwords on the result set i get a bunch of words like:
http, img, que ...etc
While what i really want is:
Obama, Metallica, Samsung ...etc
Is there any better way to implement this in Solr?, am i missing something that should be used to do this?
Thank You
Finding relevant words from a text is not easy. The first thing I would have a deeper look at is Natural Language Processing (NLP) with Solr. The article in Solr's wiki is a starting point for this. Reading the page you will stumble over the Full Example which extracts nouns and verbs, probably that already helps you.
During the process of getting this running you will need to install additional software (Apache's OpenNLP project) so after reading in Solr's wiki that project's home page maybe the next step.
To get a feeling what is possible with that you should have a look on the demonstration of the searchbox guy. There you can paste a sample text and get relevant words and terms extracted from it.
There are several tutorials out there you may have a look at for further reading.
If you went down the path and the results are not as expected or not as good as required, you may go down that road even further and start thinking about text mining with Apache Mahout. There are again several tutorials out there to cross it with Solr.
In any case you should then search Stackoverflow or the web for tutorials and How-Tos you will certainly need.
Update about arabic
If you are going to use OpenNLP for not supported languages, which Arabic unfortunately is out of the box as of version 1.5, you will need to train OpenNLP for the language. The reference about it is found on the developer docs of OpenNLP. Probably there is already something out there from the arabic community, but my arabic google-fu is not that good.
Should you decide to do the work and train it for the arabic language, why not share your traning with the project?
Update about integration in Solr/Lucene
There is work going on to integrate it as a module. In my humble opinion this is as far as it will and should get. If you compare this problem field to stemming stemming appears to be rather easy. But even stemming got complex when supporting different languages. Analysing a language to the level that you can extract nouns, verbs and so forth is so complex that a whole project evolved around it.
Having a module/contrib at hand, which you could simply copy to solr_home/lib would already be very handy. So there would be no need to run a different installer an so forth.
Well , this is a bit open ended.
First you will need to facet and find "popular terms" out of your index, then add all the non useful items such as http , img , time , what, when etc to your stop word list and re-index to get cream of the data you care about. I do not think there is an easier way of knowing popular names unless you can bounce your data against a custom dictionary of nouns during indexing (that is an option by the way)- You can choose to index only names by having a custom token filter (look how stopword filter works) and have your own nouns.txt file to go with your own nouns filter, in the case you allow only words in your dictionary to into index, and this approach is only possible if you have finite known list of nouns.

Cakephp website with English and Arabic support for the same database

Im building a website in CakePHP 1.3. My requirement is to have a website with arabic and english support. I want that if a user is entering the information in arabic so when the english user sees the same information it should be in english and vice versa.
As far as localing the labels ive done that using po files. Its pretty straight forward.
But for the database im using the Cakephp's built-in Translate Behaviour. But it again doesn't translate anything and creates another copy of the data with the current locale that is in use.
Please help me in which direction i should move.
I want to know the best practices that should be followed for this kind of scenario.
May be translating db values is not the best solution and should save the values as in whatever language they are coming.
Any help and suggestions would be highly appreciated.
It isn't actually possible to have CakePHP automatically translate data that is entered.
The Translate Behavior allows you to enter the same content in multiple languages and then retrieve the appropriate language from the database, based on the language that you currently have set in your config. It doesn't actually translate anything for you.
Theoretically, you could add a function to the Model::beforeSave() callback that would submit the Arabic text to a service like Google Translate and then save both Arabic and English versions to their appropriate tables, but the results won't necessarily be very good. As #deceze said in his comment to your question, machine translation is a hard problem.

How does Google Docs store documents (on the backend)?

I half imagine there being these great .docs in the sky... but another part of me doubts that my documents are even being stored in anything we'd traditionally call a "file." Does Google have its own document format? I feel like it must. Some branch of some existing format like ODF, maybe? Any idea what it's like, what's special about it (if anything), and/or why it is the way it is?
As far as I'm aware, Google Docs originally generated RTF files. Now, however, with the recent push of HTML5 and integration of the ContentEditable module, they may very well just store documents as plain HTML within their database.
I would guess that google definitly extracts some information for indexing from the file. For editing purposes however, I do not think the internal format will be so much different from ODF/MS-Office or other file formats. But those are only guesses, maybe someone else knows more.

Resources