IBM Watson Document Conversion not working at all - ibm-watson

We recently implemented the Document Conversion API from IBM Watson.
We always get the error, even though we specify the document type:
415 Unsupported Media Type - The media type of the input file is not supported. Specify the MIME type of the document if auto-detection was not correct.
We are trying to convert PDF's to plain text. Even the sample PDF from IBM didn't work for us.

Related

zkHost for using stream in standalone Solr?

I try to use stream functionality of Solr,
http://127.0.1.1:8983/solr/ContentArticles/stream?expr=search(ContentArticles,qt=%22/export%22,fl=Title,sort=Title%20asc,q=%22Title:Iron*%22)
but I get the following error:
{
"result-set":{
"docs":[{
"EXCEPTION":"java.io.IOException: invalid expression - zkHost not found for collection 'ContentArticles'",
"EOF":true,
"RESPONSE_TIME":0}]}}
The Reference Manual, zkHost is optional. I do not use zookeeper as it is a standalone Solr collection.
Where do I do wrong?
the stream functionality of SOLR is only available for cloud mode.
From the documentation:
"Streaming Expressions provide a simple yet powerful stream processing language for Solr Cloud."
From:
https://solr.apache.org/guide/6_6/streaming-expressions.html

Does IBM Waston personality insight API support simplied Chinese?

I'm trying to use personality insight API by IBM to get personality and value scores from social media posts in simplified Chinese language.
I find that they do support the Chinese language in the API document, but I get the error when I try to hit API.
Error info is:
{
"code": 400,
"sub_code": "C00001",
"error": "The language you requested, zh-cn, is not supported. Languages
supported: en,es,ja,ar,ko."
}
Can anyone figure this out, what's wrong? Am I using API in wrong way?
The entry for supported languages in the service documentation shows
Table 2. Specifying request and response languages
Language Argument Supported by Content-Language Supported by Accept-Language
Simplified Chinese zh-cn No Yes
ie. Chinese isn't supported for input text, but it is supported as the output report. So you can specify Spanish as the language of the text that you want to analyse, and request the analysis report back in Chinese.

Apache Camel Concept Data format vs Data type

I am using Apache Camel. While I have an idea about the following concepts, would like to get a clear understanding of the following concepts. Yes, I have gone through Apache Camel documentation.
Data Format conversion
Data Type conversion
Marshalling and Unmarshalling
What I am looking for is a clear conceptual differentiation. Thanks in advance.
These terms have a lot of different meaning in programming and computers in general. Additionally, across Camel components the terms Data Format and Data Type may be used interchangeably.
Data Format -- typically the format of "data on the wire". This is like Text, or binary for file handling and messaging scenarios (.txt, .csv, .bin, JMS, MQTT, STOMP, etc).. or JSON and XML for REST and SOAP web services (generally over http)
Data Type -- totally overloaded.. in Camel, (I'll risk being flamed for this..).. it generally has a meaning of what Java class is used as the input or output to a component. Camel also has a ton of auto-type conversion routines, so some of the subtle differences go unnoticed by users. For example, consuming from a JMS queue.. may generate a javax.jms.TextMessage, but the next step may use a java.lang.String class. Camel can auto-convert between those types.
Marshalling and Unmarshalling is the step in converting from Java Class -> Data Format and Data Format -> Java Class. For example, a JSON payload would be unmarshalled to a com.bobtire.Order Java class and used by a Java processor in Camel. Conversely, after doing some processing, one may need to marshall a com.bobtire.Order Java class to JSON to send to a REST endpoint. These functions are handled by "data format" modules within Camel. Common ones: JSON, JAXB (for XML), Bindy, PGP and JCE (for encryption)

How to map watson knowledge studio with content analytics?

I am trying to export my analyzed documents collection from ICA to WKS but it says I have to do the mapping from UIMA type to Entity type of WKS. But I could not find any explanation of how should I proceed.
Is there anyone who know what to do ?
thanks
I did this follow when I had this question a few weeks ago. I did the follow steps:
Mapping UIMA types to entity types
Before you import XMI files into a Watson Knowledge Studio project, you must define mappings between the UIMA types and Watson Knowledge Studio entity types.
Before you begin
The type system in your Watson Knowledge Studio project must include the entity types that you want to map the UIMA types to.
To map UIMA types to WKS entity types:
Create a file named cas2di.tsv in the folder that contains the UIMA TypeSystem descriptor file, such as exported_typesystem.xml or TypeSystem.xml.
Open the cas2di.tsv file with a text editor. Each line in the file specifies a single mapping. The format of the mapping depends on which annotator's annotations you want to map:
You can create mappings by using the basic format:
UIMA_Type_Name[TAB]WKS_Entity_Type
The following example defines mappings between UIMA types produced by the Named Entity Recognition annotator in IBM Watson Explorer Content Analytics and entity types defined in a WKS type system:
com.ibm.langware.Organization ORGANIZATION
com.ibm.langware.Person PERSON
com.ibm.langware.Location LOCATION
Another example defines a mapping between UIMA types produced by custom annotator that was created in IBM Watson Explorer Content Analytics Studio and Watson Knowledge Studio entity types:
com.ibm.Person PERSON
com.ibm.Date DATE
You can create mappings based on facets that are used in the Pattern Matcher annotator or Dictionary Lookup annotator in Watson Explorer Content Analytics. In text analysis rule files (*.pat), the facet is represented as the category attribute. To define a mapping, use the following syntax:
com.ibm.takmi.nlp.annotation_type.ContiguousContext:category=FACET_PATH[TAB]WKS_ENTITY_TYPE
Like:
com.ibm.takmi.nlp.annotation_type.ContiguousContext:category=FACET_PATH[TAB]ORGANIZATION
See the Official Documentation.

How do I upload/index rich/structured text documents to search with ElasticSearch?

I am building a search engine around a corpus of documents including Microsoft Word Docs, PowerPoints, PDFs, and text files. I have successfully downloaded and installed ElasticSearch and have it running (visible from the command prompt and from a browser - localhost:9200).
I can upload and search data that is entered manually (found in several tutorials online - such as this one: http://www.elasticsearchtutorial.com/elasticsearch-in-5-minutes.html#Indexing)
Now I need to make the (large?) jump from searching manually entered data to searching the large corpus of structured text files. My question is - how do I go about uploading/indexing these documents to make them available to the Elasticsearch instance I am already running.
I understand this may be too large to answer in a single reply - even being pointed to a tool or tutorial link would help.
Versions: Windows 7, Elasticsearch 1.2.1
I would try using the Elasticsearch attachment plugin:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-attachment-type.html
https://github.com/elasticsearch/elasticsearch-mapper-attachments
Attachment Type
The attachment type allows to index different "attachment" type field
(encoded as base64), for example, Microsoft Office formats, open
document formats, ePub, HTML, and so on (full list can be found here).
The attachment type is provided as a plugin extension. The plugin is a
simple zip file that can be downloaded and placed under
$ES_HOME/plugins location. It will be automatically detected and the
attachment type will be added.
It's built using Apache Tika and supports the following file formats:
Supported Document Formats
HyperText Markup Language
XML and derived formats
Microsoft Office document formats
OpenDocument Format
Portable Document Format
Electronic Publication Format
Rich Text Format
Compression and packaging formats
Text formats
Audio formats
Image formats
Video formats
Java class files and archives
The mbox format
http://tika.apache.org/0.10/formats.html
It's provided as a plugin - if you're not familiar with the plugin architecture I'd take a look here:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-plugins.html

Resources