How to map watson knowledge studio with content analytics?

How to map watson knowledge studio with content analytics? - ibm-watson

I am trying to export my analyzed documents collection from ICA to WKS but it says I have to do the mapping from UIMA type to Entity type of WKS. But I could not find any explanation of how should I proceed.
Is there anyone who know what to do ?
thanks

I did this follow when I had this question a few weeks ago. I did the follow steps:
Mapping UIMA types to entity types
Before you import XMI files into a Watson Knowledge Studio project, you must define mappings between the UIMA types and Watson Knowledge Studio entity types.
Before you begin
The type system in your Watson Knowledge Studio project must include the entity types that you want to map the UIMA types to.
To map UIMA types to WKS entity types:
Create a file named cas2di.tsv in the folder that contains the UIMA TypeSystem descriptor file, such as exported_typesystem.xml or TypeSystem.xml.
Open the cas2di.tsv file with a text editor. Each line in the file specifies a single mapping. The format of the mapping depends on which annotator's annotations you want to map:
You can create mappings by using the basic format:
UIMA_Type_Name[TAB]WKS_Entity_Type
The following example defines mappings between UIMA types produced by the Named Entity Recognition annotator in IBM Watson Explorer Content Analytics and entity types defined in a WKS type system:
com.ibm.langware.Organization ORGANIZATION
com.ibm.langware.Person PERSON
com.ibm.langware.Location LOCATION
Another example defines a mapping between UIMA types produced by custom annotator that was created in IBM Watson Explorer Content Analytics Studio and Watson Knowledge Studio entity types:
com.ibm.Person PERSON
com.ibm.Date DATE
You can create mappings based on facets that are used in the Pattern Matcher annotator or Dictionary Lookup annotator in Watson Explorer Content Analytics. In text analysis rule files (*.pat), the facet is represented as the category attribute. To define a mapping, use the following syntax:
com.ibm.takmi.nlp.annotation_type.ContiguousContext:category=FACET_PATH[TAB]WKS_ENTITY_TYPE
Like:
com.ibm.takmi.nlp.annotation_type.ContiguousContext:category=FACET_PATH[TAB]ORGANIZATION
See the Official Documentation.

Related

IBM Watson Knowledge Studio - Get mention class from entities

When annotating mentions using Watson Knowledge Studio, one have the posibility to specify the mention class: "Specific", "Negative" or "Generic" mention (Establishing a type system).
Once the annotation have been completed and the Machine Learning Model is created, it can be deployed to a service (for ex. Watson Discovery). It would be useful to be able to query entity types, with the result including the specification of it's mention class.
In later questions related to this (IBM Watson Knowledge Studio - Annotating negative / negated mentions), from a year ago, there was still no possibility to get this information from new text.
So the question is, is it still not possible to get this information?

We are still investigating to enhance this in future. Currently mention attributes cannot be extracted in NLU.

Custom UIMA annotators in IBM Watson Retrieve&Rank

Is it possible to use custom uima annotators in Retrieve&Rank service?
How can I upload my custom annotator (packaged as jar file) to the service?
I need to create an entity annotator to discover my custom domain entities.

I don't think there is an obvious straightforward way to use a custom UIMA annotator in R&R.
Possible approaches you could use, if you want to try integrating the two though:
Use a UIMA pipeline to annotate your documents before storing them in R&R, or as you query R&R for them. I've not tried this myself, but I've seen references to this sort of thing - e.g. http://wiki.apache.org/solr/SolrUIMA so there might be some value in trying this
Use the annotations from your UIMA pipeline to generate additional feature scores that the ranker you train can include in it's training. For example, if your annotator detects the presence or absence of a particular custom domain entity, it could turn this into a score that contributes to the feature scores for a search result. For an example of contributing custom feature scorers to R&R, see https://github.com/watson-developer-cloud/answer-retrieval

Integration of Solr with EMC Documentum

We have bunch of pdf documents available in EMC Documentum
We have a requirement we have to integrate Apache solr with Documentum, so that we can search for a specific document in Solr, and we can get the documents from Documentum
I looked into below link which is not sufficient information
https://community.emc.com/docs/DOC-6520
Help is really appriciated

The link you have posted would get you a working solution. That author proposes to write a custom crawler that connects to the Documentum repository and then use Apache Tika to perform the content extraction for Solr.
However I would suggest you to use
Apache ManifoldCF to act as crawler that gets the content from Documentum to Solr. You should not write this by hand, as it already has been done and tested.
Apache ManifoldCF is an effort to provide an open source framework for connecting source content repositories like Microsoft Sharepoint and EMC Documentum, to target repositories or indexes, such as Apache Solr, Open Search Server, or ElasticSearch. Apache ManifoldCF also defines a security model for target repositories that permits them to enforce source-repository security policies.
Apache Tika to perform the content extraction (PDF to text) so that the content of the documents is searchable in Solr later on.
The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.

I have built my own connecter to extract data from Documentum and insert in Elasticsearch or solr and I am willing to share. please contact me

How do I upload/index rich/structured text documents to search with ElasticSearch?

I am building a search engine around a corpus of documents including Microsoft Word Docs, PowerPoints, PDFs, and text files. I have successfully downloaded and installed ElasticSearch and have it running (visible from the command prompt and from a browser - localhost:9200).
I can upload and search data that is entered manually (found in several tutorials online - such as this one: http://www.elasticsearchtutorial.com/elasticsearch-in-5-minutes.html#Indexing)
Now I need to make the (large?) jump from searching manually entered data to searching the large corpus of structured text files. My question is - how do I go about uploading/indexing these documents to make them available to the Elasticsearch instance I am already running.
I understand this may be too large to answer in a single reply - even being pointed to a tool or tutorial link would help.
Versions: Windows 7, Elasticsearch 1.2.1

I would try using the Elasticsearch attachment plugin:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-attachment-type.html
https://github.com/elasticsearch/elasticsearch-mapper-attachments
Attachment Type
The attachment type allows to index different "attachment" type field
(encoded as base64), for example, Microsoft Office formats, open
document formats, ePub, HTML, and so on (full list can be found here).
The attachment type is provided as a plugin extension. The plugin is a
simple zip file that can be downloaded and placed under
$ES_HOME/plugins location. It will be automatically detected and the
attachment type will be added.
It's built using Apache Tika and supports the following file formats:
Supported Document Formats
HyperText Markup Language
XML and derived formats
Microsoft Office document formats
OpenDocument Format
Portable Document Format
Electronic Publication Format
Rich Text Format
Compression and packaging formats
Text formats
Audio formats
Image formats
Video formats
Java class files and archives
The mbox format
http://tika.apache.org/0.10/formats.html
It's provided as a plugin - if you're not familiar with the plugin architecture I'd take a look here:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-plugins.html

JAXB Naming Collision Salesforce Integration

I'm attempting to integrate with Salesforce using MyEclipse. The wizard fails because of a naming collision on a complex type "DescribeLayout". I need to write a JAXB binding file to ensure that the two interfaces that are created by the xjc compiler are in different packages, but I have absolutely no idea how to do this.
I do not have the URI's to the schemas that make up the WSDL, only the URN's.

This blog post shows how to append a suffix to type names to avoid this. I'm not a JAXB expert, but presumably there is a way to configure it to use a different package instead of a suffix.
http://blog.teamlazerbeez.com/2009/05/23/salesforcecom-partner-soap-api-jax-ws-tutorial-part-1/