Is there a way to show pdf in its original structure in the human review custom entity labelling in aws sagemaker? - amazon-sagemaker

I have modified this sample to read PDFs in tabular format. I would like to keep the tabular structure of the original pdf when doing the human review process. I notice the custom worker task template uses the crowd-entity-annotation element which seems to read only texts. I am aware that the human reviewer process reads from an S3 key which contains raw text written by the textract process.
I have been considering writing to S3 using tabulate but I don't think that is the best solution. I would like to keep the structure and still have the ability to annotate custom entities.

Comprehend now natively support to detect custom-defined entities for pdf documents. To do so, you can try the following steps:
Follow this github readme to start the annotation process for PDF documents.
Once the annotations are produced. You can use Comprehend CreateEntityRecognizer API to train a custom entity model for Semi-structured document”
Once entity recognizer is trained, you can use StartEntitiesDetectionJob API to run inference for PDF documents

Related

Sagemaker Inference Endpoint with fitted Encoder

So as I don't get any help by reading documentations and blogposts I ll ask over here:
I want to deploy a Sagemaker Endpoint with fitting a Sagemaker Pipeline. I want to have an endpoint which is backed by a PipelineModel. This PipelineModel should consist of two models: A fitted model which encodes my data and a model which predicts with an XGBoost estimator. I follow along this docu: enter link description here
But this example doesn't show how to integrate the fitted preprocessor model in a PipelineStep. What Step do I have to use? A TrainingStep? Thanks in advance. I am desperate
Check out this official example: Train register and deploy a pipeline model.
The two variations to keep in mind:
For models that need training (usually for those based on tensorflow/pytorch), a TrainingStep must be used so that the output (the model artifact) is correctly (and automatically) generated with the ability to use it later for inference.
For models generated by a simple fitting on the data (e.g., a scaler with sklearn), you can think about creating a TrainingStep in disguised (it is an extra component in pipeline, it is not very correct to do it but it is a working round) but the more correct method is to configure the preprocessing script so that it internally saves a model.tar.gz file with the necessary files (e.g., pickle or joblib objects) inside it can then be properly used in later steps as model_data. In fact, if you have a model.tar.gz, you can define a Model of various types (e.g., an SKLearnModel) that is already fitted.
At this point, you define your PipelineModel with the trained/fitted models and can either proceed to direct endpoint deployment or decide to go through the model registry and keep a more robust approach.

How do I export data with attachments from a Lotus Notes Database into an Excel Spreadsheet or into a Microsoft Access Database?

Not a Lotus Notes Developer but have to get data in a Lotus Notes database into SharePoint. All of the LN entries have attachments. I tried to export to a csv file but that doesn't include the attachments. I think created a new view with the Attachments field but that only returns the number of attachments. How can I extract the associated attachments with each LN form. Thanks in advance
Your question is pretty broad. Attachments are (sometimes) treated as embedded objects in a Rich Text Field. This URL has some sample code:
https://www.ibm.com/support/knowledgecenter/en/SSVRGU_9.0.1/basic/H_EXAMPLES_EMBEDDEDOBJECTS_PROPERTY_RTITEM.html
Copy/paste may not work for you because the attachments may not be in a field called "Body" or there may be multiple "Body" fields on the document (which requires other considerations beyond the scope of this question), or the attachment may be embedded objects in the document. Or all the of the above. That that code will give you a sense of what you need to do.
Also, see this:
How to retrieve Lotus Notes attachments?
I have done this by writing LotusScript code to detach all the attachments from all docs into a single folder, using the document's UNID plus the attachment name for the filename in the folder. Adding the UNID covers cases where attachments with the same name exist in mulitple documents and might actually have different content. I do not attempt to de-duplicate.
The agent adds a NotesItem to each document giving the filename(s) of the detached attachment(s).
I then create a view containing all the fields that I want to export, including the new field with the filenames. I export that view to CSV. I hand the CSV and a zip file containing the attachments over to the SharePoint team.
Maybe a bit late but... I do have extensive experience (approx. 15 years) with data extraction from IBM Notes applications/databases - independent of the type of application - and have supported migrations of quite a few large IBM Notes applications to various targets for companies around the world.
You can access IBM Notes databases using the native C-API, LotusScript, COM or Java, for example or make a document available for further processing by exporting it to Domino XML (DXL) format.
The C-API is the foundation of IBM Notes, meaning that COM and Java APIs only offer a subset of the C-API's functionality. Any of the APIs should give you the ability to extract a document's metadata and attachments. However:
A document, including it's attachment, can be encrypted using an IBM Notes ID. If you do not have access to the ID that was used to encrypt the document, you will neither be able to extract the document nor the attachment.
Attachments can be "real attachments" or so called "embedded objects". Depending on the type of attachment, the attachment needs to be handled differently if it comes to the API calls required to do the export.
Attachments can be compressed. In most cases, the API should handle the decompression transparently. However, there is at least one proprietary compression algorithm (based on Hufman) that is widely used. If you extract documents in DXL format, you will not be able to read those attachments, as they are embedded into the DXL in compressed form.
Objects being embedded into a document using (Object Linking and Embeddeding (OLE)) cannot be extracted using the COM or Java API. I.e. even if you gain access to the documents, you will not be able to transform them into a readable format.
If the information you are trying to transfer from IBM Notes to SharePoint is important to the company you work for, I would recommend to rely on a proven solution for the export/migration rather than developing this on your own, as the details can really be tricky.
Should you have any further questions, don't hesitate to get in touch.

How to instruct IBM Watson Discovery about the format of my documents?

I am trying to use the Watson Discovery service to build a virtual customer support agent. We have many documents with tons of Q and A in various formats. In the simplest case, we just have a doc, with an array of:
Q:..
A:...
Q:...
A:...
etc. When we upload these PDF files and then try to query it, it returns the full document that included the relevant answer. Is there a way to instruct Discover service, so that it will only return the relevant question and answer pair instead of the full document?
To have Discovery return the individual relevant QA pairs, they should be split up and passed to the service as separate documents. Discovery does not have a method to split a single document on it's own.
If your primary requirement is Q&A, you might probably look into Retrieve-Rank
Discovery is used to deal with complex unstructured data, in your case you have data in a consistent format.
Have a look at this sample app here

Salesforce: script to create custom object and fields

Is there a way to create custom objects and fields by using script or IDE ?
Salesforce is very easy to use, however, it's so time-consuming to create so many fields on Web interface. So, just wonder if there's ways to use script or IDE to create objects and fields in Salesforce?
You're looking for the Metadata API, or already developed tools which use the metadata api.
http://www.salesforce.com/us/developer/docs/api_meta/Content/meta_intro.htm
http://www.salesforce.com/us/developer/docs/api_meta/index.htm
Though using it directly will still require some developement, which may not save you much time. you get metadata in XML, but would still need to process it to what you want to achieve.
Somewhat also depend on the nature of what you want to do. I for instance had a requirement today for 150 custom labels based on an input file. Was much faster to generate metadata XML than to ever do that in the web interface. Deployed the metadata using the force.com IDE.

Programmatically accessing excel files located on salesforce

I am new to the salesforce platform.The task im working on involves some excel files on salesforce. I have to write a program to analyze the data in these excel files and generate a report.I have the following questions about doing this
Do i need to programmatically download these excel files locally to my machine ?. If yes, what api should i use for this ?. An example would be really appreciated.
Is this something that can be done directly on salesforce ?
Thank You.
You have a mutltitude of choices, If you're using .NET or Java, you probably want to start with the soap API, you can run a SOQL query to access the Body field of the document object (I'm assuming you're storing these in documents). The SOAP API docs have examples for this. For other languages you'll probably want to start with the REST API, you'll be able to access the body resource of your document and get back the binary stream, again, good examples in the docs.
No.
Although you can't open an Excel document to inspect it/modify it in Apex (to the best of my knowledge), you can create one - FYI

Resources