Watson Discovery: Problem with processing data through the API - ibm-watson

I'm trying to process 30Mb JSON file to Watson Discovery with the Node SDK, but it gives the error that the file is too large. In the documentation, there is explicitly said that we can process up to 50Mb JSON data with the API.
Where the issue might come from? (The JSON has a root array element and each of the objects contain only two Strings)
UPDATE: The JSON file has the following structure
{
"elements":[
{
"Q":" ... ",
"A":" ... "
},
...
]
}

From the API Documentation - https://cloud.ibm.com/apidocs/discovery-data#adddocument
413 : Too large. Returned if you attempt to add a document or document
metadata that exceeds the maximum possible.
So either the document or metadata is too large. Your data shouldn't be, so that leaves metadata. Again from the API Documentation -
metadata string The maximum supported metadata file size is 1 MB.
Metadata parts larger than 1 MB are rejected.
Example: { "Creator": "Johnny Appleseed", "Subject": "Apples" }

Related

How To Upload A Large File (>6MB) To SalesForce Through A Lightning Component Using Apex Aura Methods

I am aiming to take a file a user attaches through an Lightning Component and create a document object containing the data.
So far I have overcome the request size limits by chunking the data being uploaded into 1MB chunks. When the Apex Aura method receives these chunks of data it will either create a new document (if it is the first chunk), or will retrieve the existing document and add the new chunk to the end.
Data is received Base64 encoded, and then decoded server-side.
As the document data is stored as a Blob, the original file contents will be read as a String, and then appended with the chunk received. The new contents are then converted back into a Blob to be stored within the ContentVersion object.
The problem I'm having is that strings in Apex have a maximum length of 6,000,000 or so. Whenever the file size exceeds 6MB, this limit is hit during the concatenation, and will cause the file upload to halt.
I have attempted to avoid this limit by converting the Blob to a String only when necessary for the concatenation (as suggested here https://developer.salesforce.com/forums/?id=906F00000008w9hIAA) but this hasn't worked. I'm guessing it was patched because it's still technically allocating a string larger then the limit.
Code's really simple when appending so far:
ContentVersion originalDocument = [SELECT Id, VersionData FROM ContentVersion WHERE Id =: <existing_file_id> LIMIT 1];
Blob originalData = originalDocument.VersionData;
Blob appendedData = EncodingUtil.base64Decode(<base_64_data_input>);
Blob newData = Blob.valueOf(originalData.toString() + appendedData.toString());
originalDocument.VersionData = newData;
You will have hard time with it.
You could try offloading the concatenation to asynchronous process (#future/Queueable/Schedulable/Batchable), they'll have 12MB RAM instead of 6. Could buy you some time.
You could try cheating by embedding an iframe (Visualforce or lightning:container tag? Or maybe a "canvas app") that would grab your file and do some manual JavaScript magic calling normal REST API for document upload: https://developer.salesforce.com/docs/atlas.en-us.api_rest.meta/api_rest/dome_sobject_insert_update_blob.htm (last code snippet is about multiple documents). Maybe jsforce?
Can you upload it somewhere else (SharePoint? Heroku?) and have that system call into SF to push them (no Apex = no heap size limit). Or even look "Files Connect" up.
Can you send an email with attachments? Crude but if you write custom Email-to-Case handler class you'll have 36 MB of RAM.
You wrote "we needed multiple files to be uploaded and the multi-file-upload component provided doesn't support all extensions". That may be caused by these:
In Experience Builder sites, the file size limits and types allowed follow the settings determined by site file moderation.
lightning-file-upload doesn't support uploading multiple files at once on Android devices.
if the Don't allow HTML uploads as attachments or document records security setting is enabled for your organization, the file uploader cannot be used to upload files with the following file extensions: .htm, .html, .htt, .htx, .mhtm, .mhtml, .shtm, .shtml, .acgi, .svg.

How do we get the document file url using the Watson Discovery Service?

I don't see a solution to this using the available api documentation.
It is also not available on the web console.
Is it possible to get the file url using the Watson Discovery Service?
If you need to store the original source/file URL, you can include it as a field within your documents in the Discovery service, then you will be able to query that field back out when needed.
I also struggled with this request but ultimately got it working using Python bindings into Watson Discovery. The online documentation and API reference is very poor; here's what I used to get it working:
(Assume you have a Watson Discovery service and have a created collection):
# Programmatic upload and retrieval of documents and metadata with Watson Discovery
from watson_developer_cloud import DiscoveryV1
import os
import json
discovery = DiscoveryV1(
version='2017-11-07',
iam_apikey='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx',
url='https://gateway-syd.watsonplatform.net/discovery/api'
)
environments = discovery.list_environments().get_result()
print(json.dumps(environments, indent=2))
This gives you your environment ID. Now append to your code:
collections = discovery.list_collections('{environment-id}').get_result()
print(json.dumps(collections, indent=2))
This will show you the collection ID for uploading documents into programmatically. You should have a document to upload (in my case, an MS Word document), and its accompanying URL from your own source document system. I'll use a trivial fictitious example.
NOTE: the documentation DOES NOT tell you to append , 'rb' to the end of the open statement, but it is required when uploading a Word document, as in my example below. Raw text / HTML documents can be uploaded without the 'rb' parameter.
url = {"source_url":"http://mysite/dis030.docx"}
with open(os.path.join(os.getcwd(), '{path to your document folder with trailing / }', 'dis030.docx'), 'rb') as fileinfo:
add_doc = discovery.add_document('{environment-id}', '{collections-id}', metadata=json.dumps(url), file=fileinfo).get_result()
print(json.dumps(add_doc, indent=2))
print(add_doc["document_id"])
Note the setting up of the metadata as a JSON dictionary, and then encoding it using json.dumps within the parameters. So far I've only wanted to store the original source URL but you could extend this with other parameters as your own use case requires.
This call to Discovery gives you the document ID.
You can now query the collection and extract the metadata using something like a Discovery query:
my_query = discovery.query('{environment-id}', '{collection-id}', natural_language_query="chlorine safety")
print(json.dumps(my_query.result["results"][0]["metadata"], indent=2))
Note - I'm extracting just the stored metadata here from within the overall returned results - if you instead just had:
print(my_query) you'll get the full response from Discovery ... but ... there's a lot to go through to identify just your own custom metadata.

Google speech API v1beta1 (syncrecognize and asyncrecognize API call)

I am a Java developer and I have couple of questions related to Google speech API V1Beta1.
Question1 (Syncrecognize case):
I tried to upload (through GCS) small size (less than one min running file) audio file to google speech api it is working But the confidence output level is 0.32497215 only. That is my result is not exactly same to my audio input.
How to increase the confidence level output?
Question 2 (Asyncrecognize case):
I tried big size audio file (more than one min running file). This case I used the API call:
https://speech.googleapis.com/v1beta1/speech:asyncrecognize?key=XXXXXXXXXXXXXXXXXXXX
and Payload:
"{"config":{"encoding":"LINEAR16","sample_rate": 16000},"audio":{"uri":"gs://" + bucketName +"/"+ objectName + ""}}"
Here I got the output json like
{"name": "57...........................95"}.
After getting this output I make new API call (Operation interface) with this name value.
https://speech.googleapis.com/v1beta1/operations/57.................................95?key=XXXXXXXXXXXXXXXXX
I got the output
{
"name": "57....................................95",
"done": true,
"response": {
"#type": "type.googleapis.com/google.cloud.speech.v1beta1.AsyncRecognizeResponse"
}
}
How to proceed the work with this value? I need to get audio speech text.
Please help me to fix this issues. Thanks in advance.
Ideas to Question 1:
You should give more details in RecognitionConfig object, for example specify the languageCode and add hints via the SpeechContext object.
Answer to Question 2:
Check the sample rate of the audio file, you must be sure that is equal to the rate you gave in the request. You can check it e.g. with the following code soxi audio_file.flac (sox needed for this one).

How to view JSON logs of a managed VM in the Log Viewer?

I'm trying to get JSON formatted logs on a Compute Engine VM instance to appear in the Log Viewer of the Google Developer Console. According to this documentation it should be possible to do so:
Applications using App Engine Managed VMs should write custom log
files to the VM's log directory at /var/log/app_engine/custom_logs.
These files are automatically collected and made available in the Logs
Viewer.
Custom log files must have the suffix .log or .log.json. If the suffix
is .log.json, the logs must be in JSON format with one JSON object per
line. If the suffix is .log, log entries are treated as plain text.
This doesn't seem to be working for me: logs ending with .log are visible in the Log Viewer, but displayed as plain text. Logs ending with .log.json aren't visible at all.
It also contradicts another recent article that states that file names must end in .log and its contents are treated as plain text.
As far as I can tell Google uses fluentd to index the log files into the Log Viewer. In the GitHub repository I cannot find any evidence that .log.json files are being indexed.
Does anyone know how to get this working? Or is the documentation out-of-date and has this feature been removed for some reason?
Here is one way to generate JSON logs for the Managed VMs logviewer:
The desired JSON format
The goal is to create a single line JSON object for each log line containing:
{
"message": "Error occurred!.",
"severity": "ERROR",
"timestamp": {
"seconds": 1437712034000,
"nanos": 905
}
}
(information sourced from Google: https://code.google.com/p/googleappengine/issues/detail?id=11678#c5)
Using python-json-logger
See: https://github.com/madzak/python-json-logger
def get_timestamp_dict(when=None):
"""Converts a datetime.datetime to integer milliseconds since the epoch.
Requires special handling to preserve microseconds.
Args:
when:
A datetime.datetime instance. If None, the timestamp for 'now'
will be used.
Returns:
Integer time since the epoch in milliseconds. If the supplied 'when' is
None, the return value will be None.
"""
if when is None:
when = datetime.datetime.utcnow()
ms_since_epoch = float(time.mktime(when.utctimetuple()) * 1000.0)
return {
'seconds': int(ms_since_epoch),
'nanos': int(when.microsecond / 1000.0),
}
def setup_json_logger(suffix=''):
try:
from pythonjsonlogger import jsonlogger
class GoogleJsonFormatter(jsonlogger.JsonFormatter):
FORMAT_STRING = "{message}"
def add_fields(self, log_record, record, message_dict):
super(GoogleJsonFormatter, self).add_fields(log_record,
record,
message_dict)
log_record['severity'] = record.levelname
log_record['timestamp'] = get_timestamp_dict()
log_record['message'] = self.FORMAT_STRING.format(
message=record.message,
filename=record.filename,
)
formatter = GoogleJsonFormatter()
log_path = '/var/log/app_engine/custom_logs/worker'+suffix+'.log.json'
make_sure_path_exists(log_path)
file_handler = logging.FileHandler(log_path)
file_handler.setFormatter(formatter)
logging.getLogger().addHandler(file_handler)
except OSError:
logging.warn("Custom log path not found for production logging")
except ImportError:
logging.warn("JSON Formatting not available")
To use, simply call setup_json_logger - you may also want to change the name of worker for your log.
I am currently working on a NodeJS app running on a managed VM and I am also trying to get my logs to be printed on the Google Developper Console. I created my log files in the ‘/var/log/app_engine’ directory as described in the documentation. Unfortunately this doesn’t seem to be working for me, even for the ‘.log’ files.
Could you describe where your logs are created ? Also, is your managed VM configured as "Managed by Google" or "Managed by User" ? Thanks!

What ObjectMapper config / annotations required for successful JSON / JSON array exhange using Jackson 2.2

Using Jackson 2.2.2 and Apache CXF web services client and server API.
I'm finding it impossible to serialize / deserialize JSON without failure.
Java class:
MyPojo
{
..... various properties
}
JSON produced from Jackson:
{
"MyPojo":
{
..... various properties
}
}
When I send the exact same JSON back to Jackson for it to consume, it fails with:
com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException:
Unrecognized field "MyPojo" (class app.model.MyPojo), not marked as ignorable (17 known properties: ,.....
Ideally, Jackson would not wrap the MyPojo object with {"MyPojo":} because I only ever exchange MyPojo objects, so it is implied.
To that end, how can I get Jackson to produce:
{
..... various properties
}
Then, how do I get jackson to consume the same JSON without failing? i.e. what ObjectMapper configuration or annotations or combination of both do I have to use?
If this is impossible, then how do I configure / annotate to get Jackson to consume the "wrapped" JSON without failing?
ALSO,
I have the same issues when producing / consuming an array of MyPojo objects:
JSON produced from Jackson:
{
"MyPojo":
[
{
..... various properties
},
{
..... various properties
}
]
}
..when consumed fails with:
com.fasterxml.jackson.databind.JsonMappingException:
Can not deserialize instance of app.model.MyPojo[] out of START_OBJECT token
Again, ideally (but not essential) Jackson would produce / consume:
[
{
..... various properties
},
{
..... various properties
}
]
Note, the Apache CXF WS appears to perform some magic via its #GET, #POST, etc, annotations as when used in conjuntion with a RESTful WS resource method which returns a MyPojo object, i.e. it appears that after my method returns the object, it is transformed into JSON.
To that end, I am unsure if a local or even global ObjectMapper will influence the output, so this should also be considered when answering.
Another note, I also need the same POJO to be produced and consumed in XML via JAXB.
EDIT:
I am now quite certain that TomEE/CXF is not using Jackson and that this is the cause of my issues. I'll update when I get it to work.
RESOLVED:
Further investigation revealed that whilst the JSON was being deserailized by Jackson, the default Jettison provider was not being overriden with Jackson when serializing due to misconfiguration of default JSON provider in CXF/TomEE. This resulted in a Jackson - Jettison formatting mismatch.
On stackOverflow are more than many answered questions like yours.
This is a solution:
objectMapper.configure(SerializationConfig.Feature.WRAP_ROOT_VALUE, true);
You can see similar questions:
Use class name as root key for JSON Jackson serialization
Enable Jackson to not output the class name when serializing (using Spring MVC)

Resources