Error 2005: While attempting to GET response from form recognizer - azure-form-recognizer

Currently I'm using form recognizer version 2.1 preview to train a custom model. I'm able to test the model in Form Recognizer Labeling Tool and got the output. When I input the same file that I got out in labeling tool in my program I'm getting the error below.
{"status": "failed", "createdDateTime": "2020-09-25T20:03:21Z", "lastUpdatedDateTime": "2020-09-25T20:03:21Z", "analyzeResult": {"version": "2.1.0", "errors": [{"code": "2005", "message": "The file submitted couldn't be parsed. This can be due to one of the following reasons: the file format is not supported ( Supported formats include JPEG, PNG, BMP, PDF and TIFF), the file is corrupted or password protected."}]}}
The GET request code used is:
resp = requests.get(url=get_url,headers={"Ocp-Apim-Subscription-Key":FORM_RECOGNIZER_SUBSCRIPTION_KEY})

I have saved the file to server and then tried to read it from there and pass the read file to form recognizer. This worked for me. But I don't know why did it happen.

I also encountered this exact error message following this article.
The article show 4 steps:
Train Model (require SAS to blob - whole folder)
Get model result
Analyze (require SAS to a single file)
Get analyze result
Profit
I got this error on step 4.
After monkeying around, I figured that the cause is not actually in step 4 but in step 3 instead. I was providing SAS to the blob instead of SAS to the file.
After I corrected the SAS URL it works perfectly.
Here is how to get SAS to blob:
Here is how to get SAS to a file:

What file are you trying to use as input ? Form Recognizer supports PDF, Tiff and images (PNG and JPEG) as file types and inputs to the analyze API. See more details here - https://learn.microsoft.com/en-us/azure/cognitive-services/form-recognizer/quickstarts/python-labeled-data?tabs=v2-0#analyze-forms-for-key-value-pairs-and-tables

Related

How To Upload A Large File (>6MB) To SalesForce Through A Lightning Component Using Apex Aura Methods

I am aiming to take a file a user attaches through an Lightning Component and create a document object containing the data.
So far I have overcome the request size limits by chunking the data being uploaded into 1MB chunks. When the Apex Aura method receives these chunks of data it will either create a new document (if it is the first chunk), or will retrieve the existing document and add the new chunk to the end.
Data is received Base64 encoded, and then decoded server-side.
As the document data is stored as a Blob, the original file contents will be read as a String, and then appended with the chunk received. The new contents are then converted back into a Blob to be stored within the ContentVersion object.
The problem I'm having is that strings in Apex have a maximum length of 6,000,000 or so. Whenever the file size exceeds 6MB, this limit is hit during the concatenation, and will cause the file upload to halt.
I have attempted to avoid this limit by converting the Blob to a String only when necessary for the concatenation (as suggested here https://developer.salesforce.com/forums/?id=906F00000008w9hIAA) but this hasn't worked. I'm guessing it was patched because it's still technically allocating a string larger then the limit.
Code's really simple when appending so far:
ContentVersion originalDocument = [SELECT Id, VersionData FROM ContentVersion WHERE Id =: <existing_file_id> LIMIT 1];
Blob originalData = originalDocument.VersionData;
Blob appendedData = EncodingUtil.base64Decode(<base_64_data_input>);
Blob newData = Blob.valueOf(originalData.toString() + appendedData.toString());
originalDocument.VersionData = newData;
You will have hard time with it.
You could try offloading the concatenation to asynchronous process (#future/Queueable/Schedulable/Batchable), they'll have 12MB RAM instead of 6. Could buy you some time.
You could try cheating by embedding an iframe (Visualforce or lightning:container tag? Or maybe a "canvas app") that would grab your file and do some manual JavaScript magic calling normal REST API for document upload: https://developer.salesforce.com/docs/atlas.en-us.api_rest.meta/api_rest/dome_sobject_insert_update_blob.htm (last code snippet is about multiple documents). Maybe jsforce?
Can you upload it somewhere else (SharePoint? Heroku?) and have that system call into SF to push them (no Apex = no heap size limit). Or even look "Files Connect" up.
Can you send an email with attachments? Crude but if you write custom Email-to-Case handler class you'll have 36 MB of RAM.
You wrote "we needed multiple files to be uploaded and the multi-file-upload component provided doesn't support all extensions". That may be caused by these:
In Experience Builder sites, the file size limits and types allowed follow the settings determined by site file moderation.
lightning-file-upload doesn't support uploading multiple files at once on Android devices.
if the Don't allow HTML uploads as attachments or document records security setting is enabled for your organization, the file uploader cannot be used to upload files with the following file extensions: .htm, .html, .htt, .htx, .mhtm, .mhtml, .shtm, .shtml, .acgi, .svg.

Spark Streaming display (streaming) not working

I follow this example to simulate streaming in Spark from a source file. At the end of the example, a function named display is used which is supported only in databricks. I run my code in Jupyter notebook. What is the alternative in Jupyter to get the same output obtained from display function?
screenshoot_of_the_Example
Update_1:
The code:
# Source
sourceStream=spark.readStream.format("csv").\
option("header",True).\
schema(schema).option("ignoreLeadingWhiteSpace",True).\
option("mode","dropMalformed").\
option("maxFilesPerTrigger",1).load("D:/PHD Project/Paper_3/Tutorials/HeartTest_1/").\
withColumnRenamed("output","label")
#stream test data to the ML model
streamingHeart=pModel.transform(sourceStream).select('label')
I do the following:
streamingHeart.writeStream.outputMode("append").\
format("csv").option("path", "D:/PHD \
Project/Paper_3/Tutorials/sa1/").option("checkpointLocation",\
"checkpoint/filesink_checkpoint").start()\
The problem is that the generated files (output files) are empty. What might be the reason behind that?
I solved the problem by changing the checkpoint, as follow.
Project/Paper_3/Tutorials/sa1/").option("checkpointLocation",\
"checkpoint/filesink_checkpoint_1")

Google speech API v1beta1 (syncrecognize and asyncrecognize API call)

I am a Java developer and I have couple of questions related to Google speech API V1Beta1.
Question1 (Syncrecognize case):
I tried to upload (through GCS) small size (less than one min running file) audio file to google speech api it is working But the confidence output level is 0.32497215 only. That is my result is not exactly same to my audio input.
How to increase the confidence level output?
Question 2 (Asyncrecognize case):
I tried big size audio file (more than one min running file). This case I used the API call:
https://speech.googleapis.com/v1beta1/speech:asyncrecognize?key=XXXXXXXXXXXXXXXXXXXX
and Payload:
"{"config":{"encoding":"LINEAR16","sample_rate": 16000},"audio":{"uri":"gs://" + bucketName +"/"+ objectName + ""}}"
Here I got the output json like
{"name": "57...........................95"}.
After getting this output I make new API call (Operation interface) with this name value.
https://speech.googleapis.com/v1beta1/operations/57.................................95?key=XXXXXXXXXXXXXXXXX
I got the output
{
"name": "57....................................95",
"done": true,
"response": {
"#type": "type.googleapis.com/google.cloud.speech.v1beta1.AsyncRecognizeResponse"
}
}
How to proceed the work with this value? I need to get audio speech text.
Please help me to fix this issues. Thanks in advance.
Ideas to Question 1:
You should give more details in RecognitionConfig object, for example specify the languageCode and add hints via the SpeechContext object.
Answer to Question 2:
Check the sample rate of the audio file, you must be sure that is equal to the rate you gave in the request. You can check it e.g. with the following code soxi audio_file.flac (sox needed for this one).

How to view JSON logs of a managed VM in the Log Viewer?

I'm trying to get JSON formatted logs on a Compute Engine VM instance to appear in the Log Viewer of the Google Developer Console. According to this documentation it should be possible to do so:
Applications using App Engine Managed VMs should write custom log
files to the VM's log directory at /var/log/app_engine/custom_logs.
These files are automatically collected and made available in the Logs
Viewer.
Custom log files must have the suffix .log or .log.json. If the suffix
is .log.json, the logs must be in JSON format with one JSON object per
line. If the suffix is .log, log entries are treated as plain text.
This doesn't seem to be working for me: logs ending with .log are visible in the Log Viewer, but displayed as plain text. Logs ending with .log.json aren't visible at all.
It also contradicts another recent article that states that file names must end in .log and its contents are treated as plain text.
As far as I can tell Google uses fluentd to index the log files into the Log Viewer. In the GitHub repository I cannot find any evidence that .log.json files are being indexed.
Does anyone know how to get this working? Or is the documentation out-of-date and has this feature been removed for some reason?
Here is one way to generate JSON logs for the Managed VMs logviewer:
The desired JSON format
The goal is to create a single line JSON object for each log line containing:
{
"message": "Error occurred!.",
"severity": "ERROR",
"timestamp": {
"seconds": 1437712034000,
"nanos": 905
}
}
(information sourced from Google: https://code.google.com/p/googleappengine/issues/detail?id=11678#c5)
Using python-json-logger
See: https://github.com/madzak/python-json-logger
def get_timestamp_dict(when=None):
"""Converts a datetime.datetime to integer milliseconds since the epoch.
Requires special handling to preserve microseconds.
Args:
when:
A datetime.datetime instance. If None, the timestamp for 'now'
will be used.
Returns:
Integer time since the epoch in milliseconds. If the supplied 'when' is
None, the return value will be None.
"""
if when is None:
when = datetime.datetime.utcnow()
ms_since_epoch = float(time.mktime(when.utctimetuple()) * 1000.0)
return {
'seconds': int(ms_since_epoch),
'nanos': int(when.microsecond / 1000.0),
}
def setup_json_logger(suffix=''):
try:
from pythonjsonlogger import jsonlogger
class GoogleJsonFormatter(jsonlogger.JsonFormatter):
FORMAT_STRING = "{message}"
def add_fields(self, log_record, record, message_dict):
super(GoogleJsonFormatter, self).add_fields(log_record,
record,
message_dict)
log_record['severity'] = record.levelname
log_record['timestamp'] = get_timestamp_dict()
log_record['message'] = self.FORMAT_STRING.format(
message=record.message,
filename=record.filename,
)
formatter = GoogleJsonFormatter()
log_path = '/var/log/app_engine/custom_logs/worker'+suffix+'.log.json'
make_sure_path_exists(log_path)
file_handler = logging.FileHandler(log_path)
file_handler.setFormatter(formatter)
logging.getLogger().addHandler(file_handler)
except OSError:
logging.warn("Custom log path not found for production logging")
except ImportError:
logging.warn("JSON Formatting not available")
To use, simply call setup_json_logger - you may also want to change the name of worker for your log.
I am currently working on a NodeJS app running on a managed VM and I am also trying to get my logs to be printed on the Google Developper Console. I created my log files in the ‘/var/log/app_engine’ directory as described in the documentation. Unfortunately this doesn’t seem to be working for me, even for the ‘.log’ files.
Could you describe where your logs are created ? Also, is your managed VM configured as "Managed by Google" or "Managed by User" ? Thanks!

Python Google appengine 'Attachment' object does not support indexing

Since sometime after 3pm EST on January 9th I am getting
TypeError: 'Attachment' object does not support indexing errors when trying to access the data portion of an email attachment:
attach = mail_message.attachments.pop()
encodedAttachment = attach[1]
The format of the emails I am processing has not changed in that time, and this code worked flawlessly up until then
The latest version (1.8.9) has introduced an Attachment class that is returned now instead of the (filename content) tuple that was returned previously. The class does implement __iter__, so unpacking works exactly the same:
filename, content = attachment
But it doesn't implement __getitem__, so accessing via index as you're doing will cause the error you're seeing. It's possible that creating an issue will get the code changed to be completely backwards-compatible, but the practical thing would be to change your code.

Resources