openpyxl load_workbook from backend service - google-app-engine

I have some openpyxl code in my backend service (google app engine) and I'd like to load a file from google cloud store / blobstore, but passing the file stream (via blobstore reader) doesn't appear to be valid for load_workbook. xlrd has an option to pass file contents (Reading contents of excel file in python webapp2). Is there something similar for openpyxl?
blobstore_filename = '/gs{}'.format('/mybucket/mycloudstorefilename.xlsx')
blob_key = blobstore.create_gs_key(blobstore_filename)
blob_reader = blobstore.BlobReader(blob_key)
blob_reader = blobstore.BlobReader(blob_key, buffer_size=1048576)
blob_reader = blobstore.BlobReader(blob_key, position=0)
blob_reader_data = blob_reader.read()
load_workbook(blob_reader_data)
The error is:
UnicodeDecodeError
'ascii' codec can't decode byte 0x9d in position 11: ordinal not in range(128)

Found the missing link:
Using openpyxl to read file from memory
I needed to convert the file stream into bytes.
from io import BytesIO
...
wb = load_workbook(BytesIO(blob_reader_data))

Related

Save as .PARQUET file in Logic App workflow

How can I save the output file from Run query and list results in a .PARQUET file format.
This is my current workflow.
My Logic App is working, But the file .parquet created are not valid every time I view it on Apache Parquet Viewer
Can someone help me on this matter. Thank you!
Output:
I see that you are trying to add .parquet to the csv file you are receiving but that's not how it will be converted to parquet file.
One of the workarounds that you can try is to get the csv file and then add Azure function which can convert into parquet file and then adding the azure function to logic app.
Here is the function that worked for me:
BlobServiceClient blobServiceClient = new BlobServiceClient("<YOUR CONNECTION STRING>");
BlobContainerClient containerClient = blobServiceClient.GetBlobContainerClient("<YOUR CONTAINER NAME>");
BlobClient blobClient = containerClient.GetBlobClient("sample.csv");
//Download the blob
Stream file = File.OpenWrite(#"C:\Users\<USER NAME>\source\repos\ParquetConsoleApp\ParquetConsoleApp\bin\Debug\netcoreapp3.1\" + blobClient.Name);
await blobClient.DownloadToAsync(file);
Console.WriteLine("Download completed!");
file.Close();
//Read the downloaded blob
Stream file1 = new FileStream(blobClient.Name, FileMode.Open);
Console.WriteLine(file1.ReadToEnd());
file1.Close();
//Convert to parquet
ChoParquetRecordConfiguration csv = new ChoParquetRecordConfiguration();
using (var r = new ChoCSVReader(#"C:\Users\<USER NAME>\source\repos\ParquetConsoleApp\ParquetConsoleApp\bin\Debug\netcoreapp3.1\" + blobClient.Name))
{
using (var w = new ChoParquetWriter(#"C:\Users\<USER NAME>\source\repos\ParquetConsoleApp\ParquetConsoleApp\bin\Debug\netcoreapp3.1\convertedParquet.parquet"))
{
w.Write(r);
w.Close();
}
}
after this step you can publish to your azure function and add the Azure function connector to your logic app
You can skip the first 2 steps (i.e.. Read and Download the blob) and get the blob directly from logic app and send it to your azure function and follow the same method as above. The generated parquet file will be in this path.
C:\Users\<USERNAME>\source\repos\ParquetConsoleApp\ParquetConsoleApp\bin\Debug\netcoreapp3.1\convertedParquet.parquet
Here convertedParquet.parquet is the name of the parquet file. Now you can read the converted parquet file in Apache Parquet reader.
Here is the output

Converting the response of Python get request(jpg content) in Numpy Array

The workflow of my function is the following:
retrieve a jpg through python get request
save image as png (even though is downloaded as jpg) on disk
use imageio to read from disk image and transform it into numpy array
work with the array
This is what I do to save:
response = requests.get(urlstring, params=params)
if response.status_code == 200:
with open('PATH%d.png' % imagenumber, 'wb') as output:
output.write(response.content)
This is what I do to load and transform png into np.array
imagearray = im.imread('PATH%d.png' % imagenumber)
Since I don't need to store permanently what I download I tried to modify my function in order to transform the response.content in a Numpy array directly. Unfortunately every imageio like library works in the same way reading a uri from the disk and converting it to a np.array.
I tried this but obviously it didn't work since it need a uri in input
response = requests.get(urlstring, params=params)
imagearray = im.imread(response.content))
Is there any way to overcome this issue? How can I transform my response.content in a np.array?
imageio.imread is able to read from urls:
import imageio
url = "https://example_url.com/image.jpg"
# image is going to be type <class 'imageio.core.util.Image'>
# that's just an extension of np.ndarray with a meta attribute
image = imageio.imread(url)
You can look for more information in the documentation, they also have examples: https://imageio.readthedocs.io/en/stable/examples.html
You can use BytesIO as file to skip writing to an actual file.
bites = BytesIO(base64.b64decode(response.content))
Now you have it as BytesIO, so you can use it just like a file:
img = Image.open(bites)
img_np = np.array(im)

Programmatically emulating "gsutil mv" on appengine cloudstorage in python

I would like to implement a mv (copy-in-the-cloud) operation on google cloud storage that is similar to how gsutil does it (http://developers.google.com/storage/docs/gsutil/commands/mv).
I read somewhere earlier that this involves a read and write (download and reupload) of the data, but I cannot find the passages again.
Is this the correct way to move a file in cloud storage, or does one have to go a level down to the boto library to avoid copying the data over the network for renaming the file?
istream = cloudstorage.open(src, mode='r')
ostream = cloudstorage.open(dst, content_type=src_content, mode='w')
while True:
buf = istream.read(500000)
if not buf:
break
ostream.write(buf)
istream.close()
ostream.close()
Update: I found the rest api that supports copy and compose operations and much more. It seems that there is hope that we do not have to copy data across continents to rename something.
Useful Links I have found sofar ...
Boto based approach: https://developers.google.com/storage/docs/gspythonlibrary
GCS Clinet Lib: https://developers.google.com/appengine/docs/python/googlecloudstorageclient/
GCS Lib: https://code.google.com/p/appengine-gcs-client
RAW JSON API: https://developers.google.com/storage/docs/json_api
Use the JSON API, there is a copy method. Here is the official example for Python, using the Python Google Api Client lib :
# The destination object resource is entirely optional. If empty, we use
# the source object's metadata.
if reuse_metadata:
destination_object_resource = {}
else:
destination_object_resource = {
'contentLanguage': 'en',
'metadata': {'my-key': 'my-value'},
}
req = client.objects().copy(
sourceBucket=bucket_name,
sourceObject=old_object,
destinationBucket=bucket_name,
destinationObject=new_object,
body=destination_object_resource)
resp = req.execute()
print json.dumps(resp, indent=2)

How can I read a blob that was written to the Blobstore by a Pipeline within the test framework?

I have a pipeline that creates a blob in the blobstore and places the resulting blob_key in one of its named outputs. When I run the pipeline through the web interface I have built around it, everything works wonderfully. Now I want to create a small test case that will execute this pipeline, read the blob out from the blobstore, and store it to a temporary location somewhere else on disk so that I can inspect it. (Since testbed.init_files_stub() only stores the blob in memory for the life of the test).
The pipeline within the test case seems to work fine, and results in what looks like a valid blob_key, but when I pass that blob_key to the blobstore.BlobReader class, it cannot find the blob for some reason. From the traceback, it seems like the BlobReader is trying to access the real blobstore, while the writer (inside the pipeline) is writing to the stubbed blobstore. I have --blobstore_path setup on dev_appserver.py, and I do not see any blobs written to disk by the test case, but when I run it from the web interface, the blobs do show up there.
Here is the traceback:
Traceback (most recent call last):
File "/Users/mattfaus/dev/webapp/coach_resources/student_use_data_report_test.py", line 138, in test_serial_pipeline
self.write_out_blob(stage.outputs.xlsx_blob_key)
File "/Users/mattfaus/dev/webapp/coach_resources/student_use_data_report_test.py", line 125, in write_out_blob
writer.write(reader.read())
File "/Users/mattfaus/Desktop/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/blobstore/blobstore.py", line 837, in read
self.__fill_buffer(size)
File "/Users/mattfaus/Desktop/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/blobstore/blobstore.py", line 809, in __fill_buffer
self.__position + read_size - 1)
File "/Users/mattfaus/Desktop/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/blobstore/blobstore.py", line 657, in fetch_data
return rpc.get_result()
File "/Users/mattfaus/Desktop/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/api/apiproxy_stub_map.py", line 604, in get_result
return self.__get_result_hook(self)
File "/Users/mattfaus/Desktop/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/api/blobstore/blobstore.py", line 232, in _get_result_hook
raise _ToBlobstoreError(err)
BlobNotFoundError
Here is my test code:
def write_out_blob(self, blob_key, save_path='/tmp/blob.xlsx'):
"""Reads a blob from the blobstore and writes it out to the file."""
print str(blob_key)
# blob_info = blobstore.BlobInfo.get(str(blob_key)) # Returns None
# reader = blob_info.open() # Returns None
reader = blobstore.BlobReader(str(blob_key))
writer = open(save_path, 'w')
writer.write(reader.read())
print blob_key, 'written to', save_path
def test_serial_pipeline(self):
stage = student_use_data_report.StudentUseDataReportSerialPipeline(
self.query_config)
stage.start_test()
self.assertIsNotNone(stage.outputs.xlsx_blob_key)
self.write_out_blob(stage.outputs.xlsx_blob_key)
It might be useful if you show how you finalized the blobstore file or if you can try that finalization code separately. It sounds Files API didn't finalize the file correctly on dev appserver.
Turns out that I was simply missing the .value property, here:
self.assertIsNotNone(stage.outputs.xlsx_blob_key)
self.write_out_blob(stage.outputs.xlsx_blob_key.value) # Don't forget .value!!
[UPDATE]
The SDK dashboard also exposes a list of all blobs in your blobstore, conveniently sorted by creation date. It is available at http://127.0.0.1:8000/blobstore.

Need help with processing attachments with GAE InboundMailHandler

I have properly implemented InboundMailHandler and I'm able to process all other mail_message fields except mail_message.attachments. The attachment filename is read properly but the contents are not being saved in the proper mime_type
if not hasattr(mail_message, 'attachments'):
raise ProcessingFailedError('Email had no attached documents')
else:
logging.info("Email has %i attachment(s) " % len(mail_message.attachments))
for attach in mail_message.attachments:
filename = attach[0]
contents = attach[1]
# Create the file
file_name = files.blobstore.create(mime_type = "application/pdf")
# Open the file and write to it
with files.open(file_name, 'a') as f:
f.write(contents)
# Finalize the file. Do this before attempting to read it.
files.finalize(file_name)
# Get the file's blob key
blob_key = files.blobstore.get_blob_key(file_name)
return blob_key
blob_info = blobstore.BlobInfo.get(blob_key)
`
When I try to display the imported pdf file by going to the url: '/serve/%s' % blob_info.key()
I get a page with what seems like encoded data, instead of the actual pdf file.
Looks like this:
From nobody Thu Aug 4 23:45:06 2011 content-transfer-encoding: base64 JVBERi0xLjMKJcTl8uXrp/Og0MTGCjQgMCBvYmoKPDwgL0xlbmd0aCA1IDAgUiAvRmlsdGVyIC9G bGF0ZURlY29kZSA+PgpzdHJlYW0KeAGtXVuXHLdxfu9fgSef2RxxOX2by6NMbSLalOyQK+ucyHpQ eDE3IkWKF0vJj81vyVf3Qu9Mdy+Z40TswqKAalThqwJQjfm1/Hv5tWzxv13blf2xK++el+/LL+X+ g/dtefq
Any ideas? Thanks
Email's attachments are EncodedPayload objects; to get the data you should call the decode() method.
Try with:
# Open the file and write to it
with files.open(file_name, 'a') as f:
f.write(contents.decode())
If you want attachments larger 1MB to be processed successfully, decode and convert to str:
#decode and convert to string
datastr = str(contents.decode())
with files.open(file_name, 'a') as f:
f.write(datastr[0:65536])
datastr=datastr[65536:]
while len(datastr) > 0:
f.write(datastr[0:65536])
datastr=datastr[65536:]
Found the answer in this excellent blob post:
http://john-smith.appspot.com/app-engine--what-the-docs-dont-tell-you-about-processing-inbound-mail
This is how to decode an email attachment for GAE inbound mail:
for attach in mail_message.attachments:
filename, encoded_data = attach
data = encoded_data.payload
if encoded_data.encoding:
data = data.decode(encoded_data.encoding)

Resources