How to get the MD5 of an Object that has already been stored? - md5

I have lots of Objects already stored in GCS. Now I want to get their MD5 hashes that GCS has already calculated, from my Google App Engine application using the Google Cloud Storage Client Library, but I can't find anything that exposes it.
I know that the ETag contains the MD5 in the XML api for backwards compatibility, but it says don't rely on that being there for any other APIs.
I know that I can get to them with the JSON api, but I don't want to write up all that HTTPClient and Auth code and JSON parsing to get to a single attribute.
I know I can also calculate the MD5 at upload time but these files are already uploaded, and reading them all again to just calculate the MD5 seems wasteful.

The MD5 hash is not available for every object. While most objects have an MD5 hash, many objects do not. For example, objects that have been created using object composition will often not have a recorded MD5 hash.
The XML API and the JSON API both expose the MD5 hash, where one is available. If you don't want to mess with HTTP calls, you could use the Google API client for Java to speak to the GCS JSON API. Here are instructions for setting it up on App Engine: https://code.google.com/p/google-api-java-client/wiki/GoogleAppEngine#Getting_Started
You can use the AppIdentityCredential class to use the credentials of your app to access Google Cloud Storage objects, and when you do, the MD5 is in the "md5Hash" property of the object. Please note that it is a base64 encoding rather than a hexadecimal encoding.

Related

Loading TensorFlow.js model from File Server

I am trying to load Tensorflow.js model via HTTP protocol. Tensorflow.js requires me to store 'model.json' and 'weights.bin' files in the same folder. But I can only call 'model.json' as a parameter. It refers to the binary file by itself. That is how it works as far as I know.
For now, in the local environment, I am loading the model from the localhost(Http://127.0.0.1:8080) and it works fine.
However, the actual application accepts HTTPS protocol only. So I have tried to store them with models and weights in the same buckets in S3 and called via Lambda but it seems like only 'model.json' is retrieved. I am thinking of using EC2 instances where the Python Flask server is running but it seems like the same that only model.json is retrieved, not binary files.
Is there any way that I can retrieve 'model.json' with referring to the weight file? Is there anyway to host file server remotely with HTTPS protocol?
TFJS downloads model JSON, parses it and uses whatever paths are specified in the JSON - you can edit that file and set any URL you want for weights.
Alternatively, you can also use lower-level methods to load weights manually (in case you want to have a custom loader, etc.), but leave that for future until you're more comfortable with TFJS.

API to Database?

Please presume that I do not know anything about any of the things I will be mentioning because I really do not.
Most OpenData sites have the possibility of exporting the presented file either in for example .csv or .json formats (Example). They also always have an API tab (Example API).
I presume using the API would mean that if the data is updated you would receive the change whereas exporting it as .csv would mean the content will not be changed anymore.
My questions is: how does one use this API code to display the same table one would get when exporting a .csv file.
Would you use a database to extract this information? What kind of database and how do you link the API to the database?
I presume using the API would mean that if the data is updated you
would receive the change whereas exporting it as .csv would mean the
content will not be changed anymore.
You are correct in the sense that, if you download the csv to your computer, that csv file won't be updated any more.
An API is something you would call - in this case, you can call the API, saying "Hey, do you have the latest data on xxx?", and you will be given back the latest information about what you have asked. This does not mean though, that this site will notify you when there's a new update - you will have to keep calling the API (every hour, every day etc) to see if there are any changes.
My questions is: how does one use this API code to display the same
table one would get when exporting a .csv file.
You would:
Call the API from a server code, or a cloud service
Let the server code or cloud service decipher (or "Parse") the response
Use the deciphered response to create a table made out of HTML, or to place it into a database
Would you use a database to extract this information? What kind of
database and how do you link the API to the database?
You wouldn't necessarily need a database to extract information, although a database would be nice to place the final data inside.
You would first need some sort of way to "call the REST API". There are many ways to do this - using Shell Script, using Python, using Excel VBA etc.
I understand this is hard to visualize, so here is an example of step 1, where you can retrieve information.
Try placing in the below URL (taken from the site you showed us) in your address bar of your Chrome browser, and hit enter
http://opendata.brussels.be/api/records/1.0/search/?dataset=associations-clubs-sportifs
See how it gives back a lot of text with many brackets and commas? You've basically asked the site to give you some data, and this is the response they gave back (different browsers work differently - IE asks you to download the response as a .json file). You've basically called an API.
To see this data more cleanly, open your developer tools of your Chrome browser, and enter the following JavaScript code
var url = 'http://opendata.brussels.be/api/records/1.0/search/?dataset=associations-clubs-sportifs';
var xhr = new XMLHttpRequest();
xhr.open('GET', url);
xhr.setRequestHeader('X-Requested-With', 'XMLHttpRequest');
xhr.onload = function() {
if (xhr.status === 200) {
// success
console.log(JSON.parse(xhr.responseText));
} else {
// error
console.log(JSON.parse(xhr.responseText));
}
};
xhr.send();
When you hit enter, a response will come back, stating "Object". If you click through the arrows, you can see this is a cleaner version of the data we just saw - more human readable.
In this case, I used JavaScript to retrieve the data, but you can use whatever code you want. You could proceed to use JavaScript to decipher the data, manipulate it, and push it into a database.
kintone is an online cloud database where you can customize it to run JavaScript codes, and have it store the data in their database, so you'll have the data stored online like in the below image. This is just one example of a database you can use.
There are other cloud services which allow you to connect API end points of different services with each other, like IFTTT and Zapier, but I'm not sure if they connect with open data.
The page you linked to shows that the API returns values as a JSON object. To access the data you can just send an appropriate http request and the response will be the requested data as a JSON. You can send requests like that over your browser if you want to.
Most languages allow JSON objects to be manipulated pro grammatically if you need to do work on the data.
Restful APIs publish model is "request and publish". Wen you request data via an API endpoint, you would receive response strings in JSON objects, CSV tables or XML.
The publisher, in this case Opendata.brussel.be would update their database on regular basis and publish the results via an API endpoint.
If you want to download the table as a relational data table in a CSV file, you'd need to parse the JSON objects into relational tables. This can be tricky since each JSON response string can vary in their paths.
There're several ways to do it. You can either write scripts to flatten the JSON objects or use a tool to parse and flatten the objects for you.
I use a tool called Acho to turn API endpoints into CSV files. It would parse almost all API endpoints through the parameters and even configure for multiple requests, such as iterative and recursive requests.
Acho API parser

Limit upload size for appengine interface to cloud store

Consider an image (avatar) uploader to Google Cloud Storage which will start from the user's web browser, and then pass through a Go appengine instance which will handle standard compression/cropping etc. and then set the resulting image as an object in Cloud Storage
How can I ensure that the appengine instance isn't overloaded by too much or bad data? In other words, I think I'm asking two questions (or possibly not):
How can I limit the amount of data allowed to be sent to an appengine instance in a single request, or is there already a default safe limit?
How can I validate the data to make sure it's proper jpg/png/gif before attempting to process it with standard go image libraries?
All App Engine requests are limited to 32MB.
You can check the size of the file being uploaded before the upload starts.
You can verify the file's mime-type and only allow correct files to be uploaded.

Decode an App Engine Blobkey to a Google Cloud Storage Filename

I've got a database full of BlobKeys that were previously uploaded through the standard Google App Engine create_upload_url() process, and each of the uploads went to the same Google Cloud Storage bucket by setting the gs_bucket_name argument.
What I'd like to do is be able to decode the existing blobkeys so I can get their Google Cloud Storage filenames. I understand that I could have been using the gs_object_name property from the FileInfo class, except:
You must save the gs_object_name yourself in your upload handler or
this data will be lost. (The other metadata for the object in GCS is stored
in GCS automatically, so you don't need to save that in your upload handler.
Meaning gs_object_name property is only available in the upload handler, and if I haven't been saving it at that time then its lost.
Also, create_gs_key() doesn't do the trick because it instead takes a google storage filename and creates a blobkey.
So, how can I take a blobkey that was previously uploaded to a Google Cloud Storage bucket through app engine, and get it's Google Cloud Storage filename? (python)
You can get the cloudstorage filename only in the upload handler (fileInfo.gs_object_name) and store it in your database. After that it is lost and it seems not to be preserved in BlobInfo or other metadata structures.
Google says: Unlike BlobInfo metadata FileInfo metadata is not
persisted to datastore. (There is no blob key either, but you can
create one later if needed by calling create_gs_key.) You must save
the gs_object_name yourself in your upload handler or this data will
be lost.
https://developers.google.com/appengine/docs/python/blobstore/fileinfoclass
Update: I was able to decode a SDK-BlobKey in Blobstore-Viewer: "encoded_gs_file:base64-encoded-filename-here". However the real thing is not base64 encoded.
create_gs_key(filename, rpc=None) ... Google says: "Returns an encrypted blob key as a string." Does anyone have a guess why this is encrypted?
From the statement in the docs, it looks like the generated GCS filenames are lost. You'll have to use gsutil to manually browse your bucket.
https://developers.google.com/storage/docs/gsutil/commands/ls
If you have blobKeys you can use: ImagesServiceFactory.makeImageFromBlob

ndb.BlobProperty vs BlobStore: which is more private and more secure

I have been reading all over stackoverflow concerning datastore vs blobstore for storing and retrieving image files. Everything is pointing towards blobstore except one: privacy and security.
In the datastore, the photos of my users are private: I have full control on who gets a blob. In the blobstore, however, anyone who knows the url can conceivable access my users photos? Is that true?
Here is a quote that is supposed to give me peace of mind, but it's still not clear. So anyone with the blob key can still access the photos? (from Store Photos in Blobstore or as Blobs in Datastore - Which is better/more efficient /cheaper?)
the way you serve a value out of the Blobstore is to accept a request
to the app, then respond with the X-AppEngine-BlobKey header with the
key. App Engine intercepts the outgoing response and replaces the body
with the Blobstore value streamed directly from the service. Because
app logic sets the header in the first place, the app can implement
any access control it wants. There is no default URL that serves
values directly out of the Blobstore without app intervention.
All of this is to ask: Which is more private and more secure for trafficking images, and why: datastore or blobstore? Or, hey, google-cloud-storage (which I know nothing about presently)
If you use google.appengine.api.images.get_serving_url then yes, the url returned is public. However the url returned is not guessable from a blob's key, nor does the url even exist before calling get_serving_url. (Or after calling delete_serving_url).
If you need access control on top of the data in the blobstore you can write your own handlers and add the access control there.
BlobProperty is just as private and secure as BlobStore, all depends on your application which serves the requests. your application can implement any permission checking before sending the contents to the user, so I don't see any difference as long as you serve all the images yourself and don't intentionally create publicly available URLs.
Actually, I would not even thinlk about storing photos in the BlobProperty, because this way the data ends up in the database instead of the BlobStore and it costs significantly more to store data in the database. BlobStore, on the other hand, is cheap and convenient.

Resources