How to mapreduce over google cloud storage file? - google-app-engine

from the app-engine mapreduce console (myappid.appspot.com/mapreduce/status)
I have a mapreduce defined with input_reader: mapreduce.input_readers.BlobstoreLineInputReader
that I have used successfully with a regular blobstore file, but it doesn't work with a Blobkey created from cloud storage with create_gs_key. when I run it, I get the error "BadReaderParamsError: Could not find blobinfo for key THEKEY". The input reader checks for the existence of a BlobInfo. Is there any work around to this? shouldn't BlobInfo.get(BLOBKEY FROM CS) return a blobinfo?
to get a blob_key from a google cloud storage file, I run this:
from google.appengine.ext import blobstore
READ_PATH = '/gs/mybucket/myfile.json'
blob_key = blobstore.create_gs_key(READ_PATH)
print blob_key

A community member created a LineInputReader for Cloud Storage as an issue on the appengine-mapreduce library: http://code.google.com/p/appengine-mapreduce/issues/detail?id=140
We've posted our modifications here: https://github.com/thinkjson/CloudStorageLineInputReader
We're using this to do MapReduce over about 4TB of data, and have been happy with it so far.

Cloud Storage and BlobStore are two different storages, you can't pass a key from the Cloud Storage as a BlobStore key.
You will need to implement your own line reader over Cloud Storage file.

Related

google cloud storage: access cloud storage and provide download link for users

In my developent server, i am able to use the blobkey to download a csv object. The problem is that in production, the blobkey does not download anything (it returns a 404); presumably because the blobkey is inaccurate. I think this is because googles deprecation of the blobstore and is no longer using blobkeys. This means i need to try and download from google storage bucket. I am not sure how to do this; In development server, i would go to this endpoint to download /data?key=<blob_key> to download the blob object.
I can also download the csv object if i navigate to the bucket and ot the item and then click download. Is there some minor adjustments i can make to get the download to occur? BI would appreciate if someone could point me to a particular direction.
To download objects from your buckets in Cloud Storage depending on your preferences you can check the following code sample (Python):
from google.cloud import storage
def download_blob(bucket_name, source_blob_name, destination_file_name):
"""Downloads a blob from the bucket."""
# bucket_name = "your-bucket-name"
# source_blob_name = "storage-object-name"
# destination_file_name = "local/path/to/file"
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(source_blob_name)
blob.download_to_filename(destination_file_name)
print(
"Blob {} downloaded to {}.".format(
source_blob_name, destination_file_name
)
)
Be sure that you are not using Python 2.7 anymore since it became deprecated and it is not supported anymore. If you have Python 2.7 please upgrade to Python 3.7.

Reading PlayStore csv review files from Google storage bucket using Java App Engine

This problem has stumped me for most part of my day.
BACKGROUND
I am attempting to read the Play Store reviews for my Apps via my own Google App Engine Java project.
Now I am able to get the list of all the files using Google Cloud Storage client api (java).
I can also read the meta for each of the csv files in that bucket and print it to the logs:
PROBLEM
I simply can't find a way to read the actual object and get the csv data.
My java code snippet:
BUCKET_NAME = "pubsite_prod_rev_*******";
objectFileName = "reviews/reviews_*****_***.csv"
Storage.Objects.Get obj = client.objects().get(BUCKET_NAME, objectFileName);
InputStream is = obj.executeMediaAsInputStream();
Now when I print this inputstream, it tells me its GZIPInputStream (java.util.zip.GZIPInputStream#f0be2c). Converting this inputstream to byte[] or String (desired) does not work.
And if I try to envelope it inside GZIPInputStream object using:
zis = new GZIPInputStream(is);
it throws ZipException : Not in GZIP format.
Metadata of the file:
"contentType": "text/csv; charset=utf-16le",
"contentEncoding": "gzip",
What wrong am I doing?
Sub Question: In the past I have successfully read text data from Google Cloud Storage using GcsService, but it does not seem to work with the Buckets which have the Play Store review csv files. Does anybody know if my Google App Engine project (connected to same Google developer account) can read these Buckets?
Solved it using executeMedia() and parseAsString
HttpResponse response = obj.executeMedia();
response.parseAsString(); //works!!

How to upload media (image, video) to Google Cloud Storage with Java

I am now facing with a problem once uploading media files to Google Cloud Storage by using Google Cloud Storage API Java. To be more specific, the GCS API Java example just help us upload text files to Google Bucket but is not useful for media files. I also see some discussion from ticket: Written file not showing up in Google Cloud Storage that the team suggest to use a gsutil tool written by Python. I am not using blobstore as well
My question is how can I do with the following requirements:
-Creating and deleting buckets.
-Uploading, downloading, and deleting objects (such as media files).
-Listing buckets and objects.
-Moving, copying, and renaming objects.
by implementing with Java?
I thank you very much for your time and look forward to hearing from you.
Upload/Download files: you can use the Blobstore API which can be configured to store blobs in Google Cloud Storage by specifying your bucket in the BlobstoreService createUploadUrl. Similarly, to download you can create a createGsBlobKey with the Bucket name + Object name which can then be served by the Blobstore service.
Create/Delete buckets: The Google Cloud Storage Java Client Library does not offers a way to create/delete buckets. You will need to use Google Cloud Storage REST API to programatically create and delete. Thought, you might want to consider organizing your data within one bucket.
Moving, copying, and renaming objects: make use of the Google Cloud Storage Java Client Library
I faced a similar requirement when I had to deal with all sorts of documents including media objects such as images and videos. This is the implementation I followed based on the official documentation of Google Cloud Examples project on GitHub:
Source Link
To Upload a file
public boolean uploadFile(String filePath, byte[] file) {
try {
setDefaultStorageCredentials();
storage.create(BlobInfo.newBuilder(bucketName, filePath).build(),
new ByteArrayInputStream(file));
return true;
} catch (Exception e) {
return false;
}
}
To download a file
public byte[] downloadFile(String filePath) throws FileNotFoundException, IOException {
setDefaultStorageCredentials();
return storage.get(bucketName).get(filePath).getContent();
}
To delete a file
public boolean deleteFile(String filePath){
setDefaultStorageCredentials();
return storage.delete(storage.get(bucketName).get(filePath).getBlobId());
}
To provide temporary access to a file using a signed URL
public String getTemporaryFileLink(String filePath) throws Exception{
setDefaultStorageCredentials();
Blob blob = storage.get(bucketName).get(filePath);
String blobName = blob.getName();
URL signedUrl = storage.signUrl(BlobInfo.newBuilder(bucketName, blobName).build(), 5,TimeUnit.MINUTES);
return signedUrl.toExternalForm();
}
Most of these methods are mentioned in this Google Github project. I just removed the clutter in my implementation. Hope this helps.

Decode an App Engine Blobkey to a Google Cloud Storage Filename

I've got a database full of BlobKeys that were previously uploaded through the standard Google App Engine create_upload_url() process, and each of the uploads went to the same Google Cloud Storage bucket by setting the gs_bucket_name argument.
What I'd like to do is be able to decode the existing blobkeys so I can get their Google Cloud Storage filenames. I understand that I could have been using the gs_object_name property from the FileInfo class, except:
You must save the gs_object_name yourself in your upload handler or
this data will be lost. (The other metadata for the object in GCS is stored
in GCS automatically, so you don't need to save that in your upload handler.
Meaning gs_object_name property is only available in the upload handler, and if I haven't been saving it at that time then its lost.
Also, create_gs_key() doesn't do the trick because it instead takes a google storage filename and creates a blobkey.
So, how can I take a blobkey that was previously uploaded to a Google Cloud Storage bucket through app engine, and get it's Google Cloud Storage filename? (python)
You can get the cloudstorage filename only in the upload handler (fileInfo.gs_object_name) and store it in your database. After that it is lost and it seems not to be preserved in BlobInfo or other metadata structures.
Google says: Unlike BlobInfo metadata FileInfo metadata is not
persisted to datastore. (There is no blob key either, but you can
create one later if needed by calling create_gs_key.) You must save
the gs_object_name yourself in your upload handler or this data will
be lost.
https://developers.google.com/appengine/docs/python/blobstore/fileinfoclass
Update: I was able to decode a SDK-BlobKey in Blobstore-Viewer: "encoded_gs_file:base64-encoded-filename-here". However the real thing is not base64 encoded.
create_gs_key(filename, rpc=None) ... Google says: "Returns an encrypted blob key as a string." Does anyone have a guess why this is encrypted?
From the statement in the docs, it looks like the generated GCS filenames are lost. You'll have to use gsutil to manually browse your bucket.
https://developers.google.com/storage/docs/gsutil/commands/ls
If you have blobKeys you can use: ImagesServiceFactory.makeImageFromBlob

Location of GS File in Local/Dev AppEngine

I'm trying to trouble shoot some issues I'm having with an export task I have created. I'm attempting to export CSV data using Google Cloud Storage and I seem to be unable to export all my data. I'm assuming it has something to do with the (FAR TOO LOW) 30 second file limit when I attempt to restart the task.
I need to trouble shoot, but I can't seem to find where my local/development server writing the files out. I see numerous entries in the GsFileInfo table so I assume something is going on, but I can't seem to find the actual output file.
Can someone point me to the location of the Google Cloud Storage files in the local AppEngine development environment?
Thanks!
Looking at dev_appserver code, looks like you can specify a path or it will calculate a default based on the OS you are using.
blobstore_path = options.blobstore_path or os.path.join(storage_path,
'blobs')
Then it passed this path to blobstore_stub (GCS storage is backed by blobstore stub), which seems to shard files by their blobstore key.
def _FileForBlob(self, blob_key):
"""Calculate full filename to store blob contents in.
This method does not check to see if the file actually exists.
Args:
blob_key: Blob key of blob to calculate file for.
Returns:
Complete path for file used for storing blob.
"""
blob_key = self._BlobKey(blob_key)
return os.path.join(self._DirectoryForBlob(blob_key), str(blob_key)[1:])
For example, i'm using ubuntu and started with dev_appserver.py --storage_path=~/tmp, then i was able to find files under ~/tmp/blobs and datastore under ~/tmp/datastore.db. Alternatively, you can go to local admin_console, the blobstore viewer link will also display gcs files.
As tkaitchuck mentions above, you can use the included LocalRawGcsService to pull the data out of the local.db. This is the only way to get the file, as they are stored in the Local DB using the blobstore. Here's the original answer:
which are the files uri on GAE java emulating cloud storage with GCS client library?

Resources