Search files in s3 bucket using bucket with filename - file

Suppose I upload a file in a bucket and there are already some 1000 files in the same bucket. Now i would like to search files through the filename if its there in the bucket or not. Help is appreciated. as I was unable to find any such documentation. If anyone has tried this then do post you comments :)

There is no search functionality in the S3 API. If you know the exact name of the file you can issue a HEAD request to that object in the S3 bucket (the boto lookup() method does this) and if you get a 200 Response back from the server then you know the file is there. If you get a 404, it's not there.
If you don't know the exact name of the file you are looking for, all you can really do is list the contents of the bucket until you find the file you are looking for. This is very inefficient and if you need to do this on a regular basis, I would recommend storing the filenames in a separate database that would allow you to search efficiently.

Related

google appengine searching buckets to find a particular "content_type" text/csv

I have multiple buckets and i would like to find a the buckets that store the csv files. I do not know how to search buckets to find what i need. Is there a method to query the buckets to only find content type "text/csv." Ultimately i am attempting to find the csv files blobkey that begins with "encoded_gs_file:" Also, what is the relationship between the datastore and storage?
The blobstore viewer that i am running in localhost only shows the encoded_gs_file for images. But i know that there should be a encoded_gs_file for the csv files.
When i visit the following url:
http://localhost:8000/datastore?kind=__GsFileInfo__
i can see the csv file type, but when i go to this url:
http://localhost:8000/datastore?kind=__BlobInfo__
the csv file does not appear. I think if i can get the csv file to appear in the ____blobInfo____ endpoint, then i can download it
There is not a specific method to search objects into a bucket, but what you can do is to search using different API methods for example using the JSON API:
1.List all the buckets on your project. https://cloud.google.com/storage/docs/json_api/v1/buckets/list?apix_params=%7B%22project%22%3A%22edp44591%22%7D
2.Then, having the list of buckets you can list all the object in each one
https://cloud.google.com/storage/docs/json_api/v1/objects/list
3.Once you have the list of objects inside the bucket you can filter with you preferred programming language.
Basically you can do the same with the XML API here is the reference to it:
https://cloud.google.com/storage/docs/xml-api/reference-methods
Or using the gsutil tool:
gsutil list :to list all the bucket on your project: https://cloud.google.com/storage/docs/listing-buckets
gsutil ls -r gs://[BUCKET_NAME]/** : to list all the objects inside your project.
https://cloud.google.com/storage/docs/listing-objects
If you want to see examples about how to use the API with different code-languages go to the document Cloud Storage Client Libraries https://cloud.google.com/storage/docs/reference/libraries#client-libraries-install-nodejs

google-appengine: finding out which bucket data is being stored

I have csv objects that i can download with a blobkey. The endpoint looks like this /data?key=[blobkey]. If i were to go into the datastore and find the csv key, replace it with the blobkey, i get the csv downloaded. The problem occurs when trying to generate new csv files. The bucket that i am saving the new scv files is one i created just recently. What should happen is that i should be getting a blobkey that i can replace in the endpoint and get a csv object. But i do not get anything.
My thinking is that i am just saving it inside of the wrong bucket. I should try to save it into the original bucket, where the other csv objects are. But i do not know how to search for those objects. I do know for sure that the csv exits because i can navigate to the bucket and download it there from the url https://storage.cloud.google.com/.../my.csv
My question is, how do i find the position of the bucket where the previous csv's were saved?
How i am downloading CSV in development server:
# obtain a blobkey
blob_key = BlobKey(blobstore.create_gs_key(u'/gs' + gcs_filename))
# pass the blobkey to the endpoint
/data?key=blob_key
the bucket i am is
/user-exports/user_key/timestamp/mydata.csv

Uploading to Google Cloud Storage using Blobstore: Blobstore doesn't retain file name upon upload

I'm trying to upload to GCS using the Blobstore. I have set the GCS bucket name while generating the upload url, and the file gets uploaded successfully.
In the upload handler, blobInfo.getFilename() returns the right file name. But the file actually got saved in the GCS bucket in some different file name. Each time, the file name is some random hash like this one:
L2FwcGhvc3RpbmdfcHJvZC9ibG9icy9BRW5CMlVvbi1XNFEyWEJkNGlKZHNZRlJvTC0wZGlXVS13WTF2c0g0LXdzcEVkaUNEbEEyc3daS3Vham1MVlZzNXlCSk05ZnpKc1RudDJpajF1TmxwdWhTd2VySVFLdUw3US56ZXFHTEZSLVoxT3lablBI
Is this how it will work? Is this an anomaly?
I store the file name to the datastore based on the data returned from blobInfo.getFilename(), which is the correct value of file name. But I'm unable to access the file using the GcsFilename since the file is stored in GCS with that random hash as file name.
Any pointers would be greatly helpful.
Thanks!
PS: The blobstore page says that BlobInfo is currently not available for GCS objects. But BlobInfo.getFilename returns the right value for me. Is that something wrong from my end?
It's how it works, see https://cloud.google.com/appengine/docs/python/blobstore/fileinfoclas ...:
FileInfo metadata is not persisted to datastore [...] You must save
the gs_object_name yourself in your upload handler or this data will
be lost
I personally recommend that new applications use https://cloud.google.com/appengine/docs/python/googlecloudstorageclient/ directly, rather than the blobstore emulation on top of it.
The latter is currently provided essentially only for (limited, partial) backwards compatibility: it's not really all that suitable for new applications.

Location of GS File in Local/Dev AppEngine

I'm trying to trouble shoot some issues I'm having with an export task I have created. I'm attempting to export CSV data using Google Cloud Storage and I seem to be unable to export all my data. I'm assuming it has something to do with the (FAR TOO LOW) 30 second file limit when I attempt to restart the task.
I need to trouble shoot, but I can't seem to find where my local/development server writing the files out. I see numerous entries in the GsFileInfo table so I assume something is going on, but I can't seem to find the actual output file.
Can someone point me to the location of the Google Cloud Storage files in the local AppEngine development environment?
Thanks!
Looking at dev_appserver code, looks like you can specify a path or it will calculate a default based on the OS you are using.
blobstore_path = options.blobstore_path or os.path.join(storage_path,
'blobs')
Then it passed this path to blobstore_stub (GCS storage is backed by blobstore stub), which seems to shard files by their blobstore key.
def _FileForBlob(self, blob_key):
"""Calculate full filename to store blob contents in.
This method does not check to see if the file actually exists.
Args:
blob_key: Blob key of blob to calculate file for.
Returns:
Complete path for file used for storing blob.
"""
blob_key = self._BlobKey(blob_key)
return os.path.join(self._DirectoryForBlob(blob_key), str(blob_key)[1:])
For example, i'm using ubuntu and started with dev_appserver.py --storage_path=~/tmp, then i was able to find files under ~/tmp/blobs and datastore under ~/tmp/datastore.db. Alternatively, you can go to local admin_console, the blobstore viewer link will also display gcs files.
As tkaitchuck mentions above, you can use the included LocalRawGcsService to pull the data out of the local.db. This is the only way to get the file, as they are stored in the Local DB using the blobstore. Here's the original answer:
which are the files uri on GAE java emulating cloud storage with GCS client library?

Uploading multiple files to blobstore (redux)

Yes, I've seen this question already, but I'm finding information that contradicts its accepted answer and Nick Johnson's blog on the GAE docs.
The docs talk about uploading more than one file at the same time - the function to get uploaded files returns a list:
The get_uploads() method returns a
list of BlobInfo objects, one for each
uploaded file in the request.
But everywhere I've looked, the going assumption is that only one file a time can be uploaded, and a new upload url needs to be created each time.
Is it even possible to upload more than one file at the same time using HTML5/Flash using Plupload?
Currently, the blobstore service upload URLs only support one file upload per post. In order to upload multiple files, you need to use the pattern documented in my blog posts. In future, we may extend the blobstore API to support more flexible upload URLs, supporting multiple uploaded files in a single request.
Edit: The blobstore now supports multiple file uploads in a single request.
Here's how I use the get_uploads() method for more than one file:
blob_info = self.get_uploads()[0]
blob_info2 = self.get_uploads()[1]
Nick Johnson's dropbox service is another example and I hope you find what suits your needs.

Resources