google appengine searching buckets to find a particular "content_type" text/csv - google-app-engine

I have multiple buckets and i would like to find a the buckets that store the csv files. I do not know how to search buckets to find what i need. Is there a method to query the buckets to only find content type "text/csv." Ultimately i am attempting to find the csv files blobkey that begins with "encoded_gs_file:" Also, what is the relationship between the datastore and storage?
The blobstore viewer that i am running in localhost only shows the encoded_gs_file for images. But i know that there should be a encoded_gs_file for the csv files.
When i visit the following url:
http://localhost:8000/datastore?kind=__GsFileInfo__
i can see the csv file type, but when i go to this url:
http://localhost:8000/datastore?kind=__BlobInfo__
the csv file does not appear. I think if i can get the csv file to appear in the ____blobInfo____ endpoint, then i can download it

There is not a specific method to search objects into a bucket, but what you can do is to search using different API methods for example using the JSON API:
1.List all the buckets on your project. https://cloud.google.com/storage/docs/json_api/v1/buckets/list?apix_params=%7B%22project%22%3A%22edp44591%22%7D
2.Then, having the list of buckets you can list all the object in each one
https://cloud.google.com/storage/docs/json_api/v1/objects/list
3.Once you have the list of objects inside the bucket you can filter with you preferred programming language.
Basically you can do the same with the XML API here is the reference to it:
https://cloud.google.com/storage/docs/xml-api/reference-methods
Or using the gsutil tool:
gsutil list :to list all the bucket on your project: https://cloud.google.com/storage/docs/listing-buckets
gsutil ls -r gs://[BUCKET_NAME]/** : to list all the objects inside your project.
https://cloud.google.com/storage/docs/listing-objects
If you want to see examples about how to use the API with different code-languages go to the document Cloud Storage Client Libraries https://cloud.google.com/storage/docs/reference/libraries#client-libraries-install-nodejs

Related

How do you retrieve thumbnails from the cloud accounts?

When you ask Kloudless to retrieve the files from an account, using: GET /v0/accounts/{account_id}/folders/{id}/contents/, it only lists the actual files, there are no thumbnail files.
So you cannot use the get files contents:GET /v0/accounts/{account_id}/files/{id}/contents/
because it needs a specific file id for the thumbnail file, but you don't get that because none are listed in the preview call.
So how do you retrieve thumbnails for the files?
2016-09 Update: A thumbnails endpoint (docs) is now available for select services. The prior SO answer has been preserved below as it describes the File Download endpoint which is valuable to obtain the file contents for services that do not yet support obtaining thumbnails for.
At the current time the Kloudless API does not support returning thumbnails for
files stored in users' cloud storage accounts.
The request that you are making:
GET /v0/accounts/{account_id}/files/{id}/contents/
is a download request which fetches the full contents of the file.
The file ID can be obtained from the objects listed in the
children request which you referenced before:
GET /v0/accounts/{accounts_id}/folders/{id}/contents/
This will return a list of file/folder objects which have the ID of the
resource as well as other metadata. The ID in the returned file objects can be
used in the download request to fetch the contents of the file.

Location of GS File in Local/Dev AppEngine

I'm trying to trouble shoot some issues I'm having with an export task I have created. I'm attempting to export CSV data using Google Cloud Storage and I seem to be unable to export all my data. I'm assuming it has something to do with the (FAR TOO LOW) 30 second file limit when I attempt to restart the task.
I need to trouble shoot, but I can't seem to find where my local/development server writing the files out. I see numerous entries in the GsFileInfo table so I assume something is going on, but I can't seem to find the actual output file.
Can someone point me to the location of the Google Cloud Storage files in the local AppEngine development environment?
Thanks!
Looking at dev_appserver code, looks like you can specify a path or it will calculate a default based on the OS you are using.
blobstore_path = options.blobstore_path or os.path.join(storage_path,
'blobs')
Then it passed this path to blobstore_stub (GCS storage is backed by blobstore stub), which seems to shard files by their blobstore key.
def _FileForBlob(self, blob_key):
"""Calculate full filename to store blob contents in.
This method does not check to see if the file actually exists.
Args:
blob_key: Blob key of blob to calculate file for.
Returns:
Complete path for file used for storing blob.
"""
blob_key = self._BlobKey(blob_key)
return os.path.join(self._DirectoryForBlob(blob_key), str(blob_key)[1:])
For example, i'm using ubuntu and started with dev_appserver.py --storage_path=~/tmp, then i was able to find files under ~/tmp/blobs and datastore under ~/tmp/datastore.db. Alternatively, you can go to local admin_console, the blobstore viewer link will also display gcs files.
As tkaitchuck mentions above, you can use the included LocalRawGcsService to pull the data out of the local.db. This is the only way to get the file, as they are stored in the Local DB using the blobstore. Here's the original answer:
which are the files uri on GAE java emulating cloud storage with GCS client library?

Search files in s3 bucket using bucket with filename

Suppose I upload a file in a bucket and there are already some 1000 files in the same bucket. Now i would like to search files through the filename if its there in the bucket or not. Help is appreciated. as I was unable to find any such documentation. If anyone has tried this then do post you comments :)
There is no search functionality in the S3 API. If you know the exact name of the file you can issue a HEAD request to that object in the S3 bucket (the boto lookup() method does this) and if you get a 200 Response back from the server then you know the file is there. If you get a 404, it's not there.
If you don't know the exact name of the file you are looking for, all you can really do is list the contents of the bucket until you find the file you are looking for. This is very inefficient and if you need to do this on a regular basis, I would recommend storing the filenames in a separate database that would allow you to search efficiently.

In Drupal, is there a way to index files (pdf, doc) that were submitted via a Webform?

I'm trying to figure out a solution on how to be able to index/search PDF, doc, and maybe txt files that were uploaded via a webform. I've found a module (Search API attachments) that will index files but it appears that it only indexes files that are attached to nodes. :(
Our client wants to be able to search the contents of resumés that are submitted from a webform.
If your clients are expecting hundreds of nodes, it might be worthwhile to set up an Apache Solr. Then you can use Tika to index all kinds of files: http://tika.apache.org/
If that's not an option, you can write a custom module that uses the Webform API that saves the attached file as a node... and then use your Search API attachments module.

How do you upload data in bulk to Google App Engine Datastore?

I have about 4000 records that I need to upload to Datastore.
They are currently in CSV format. I'd appreciate if someone would
point me to or explain how to upload data in bulk to GAE.
You can use the bulkloader.py tool:
The bulkloader.py tool included with
the Python SDK can upload data to your
application's datastore. With just a
little bit of set-up, you can create
new datastore entities from CSV files.
I don't have the perfect solution, but I suggest you have a go with the App Engine Console. App Engine Console is a free plugin that lets you run an interactive Python interpreter in your production environment. It's helpful for one-off data manipulation (such as initial data imports) for several reasons:
It's the good old read-eval-print interpreter. You can do things one at a time instead of having to write the perfect import code all at once and running it in batch.
You have interactive access to your own data model, so you can read/update/delete objects from the data store.
You have interactive access to the URL Fetch API, so you can pull data down piece by piece.
I suggest something like the following:
Get your data model working in your development environment
Split your CSV records into chunks of under 1,000. Publish them somewhere like Amazon S3 or any other URL.
Install App Engine Console in your project and push it up to production
Log in to the console. (Only admins can use the console so you should be safe. You can even configure it to return HTTP 404 to "cloak" from unauthorized users.)
For each chunk of your CSV:
Use URLFetch to pull down a chunk of data
Use the built-in csv module to chop up your data until you have a list of useful data structures (most likely a list of lists or something like that)
Write a for loop, iterating through each each data structure in the list:
Create a data object with all correct properties
put() it into the data store
You should find that after one iteration through #5, then you can either copy and paste, or else write simple functions to speed up your import task. Also, with fetching and processing your data in steps 5.1 and 5.2, you can take your time until you are sure that you have it perfect.
(Note, App Engine Console currently works best with Firefox.)
By using remote API and operations on multiple entities. I will show an example on NDB using python, where our Test.csv contains the following values separated with semicolon:
1;2;3;4
5;6;7;8
First we need to import modules:
import csv
from TestData import TestData
from google.appengine.ext import ndb
from google.appengine.ext.remote_api import remote_api_stub
Then we need to create remote api stub:
remote_api_stub.ConfigureRemoteApi(None, '/_ah/remote_api', auth_func, 'your-app-id.appspot.com')
For more information on using remote api have a look at this answer.
Then comes the main code, which basically does the following things:
Opens the Test.csv file.
Sets the delimiter. We are using semicolon.
Then you have two different options to create a list of entities:
Using map reduce functions.
Using list comprehension.
In the end you batch put the whole list of entities.
Main code:
# Open csv file for reading.
with open('Test.csv', 'rb') as file:
# Set delimiter.
reader = csv.reader(file, delimiter=';')
# Reduce 2D list into 1D list and then map every element into entity.
test_data_list = map(lambda number: TestData(number=int(number)),
reduce(lambda list, row: list+row, reader)
)
# Or you can use list comprehension.
test_data_list = [TestData(number=int(number)) for row in reader for number in row]
# Batch put whole list into HRD.
ndb.put_multi(test_data_list)
The put_multi operation also takes care of making sure to batch appropriate number of entities in a single HTTP POST request.
Have a look at this documentation for more information:
CSV File Reading and Writing
Using the Remote API in a Local Client
Operations on Multiple Keys or Entities
NDB functions
the later version of app engine sdk, one can upload using the appcfg.py
see appcfg.py

Resources