Loading from Google Cloud Console to Google BigQuery using command-line tool

Loading from Google Cloud Console to Google BigQuery using command-line tool - google-app-engine

Until now I was using the BigQuery web tool to load from a backup of my data automatically saved on the Cloud Storage. I'm storing these backups three times a week, on three different buckets depending the week day (monday, wednesday, friday).
The GAE backup tool saves the .backup_info files with such a long name (for example: ahNzfmVnb2xpa2Vwcm9kdWN0aW9uckELEhxfQUVfRGF0YXN0b3JlQWRtaW5fT3BlcmF0aW9uGIrD6wMMCxIWX0FFX0JhY2t1cF9JbmZvcm1hdGlvbhgBDA.entityName.backup_info) and don't know how it is determined or if I can determine easier one. I can only give name to the "output-X-retry-Y" files. Is there any way to change this?
On the other hand, I'm trying the command-line tool, I want to move from the web tool to this one.
I've tried the load command but don't know how to automatically generate the schema from the backup, the same way I'm doing it from the web tool on the 'specify schema' section.
I'm always taking an error because of not specifying the schema trying this format:
bq load dataset.table gs://path
Is it possible to not determine the schema, the same way I'm not doing it on the web tool?

If you're running bq load to import a GAE datastore backup, you should add the --source_format=DATASTORE_BACKUP flag. Note you need to add that flag after load but before the table name:
bq load --source_format=DATASTORE_BACKUP dataset.table gs://path
That will tell BigQuery that you're loading from a datastore backup, which has a self-describing schema.
As far as I know, there isn't a way to control the generated name of the datastore backup.

Related

wso2am deployment overrides database, API's are lost

i am using wso2 api-manager 02.01.00 on a linux system. The Api-Manager is deployed at Folder A. The Databases (h2) are deployed ad Folder B which is not in Folder A. The datasources in /repository/conf/datasources/master-datasources.xml are pointing correctly to the databases in Folder B. I configured it like that, because i want do preserve the databases if there is a deployment. (Becaus a fiew Developer are using the API-Manager and they don't want to loose their Data.) But it seem, that WSO2AM_DB.h2.db is created new if there is an api-manager-depoyment. I think this, because i had a look to the DB-Size. I started with a Size of 1750KB for WSO2AM_DB.h2.db. I published a view API's in the Manager and the Size increases to 2774KB. Then i did a Deployment and the size returned to 1750KB.
Effect is that API-Store/Publisher says "There are no APIS published yet".
But i could see the APIS at Application Subscriptions and in Carbon Resources at /_system/governance/apimgt/applicationdata/provider/admin.
I tried to force a new Indexing with this, but it doesn't change anything.
Could i configure at any place, that the Database should not be created/manipulated at start?
Meanwhile i'm really desperated of not solving this problem.
Maybe you could help me.
Thank you for your Time.

WSO2 does not recommend to run on H2 database. You need to use a production database such as mysql, oracle, etc. H2 is only for tryouts.
Basically, WSO2 servers store data in databases as well as use the file system. For this kind of a deployment, you need to do the following.
Point to an external database. If you are using this for demo purposes, still you can go with the current mode (H2 database).
Use dep-sync. The content which comes under the WSO2_HOME/repository/deployment/server location needs to be preserved. You can use SVN based dep-sync or rsync. Basic idea is that for a new deployment, you need to have the data of the previous deployment.
Solr Indexing preservation. If you have hundreds/thousands of APIs in the system, it would take time for indexing. To avoid that you can copy the content of WSO2_HOME/solr to the new deployment.

GAE: What's faster loading an include config file from GCS or from cloud SQL

Based on the subdomain that is accessing my application I need to include a different configuration file that sets some variables used throughout the application (the file is included on every page). I'm in two minds about how to do this
1) Include the file from GCS
2) Store the information in a table on Google Cloud SQL and query the database on every page through an included file.
Or am I better off using one of these options and then Memcache.
I've been looking everywhere for what is the fastest option (loading from GCS or selecting from cloud SQL), but haven't been able to find anything.
NB: I don't want to have the files as normal php includes as I don't want to have to redeploy the app every time I setup a new subdomain (different users get different subdomains) and would rather either just update the database or upload a new config file to cloud storage, leaving the app alone.

I would say the most sane solution would be to store the configuration files in the Cloud SQL as you can easily make changes to them even from within the app and using the memcache since it was build exactly for this kind of stuff.
The problem with the GCS is that you cannot simply edit the file and you will have to delete and add a new version every time which is not going to be optimal in a long run.

GCS is cheaper, although for small text files it does not matter much. Otherwise, I don't see much of a difference.

Restore app-engine entities locally

Hi guys I've dumped (made a backup) of my Appengine datastore entities,following this tutorial, now I wonder if there is a way to restore the data locally ? so I can do some test and debug.

In windows, the datastore is in the directory
C:\Users\UserName\AppData\Local\Temp\AppName
In OSx this question can help you
In this directory are storade the datastore.db (the local storage), change the name (the app should not be running, and if is locked, kill all the python process)
Now go to the appengine dashboard
click in your app link
click in Blob Viewer (i'm assumming that you did the backup into a blobstore)
click in the file name
click in download
rename the file to datastore.db
copy to the previous path
start the app

Remote API (as koma mentions) is the main GAE-documented approach, and it's a good approach. Alternatively, you can download the entities using the cloud download tool, write your own store reader/deserializer, and execute it within your dev server local instance: http://gbayer.com/big-data/app-engine-datastore-how-to-efficiently-export-your-data. Read the part about the New Approach...
While these options are not automatic and require engineering, I really wanted to point out the side effect of doing this: We have been facing performance issues in the local development server for months now, specifically when the datastore has more than 1,000 entities with over 50 indexes. Just search for "require_indexes slow" and you'll see what I'm talking about.
I'm sure you have a solid reason to import lots of data locally for testing and debugging, just wanted to let you know your application will perform extremely slow, and debug mode will be impossibly slow; we can't even use debug mode with our setup anymore.

If you want to get some test data in your local db, you could copy some using the remote api

local GAE datastore does not keep data after computer shuts down

On my local machine (i.e. http://localhost:8080/), I have entered data into my GAE datastore for some entity called Article. After turning off my computer and then restarting next day, I find the datastore empty: no entity. Is there a way to prevent this in the future?
How do I make a copy of the data in my local datastore? Also, will I be able to upload said data later into both localhost and production?
My model is ndb.
I am using Max OS X and Python 2.7, if theses matter.

I have experienced the same problem. Declaring the datastore path when running dev_appserver.py should fix it. These are the arguments I use when starting the dev_appserver
python dev_appserver.py --high_replication --use_sqlite --datastore_path=myapp.datastore --blobstore_path=myapp_blobs
This will use sqlite and save the data in the file myapp.datastore. If you want to save it in a different directory, use --datastore_path=/path/to/myapp/myapp.datastore
I also use --blobstore_path to save my blobs in a specific directory. I have found that it is more reliable to declare which directory to save my blobs. Again, that is --blobstore_path=/path/to/myapp/blobs or whatever you would like.
Since declaring blob and datastore paths, I haven't lost any data locally. More info can be found in the App Engine documentation here:
https://developers.google.com/appengine/docs/python/tools/devserver#Using_the_Datastore

Data in the local datastore is preserved unless you start it with the -c flag to clear it, at least on the PC. You therefore probably have a different issue with temp files or permissions or something.
The local data is stored using a different method to the actual production servers, so not sure if you can make a direct backup as such. If you want to upload data to both the local and deployed servers you can use the Upload tool suite: uploading data
The bulk loader tool can upload and download data to and from your application's datastore. With just a little bit of setup, you can upload new datastore entities from CSV and XML files, and download entity data into CSV, XML, and text files. Most spreadsheet applications can export CSV files, making it easy for non-developers and other applications to produce data that can be imported into your app. You can customize the upload and download logic to use different kinds of files, or do other data processing.
So you can 'backup' by downloading the data to a file.
To load/pull data into the local development server just give it the local URL.

The datastore typically saves to disk when you shut down. If you turned off your computer without shutting down the server, I could see this happening.

Best way to get CSV data into App Engine when bulkloader takes too long/generates errors?

I have a 10 MB CSV file of Geolocation data that I tried to upload to my App Engine datastore yesterday. I followed the instructions in this blog post and used the bulkloader/appcfg tool. The datastore indicated that records were uploaded but it took several hours and used up my entire CPU quota for the day. The process broke down in errors towards the end before I actually exceeded my quota. But needless to say, 10 MB of data shouldn't require this much time and power.
So, is there some other way to get this CSV data into my App Engine datastore (for a Java app).
I saw a post by Ikai Lan about using a mapper tool he created for this purpose but it looks rather complicated.
Instead, what about uploading the CSV to Google Docs - is there a way to transfer it to the App Engine datastore from there?

I do daily uploads of 100000 records (20 megs) through the bulkloader. Settings I played with:
- bulkloader.yaml config: set to auto generate keys.
- include header row in raw csv file.
- speed parameters are set on max (not sure if reducing would reduce cpus consumed)
These settings burn through my 6.5 hrs of free quota in about 4 minutes -- but it gets the data loaded (maybe its' from the indexes being generated).
appcfg.py upload_data --config_file=bulkloader.yaml --url=http://yourapp.appspot.com/remote_api --filename=data.csv --kind=yourtablename --bandwidth_limit=999999 --rps_limit=100 --batch_size=50 --http_limit=15
(I autogenerate this line with a script and use Autohotkey to send my credentials).

I wrote this gdata connector to pull data out of a Google Docs Spreadsheet and insert it into the datastore, but it uses Bulkloader, so it kind of takes you back to square one of your problem.
http://code.google.com/p/bulkloader-gdata-connector/source/browse/gdata_connector.py
What you could do however is take a look at the source to see how I pull data out of gdocs and create a task(s) that does that, instead of going through bulkloader.
Also you could upload your document into the blobstore and similarly create a task that reads csv data out of blobstore and creates entities. (I think this would be easier and faster than working with gdata feeds)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Loading from Google Cloud Console to Google BigQuery using command-line tool - google-app-engine

Related

wso2am deployment overrides database, API's are lost

GAE: What's faster loading an include config file from GCS or from cloud SQL

Restore app-engine entities locally

local GAE datastore does not keep data after computer shuts down

Best way to get CSV data into App Engine when bulkloader takes too long/generates errors?

Categories

Resources