How can I speed up the App Engine bulk downloader? - google-app-engine

I'm trying to use the App Engine bulkloader to download entities from the datastore (the high-replication one if it matters). It works, but it's quite slow (85KB/s). Are there some magical set of parameters I can pass it to make it faster? I'm receiving about 5MB/minute or 20,000 records/minute, and given that my connection can do 1MB/second (and hopefully App Engine can serve faster than that) there must be a way to do it faster.
Here's my current command. I've tried high numbers, low numbers, and every permutation:
appcfg.py download_data
--application=xxx
--url=http://xxx.appspot.com/_ah/remote_api
--filename=backup.csv
--rps_limit=30000
--bandwidth_limit=100000000
--batch_size=500
--http_limit=32
--num_threads=30
--config_file=bulkloader.yaml
--kind=foo
I already tried this
App Engine Bulk Loader Performance
and it's no faster than what I already have. The number's he mentions are on par with what I'm seeing as well.
Thanks in advance.

Did you set an index on the key of the entity your trying to download?
I don't know if that helps but check if you get a warning at the beginning of the download that says something about "using sequential download"
Put this on the index.yaml to create an index on the entity key upload and wait for the index to be built.
- kind: YOUR_ENTITY_TYPE
properties:
- name: __key__
direction: desc

Related

Does google app engine java datastore cache JPA query results?

I am using DN3 and GAE 1.7.4.
I use JPA2 which according to the documentations by default has Level2 cache enabled.
Here is my question:
If I run a query that returns some objects, would these objects be put in the cache automatically by their ID?
If I run em.find() with an id of an object which has already been loaded with another query createQuery().getResultList() would it be available in the cache?
Do I need to run my em.find() or query in a transaction in order for the cache to kick in?
I need some clarification on how this cache works and how I could do my queries/finds/persists in order to make the best use of the cache.
Thanks
From Google App Engine: Using JPA with App Engine
Level2 Caching is enabled by default. To get the previous default
behavior, set the persistence property datanucleus.cache.level2.type
to none. (Alternatively include the datanucleus-cache plugin in the
classpath, and set the persistence property
datanucleus.cache.level2.type to javax.cache to use Memcache for L2
caching.
As for your doubts, this depends on your query as well as DataNucleus and GAE Datastore adapter implementation specifics. As Carol McDonald suggested I believe that the best path to find the answers for your questions is with JPA2 Cache interface... More specifically the contains method.
Run your query, get access to the Cache interface through the EntityManagerFactory and see if the Level 2 cache contains the desired entity.
Enabling DataNucleus logs will also give you good hints about what is happening behind the scenes.
After debugging in development local GAE mode I figured level 2 cache works. No need for transaction begin/commit. The result of my simple query on primary keys as well as em.find() would be put in the cache by their primary keys.
However the default cache timeout in local development server is like a few seconds, I had to add this:
<property name="datanucleus.cache.level2.timeout" value="3600000" />
to persistence.xml.

HRD migration broke datastore queries & indexes

Google appengine HRD migration has been a nightmare for me. I migrated my 55GB datastore to HRD yesterday. Since then many queries and indexes are broken:
Some examples:
Select * from table1 where col1=val1 => query.get() returns empty in
python. However, it works in datastore viewer.
Select * from table1
where col1=val1 => query.count()>0. However query.get() = empty.
Select * from table1 where col1=val1 order by col2 desc => Almost
half of the rows are getting missed in the response. Same behavior in
datastore viewer.
How do I get these tables and indexes repaired? Any way of getting Google Appengine team support for addressing this issue? Its a GAE Migration tool bug.
Will appreciate any help.
When the migration tool is used, a new app id is assigned, which makes all the keys change.
To recreate the custom indexes:
Temporarily empty index.yaml.
Vacuum the indexes (check out How can I remove unused indexes in Google Application Engine? for further information).
Wait until all the indexes have been deleted.
Restore index.yaml.
Create indexes by either redeploying the application or running appcfg.py update_indexes <path> (check out the documentation for furher information).
You may also need to manually update all of the other references (e.g. a ListProperty of keys) if you have any.
Edit
The simple, mono-property indexes that are managed automatically by App Engine are created/updated when a property is put.
To regenerate them, I recommend creating and running a simple MapReduce task to put every existing entity. This procedure should rebuild all the indexes (including those defined in index.yaml).
As this is a costly process, first do it manually with a few entities to see if it solves the problem.
Tables get repaired automatically in about 2-3 days. Thats a HRD problem. My problem is now resolved.
Update : finally it fix itself in 24 hours=)
I have the same problem as you
query.count()>0. However query.get() or fetch() is empty.
Its strange that some table work fine but some table has this problem
I think it is Google App Engine problem from migration very large table(model).
I hope my tables will recovered like yours in 2-3 days too.

Best way to get CSV data into App Engine when bulkloader takes too long/generates errors?

I have a 10 MB CSV file of Geolocation data that I tried to upload to my App Engine datastore yesterday. I followed the instructions in this blog post and used the bulkloader/appcfg tool. The datastore indicated that records were uploaded but it took several hours and used up my entire CPU quota for the day. The process broke down in errors towards the end before I actually exceeded my quota. But needless to say, 10 MB of data shouldn't require this much time and power.
So, is there some other way to get this CSV data into my App Engine datastore (for a Java app).
I saw a post by Ikai Lan about using a mapper tool he created for this purpose but it looks rather complicated.
Instead, what about uploading the CSV to Google Docs - is there a way to transfer it to the App Engine datastore from there?
I do daily uploads of 100000 records (20 megs) through the bulkloader. Settings I played with:
- bulkloader.yaml config: set to auto generate keys.
- include header row in raw csv file.
- speed parameters are set on max (not sure if reducing would reduce cpus consumed)
These settings burn through my 6.5 hrs of free quota in about 4 minutes -- but it gets the data loaded (maybe its' from the indexes being generated).
appcfg.py upload_data --config_file=bulkloader.yaml --url=http://yourapp.appspot.com/remote_api --filename=data.csv --kind=yourtablename --bandwidth_limit=999999 --rps_limit=100 --batch_size=50 --http_limit=15
(I autogenerate this line with a script and use Autohotkey to send my credentials).
I wrote this gdata connector to pull data out of a Google Docs Spreadsheet and insert it into the datastore, but it uses Bulkloader, so it kind of takes you back to square one of your problem.
http://code.google.com/p/bulkloader-gdata-connector/source/browse/gdata_connector.py
What you could do however is take a look at the source to see how I pull data out of gdocs and create a task(s) that does that, instead of going through bulkloader.
Also you could upload your document into the blobstore and similarly create a task that reads csv data out of blobstore and creates entities. (I think this would be easier and faster than working with gdata feeds)

Wiping the datastore?

I'm working on an app engine project (java). I'm using the jdo interface. I haven't pushed the application yet (just running at localhost). Is there a way I can totally wipe my datastore after I publish? In eclipse, when working locally, I can just wipe the datastore by deleting the local file:
appengine-generated/local_db.bin
any facility like that once published?
I'm using jdo right now, but I might switch to objectify or slim3, and would want a convenient way to wipe my datastore should I switch over, or otherwise make heavy modifications to my classes.
Otherwise it seems like I have to setup methods to delete instances myself, right?
Thanks
you can delete it from admin console if there are not much enitty stored in your app. go to http://appengine.google.com and manually do it. easy for less than 2000-5000 entity.
This question addressed the same topic. There is no one command way to drop an entire datastore's worth of data. The only suggestion I have beyond those give in that previous question, would be to try out the new Mapper functionality, which would make it easy to map over an entire set of entities, deleting them as you went.

App engine index building stalled stuck

I am having a problem with indexes building in my App Engine application. There are only about 200 entities in the indexes that are being built, and the process has now been running for over 24 hours.
My application name is romanceapp.
Is there any way that I can re-start or clear the indexes that are being built?
Try to redeploy your application to appspot, I have the same issue and this solved it.
Let me know if this helps.
Greeting, eng.Ilian Iliev
To handle "Error" indexes, first
remove them from your index.yaml file
and run appcfg.py vacuum_indexes.
Then, either reformulate the index
definition and corresponding queries
or remove the entities that are
causing the index to "explode."
Finally, add the index back to
index.yaml and run appcfg.py
update_indexes.
I found it here and it helps me.

Resources