I have a file which will be stored in Google Storage. I then need to store data about this file in my datastore so I can find the file. If I get the file and send it to storage and something happens right after that causing no information to not be stored in the datastore the file would exist, but I would have no way to find it. Is there some technique or method to make this whole process in a "transaction"? Either everything gets inserted or nothing does?
Google Storage and Google Datastore don't provide an API to do this directly, but it can be accomplished with a 3-step dance. I'll assume that your Datastore entity is called FileInfo. We'll need another datastore entity called PreStoreFile.
Write a PreStoreFile to datastore that contains the cloud storage path where the file will be written, and a timestamp.
Write the file to cloud storage.
In a single transaction, write FileInfo and delete the PreStoreFile from step 1.
Finally, add a cron job that runs every hour or day and finds old PreStoreFile entities and deletes them along with the corresponding datastore object.
Related
I am writing a Web crawler for Linked Data and I need to store crawled URIs on the disk (not necessarily distributed but could be). My crawler will check if a URI exists in the storage constantly. If a URI does exists, it will do nothing, if it does not exists, it will crawl the URI and write the URI to the storage. At first, since the storage will be rather empty, there will be more writes than reads but at some point, reads will be more than writes and I favor faster reads. I don't need any join operations etc.
I am thinking about a document based NoSQL storage and I define a key="domain of a URI", value="an array of the whole URIs". I am not sure if I need a secondary index on the value.
Since you are only interested in searching and storing, SQLite is suitable for your purposes. It's a lightweight database engine.
Selamlar.
To move data from datastore to bigquery tables I currently follow a manual and time consuming process, that is, backing up to google cloud storage and restoring to bigquery. There is scant documentation on the restoring part so this post is handy http://sookocheff.com/posts/2014-08-04-restoring-an-app-engine-backup/
Now, there is a seemingly outdated article (with code) to do it https://cloud.google.com/bigquery/articles/datastoretobigquery
I've been, however, waiting for access to this experimental tester program that seems to automate the process, but gotten no access for months https://docs.google.com/forms/d/1HpC2B1HmtYv_PuHPsUGz_Odq0Nb43_6ySfaVJufEJTc/viewform?formkey=dHdpeXlmRlZCNWlYSE9BcE5jc2NYOUE6MQ
For some entities, I'd like to push the data to big query as it comes (inserts and possibly updates). For more like biz intelligence type of analysis, a daily push is fine.
So, what's the best way to do it?
There are three ways of entering data into bigquery:
through the UI
through the command line
via API
If you choose API, then you can have two different ways: "batch" mode or streaming API.
If you want to send data "as it comes" then you need to use the streaming API. Every time you detect a change on your datastore (or maybe once every few minutes, depending on your needs), you have to call the insertAll method of the API. Please notice you need to have a table created beforehand with the structure of your datastore. (This can be done via API if needed too).
For your second requirement, ingesting data once a day, you have the full code in the link you provided. All you need to do is adjust the JSON schema to those of your data store and you should be good to do.
I want to export some data from our App Engine app - the current data set is around 70k (will grow) entities which need to be filtered.
The filtering is done with a cron job (app engine task), 1k batch at a time. Is there a mechanism which will allow me to add lines to an existing file, rather than uploading it in bulk (like Google Cloud Storage requires)?
You can use the Datastore API to access the Datastore from your own PC or a Compute Engine instance and write all the entities to your hard drive (or Compute Engine instance). It's different from using the Datastore from within the App Engine instances, but only slightly, so you should have no problems writing the code.
I must observe, however, that writing 100 files to the Cloud Storage with 1,000 entities in each sounds like a good solution to me. Whatever you want to do with these records later, having 100 smaller files instead of one large super-file may be a good idea.
In an effort to reduce the number of datastore PUTs I am consuming I
wish to use the memcache much more frequently. The idea being to store
entities in the memcache for n minutes before writing all entities to
the datastore and clearing the cache.
I have two questions:
Is there a way to batch PUT every entity in the cache? I believe
makePersistentAll() is not really batch saving but rather saving each
individually is it not?
Secondly is there a "callback" function you can place on entities as
you place them in the memcache? I.e. If I add an entity to the cache
(with a 2 minute expiration delta) can I tell AppEngine to save the
entity to the datastore when it is evicted?
Thanks!
makePersistentAll does indeed do a batch PUT, which the log ought to tell you clear enough
There's no way to fetch the entire contents of memcache in App Engine. In any case, this is a bad idea - items can be evicted from memcache at any time, including between when you inserted them and when you tried to write the data to the datastore. Instead, use memcache as a read cache, and write data to the datastore immediately (using batch operations when possible).
I am using Google App Engine to create a web application. The app has an entity, records for which will be inserted through an upload facility by the user. User may select up to 5K rows(objects) of data. I am using DataNucleus project as JDO implementation. Here is the approach I am taking for inserting the data to Data Store.
Data is read from the CSV and converted to entity objects and stored in a list.
The list is divided into smaller groups of objects say around 300/group.
Each group is serialized and stored in cache using memcache using a unique id as the key.
For each group, a task is created and inserted into the Queue along with the key. Each task calls a servlet which takes this key as the input parameter, reads the data from memory and inserts this to the data store and deletes the data from memory.
The Queue has a maximum rate of 2/min and the bucket size is 1. The problem i am facing is the task is not able to insert all 300 records in to data store. Out of 300, maximum that gets inserted is around 50. I have validated the data once it is read from memcache and am able to get all the stored data back from the memory. I am using the makepersistent method of the PersistenceManager to save data to ds. Can someone please tell me what the issue could be?
Also, I want to know, is there a better way of handling bulk insert/update of records. I have used BulkInsert tool. But in cases like these, it will not satisfy the requirement.
This is a perfect use-case for App Engine mapreduce. Mapreduce can read lines of text from a blob as input, and it will shard your input for you and execute it on the taskqueue.
When you say that the bulkloader "will not satisfy the requirement", it would help if you say what requirement you have that it doesn't satisfy, though - I presume in this case, the issue is that you need non-admin users to upload data.