I am using Google App Engine to create a web application. The app has an entity, records for which will be inserted through an upload facility by the user. User may select up to 5K rows(objects) of data. I am using DataNucleus project as JDO implementation. Here is the approach I am taking for inserting the data to Data Store.
Data is read from the CSV and converted to entity objects and stored in a list.
The list is divided into smaller groups of objects say around 300/group.
Each group is serialized and stored in cache using memcache using a unique id as the key.
For each group, a task is created and inserted into the Queue along with the key. Each task calls a servlet which takes this key as the input parameter, reads the data from memory and inserts this to the data store and deletes the data from memory.
The Queue has a maximum rate of 2/min and the bucket size is 1. The problem i am facing is the task is not able to insert all 300 records in to data store. Out of 300, maximum that gets inserted is around 50. I have validated the data once it is read from memcache and am able to get all the stored data back from the memory. I am using the makepersistent method of the PersistenceManager to save data to ds. Can someone please tell me what the issue could be?
Also, I want to know, is there a better way of handling bulk insert/update of records. I have used BulkInsert tool. But in cases like these, it will not satisfy the requirement.
This is a perfect use-case for App Engine mapreduce. Mapreduce can read lines of text from a blob as input, and it will shard your input for you and execute it on the taskqueue.
When you say that the bulkloader "will not satisfy the requirement", it would help if you say what requirement you have that it doesn't satisfy, though - I presume in this case, the issue is that you need non-admin users to upload data.
Related
In our application we currently use dynamoDb to store the notification details. So a scheduler runs twice a day which queries "notificationType"(pk -> notifiactionType, sk -> userId).
In each item there is an attribute(timestamp), based on which if the timestamp is more than the current time will send a trigger(more business logic that for some records one day after the timestamp a mail needs to be sent). Now once the user performs the activity for which the notification is sent, then will delete the entry
My query is that, if the data grows large for a notificationType, then retrieval of all the data is redundant because for some records the notification is not going to be sent. Hence more read capacity is used and that might potentially increase the cost in later point of time.
In this case would it be wise to use the existing dynamoDb or move to any other db like mongoDb, cassandra or any other db.
Note: My primary concern is the cost
Another option is to use a workflow engine that can model the notification process per user instead of a batch job. This way you can avoid scanning large amounts of data as the engine would rely on durable timers to execute actions at the appropriate time.
My open-source project temporal.io which I led at Uber is used by multiple companies for notification like scenarios and was tested up 200 million open parallel workflows.
I wanted to understand how to optimize the use of accesses to the Firebase database.
I have a database containing: mountain refuges and routes.
Practically every opening of the application all the necessary data are downloaded from the database.
My idea would be to save the relevant data in a json file, and only when the user wants more information on this data does the read access to the database take place.
How can I reduce read access to the database? Is it possible to save the most "relevant" data in a json file on the device so as not to download the "relevant" data every time the user opens the app?
If you're asking about pricing, Firestore is billed by interactions with the database reads, writes, deletes, not quantity of data (ignoring Stored Data for this use case)
Additionally, Queries are shallow: they only return documents in a particular collection or collection group and do not return subcollection data.
So as long as your structure supports it, showing the higher level List of Places should be the only reads you're doing until the user actually selects a place to get more details on.
If you have a million places, leverage pagination to only load enough to support the UI - say 100 at a time. That will limit the number of reads needed as well.
I have a requirement to load 100's of tables to BigQuery from Google Cloud Storage(GCS -> Temp table -> Main table). I have created a python process to load the data into BigQuery and scheduled in AppEngine. Since we have Maximum 10min timeout for AppEngine. I have submitted the jobs in Asynchronous mode and checking the job status later point of time. Since I have 100's of tables need to create a monitoring system to check the status the job load.
Need to maintain a couple of tables and bunch of views to check the job status.
The operational process is little complex. Is there any better way?
Thanks
When we did this, we simply used a message queue like Beanstalkd, where we pushed something that later had to be checked, and we wrote a small worker who subscribed to the channel and dealt with the task.
On the other hand: BigQuery offers support for querying data directly from Google Cloud Storage.
Use cases:
- Loading and cleaning your data in one pass by querying the data from a federated data source (a location external to BigQuery) and writing the cleaned result into BigQuery storage.
- Having a small amount of frequently changing data that you join with other tables. As a federated data source, the frequently changing data does not need to be reloaded every time it is updated.
https://cloud.google.com/bigquery/federated-data-sources
I am investigating what might be the best infrastructure for storing log files from many clients.
Google App engine offers a nice solution that doesn't make the process a IT nightmare: Load balancing, sharding, server, user authentication - all in once place with almost zero configuration.
However, I wonder if the Datastore model is the right for storing logs. Each log entry should be saved as a single document, where each clients uploads its document on a daily basis and can consists of 100K of log entries each day.
Plus, there are some limitation and questions that can break the requirements:
60 seconds timeout on bulk transaction - How many log entries per second will I be able to insert? If 100K won't fit into the 60 seconds frame - this will affect the design and the work that needs to be put into the server.
5 inserts per entity per seconds - Is a transaction considered a single insert?
Post analysis - text search, searching for similar log entries cross clients. How flexible and efficient is Datastore with these queries?
Real time data fetch - getting all the recent log entries.
The other option is to deploy an elasticsearch cluster on goole compute and write the server on our own which fetches data from ES.
Thanks!
Bad idea to use datastore and even worse if you use entity groups with parent/child as a comment mentions when comparing performance.
Those numbers do not apply but datastore is not at all designed for what you want.
bigquery is what you want. its designed for this specially if you later want to analyze the logs in a sql-like fashion. Any more detail requires that you ask a specific question as it seems you havent read much about either service.
I do not agree, Data Store is a totally fully managed no sql document store database, you can store the logs you want in this type of storage and you can query directly in datastore, the benefits of using this instead of BigQuery is the schemaless part, in BigQuery you have to define the schema before inserting the logs, this is not necessary if you use DataStore, think of DataStore as a MongoDB log analysis use case in Google Cloud.
I have an AppEngine application that process files from Cloud Storage and inserts them in BigQuery.
Because now and also in the future I would like to know the sanity/performance of the application... I would like to store stats data in either Cloud Datastore or in a Cloud SQL instance.
I have two questions I would like to ask:
Cloud Datastore vs Cloud SQL - what would you use and why? What downsides have you experienced so far?
Would you use a task or direct call to insert data and, also, why? - Would you add a task and then have some consumers insert to data or would you do a direct insert [ regardless of the solution choosen above ]. What downsides have you experienced so far?
Thank you.
Cloud SQL is better if you want to perform JOINs or SUMs later, Cloud Datastore will scale more if you have a lot of data to store. Also, in the Datastore, if you want to update a stats entity transactionally, you will need to shard or you will be limited to 5 updates per second.
If the data to insert is small (one row to insert in BQ or one entity in the datastore) then you can do it by a direct call, but you must accept that the call may fail. If you want to retry in case of failure, or if the data to insert is big and it will take time, it is better to run it asynchronously in a task. Note that with tasks,y you must be cautious because they can be run more than once.