I have Google AppEngine application which fetch data from multiple sources and then I want to process all the data in serial manner.
What is the best way to do it?
Ido
An approach: store the fetched data in the datastore. Use TaskQueues to process the data.
Related
The ERP used in the company that I work for has some information available across API requests. Currently my code just grab each document and replace at Mongo. It does not handle excluded documents, for example. Moreover, this code takes hours to get all information because API requests are limited.
Is there a tool, library, app, etc. that can sync these databases?
What is the best practice to achieve the synchronization of API and MongoDB database?
PS: why do I need to sync them? API requests are limited.
PS2: not all, but some API requests return "last edited" timestamp. I am using it to reduce running time.
I created an App Engine Service to transcode video files as well as images. Video files can be large and thus will take longer to process. Cloud Tasks seemed like a great fit but my front-end clients needs to monitor tasks during execution.
For example, if a user uploads a video file, I would like to keep the client informed on the progress of their upload. I can't really see anywhere in the docs that show how to request this information from an actively executing task (or an API I can send these updates to?). My current implementation uses web sockets to relay this information, however, this doesn't seem scalable if lots of clients start uploading videos. My thought is to store task state in a NoSQL db and return the DB task ID to the client when a video file is uploaded. Then, I would just poll the DB for updates.
Is this correct or is there a better approach?
The approach come from different perspectives, but my suggestion its to move the less possible pieces of the board.
If you have developed a nodeJS application to perform your video operations, complete them with a notification system, or tool like Service Worker. https://www.youtube.com/watch?v=HlYFW2zaYQM
Meanwhile, you can use PubSub and make a the complete ring about notifications/events:
"Ingest events at any scale"
Data ingestion is the foundation for analytics and machine learning, whether you are building stream, batch, or unified pipelines. Cloud Pub/Sub provides a simple and reliable staging location for your event data on its journey towards processing, storage, and analysis.
I have large data stored in Postres database and I need to send the data to the client via a REST API using Django. The requirement is to send the data in chunks and to not load the entire content into memory at once. I understand that there is a StreamingHttpResponse class in Django which I will explore. But are there any other better options? I've heard about Kafka and Spark for streaming applications but the tutorials I've checked about these two tend to involve streaming live data like interacting with Twitter data, etc. But is it possible to stream data from database using any of these two? If yes, how do I then integrate it with REST so that clients can interact with it? Any leads would be appreciated. Thanks.
You could use debezium or apache-kafka-connect to bulk load your database into Kafka.
Once the data is there, you can either put a Kafka consumer within your Django application or outside of it and make REST requests as messages are consumed. Spark isn't completely necessary, and shouldn't be used within Django
My application is currently on app engine server. My application writes the records(for logging and reporting) continuously.
Scenario: Views count in the website. When we open the website it hits the server to add the record with time and type of view. Showing these counts in the users dashboard.
Seems these requests are huge now. For now 40/sec. Google App Engine writes are going heavy and cost is increasing like anything.
Is there any way to reduce this or any other db to log the views?
Google App Engine's Datastore is NOT suitable for such a requirement where you have to continuously write to datastore and read less often.
You need to offload this task to a third party service (either you write one or use existing one)
Better option for user tracking and analytics is Google Analytics (Although you wont be directly able to show the hit counters on website using analytics).
If you want to show your user page hit count use a page hit counter: https://www.google.com/search?q=hit+counter
In this case you should avoid Datastore.
For this kind of analytics it's best to do the following:
Dump data to GAE log (yes, this sounds counter-intuitive, but it's actually advice from google engineers). GAE log is persistent and is guaranteed to not loose data you write to it.
Periodically parse the log for your data and then export it to BigQuery.
BigQuery has a quite powerful query language so it's capable of doing complex analytics reports.
Luckily this was already done before: see the Mache framework. Also see related video.
Note: there is now a new BigQuery feature called streaming inserts, which could potentially replace the cumbersome middle step (files on Cloud Storage) used in Mache.
We have an application that we're deploying on GAE. I've been tasked with coming up with options for replicating the data that we're storing the the GAE data store to a system running in Amazon's cloud.
Ideally we could do this without having to transfer the entire data store on every sync. The replication does not need to be in anything close to real time, so something like a once or twice a day sync would work just fine.
Can anyone with some experience with GAE help me out here with what the options might be? So far I've come up with:
Use the Google provided bulkloader.py to export the data to CSV and somehow transfer the CSV to Amazon and process there
Create a Java app that runs on GAE, reads the data from the data store and sends the data to another Java app running on Amazon.
Do those options work? What would be the gotchas with those? What other options are there?
You could use a logic similar to what App Engine HRD migration or backup tool are doing:
Mark modified entities with a child entity marker
Run a MapperPipeline using App Engine mapreduce library iterating on those entity using a Datastore Input Reader
In your map function fetch the parent entity and serialize it to Google Storage using a File Output Writer and remove the marker
Ping the remote host to import those entity from the Google Storage url
As an alternative to 3 and 4, you could make multiple urlfetch(POST) to send each serialized entity to the remote host directly, but it is more fragile as an single failure could compromise the integrity of your data import.
You could look at the datastore admin source code for inspiration.