I have large data stored in Postres database and I need to send the data to the client via a REST API using Django. The requirement is to send the data in chunks and to not load the entire content into memory at once. I understand that there is a StreamingHttpResponse class in Django which I will explore. But are there any other better options? I've heard about Kafka and Spark for streaming applications but the tutorials I've checked about these two tend to involve streaming live data like interacting with Twitter data, etc. But is it possible to stream data from database using any of these two? If yes, how do I then integrate it with REST so that clients can interact with it? Any leads would be appreciated. Thanks.
You could use debezium or apache-kafka-connect to bulk load your database into Kafka.
Once the data is there, you can either put a Kafka consumer within your Django application or outside of it and make REST requests as messages are consumed. Spark isn't completely necessary, and shouldn't be used within Django
Related
I'm implementing a Django web service, which is about to have different platform apps,
Reactjs for computers, a swift app for ios, and Kotlin for android devices. the protocol is rest API and perhaps a chat feature included then Django channels are used as well. The data format is JSON. For deployment, I intend to use docker which includes Django, celery, and ReactJS app. And the database is on another separate server which is PostgreSQL. I was thinking to collect some user activity data and some history logs to show the user itself what she/he has done so far. After hours of searching, I came up with Kafka! unfortunately, I have no idea how can I use Kafka and integrate these stuff together and how can I deploy these things. I wish there was a system schema for this specific kind of system that shows what is what and where is what?
Kafka will only integrate your database and Django, with some effort, and ideally a separate Kafka Connect service.
From React (or other clients), you'll need to query some Django API routes which will then query your database. Kafka won't help with your frontend, and isn't really what is exposing the history/activity you're interested in displaying. In other words, you could simply write that to the database, and skip Kafka entirely.
Essentially, you're following the CQRS design pattern if you properly separate Kafka writes from end user / UI reads.
shows what's what and what's where!
Unclear what this means, but data lineage and metadata tools are a whole separate thing. For example, LinkedIn DataHub collects information such as this
I am new to apache kafka and apache spark. I want to integrate the kafka with my angularjs code. Basically I want to make sure that, when a user click on any link or searches anything on my website, then the those searches and clicks should be triggered as an event and send it to the kafka data pipe for the use of analytics.
My question is how can I integrate frontend code which is in angular.js to apache kafka?
Can I send the searches and click stream data directly to apache spark using kafka pipeline or do I need to send those data to kafka and apache spark will do polling to kafka server and receive the data in batches?
I don't think (just cannot find at glance) there is Kafka client for front-end JavaScript. I cannot actually imagine stable setup when millions of producers (each client's browser) writing to the same Kafka topic.
What you need to do in Angular, is to call your server side function to log your events in Kafka.
Server side code may be written in a bunch of languages, including JavaScript for node.js.
Please take a look for available clients at Kafka Documentation
Update 2019: There are several projects implementing REST over HTTP(s) proxy for producer and consumer interfaces. For example Kafka Rest project (source). Never tried these by myself though.
So I have this 2 applications connected with a REST API (json messages). One written in Django and the other in Php. I have an exact database replica on both sides (using mysql).
My question is, how can i keep this 2 applications databases synchronized?
In other words, when i press "submit" on one of them, i want that data to be saved on the current app database, and on the remote database for the other app using rest.
Is there a django app that does that? i read about django-synchro but didn't see anything REST related.
And i would like to keep things asynchronous, in other words the user must be able to keep using the app while this process is running on the background and keeping data consistent.
I had a look at celery and redis and it seems like a cron job will do what i need
We have an application that we're deploying on GAE. I've been tasked with coming up with options for replicating the data that we're storing the the GAE data store to a system running in Amazon's cloud.
Ideally we could do this without having to transfer the entire data store on every sync. The replication does not need to be in anything close to real time, so something like a once or twice a day sync would work just fine.
Can anyone with some experience with GAE help me out here with what the options might be? So far I've come up with:
Use the Google provided bulkloader.py to export the data to CSV and somehow transfer the CSV to Amazon and process there
Create a Java app that runs on GAE, reads the data from the data store and sends the data to another Java app running on Amazon.
Do those options work? What would be the gotchas with those? What other options are there?
You could use a logic similar to what App Engine HRD migration or backup tool are doing:
Mark modified entities with a child entity marker
Run a MapperPipeline using App Engine mapreduce library iterating on those entity using a Datastore Input Reader
In your map function fetch the parent entity and serialize it to Google Storage using a File Output Writer and remove the marker
Ping the remote host to import those entity from the Google Storage url
As an alternative to 3 and 4, you could make multiple urlfetch(POST) to send each serialized entity to the remote host directly, but it is more fragile as an single failure could compromise the integrity of your data import.
You could look at the datastore admin source code for inspiration.
I have Google AppEngine application which fetch data from multiple sources and then I want to process all the data in serial manner.
What is the best way to do it?
Ido
An approach: store the fetched data in the datastore. Use TaskQueues to process the data.