Creating a Datamart with GitHub information - analytics

I would like to create a Datamart with GitHub information with commits, pull requests, reverts and so on.
GitHub provides many webhooks with these events. I am trying to create an architecture to process these events and load it to a RDS Database.
I was thinking in use a API Gateway + Kinesis Firehose to dump the events to the S3. Then use a cron (like. https://airflow.apache.org/) to process these files.
Cons and Pros:
(+) It's reliable as we have a simple API Gateway + Kineses dumping to S3.
(+) It's easy to reprocess as I am using Airflow
(-) It seems a little bit over architecting
(-) It will not be a real-time datamart.
Do u guys can think and propose another architecture with PROS and CONS?

Personally I would go with:
API Gateway -> Lambda -> Kinesis Stream -> Kinesis Analytics
This will give you the requirement of being real-time.
You can then offload the streams to S3 using Kinesis Firehose for any ad-hoc querying.

Related

How to use NATS Streaming Server with Apache flink?

I want to use NATs streaming server to streaming data and using Flink want to process on data.
how I can use apache flink to process real-time streaming data with NATS streaming server?
You'll need to either find or develop a Flink/NATS connector, or mirror the data into some other stream storage service that is already has Flink support. There is not a NATS connector among the connectors that are part of Flink, or Apache Bahir, or in the collection of Flink community packages. But if you search around, you will find some relevant projects on github, etc.
When evaluating a connector implementation, in addition to the usual considerations, consider these factors:
does it provide both consumer and producer interfaces?
does it do checkpointing?
what processing guarantees does it provide? (at least once, exactly once)
how good is the error handling?
performance: e.g., is it somehow batching writes?
how does it handle serialization?
does it expose any metrics?
If you decide to write your own connector, there are existing connectors for similar systems you can use as a reference, e.g., Nifi, Pulsar, etc. And you should be aware that the interfaces used by data sources are currently being refactored under the umbrella of FLIP-27.

can we use apache flink for real time pef/excel file generation?

Currently we have a system which generates (exports) an excel file of a webpage where we show lot's of numbers in UI grid. This system is designed using Java Programming Language. The Problem as the number if users are growing the service is slowing down. Below are the steps of high level working of the service.
The user submits the request for file export
The request is received by the excel generation service which does set of HTTP API calls and generates an excel file, which is uploaded to google cloud storage
At the end the file is download by the user
So can we use Apache Flink for exporting excel files in parallel?
While you probably could implement this with Apache Flink, I don't think it's a good fit for this application. I would suggest you look at an event-driven, serverless computing platform instead.
I would say that is better to use an event-driven architecture instead of using Flink. You can create two services, one service that handles the HTTP requests and insert them in a queue/log (this way you decouple the services and is easy to increase the throughput as other many advantages) and have another service capable to consume that events previously inserted and generate the excel files

Apache Nifi Site To Site Data Partitioning

I have a single output port in NiFi flow and I have a Flink job that's consuming data from this port using NiFi Site To Site protocol (Flink provides appropriate connector). The consumption is parallel - i.e. there are multiple Flink sources reading from the same NiFi port.
What I would like to achieve is kind of partitioned data load balancing between running Flink sources - i.e. ensure that data with the same key is always delivered to the same Flink source (similar to ActiveMQ message groups or Kafka partitioning). This is needed for ordering purposes.
Unfortunately, I was unable to find any documentation telling how to accomplish that.
Any suggestions really appreciated.
Thanks in advance,
Site-to-site wasn't really made to do what you are asking for. The best way to achieve it would be for NiFi to publish to Kafka, and then Flink consume from Kafka.

Google Cloud Dataflow ETL (Datastore -> Transform -> BigQuery)

We have an application running on Google App Engine using Datastore as persistence back-end. Currently application has mostly 'OLTP' features and some rudimentary reporting. While implementing reports we experienced that processing large amount of data (millions of objects) is very difficult using Datastore and GQL. To enhance our application with proper reports and Business Intelligence features we think its better to setup a ETL process to move data from Datastore to BigQuery.
Initially we thought of implementing the ETL process as App Engine cron job but it looks like Dataflow can also be used for this. We have following requirements for setting up the process
Be able to push all existing data to BigQuery by using Non streaming
API of BigQuery.
Once above is done, push any new data whenever it is updated/created in
Datastore to BigQuery using streaming API.
My Questions are
Is Cloud Dataflow right candidate for implementing this pipeline?
Will we be able to push existing data? Some of the Kinds have
millions of objects.
What should be the right approach to implement it? We are considering two approaches.
First approach is to go through pub/sub i.e. for existing data create a cron job and push all data to pub/sub. For any new updates push data to pub/sub at the same time it is updated in DataStore. Dataflow Pipeline will pick it from pub/sub and push it to BigQuery.
Second approach is to create a batch Pipeline in Dataflow that will query DataStore and pushes any new data to BigQuery.
Question is are these two approaches doable? which one is better cost wise? Is there any other way which is better than above two?
Thank you,
rizTaak
Dataflow can absolutely be used for this purpose. In fact, Dataflow's scalability should make the process fast and relatively easy.
Both of your approaches should work -- I'd give a preference to the second one of using a batch pipeline to move the existing data, and then a streaming pipeline to handle new data via Cloud Pub/Sub. In addition to the data movement, Dataflow allow arbitrary analytics/manipulation to be performed on the data itself.
That said, BigQuery and Datastore can be connected directly. See, for example, Loading Data From Cloud Datastore in BigQuery documentation.

SalesForce Notifications - Reliable Integration

I need to develop a system that is listening to the changes happened with SalesForce objects and transfers them to my end.
Initially I considered SalesForce Streaming API that allows exactly that - create a push topic that subscribes to objects notifications and later have a set of clients that are reading them using long polling.
However such approach doesn't guarantee durability and reliable delivery of notifications - which I am in need.
What will be the architecture allowing to implement the same functionality in reliable way?
One approach I have in mind is create a Force.com applications that uses SalesForce triggers to subscribe to notifications and later just sends them using HTTPS to the cloud or my Data Server. Will this be a valid option - or are there any better ones?
I two very good questions on salesforce.stackexchange.com covering this very topic in details:
https://salesforce.stackexchange.com/questions/16587/integrating-a-real-time-notification-application-with-salesforce
https://salesforce.stackexchange.com/questions/20600/best-approach-for-a-package-to-respond-to-dml-events-dynamically-without-object

Resources