Update Dgraph in real time via Flink - database

I am searching for a low latency graph DB which allows for in depth queries, while being updated in real time.
Is it possible to update Dgraph in real time through Flink processes?
I would like to validate an idea as follows:
read stream in Kafka pass to Flink to create Data Table / Graph
pass the data Table / Graph to Dgraph along with edge / vertices attributes
update Dgraph in real time ( edge / vertices attributes )
copy / Lift the latest version of Dgraph to Flink to perform computations (periodically)
If impossible: Dgraph is based on RocksDB, does anyone know if data can be passed via RocksDB to Dgraph?

What you describe sounds straight forward, Dgraph should be able to do those operations. Is the concern around high throughput, i.e. whether Dgraph would be able to take the mutation and query load thrown by Flink?
The main issue that you might run into here is that the data would need to be converted into RDF format for mutations, and the queries would need to be in GraphQL-like format that we use.
For more documentation, you can see our wiki: https://wiki.dgraph.io/Main_Page
Also, happy to understand your specific use case and give more detailed answers here: https://discuss.dgraph.io

Related

How to get the throughput of KafkaSource in Flink?

I want to know the throughput of KafkaSource. In other words, I want to measure the speed at which flink reads data. My idea is to add a map operator after the Source and use the built-in Metrics in the map operator. Will this increase the overhead? I hope to get this metric without adding a lot of overhead. what should I do? Or is there a way to get the output throughput of this topic in kafka? Or should I get KafkaSource's NumberOutPersecond through the REST API?
Take a look at Kafka Manager which displays a lot of metrics related to Kafka. It's a tool which is used to manage Kafka and acts as a real-time dashboard. You need to install and configure this separately.
This can be used to check the consumption rate for your Flink consumer.
You can also make use of built-in metrics publisher on the source operator without using a Map only for that purpose.

Using Apache Flink for data streaming

I am working on building an application with below requirements and I am just getting started with flink.
Ingest data into Kafka with say 50 partitions (Incoming rate - 100,000 msgs/sec)
Read data from Kafka and process each data (Do some computation, compare with old data etc) real time
Store the output on Cassandra
I was looking for a real time streaming platform and found Flink to be a great fit for both real time and batch.
Do you think flink is the best fit for my use case or should I use Storm, Spark streaming or any other streaming platforms?
Do I need to write a data pipeline in google data flow to execute my sequence of steps on flink or is there any other way to perform a sequence of steps for realtime streaming?
Say if my each computation take like 20 milliseconds, how can I better design it with flink and get better throughput.
Can I use Redis or Cassandra to get some data within flink for each computation?
Will I be able to use JVM in-memory cache inside flink?
Also can I aggregate data based on a key for some time window (example 5 seconds). For example lets say there are 100 messages coming in and 10 messages have the same key, can I group all messages with the same key together and process it.
Are there any tutorials on best practices using flink?
Thanks and appreciate all your help.
Given your task description, Apache Flink looks like a good fit for your use case.
In general, Flink provides low latency and high throughput and has a parameter to tune these. You can read and write data from and to Redis or Cassandra. However, you can also store state internally in Flink. Flink does also have sophisticated support for windows. You can read the blog on the Flink website, check out the documentation for more information, or follow this Flink training to learn the API.

Querying Data from Apache Flink

I am looking to migrate from a homegrown streaming server to Apache Flink. One thing that we have is a Apache Storm like DRPC interface to run queries against the state held in the processing topology.
So for example: I have a bunch of sensors that I am running an moving average on. I want to run a query on the topology and return all the sensors where that average is above a fixed value.
Is there an equivalent in Flink, or if not, what is the best way to achieve equivalent functionality?
Out-of-box Flink does not come with a solution for querying the internal state of operations right now. You're lucky however, because there are two solutions: We did an example of a stateful word count example that allows querying the state. This is available here: https://github.com/dataArtisans/query-window-example
For one of the upcoming versions of Flink we are also working on a generic solution to the queryable state use case. This will allow querying the state of any internal operation.
Also, could it also suffice, in your case, to just periodically output the values to something like Elasticsearch using a Window Operation. The results could then simply be queried from Elasticsearch.
They are coming with Out-of-box solution called Queryable State in next release.
Here is an example
https://github.com/apache/flink/blob/master/flink-tests/src/test/java/org/apache/flink/test/query/QueryableStateITCase.java
But I suggest you should read about it more first then see the example.

Database and large Timeseries - Downsampling - OpenTSDB InfluxDB Google DataFlow

I have a project where we sample "large" amount of data on per-second basis. Some operation are performed as filtering and so on and it needs then to be accessed as second, minute, hour or day interval.
We currently do this process with an SQL based system and a software that update different tables (daily average, hourly averages, etc...).
We are currently looking if other solution could fit our needs and I went across several solutions, as open tsdb, google cloud dataflow and influxdb.
All seem to address timeseries needs, but it gets difficult to get information about the internals. opentsdb do offer downsampling but it is not clearly specified how.
The need is since we can query vast amount of data, for instance a year, if the DB downsample at the query and is not pre-computed, it may take a very long time.
As well, downsampling needs to be "updated" when ever "delayed" datapoint are added.
On top of that, upon data arrival we perform some processing (outliner filter, calibration) and those operation should not be written on the disk, several solution can be used like a Ram based DB but perhaps some more elegant solution that would work together with the previous specification exists.
I believe this application is not something "extravagant" and that it must exist some tools to perform this, I'm thinking of stock tickers, monitoring and so forth.
Perhaps you may have some good suggestions into which technologies / DB I should look on.
Thanks.
You can accomplish such use cases pretty easily with Google Cloud Dataflow. Data preprocessing and optimizing queries is one of major scenarios for Cloud Dataflow.
We don't provide a "downsample" primitive built-in, but you can write such data transformation easily. If you are simply looking at dropping unnecessary data, you can just use a ParDo. For really simple cases, Filter.byPredicate primitive can be even simpler.
Alternatively, if you are looking at merging many data points into one, a common pattern is to window your PCollection to subdivide it according to the timestamps. Then, you can use a Combine to merge elements per window.
Additional processing that you mention can easily be tacked along to the same data processing pipeline.
In terms of comparison, Cloud Dataflow is not really comparable to databases. Databases are primarily storage solutions with processing capabilities. Cloud Dataflow is primarily a data processing solution, which connects to other products for its storage needs. You should expect your Cloud Dataflow-based solution to be much more scalable and flexible, but that also comes with higher overall cost.
Dataflow is for inline processing as the data comes in. If you are only interested in summary and calculations, dataflow is your best bet.
If you want to later take that data and access it via time (time-series) for things such as graphs, then InfluxDB is a good solution though it has a limitation on how much data it can contain.
If you're ok with 2-25 second delay on large data sets, then you can just use BigQuery along with Dataflow. Dataflow will receive, summarize, and process your numbers. Then you submit the result into BigQuery. HINT, divide your tables by DAYS to reduce costs and make re-calculations much easier.
We process 187 GB of data each night. That equals 478,439,634 individual data points (each with about 15 metrics and an average of 43,000 rows per device) for about 11,512 devices.
Secrets to BigQuery:
LIMIT your column selection. Don't ever do a select * if you can help it.
;)

Which approach and database to use in performance-critical solution

I have the following scenario:
Around 70 million of equipments send a signal every 3~5 minutes to
the server sending its id, status (online or offiline), IP, location
(latitude and longitude), parent node and some other information.
The other information might not be in an standard format (so no schema for me) but I still need to query it.
The equipments might disappear for some time (or forever) not sending
signals in the process. So I need a way to "forget" the equipments if
they have not sent a signal in the last X days. Also new equipments
might come online at any time.
I need to query all this data. Like knowing how many equipments are offline on a specific region or over
an IP range. There won't be many queries running at the same time.
Some of the queries need to run fast (less than 3 min per query) and
at the same time as the database is updating. So I need indexes on
the main attributes (id, status, IP, location and parent node). The
query results do not need to be 100% accurate, eventual consistency
is fine as long as it doesn't take too long (more than 20 min on
avarage) for them to appear in the queries results.
I don't need
persistence at all, if the power goes out it's okay to lose
everything.
Given all this I thought of using a noSQL approach maybe MongoDB or CouchDB since I have experience with MapReduce and Javascript but I don't know which one is better for my problem (I'm gravitating towards CouchDB) or if they are fit at all to handle this massive workload. I don't even know if I actually need a "traditional" database since I don't need persistence to disk (maybe a main-memory approach would be better?), but I do need a way to build custom queries easily.
The main problem I detect are the following:
Need to insert/update lots of tuples really fast and I don't know
beforehand if the signal I receive is already in the database or not.
Almost all of the signals will be in the same state as they were the
last time, so maybe query by id and check to see if the tuple changed if not do nothing, if it did update?
Forgeting offline equipments. A batch job that runs during the night
removing expired tuples would solve this problem.
There won't be many queries running at the same time, but they need
to run fast. So I guess I need to have a cluster that perform a
single query on multiple nodes of the cluster (does CouchDB MapReduce
splits the workload to multiple nodes of the cluster?). I'm not
enterily sure I need a cluster though, could a single more expensive
machine handle all the load?
I have never used a noSQL system before, but I have theoretical
knowledge of the subject.
Does this make sense?
Apache Flume for collecting the signals.
It is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store. Easy to configure and scale. Store the data in HDFS as files using Flume.
Hive for batch queries.
Map the data files in HDFS as external tables in Hive warehouse. Write SQL like queries using HiveQL whenever you need offline-batch processing.
HBase for random real-time reads/writes.
Since HDFS, being a FS, lacks the random read/write capability, you would require a DB to serve that purpose. Looking at your use case HBase seems good to me. I would not say MongoDB or CouchDB as you are not dealing with documents here and both these are document-oriented databases.
Impala for fast, interactive queries.
Impala allows you to run fast, interactive SQL queries directly on your data stored in HDFS or HBase. Unlike Hive it does not use MapReduce. It instead leverages the power of MPP so it's good for real time stuff. And it's easy to use since it uses the same metadata, SQL syntax (Hive SQL), ODBC driver etc as Hive.
HTH
Depending on the type of analysis, CouchDB, HBase of Flume may be all be good choices. For strictly numeric "write-once" metrics data graphite is a very popular open source solution.

Resources