what is the main difference between data subscription and streaming computing in TDengine database? - tdengine

what is the main difference between data subscription and streaming computing in TDengine database?
As I check ,they all push the data to the consumer in real time
I didn't tell the difference

TDengine's difference between data subscription and streaming computing is:
data subscription push data to the user.
but streaming computing push the data to the TDengine's table .
hence the former guy is faster , the latter is persistent.

Related

Problems and solutions when using a secondary datastore alongside the main database?

I am in the middle of an interview simulation and I got stock with one question. Can someone provide the answer for me please?
The question:
We use a secondary datastore (we use elasticsearch alongside our main database) for real time analytics and reporting. What problems might you anticipate with this sort of approach? Explain how would go about solving or mitigating them?
Thank you
There are several problems:
No transactional cover : If your main database is transactional (which it usually is), so you either commit or you don't. After the record is inserted into your main database, there is no guarentee that it will be committed to ES. In fact if you commit several records to your primary DB, you may have a situation where some of them are committed to ES, and few others are not. This is a MAJOR issue.
Refresh Interval : Elasticsearch by default refreshes every second. That means "Real-time" is generally 1 second later, or at least when the data is queried for. If you commit a record into your primary db, and immediately query for it via ES, it may not get found. THe only way around this is to GET the record using its ID.
Data-Duplication : Elasticsearch cannot do joins. You need to denormalize all data that is coming from a RDBMS. If one user has many posts, you cannot "join" to search. You have to add the user id an any other user specific details to every post object.
Hardware : Elasticsearch needs RAM (bare minimum of 1 gb) to work properly. This is assuming you don't use anything else from the ELK stack. THis is an important cost wise consideration.
One problem might be synchronization issues, where the elastic search store gets out of sync and starts service stale data. To avoid issues, you will have to implement monitoring on your data pipeline, elastic search and the primary database, to detect any problem by checking for update times, delay, number of records (within some level of error) in each of them and overall system operation status (up / down).
Another is disconnection and recovery - what happens if your data pipeline or elastic search loses connection to the rest of the system? You will need an automatic way to re-connect, when network is restored and start synchronising data again.
You also have to take into account sudden influx of data - how to scale ElasticSearch ingestion or your data processor (data pipeline) if there is large amount of updates and inserts in peak hours or after re-connection when there was network issues.

How to correctly verify existence of a record in database when designing an ETL based on Apache Flink?

I am in the process of creating an ETL and fraud management module using flink to analyze a sequence of real time credit card transactions.
All transactions are received by an exposed API that pushes the data into a Kafka topic.
First, the received data needs to be checked and cleaned, and then stored in a database.
The next step is a fraud analysis of these transactions.
In this first step, with Flink, I have to check in the Card database that the card is known before continuing. The problem is, there are around a billion cards in this database and new card could be added over time.
So I'm not sure if I could cache the entire card number in memory or how to effectively handle this check: Is Flink able to handle some kind of sliding cache to check the card for existence in batch?
What you might do is to mirror the card database into Flink's key-partitioned state, either on-heap, or using RocksDB if you want to this to spill to disk. Key-partitioned state is sharded across the cluster, so if you do want to keep the entire card database in memory, you can scale up the cluster until that's feasible.
To keep only recently seen values, you could rely on state TTL to expire records that haven't been accessed recently.
An alternative: Flink SQL has support for doing streaming lookup joins against JDBC databases, and you can configure caching for that.

Periodically refreshing static data in Apache Flink?

I have an application that receives much of its input from a stream, but some of its data comes from both a RDBMS and also a series of static files.
The stream will continuously emit events so the flink job will never end, but how do you periodically refresh the RDBMS data and the static file to capture any updates to those sources?
I am currently using the JDBCInputFormat to read data from the database.
Below is a rough schematic of what I am trying to do:
For each of your two sources that might change (RDBMS and files), create a Flink source that uses a broadcast stream to send updates to the Flink operators that are processing the data from Kafka. Broadcast streams send each Object to each task/instance of the receiving operator.
For each of your sources, files and RDBMS, you can create a snapshot in HDFS or in a storage periodically(example at every 6 hours) and calculate the difference between to snapshots.The result will be push to Kafka. This solution works when you can not modify the database and files structure and an extra information(ex in RDBMS - a column named last_update).
Another solution is to add a column named last_update used to filter data that has changed between to queries and push the data to Kafka.

Loading data from google cloud storage to BigQuery

I have a requirement to load 100's of tables to BigQuery from Google Cloud Storage(GCS -> Temp table -> Main table). I have created a python process to load the data into BigQuery and scheduled in AppEngine. Since we have Maximum 10min timeout for AppEngine. I have submitted the jobs in Asynchronous mode and checking the job status later point of time. Since I have 100's of tables need to create a monitoring system to check the status the job load.
Need to maintain a couple of tables and bunch of views to check the job status.
The operational process is little complex. Is there any better way?
Thanks
When we did this, we simply used a message queue like Beanstalkd, where we pushed something that later had to be checked, and we wrote a small worker who subscribed to the channel and dealt with the task.
On the other hand: BigQuery offers support for querying data directly from Google Cloud Storage.
Use cases:
- Loading and cleaning your data in one pass by querying the data from a federated data source (a location external to BigQuery) and writing the cleaned result into BigQuery storage.
- Having a small amount of frequently changing data that you join with other tables. As a federated data source, the frequently changing data does not need to be reloaded every time it is updated.
https://cloud.google.com/bigquery/federated-data-sources

Using Apache Flink for data streaming

I am working on building an application with below requirements and I am just getting started with flink.
Ingest data into Kafka with say 50 partitions (Incoming rate - 100,000 msgs/sec)
Read data from Kafka and process each data (Do some computation, compare with old data etc) real time
Store the output on Cassandra
I was looking for a real time streaming platform and found Flink to be a great fit for both real time and batch.
Do you think flink is the best fit for my use case or should I use Storm, Spark streaming or any other streaming platforms?
Do I need to write a data pipeline in google data flow to execute my sequence of steps on flink or is there any other way to perform a sequence of steps for realtime streaming?
Say if my each computation take like 20 milliseconds, how can I better design it with flink and get better throughput.
Can I use Redis or Cassandra to get some data within flink for each computation?
Will I be able to use JVM in-memory cache inside flink?
Also can I aggregate data based on a key for some time window (example 5 seconds). For example lets say there are 100 messages coming in and 10 messages have the same key, can I group all messages with the same key together and process it.
Are there any tutorials on best practices using flink?
Thanks and appreciate all your help.
Given your task description, Apache Flink looks like a good fit for your use case.
In general, Flink provides low latency and high throughput and has a parameter to tune these. You can read and write data from and to Redis or Cassandra. However, you can also store state internally in Flink. Flink does also have sophisticated support for windows. You can read the blog on the Flink website, check out the documentation for more information, or follow this Flink training to learn the API.

Resources