Near real-time data ingestion from SQL SERVER to HDFS in cloudera - sql-server

We have PLC data in SQL Server which gets updated every 5 min.
Have to push the data to HDFS in cloudera distribution in the same time interval.
Which are the tools available for this?

I would suggest to use the Confluent Kafka for this task (https://www.confluent.io/product/connectors/).
The idea is as following:
SQLServer --> [JDBC-Connector] --> Kafka --> [HDFS-Connector] --> HDFS
All these connectors are already available via confluent web site.

I'm assuming your data is being written in some directory in local FS. You may use some streaming engine for this task. Since you've tagged this with apache-spark, I'll give you the Spark Streaming solution.
Using structured streaming, your streaming consumer will watch your data directory. Spark streaming reads and processes data in configurable micro batches (stream wait time) which in your case will be of a 5 min duration. You may save data in each micro batch as text files which will use your cloudera hadoop cluster for storage.
Let me know if this helped. Cheers.

You can google the tool named sqoop. It is an open source software.

Related

Can yugabyteDB used for storage & streaming?

We have GoLang backend service used to:
Store data in yugabyte DB using YCL driver
Publish the same data to Kafka
Step 2 was necessary so that consumers can stream through kafka
Can yugabyteDB help stream data, once a new row created in a table? to avoid maintainence of state in kafka....
if yes, does yugabyte db support streaming with push model?
CDC feature is actively being worked on at https://github.com/yugabyte/yugabyte-db/issues/9019. Also support for 2, pushing into kafka is in the works.

Real time Streaming Data Pipeline using Kafka Connect and Flink

I am planning to put together following data pipeline for one of the requirement.
IBM MQ -> Kafka Connect -> Flink -> MongoDB
Flink real time streaming is to perform filtering , applying business rule and enriching incoming records.
IBM MQ part is a legacy component which can not be changed.
Possibly confluent or cloudera platform will be used to house the Kafka and Flink part of the flow.
I could use some thoughts/suggestions around above approach.
I would take a closer look at whether you really need Kafka Connect. I believe that IBM MQ supports JMS, and there's a JMS compatible connector for Flink in Apache Bahir: http://bahir.apache.org/docs/flink/current/flink-streaming-activemq/.

MS SQL CDC with Kafka Connect and Apache Kafka

In my current use case, I am using Spark core to read data from MS SQL Server and doing some processing on the data and sending it to Kafka every 1 minute, I am using Spark and Phoenix to maintain the CDC information in HBase table.
But this design has some issues, for e.g. if there is surge in MS SQL records Spark processing takes more time than batch interval and spark ends up sending duplicate records to Kafka.
As an alternate to this I am thinking of using Kafka Connect to read the messages from MS SQL and send records to Kafka topic and maintain the MS SQL CDC in Kafka. Spark Streaming will read records from Kafka topic and process the records and stores into HBase and send to other Kafka topics.
I have a few questions in order to implement this architecture:
Can I achieve this architecture with open source Kafka connectors and Apache Kafka 0.9 versions.
If yes can you please recommend me a GitHub project, which can offer me such connectors where I can CDC MS SQL tables using SQL query such as SELECT * FROM SOMETHING WHERE COLUMN > ${lastExtractUnixTime}) and store records into Kafka topic.
Does Kafka connect supports Kerberos Kafka setup.
Can I achieve this architecture with open source Kafka connectors and Apache Kafka 0.9 versions.
Yes, Kafka Connect was released in version 0.9 of Apache Kafka. Features such as Single Message Transforms were not added until later versions though. If possible, you should be using the latest version of Apache Kafka (0.11)
If yes can you please recommend me a GitHub project, which can offer me such connectors where I can CDC MS SQL tables using SQL query such as SELECT * FROM SOMETHING WHERE COLUMN > ${lastExtractUnixTime}) and store records into Kafka topic.
You can use the JDBC Source which is available as part of Confluent Platform (or separately), and may also want to investigate kafka-connect-cdc-mssql
Does Kafka connect supports Kerberos Kafka setup.
Yes -- see here and here
Regarding this point :
Spark Streaming will read records from Kafka topic and process the records and stores into HBase and send to other Kafka topics.
You can actually use Kafka Connect here too -- there are Sinks available for HBase -- see the full list of connectors here.
For further manipulation of data in Kafka, there is the Kafka Streams API, and KSQL.

how to backup Cloudant to mass low cost storage such as AWS Glacier?

One approach organisations sometimes use for backing up Cloudant is to run a standalone instance of CouchDB in their private network or a public network and replicate data from Cloudant to that CouchDB instance. The CouchDB data can then be exported to mass storage such as Amazon Glacier.
Questions:
What are the steps required to implement this?
Are there any gotchas to be aware of?
Here are the approximate steps:
a server running CouchDB (e.g. in EC2)
continuous replication from Cloudant --> CouchDB
periodic (e.g. nightly) cron job to
copy the relevant .couch file over somewhere
zip it up
use AWS command-line tools to put the zipped file on S3
use AWS command-line tools to send that S3 file to Glacier
Things to remember:
Glacier keeps everything unless you say "kill that backup from 30 days ago", so you keep paying for old backups. Best to delete really old stuff
with continuous replication: if you delete a doc on Cloudant it immediately deletes on your backup (oops)
restoring from Glacier is a pain, then you can restore it to CouchDB, then you can replicate it to Cloudant.
Cloudant will not be able to support your CouchDB installation - you will need to support it yourself.

Need of a redis server in a centralized setup

I have gone through the documentation on the Logstash server to find out that we require a redis server which will act as a broker.
Here is the link:
http://logstash.net/docs/1.1.12/tutorials/getting-started-centralized
But what is not clear to me is why we use Redis at all as a broker?
We could rather simply directly ship the logs to the elastic search from logstash itself, that would save us the need of the redis broker. Then why do we go for a shipper and a indexer ?
Need clear explanation.
Thanks.
I believe you can find an answer here:
https://groups.google.com/forum/#!topic/logstash-users/VakCOAzZI8k
Redis basically acts as a temporary key value store for raw shipper information which is then parsed by the indexer. Then the log info is ultimately stored in elasticsearch, and not in redis.
Apparently the idea is to offload indexing to a server dedicated to such tasks, as indexing is CPU intensive. Redis being called a broker seems appropriate, i guess.
When using Logstash with Redis, you can configure Redis to only store all the log entries in memory which would like a in memory queue (like memcache).
You mat come to the point where the number of logs sent will not be processed by Logstash and it can bring down your system on constant basis (observed in our environment).
If you feel Redis is an overhead for your disk, you can configure it to store all the logs in memory until they are processed by logstash.

Resources