Tajo: Does tsql need Hadoop? - analytics

I know Tajo requires Hadoop to be installed first. But I am not very sure bin/tsql. Is Hadoop required for tsql to run? If so, is there any plan to make it lighter? Any insight/help would be appreciated.

I'm a PMC member of Apache Tajo. Thank you for your interest in Apache Tajo.
Currently, Apache Tajo requires Hadoop 2.2 or higher because Tajo depends on some Hadoop libraries. In other words, Tajo just needs an unpacked Hadoop distribution rather than a running Hadoop cluster. So, without a running Hadoop cluster, you can launch a Tajo cluster with local file system or Amazon S3 if you set HADOOP_HOME environment variable to your shell.
Tajo team has a plan to make its Hadoop dependency lighter. After we do so, you could easily deploy and use bin/tsql or Tajo cluster with your storage.

Related

What is recommended way to automate Flink Job submission on AWS EMR cluster while pipeline deployment

I am new to Flink and EMR cluster deployment. Currently we have a Flink job and we are manually deploying it on AWS EMR cluster via Flink CLI stop/start-job commands.
I wanted to automate this process (Automate updating flink job jar on every deployment happening via pipelines with savepoints) and need some recommendations on possible approaches that could be explored.
We got an option to automate this process via Flink Rest API support for all flink job operation
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/rest_api/
Sample project which used the same approach : https://github.com/ing-bank/flink-deployer

Kubernetes volume snapshot Vs. sql backup?

I am running database inside Kubernetes pod. i am planning to run K8s job to take automatic backup of databases from pod.
There is also i can write shell script to take snapshot of volume(PV).
which method will be better to use? in emergency which one will save time restore data ?
You can use Stash by AppsCode which is a great solution to backup Kubernetes volumes.
For supported versions check here
Stash by AppsCode is a Kubernetes operator for restic. If you are
running production workloads in Kubernetes, you might want to take
backup of your disks. Traditional tools are too complex to setup and
maintain in a dynamic compute environment like Kubernetes. restic is a
backup program that is fast, efficient and secure with few moving
parts. Stash is a CRD controller for Kubernetes built around restic to
address these issues.
Using Stash, you can backup Kubernetes volumes mounted in following
types of workloads:
Deployment, DaemonSet, ReplicaSet, ReplicationController, StatefulSet
After installing stash using Script or HELM you would want to follow
Instructions for Backup and Restore if you are not familiar
I find it very useful

How to "submit" an ad-hoc SQL to Beam on Flink

I'm using Apache Beam with Flink runner with Java SDK. It seems that deploying a job to Flink means building a 80-megabyte fat jar that gets uploaded to Flink job manager.
Is there a way to easily deploy a lightweight SQL to run Beam SQL? Maybe have job deployed that can soemhow get and run ad hoc queries?
I don't think it's possible at the moment, if I understand your question. Right now Beam SDK will always build a fat jar which will implement the pipeline and include all pipeline dependencies, and it will not be able to accept lightweight ad-hoc queries.
If you're interested in more interactive experience in general, you cat look at the ongoing efforts to make Beam more interactive, for example:
SQL shell: https://s.apache.org/beam-sql-packaging . This describes a work-in-progress Beam SQL shell, which should allow you to quickly execute small SQL queries locally in a REPL environment, so that you can interactively explore your data, and design the pipeline before submitting a long-running job. This does not change the way how the job gets submitted to Flink (or any other runner) though. So after you submitted the long running job, you will likely still have to use normal job management tools you currently have to control it.
Python: https://s.apache.org/interactive-beam . Describes the approach to wrap existing runner into an interactive wrapper.

Near real-time data ingestion from SQL SERVER to HDFS in cloudera

We have PLC data in SQL Server which gets updated every 5 min.
Have to push the data to HDFS in cloudera distribution in the same time interval.
Which are the tools available for this?
I would suggest to use the Confluent Kafka for this task (https://www.confluent.io/product/connectors/).
The idea is as following:
SQLServer --> [JDBC-Connector] --> Kafka --> [HDFS-Connector] --> HDFS
All these connectors are already available via confluent web site.
I'm assuming your data is being written in some directory in local FS. You may use some streaming engine for this task. Since you've tagged this with apache-spark, I'll give you the Spark Streaming solution.
Using structured streaming, your streaming consumer will watch your data directory. Spark streaming reads and processes data in configurable micro batches (stream wait time) which in your case will be of a 5 min duration. You may save data in each micro batch as text files which will use your cloudera hadoop cluster for storage.
Let me know if this helped. Cheers.
You can google the tool named sqoop. It is an open source software.

Solr and Zookeeper with a single node

I have the setup of Solr cloud running in my local machine with the internal Zookeeper (i.e) Zookeeper that is being internally used by Solr with the single node.
My query is that while I move my Solr to the production environment, Is it recommended to run the Zookeeper in a isolated/separate/external instance or is it better to go with the internal instance of Zookeeper that comes along with the Solr?
The use solr internal zookeeper is discouraged for the production environments. This is even stated in SolrCloud documentation.
Although Solr comes bundled with Apache ZooKeeper, you should consider yourself discouraged from using this internal ZooKeeper in production, because shutting down a redundant Solr instance will also shut down its ZooKeeper server, which might not be quite so redundant. Because a ZooKeeper ensemble must have a quorum of more than half its servers running at any given time, this can be a problem.
The solution to this problem is to set up an external ZooKeeper ensemble. You should create this ensemble on a different machine so that if any of the solr machine goes down it will not impact the zookeeper and rest of the solr instances. I know currently you are going with one solr instance.
As mentioned, for production is not a good idea to have the internal Zookeeper inside Solr but for development is entirely OK and very practical and for that you just need to add this lines to your /etc/default/solr.in.sh file:
SOLR_MODE=solrcloud
ZK_CREATE_CHROOT=true
As an alternative, you can also start Solr manually with the command $SOLR_HOME_DIR/bin/solr start -c
Tested with Apache Solr 9 on a Debian based Linux

Resources