Using Kylin without HDFS and HBase - database

is it possible to connect Apache Kylin without other databases like Hbase (plus HDFS) in general? So you can store raw data and the cube metadata somewhere else?

I think you coulde use Apache Hive using managed native tables
(Hive storage handlers)
Hive could connect over ODBC driver to MySQL for example

To use Kylin ,HDFS is mandatatory .Raw data as well as Cube data both will be stored in HDFS.
If you want to support other nosql datastore like cassandra ,you can consider other framework ,FiloDB

Related

How to transfer data from ELK to SQL Server?

Our organization uses Elastic Logstash & Kibana (ELK) and we use a SQL Server data warehouse for analysis and reporting. There are some data items from ELK that we want to copy into the data warehouse. I have found many websites describing how to load SQL Server data into ELK. However, we need to go in the other direction. How can I transfer data from ELK to SQL Server, preferably using SSIS?
I have implemented a similar solution in python, where we are ingesting data from elastic cluster into our sql dwh. You can import Elasticsearch package for python which allows you to do that.
you can find more information here
https://elasticsearch-py.readthedocs.io/en/master/

Which is the most spark compatible database for data visualization?

I am using Twitter Steaming and wanted to do visualization for my data. Which is the most compatible and feature enriched database recommended?
You could setup a data pipeline where you fetch and move your data using a tool like Apache Flume or/and Apache Kafka, analyze it with Spark and store it in a sink like Elasticsearch (or any other NoSql db). After that you can query your data using a visualization tool like Kibana.

How to explore HBase data

I am currently doing an app that loads data into HBase, I chose HBase because the data is not structured and therefore using a column based database is recommended.
Once the data is in HBase I thought of integrating Solr to it but I found little information about the subject and no answer for my question "https://stackoverflow.com/questions/36542936/integrating-solr-to-hbase"
So I wanted to ask how I can query data stored in HBase? Spark Streaming doesn't seem to be made for that ..
Any help please ?
Thanks in advance
Assuming that your question is on how to query data from Hbase.
Apache Phoenix Provides a Sql Wrapper over Hbase.
Hive Hbase Integration Hive also provides a Sql Wrapper over Hbase
Spark Hbase Plugin lets your Apache Spark application interact with Apache HBase.

Library to move data between repositories

Is there any open source library (any programming language) that helps to load data from any data source (file, SQL db, NoSQL db, etc.) and store it into any other data repository? I've checked some ETL libraries like Talend or Octopus but they only deal with SQL databases.
Try https://flywaydb.org/, since NoSQL has different nature than Relational Structure you should write your own converter
{ "item_id" : 1, "tags" : ["a","b","c"] }
How this should be translated into RDBMS? you can use flyway for relational-to-relational db migration
Have a look at Apache Camel and their ETL Example. Camel knows how to load and store from a large variety of sources and repositories, including files, SQL, and various NoSQL databases like Cassandra and MongoDB.
You could also check out 10 Open Source ETL Tools.
By the way, Talend is not limited to SQL databases, as shown in these blog posts:
Talend & MongoDB: An Introduction to Simple Relational Mapping into MongoDB
How to Offload Oracle and MySQL Databases into Hadoop using Apache Spark and Talend

PostgreSQL -> Oracle replication

I'm looking for a tool to export data from a PostgreSQL DB to an Oracle data warehouse. I'm really looking for a heterogenous DB replication tool, rather than an export->convert->import solution.
Continuent Tungsten Replicator looks like it would do the job, but PostgreSQL support won't be ready for another couple months.
Are there any open-source tools out there that will do this? Or am I stuck with some kind of scheduled pg_dump/SQL*Loader solution?
You can create a database link from Oracle to Postgres (this is called heterogeneous connectivity). This makes it possible to select data from Postgres with a select statement in Oracle. You can use materialized views to schedule and store the results of those selects.
It sounds like SymmetricDS would work for your scenario. SymmetricDS is web-enabled, database independent, data synchronization/replication software. It uses web and database technologies to replicate tables between relational databases in near real time.
Sounds like you want an ETL (extract transform load) tool. There are allot of open source options Enhydra Octopus, and Talend Open Studio are a couple I've come across.
In general ETL tools offer you better flexibility than the straight across replication option.
Some offer scheduling, data quality, and data lineage.
Consider using the Confluent Kafka Connect JDBC sink and source connectors if you'd like to replicate data changes across heterogeneous databases in real time.
The source connector can select the entire database , particular tables, or rows returned by a provided query, and send the data as a Kafka message to your Kafka broker. The source connector can calculate the diffs based on an incrementing id column, a timestamp column, or be run in bulk mode where the entire contents are recopied periodically. The sink can read these messages, optionally check them against an avro or json schema, and populate the source database with the results. It's all free, and several sink and source connectors exist for many relational and non-relational databases.
*One major caveat - Some JDBC Kafka connectors can not capture hard deletes
To get around that limitation, you can use a propietary connector such as Debezium (http://www.debezium.io), see also
Delete events from JDBC Kafka Connect Source.

Resources