I am currently doing an app that loads data into HBase, I chose HBase because the data is not structured and therefore using a column based database is recommended.
Once the data is in HBase I thought of integrating Solr to it but I found little information about the subject and no answer for my question "https://stackoverflow.com/questions/36542936/integrating-solr-to-hbase"
So I wanted to ask how I can query data stored in HBase? Spark Streaming doesn't seem to be made for that ..
Any help please ?
Thanks in advance
Assuming that your question is on how to query data from Hbase.
Apache Phoenix Provides a Sql Wrapper over Hbase.
Hive Hbase Integration Hive also provides a Sql Wrapper over Hbase
Spark Hbase Plugin lets your Apache Spark application interact with Apache HBase.
Related
I am trying to migrate Cassandra cluster onto AWS Keyspaces for Apache Cassandra.
After the migration is done how can I verify that the data has been migrated successfully as-is?
Many solutions are possible, you could simply read all rows of a partition and compute a checksum / signature and compare with your original data for instance.Then iterating through all your partitions, then doing it for all your tables. Checksums work.
You could use AWS Glue to perform an 'except' function. Spark has a lot of usefull functions for working with massive datasets. Glue is serverless spark. You can use the spark cassandra connector with Cassandra and Keyspaces to work with datasets in glue. For example you may want to see the data that is not in Keyspaces.
cassandraTableDataframe.except(keyspacesTableDateframe).
You could also do this by exporting both datasets to s3 and performing these queries in Athena.
Here is a helpful repository of Glue and Keyspaces functions including export, count, and distinct.
As a Web Developer everyday we are hearing about new technologies, recently I came to know about Elastic Search it is used to analyze the big volumes of data. I've my data in Mongo DB weather it is possible to use elastic search on it.
MongoDB Atlas has a feature called 'Atlas Search', which implements the Apache Lucene engine. This could be a solution for your search requirements.
See Atlas Search for details
Depends what you mean by "analyze the big volumes of data", what are your requirements? Don't pay to much attention on marketing slogans. Maybe you can connect Elasticsearch with MongoDB via an ODBC driver. Elasticsearch is a document oriented NoSQL database like MongoDB is. As usual both have their pros and cons.
MongoDB is more like a database, i.e. it supports CRUD (Create, Read, Update, Delete) operations and the Aggregation Framework is very powerful.
In Elasticsearch you can store data and analyze or query it. I remember in earlier releases it was not so easy to delete or update existing single documents.
I am very new to cassandra and from database background. I want to know the process/tools/utilities by which I can load bulk of data into cassandra column family and read the data to show for analytics.
Thank you in advance!
The commands you talk to cassandra keyspaces (which are what databases are to sql) are similar to sql.
Check the CQL query language turorials.
Also check the datastax cassandra tutorials or the ones at tutorialspoint, but check that what you read matches the version you want to use.
Once you get a hang of the basics you can move on to cassandra-specific concepts like data replication and partitioning
A quick & easy start would be to get cassandra on docker and setup a container running your keyspace.
I am using Twitter Steaming and wanted to do visualization for my data. Which is the most compatible and feature enriched database recommended?
You could setup a data pipeline where you fetch and move your data using a tool like Apache Flume or/and Apache Kafka, analyze it with Spark and store it in a sink like Elasticsearch (or any other NoSql db). After that you can query your data using a visualization tool like Kibana.
is it possible to connect Apache Kylin without other databases like Hbase (plus HDFS) in general? So you can store raw data and the cube metadata somewhere else?
I think you coulde use Apache Hive using managed native tables
(Hive storage handlers)
Hive could connect over ODBC driver to MySQL for example
To use Kylin ,HDFS is mandatatory .Raw data as well as Cube data both will be stored in HDFS.
If you want to support other nosql datastore like cassandra ,you can consider other framework ,FiloDB