Writing and reading data from Cassandra 3.9 - database

I am very new to cassandra and from database background. I want to know the process/tools/utilities by which I can load bulk of data into cassandra column family and read the data to show for analytics.
Thank you in advance!

The commands you talk to cassandra keyspaces (which are what databases are to sql) are similar to sql.
Check the CQL query language turorials.
Also check the datastax cassandra tutorials or the ones at tutorialspoint, but check that what you read matches the version you want to use.
Once you get a hang of the basics you can move on to cassandra-specific concepts like data replication and partitioning
A quick & easy start would be to get cassandra on docker and setup a container running your keyspace.

Related

How do I check the data integrity after migrating a Cassandra database onto AWS Keyspaces

I am trying to migrate Cassandra cluster onto AWS Keyspaces for Apache Cassandra.
After the migration is done how can I verify that the data has been migrated successfully as-is?
Many solutions are possible, you could simply read all rows of a partition and compute a checksum / signature and compare with your original data for instance.Then iterating through all your partitions, then doing it for all your tables. Checksums work.
You could use AWS Glue to perform an 'except' function. Spark has a lot of usefull functions for working with massive datasets. Glue is serverless spark. You can use the spark cassandra connector with Cassandra and Keyspaces to work with datasets in glue. For example you may want to see the data that is not in Keyspaces.
cassandraTableDataframe.except(keyspacesTableDateframe).
You could also do this by exporting both datasets to s3 and performing these queries in Athena.
Here is a helpful repository of Glue and Keyspaces functions including export, count, and distinct.

can we use elastic search on mongo DB?

As a Web Developer everyday we are hearing about new technologies, recently I came to know about Elastic Search it is used to analyze the big volumes of data. I've my data in Mongo DB weather it is possible to use elastic search on it.
MongoDB Atlas has a feature called 'Atlas Search', which implements the Apache Lucene engine. This could be a solution for your search requirements.
See Atlas Search for details
Depends what you mean by "analyze the big volumes of data", what are your requirements? Don't pay to much attention on marketing slogans. Maybe you can connect Elasticsearch with MongoDB via an ODBC driver. Elasticsearch is a document oriented NoSQL database like MongoDB is. As usual both have their pros and cons.
MongoDB is more like a database, i.e. it supports CRUD (Create, Read, Update, Delete) operations and the Aggregation Framework is very powerful.
In Elasticsearch you can store data and analyze or query it. I remember in earlier releases it was not so easy to delete or update existing single documents.

Elastic Search 5 and SQL Server synchronisation

I am starting a Elastic search 5 project from data that are actually in a SQL Server, so I am starting from the start:
I am thinking about how import data from my SQL Server, and especially how to synchronise my data when data are updated or added.
I saw here it is adviced to make no too frequent batch.
But how make synchronisation batchs, may I have to write it myself or is there very used tools and practices ?
River and JDBC plugin feeder appears to have been really used but don't work with Elastic Search 5.*
Any help would be very welcomed.
I'd recommend using Logstash:
It's easy to use and setup
You can do your own ETL in logstash configuration files
You can have multiple JDBC sources in one file
You'll have figure out how to make incremental (batched) updates to sync your data. It really depends on your data model.
This is a nice blog piece to begin with:
https://www.elastic.co/blog/logstash-jdbc-input-plugin

How to explore HBase data

I am currently doing an app that loads data into HBase, I chose HBase because the data is not structured and therefore using a column based database is recommended.
Once the data is in HBase I thought of integrating Solr to it but I found little information about the subject and no answer for my question "https://stackoverflow.com/questions/36542936/integrating-solr-to-hbase"
So I wanted to ask how I can query data stored in HBase? Spark Streaming doesn't seem to be made for that ..
Any help please ?
Thanks in advance
Assuming that your question is on how to query data from Hbase.
Apache Phoenix Provides a Sql Wrapper over Hbase.
Hive Hbase Integration Hive also provides a Sql Wrapper over Hbase
Spark Hbase Plugin lets your Apache Spark application interact with Apache HBase.

PostgreSQL -> Oracle replication

I'm looking for a tool to export data from a PostgreSQL DB to an Oracle data warehouse. I'm really looking for a heterogenous DB replication tool, rather than an export->convert->import solution.
Continuent Tungsten Replicator looks like it would do the job, but PostgreSQL support won't be ready for another couple months.
Are there any open-source tools out there that will do this? Or am I stuck with some kind of scheduled pg_dump/SQL*Loader solution?
You can create a database link from Oracle to Postgres (this is called heterogeneous connectivity). This makes it possible to select data from Postgres with a select statement in Oracle. You can use materialized views to schedule and store the results of those selects.
It sounds like SymmetricDS would work for your scenario. SymmetricDS is web-enabled, database independent, data synchronization/replication software. It uses web and database technologies to replicate tables between relational databases in near real time.
Sounds like you want an ETL (extract transform load) tool. There are allot of open source options Enhydra Octopus, and Talend Open Studio are a couple I've come across.
In general ETL tools offer you better flexibility than the straight across replication option.
Some offer scheduling, data quality, and data lineage.
Consider using the Confluent Kafka Connect JDBC sink and source connectors if you'd like to replicate data changes across heterogeneous databases in real time.
The source connector can select the entire database , particular tables, or rows returned by a provided query, and send the data as a Kafka message to your Kafka broker. The source connector can calculate the diffs based on an incrementing id column, a timestamp column, or be run in bulk mode where the entire contents are recopied periodically. The sink can read these messages, optionally check them against an avro or json schema, and populate the source database with the results. It's all free, and several sink and source connectors exist for many relational and non-relational databases.
*One major caveat - Some JDBC Kafka connectors can not capture hard deletes
To get around that limitation, you can use a propietary connector such as Debezium (http://www.debezium.io), see also
Delete events from JDBC Kafka Connect Source.

Resources