How do I check the data integrity after migrating a Cassandra database onto AWS Keyspaces - database

I am trying to migrate Cassandra cluster onto AWS Keyspaces for Apache Cassandra.
After the migration is done how can I verify that the data has been migrated successfully as-is?

Many solutions are possible, you could simply read all rows of a partition and compute a checksum / signature and compare with your original data for instance.Then iterating through all your partitions, then doing it for all your tables. Checksums work.

You could use AWS Glue to perform an 'except' function. Spark has a lot of usefull functions for working with massive datasets. Glue is serverless spark. You can use the spark cassandra connector with Cassandra and Keyspaces to work with datasets in glue. For example you may want to see the data that is not in Keyspaces.
cassandraTableDataframe.except(keyspacesTableDateframe).
You could also do this by exporting both datasets to s3 and performing these queries in Athena.
Here is a helpful repository of Glue and Keyspaces functions including export, count, and distinct.

Related

Is it feasible to use deltalake without databricks?

We have our datalake in AWS s3.
Metadata in hive, we have a small running cluster.(we havent used Athena/Glue) .
We use spark and presto to in our Airflow pipeline.
The processed data gets dumped into snowflake.
The Detalake has various formats but majorly in parquet.
We want to experiment with Databricks. Our plan is to
Create Deltalake tables instead of hive ones for the entire detalake.
Use Databricks for processing and warehousing for a significant part of the data.
We can not replace snowflake with databricks, at least at this moment.
So we need the deltalake tables to be used by other spark pipelines as well.
This last step above, is it possible this way without challenges or is it tricky ?

How os data loaded or synced in SnowFlake

We are considering to use snowflake. I tried looking into the documentation and google, but without luck. How does snowflake query/store data? As an example if I have a CSV file, database, datalake ... is it like real time querying vs the sources, or does it replicate data to snowflake? If replication, how often does it update?
Maybe an introduction to the Snowflake Architecture is helping you here: https://docs.snowflake.com/en/user-guide/intro-key-concepts.html
Let's split up your query in two parts:
How does Snowflake store data? Basically Snowflake is storing data in it's own proprietary file format. The files are are called micro partitions, are in hybrid columnar format and are stored in for example S3 in case you are using Snowflake on AWS.
How does Snowflake query data? For this Snowflake is leveraging compute instances called Virtual Warehouses, which correspond to compute instances of your cloud provider underneath. With them, the files are accessed and queried.

Historical data migration from Teradata to Snowflake

What are the steps to be taken to migrate historical data load from Teradata to Snowflake?
Imagine there is 200TB+ of historical data combined from all tables.
I am thinking of two approaches. But I don't have enough expertise and experience on how to execute them. So looking for someone to fill in the gaps and throw some suggestions
Approach 1- Using TPT/FEXP scripts
I know that TPT/FEXP scripts can be written to generate files for a table. How can I create a single script that can generate files for all the tables in the database. (Because imagine creating 500 odd scripts for all the tables is impractical).
Once you have this script ready, how is this executed in real-time? Do we create a shell script and schedule it through some Enterprise scheduler like Autosys/Tidal?
Once these files are generated , how do you split them in Linux machine if each file is huge in size (because the recommended size is between 100-250MB for data loading in Snowflake)
How to move these files to Azure Data Lake?
Use COPY INTO / Snowpipe to load into Snowflake Tables.
Approach 2
Using ADF copy activity to extract data from Teradata and create files in ADLS.
Use COPY INTO/ Snowpipe to load into Snowflake Tables.
Which of these two is the best suggested approach ?
In general, what are the challenges faced in each of these approaches.
Using ADF will be a much better solution. This also allows you to design DataLake as part of your solution.
You can design a generic solution that will import all the tables provided in the configuration. For this you can choose the recommended file format (parquet) and the size of these files and parallel loading.
The challenges you will encounter are probably a poorly working ADF connector to Snowflake, here you will find my recommendations on how to bypass the connector problem and how to use DataLake Gen2:
Trouble loading data into Snowflake using Azure Data Factory
More about the recommendation on how to build Azure Data Lake Storage Gen2 structures can be found here: Best practices for using Azure Data Lake Storage Gen2

Data pipeline - dumping large files from API responses into AWS then with final destination being on premises SQL Server

I'm new to building data pipelines where dumping files in the cloud is one or more steps in the data flow. Our goal is to store large, raw sets of data from various APIs in the cloud then only pull what we need (summaries of this raw data) and store that in our on premises SQL Server for reporting and analytics. We want to do this in the most easy, logical and robust way. We have chosen AWS as our cloud provider but since we're at the beginning phases are not attached to any particular architecture/services. Because I'm no expert with the cloud nor AWS, I thought I'd post my thought for how we can accomplish our goal and see if anyone has any advice for us. Does this architecture for our data pipeline make sense? Are there any alternative services or data flows we should look into? Thanks in advance.
1) Gather data from multiple sources (using APIs)
2) Dump responses from APIs into S3 buckets
3) Use Glue Crawlers to create a Data Catalog of data in S3 buckets
4) Use Athena to query summaries of the data in S3
5) Store data summaries obtained from Athena queries in on-premises SQL Server
Note: We will program the entire data pipeline using Python (which seems like a good call and easy no matter what AWS services we utilize as boto3 is pretty awesome from what I've seen thus far).
You may use glue jobs (pyspark) for #4 and #5. You may automate flow using Glue triggers

PostgreSQL -> Oracle replication

I'm looking for a tool to export data from a PostgreSQL DB to an Oracle data warehouse. I'm really looking for a heterogenous DB replication tool, rather than an export->convert->import solution.
Continuent Tungsten Replicator looks like it would do the job, but PostgreSQL support won't be ready for another couple months.
Are there any open-source tools out there that will do this? Or am I stuck with some kind of scheduled pg_dump/SQL*Loader solution?
You can create a database link from Oracle to Postgres (this is called heterogeneous connectivity). This makes it possible to select data from Postgres with a select statement in Oracle. You can use materialized views to schedule and store the results of those selects.
It sounds like SymmetricDS would work for your scenario. SymmetricDS is web-enabled, database independent, data synchronization/replication software. It uses web and database technologies to replicate tables between relational databases in near real time.
Sounds like you want an ETL (extract transform load) tool. There are allot of open source options Enhydra Octopus, and Talend Open Studio are a couple I've come across.
In general ETL tools offer you better flexibility than the straight across replication option.
Some offer scheduling, data quality, and data lineage.
Consider using the Confluent Kafka Connect JDBC sink and source connectors if you'd like to replicate data changes across heterogeneous databases in real time.
The source connector can select the entire database , particular tables, or rows returned by a provided query, and send the data as a Kafka message to your Kafka broker. The source connector can calculate the diffs based on an incrementing id column, a timestamp column, or be run in bulk mode where the entire contents are recopied periodically. The sink can read these messages, optionally check them against an avro or json schema, and populate the source database with the results. It's all free, and several sink and source connectors exist for many relational and non-relational databases.
*One major caveat - Some JDBC Kafka connectors can not capture hard deletes
To get around that limitation, you can use a propietary connector such as Debezium (http://www.debezium.io), see also
Delete events from JDBC Kafka Connect Source.

Resources