We have our datalake in AWS s3.
Metadata in hive, we have a small running cluster.(we havent used Athena/Glue) .
We use spark and presto to in our Airflow pipeline.
The processed data gets dumped into snowflake.
The Detalake has various formats but majorly in parquet.
We want to experiment with Databricks. Our plan is to
Create Deltalake tables instead of hive ones for the entire detalake.
Use Databricks for processing and warehousing for a significant part of the data.
We can not replace snowflake with databricks, at least at this moment.
So we need the deltalake tables to be used by other spark pipelines as well.
This last step above, is it possible this way without challenges or is it tricky ?
Related
I am trying to migrate Cassandra cluster onto AWS Keyspaces for Apache Cassandra.
After the migration is done how can I verify that the data has been migrated successfully as-is?
Many solutions are possible, you could simply read all rows of a partition and compute a checksum / signature and compare with your original data for instance.Then iterating through all your partitions, then doing it for all your tables. Checksums work.
You could use AWS Glue to perform an 'except' function. Spark has a lot of usefull functions for working with massive datasets. Glue is serverless spark. You can use the spark cassandra connector with Cassandra and Keyspaces to work with datasets in glue. For example you may want to see the data that is not in Keyspaces.
cassandraTableDataframe.except(keyspacesTableDateframe).
You could also do this by exporting both datasets to s3 and performing these queries in Athena.
Here is a helpful repository of Glue and Keyspaces functions including export, count, and distinct.
We are considering to use snowflake. I tried looking into the documentation and google, but without luck. How does snowflake query/store data? As an example if I have a CSV file, database, datalake ... is it like real time querying vs the sources, or does it replicate data to snowflake? If replication, how often does it update?
Maybe an introduction to the Snowflake Architecture is helping you here: https://docs.snowflake.com/en/user-guide/intro-key-concepts.html
Let's split up your query in two parts:
How does Snowflake store data? Basically Snowflake is storing data in it's own proprietary file format. The files are are called micro partitions, are in hybrid columnar format and are stored in for example S3 in case you are using Snowflake on AWS.
How does Snowflake query data? For this Snowflake is leveraging compute instances called Virtual Warehouses, which correspond to compute instances of your cloud provider underneath. With them, the files are accessed and queried.
What are the steps to be taken to migrate historical data load from Teradata to Snowflake?
Imagine there is 200TB+ of historical data combined from all tables.
I am thinking of two approaches. But I don't have enough expertise and experience on how to execute them. So looking for someone to fill in the gaps and throw some suggestions
Approach 1- Using TPT/FEXP scripts
I know that TPT/FEXP scripts can be written to generate files for a table. How can I create a single script that can generate files for all the tables in the database. (Because imagine creating 500 odd scripts for all the tables is impractical).
Once you have this script ready, how is this executed in real-time? Do we create a shell script and schedule it through some Enterprise scheduler like Autosys/Tidal?
Once these files are generated , how do you split them in Linux machine if each file is huge in size (because the recommended size is between 100-250MB for data loading in Snowflake)
How to move these files to Azure Data Lake?
Use COPY INTO / Snowpipe to load into Snowflake Tables.
Approach 2
Using ADF copy activity to extract data from Teradata and create files in ADLS.
Use COPY INTO/ Snowpipe to load into Snowflake Tables.
Which of these two is the best suggested approach ?
In general, what are the challenges faced in each of these approaches.
Using ADF will be a much better solution. This also allows you to design DataLake as part of your solution.
You can design a generic solution that will import all the tables provided in the configuration. For this you can choose the recommended file format (parquet) and the size of these files and parallel loading.
The challenges you will encounter are probably a poorly working ADF connector to Snowflake, here you will find my recommendations on how to bypass the connector problem and how to use DataLake Gen2:
Trouble loading data into Snowflake using Azure Data Factory
More about the recommendation on how to build Azure Data Lake Storage Gen2 structures can be found here: Best practices for using Azure Data Lake Storage Gen2
I am working on requirement where I need to stream data from Snowflake to Oracle for some value added process.
Few method which I got to know is unload file to S3 then load to Oracle and other one is informatica.
But above two approaches require some effort so is there any simple way of streaming data from Snowflake to Oracle.
Snowflake cannot connect directly to Oracle. You'll need some tooling or code "in between" the two.
I came across this data migration tool the other day, and it appears to support both Snowflake and Oracle: https://github.com/markddrake/YADAMU---Yet-Another-DAta-Migration-Utility/releases/tag/v1.0
-Paul-
I'm looking for a tutorial or something that allow me to learn Presto step by step.
The idea is to start integrating file's and MSSQL, which is my knowledge area.
Unfortunately, since it is a relatively new area, I didn't find anything more than Facebook page or the Presto.io page, however it is not good enough for someone that want to start knowing the big data world from scratch.
I will appreciate your help and/or orientation in this area.
Presto has 2 primary use cases:
querying data stored in a cluster (on Hadoop's HDFS) or in a cloud (e.g. Amazon S3)
data federation, i.e. querying (and joining) data from multiple data sources (e.g. HDFS, S3, traditional RDBMS like PostgreSQL or SQL Server)
As far as SQL Server support is concerned -- Presto supports connecting to SQL Server since https://github.com/prestosql/presto/commit/072440cbb2c8df2a689c4c903dd325013eae41a0.
When it comes to querying files -- Presto uses Hive's Metastore to keep track of metadata (everything besides actually reading the data). Thus the files must reside on HDFS or S3 to be accessible (other cloud data stores like Azure's Blob are, AFAIK, not supported yet).