We are looking into automatically syncing external tables from AWS Glue Data Catalog to Snowflake. I am aware of the following: https://docs.snowflake.com/en/user-guide/tables-external-hive.html#integrating-existing-hive-tables-and-partitions-with-snowflake and we tried deploying the metastore connector to EMR cluster, which is integrated with AWS Glue Data Catalog.
Unfortunately, that doesn't seem to work - newly created tables in Hive (From EMR cluster) are not replicated to Snowflake.
The connector works as long as we are using local Metastore, so I guess our setup is not supported. Any ideas how we could achieve our goal?
Thanks!
Related
I am new to Azure and have no prior experience or knowledge regarding working with Azure data warehouse systems (now Azure Synapse Analytics Framework)
I have access to a "read only" data warehouse (not in Azure) that looks like this:
I want to replicate this data warehouse as it is on Azure cloud. Can anyone point me to the right direction (video tutorials or documentation) and the number of steps involved in this process? There are around 40 databases in this warehouse. And what if I wanted to replicated only specific ones?
We can't do that you only have the read only permisson. No matter which data warehouse, we all need the server admin or database owner permission to do the database replicate.
You can easily get this from the all documents relate to the database backup/migrate/replicate, for example: https://learn.microsoft.com/en-us/sql/t-sql/statements/backup-transact-sql?view=sql-server-ver15#permissions,
If you have enough permission then you can to that. But for Azure SQL datawarehouse, now we called SQL pool (formerly SQL DW), we can't replicate other from on-premise datawarehouse to Azure directly.
The official document provide a way import the data into to Azure SQL pool((formerly SQL DW)):
Once your dedicated SQL pool is created, you can import big data with
simple PolyBase T-SQL queries, and then use the power of the
distributed query engine to run high-performance analytics.
You also could use other ETL tool to achieve the data migration from on-premise datawarehouse to Azure. For example using Data Factory, combine these two tutorials:
Copy data to and from SQL Server by using Azure Data Factory
Copy and transform data in Azure Synapse Analytics by using Azure
Data Factory
My current model looks like this:
Gather disparate data sources and import into SQL Server.
Process and transform data using SSIS packages.
final step in the SSIS package uploads data to the data warehouse.
BI tools pull data from the data warehouse for end users.
Is this a logical work flow? I initially was going to use data factory and the Azure SSIS integration runtime to process data. However I didn't understand why these steps were needed, as it would seem simpler in my situation just to build my SSIS packages on premises and upload the processed data to my data warehouse. What benefits would I gain from using data factory and the integration runtime? My main concern is that my current model will make automation difficult but I'm not entirely sure. Any help is appreciated.
Your possible paths here would be SSIS on prem, SSIS on VM in Cloud, SSIS in ADF or natively build the pipelines in ADF.
ADF is an Azure Cloud PaaS managed service for data movement and data integration orchestration. To reach back into on-prem data sources, you need to use an Integration Runtime gateway on the source side. So, if you are looking to move to a Cloud-first architecture or migrating into Azure, ADF is a good solution (use V2).
If you are remaining all on-prem SSIS on-prem is the best scenario.
If this is hybrid, where you will continue to have some data on prem and load Azure Data Warehouse in the Cloud, then you can still use SSIS on prem with connectors into ADW as the target. Or if you have to eliminate the local server concept, you can run that SSIS in a VM in Azure.
If you want to eliminate both the datacenter server and the need to patch, maintain, etc. the SSIS server, then use SSIS in ADF, which provides SSIS as a Service. In that case, you can still move data in a hybrid manner.
It really is going to depend on factors such as are you comfortable more in Visual Studio to develop SSIS jobs or do you want to build the pipelines in JSON in ADF? Do you have a plan or a need to move to Cloud? Do you want to move to a Cloud-Managed service (i.e. ADF V2)?
I hope that helps!!
Our organization uses Elastic Logstash & Kibana (ELK) and we use a SQL Server data warehouse for analysis and reporting. There are some data items from ELK that we want to copy into the data warehouse. I have found many websites describing how to load SQL Server data into ELK. However, we need to go in the other direction. How can I transfer data from ELK to SQL Server, preferably using SSIS?
I have implemented a similar solution in python, where we are ingesting data from elastic cluster into our sql dwh. You can import Elasticsearch package for python which allows you to do that.
you can find more information here
https://elasticsearch-py.readthedocs.io/en/master/
I'm researching the differences between AWS and Azure for my company. We going to make an web-based application. Which is going to be across 3 regions, each region needs to have a MS SQL database.
But I can't figure how to do the following with AWS: the databases need to sync between each region (2 way). So the data stays the same on every Database.
Why we want this? For example a customer* from Eu adds a record to the database. Now this database needs to sync with the other regions. Resulting that a customer form the region US can see the added records. (*Customers can add products to the database)
Do you guys have any idea how we can achieve this?
it's a requirement to use Ms SQL.
If you are using SQL on EC2 instances then the only way to achieve multi-region, multi-master for MS SQL Server is to use Peer-to-Peer Transactional Replication, however it doesn't protect against individual row conflicts.
https://technet.microsoft.com/en-us/library/ms151196.aspx
This isn't a feature of AWS RDS for MS SQL, however there is another product for multi-region replication that's available on the AWS marketplace, but it only works for read replicas.
http://cloudbasic.net/aws/rds/alwayson/
At present AWS doesn't support read replicas for SQL server RDS databases.
However replication between AWS RDS sql server databases can be done using DMS (database migration service). Refer below link for more details
https://aws.amazon.com/blogs/database/introducing-ongoing-replication-from-amazon-rds-for-sql-server-using-aws-database-migration-service/
I have created an application on Bluemix. I need to copy my database on Bluemix that can be accessed from my adapter. Can anyone give me detailed steps on how to proceed?
First thing: if your database is reachable through the Internet and you only need to connect to it from the application, please note that a cf application on Bluemix can access the public network and so it is already able to connect to your DB in this scenario.
Assuming that you have a requirement for migrating the DB on Bluemix, you didn't specify which kind of database you want to migrate, here are the main (not all) possibilities you currently have:
RDBMS:
PostgreSQL by Compose (you need an account on compose.io)
SQL Database (DB2, only Premium plan available)
ClearDB (MySQL)
ElephantSQL (this is basically a PostgreSQL as a Service - that is you have to work on the db via API)
you could use the RDBS capability of dashDB
No-SQL:
Cloudant (documental)
Redis by Compose (ultra fast key-value db. You need an account on compose.io)
MongoDB by Compose (you need an account on compose.io)
IBM Graph (graph No-SQL db)
I suggest you to take a look at the Bluemix Catalog (subcategory Data and Analytics) and to refer to the Docs as well.
You can create dashDB service on your bluemix, and copy / upload your data to Bluemix dashDB database, using dashDB VCAP Credentials to connect to it from your adapter, or you can bind your dashDB service to you application on Bluemix.