Clone data across snowflake accounts on schema basis - snowflake-cloud-data-platform

I have two snowflake account. I want to clone data from one some schemas from database of one account to schemas of database in another account.
Account1: https://account1.us-east-1.snowflakecomputing.com
Database: DB_ONE
SCHEMAS: A1SCHEMA1, A1SCHEMA2, A1SCHEMA3, A1SCHEMA4(has external tables)
Account2: https://account2.us-east-1.snowflakecomputing.com
Database: DB_TWO
SCHEMAS: A2SCHEMA1, A2SCHEMA3, A2SCHEMA4(has external tables)
Both accounts are under same organization.
I want to clone A1SCHEMA1 of DB_ONE from account1 to A2SCHEMA1 of DB_TWO in account2.
Is it possible? If you, what are the instruction. I have found info on db level but not on schema level. Also, I would need to refresh the data from clone on demand basis.
Can I clone the A1SCHEMA4 of DB_ONE from account1 to A2SCHEMA4 of DB_TWO in account2? as it has external tables.
Note: DB_ONE is not created from a share. Basically I want to get data from prod to lower env. replicate or clone but I want to refresh it as well.

Since your goal appears to be to leverage prod data for development purposes, then data sharing isn't a good solution, since it is read-only. Data replication is probably the best solution here, but since you want it at a schema-level, then you're going to need to change things up a bit.
Create a schema-level clone on Prod into a separate database on Prod
On dev, create a database that is a replica of the Prod clone database. This will be a read-only replica of prod, so you'll need the next step.
Once the prod clone is replicated to dev, you can then clone that database/schema into your persistent development structures at a schema-level.
This sounds like a lot of hops of data, but keep in mind that clones are zero-copy, so the only true data movement is across the replication process. This will cost you some replication processing, but since the 2 accounts are in the same region, you will not be charged for data egress and the process will run pretty fast.

Related

Advice on Azure platform to host Data Warehouse

I am a Data Warehouse developer currently looking into using the Azure platform to host a new Data Warehouse.
My experience is with using on premise servers hosting standard SQL Server Databases, one for the staging database and one for the Data Warehouse. Typically I would use a combination of SSIS and stored procedures running in a scheduled SQL server agent job for the ETL.
How can I replicate this kind of setup within Azure?
The storage size will be less than 1TB so could I just use Azure SQL Server Database over Azure SQL Data Warehouse?
If so would I need separate databases for staging and the data warehouse using the elastic pool option?
The data that I will be loading into staging will all be on premise. Will SSIS still be suitable for loading to Azure or will Azure Data Factory be a better fit?
Any help at all would be greatly appreciated! Thanks.
Leon has lots of good information there. But from a Data Warehouse perspective, I wouldn't use Data Sync for ETL purposes (mensioned as "not preferred" in the link Leon provided, Data Sync, in the list "When to use Data Sync").
For DW, Azure DB is a good option. Azure SQL Data Warehouse (known as Azure Synapse Analytics nowadays) is a heavy duty beast for handling DW. Are you really sure you need this kind of system with < 1Tb data? I'd personnally leave Azure Synaptics for now, and tried with Azure DB first. It's a LOT cheaper and you can upgrade later if necessary.
One thing to note about Azure DB though: Azure DB doesn't support queries over databases. That's not a deal breaker though, everything can be handled in the same database. I personally use a schema to differentiate staging from the DW (and of course I use other schemas in the DW as well). It's not very difficult to use separate databases of course, but the border between them is a lot deeper in Azure DB than on-premise SQL Server or other Azure solutions (Managed Instance for example).
SSIS is still an option, but the problem is, what you use to run the packages? There are options like:
continue running them from on-premise (all the hard work is still done in the cloud)
rent a VM with SQL Server from Azure, deploy the packages to the VM and run them from VM
use Data Factory to run the SSIS packages
None of those are a perfect solution for every use case. First two options come with quite a heavy cost, if running SSIS is the only thing you need them for. Using Data Factory to run SSIS is a bit cumbersome at the moment, but it's an option anyway.
Data Factory itself is a good option as well (I haven't personally tried it, but I have heard good things about it). If you use Data Factory to run your SSIS, why not start using Data Factory without SSIS packages in the first place? Of course Data Factory has some limitations compared to SSIS which might be the reason, but if your SSIS packages are simple enough, why not give Data Factory a try.
I would suggest you using Azure SQL database. It provides many price tier with difference storage for you. You can select the most suitable price tier for you. Azure SQL database also support scale up/down base on the usage.
Ref: Service tiers in the DTU-based purchase model
And as you said, the data that I will be loading into staging will all be on premise.
Azure SQL database has the feature Data Sync can help you do that:
Data Sync is useful in cases where data needs to be kept updated across several Azure SQL databases or SQL Server databases. Here are the main use cases for Data Sync:
Hybrid Data Synchronization: With Data Sync, you can keep data
synchronized between your on-premises databases and Azure SQL
databases to enable hybrid applications. This capability may appeal
to customers who are considering moving to the cloud and would like
to put some of their application in Azure.
Distributed Applications: In many cases, it's beneficial to separate
different workloads across different databases. For example, if you
have a large production database, but you also need to run a
reporting or analytics workload on this data, it's helpful to have a
second database for this additional workload. This approach minimizes
the performance impact on your production workload. You can use Data
Sync to keep these two databases synchronized.
Globally Distributed Applications: Many businesses span several
regions and even several countries/regions. To minimize network
latency, it's best to have your data in a region close to you. With
Data Sync, you can easily keep databases in regions around the world
synchronized.
When you create the SQL database, you can migrate the schema or data to Azure with many tools, such as Data Migration Assistant(DMA).
Then Set up SQL Data Sync between Azure SQL Database and SQL Server on-premises, it will help sync the data auto every 5 mins.
Hope this helps.
If you want to start on the less expensive options in Azure, go with a general purpose SQL database and an Azure Data Factory pipeline with a few activities.
Dynamic Resource Scaling ETL
You can scale up the database by issuing an alter database statement and then move onto your stored proc based ETL. I would even use a "master" proc to call the dimension and fact proc's to control the execution flow. Then scale down the database with another alter database statement. I even created my own stored proc to issue these scaling statements.
You also cannot predict when the scaling will be completed, so I have a wait activity. You could be a little more nerdy with a loop that checks the service objective property and then proceeds when it is complete. But it was just easier to wait for 10 minutes. I have only been burnt a couple times when the scaling took longer.
Data Pipeline Activities:
Scale up, proceed if successful
Wait about 10 minutes, proceed always
Execute the ETL, proceed always
Scale down
Elastic Query
You can query across databases with vertical partition Elastic Query. Performance isn't great, and they don't recommend it for ETL, but it will work. To improve performance try dumping any large table you need into a temp table and then transform the data locally.

db replication vs mirroring

Can anyone explain the differences from a replication db vs a mirroring db server?
I have huge reports to run. I want to use a secondary database server to run my report so I can off load resources from the primary server.
Should I setup a replication server or a mirrored server and why?
For your requirements the replication is the way to go. (asumming you're talking about transactional replication) As stated before mirroring will "mirror" the whole database but you won't be able to query unless you create snapshots from it.
The good point of the replication is that you can select which objects will you use and you can also filter it, and since the DB will be open you can delete info if it's not required( just be careful as this can lead to problems maintaining the replication itself), or create specific indexes for the report which are not needed in "production". I used to maintain this kind of solutions for a long time with no issues.
(Assuming you are referring to Transactional Replication)
The biggest differences are: 1) Replication operates on an object-by-object basis whereas mirroring operates on an entire database. 2) You can't query a mirrored database directly - you have to create snapshots based on the mirrored copy.
In my opinion, mirroring is easier to maintain, but the constant creation of snapshots may prove to be a hassle.
As mentioned here
Database mirroring and database replication are two high data
availability techniques for database servers. In replication, data and
database objects are copied and distributed from one database to
another. It reduces the load from the original database server, and
all the servers on which the database was copied are as active as the
master server. On the other hand, database mirroring creates copies of
a database in two different server instances (principal and mirror).
These mirror copies work as standby copies and are not always active
like in the case of data replication.
This question can also be helpful or have a look at MS Documentation

Postgres daily data dump and hydration clogs disk space?

I take a daily dump from my production environtment by doing:
pg_dump <database name> > dump_<date>.sql
then I transfer this over to staging and import the staging db by first dropping the tables:
drop schema public cascade;
create schema public;
and then doing the following:
psql <database name> < dump_<date>.sql
However it seems like the staging DB is getting unusually bigger and bigger everyday. At this point even after I drop the tables & data, there's 150 gb of space taken in the DB folder.
It feels like something like logs or metadata is clogging the folders.
What's the proper way to do this or is there a good way to clean this extra data other than deleting the DB and reinitiating it everytime.
Thanks!
There is a better way, a much much better way.
https://www.postgresql.org/docs/9.5/static/high-availability.html
Database servers can work together to allow a second server to take
over quickly if the primary server fails (high availability), or to
allow several computers to serve the same data (load balancing).
Ideally, database servers could work together seamlessly. Web servers
serving static web pages can be combined quite easily by merely
load-balancing web requests to multiple machines. In fact, read-only
database servers can be combined relatively easily too. Unfortunately,
most database servers have a read/write mix of requests, and
read/write servers are much harder to combine. This is because though
read-only data needs to be placed on each server only once, a write to
any server has to be propagated to all servers so that future read
requests to those servers return consistent results.
Now when you read the documentation it seems very intimidating at first. However in reality all you need to do is take one dump of the entire cluster and enable WAL logging on postgresql.conf then you can copy the WAL archive files daily, weekly or monthly to another server.

What is the DBA function/tool called for keeping many remote DBs synchronized with a main DB

I am a front-end developer being asked to fulfil some DBA tasks. Uncharted waters.
My client has 10 remote (off network) data collection terminals hosting a PostgreSQL application. My task is to take the .backup or .sql files those terminals generate and add them to the main DB. The schema for all of these DBs will match. But the merge operation will lead to many duplicates. I am looking for a tool that can add a backup file to an existing DB, filter out duplicates, and provide a report on the merge.
Is there a term for this kind of operation in the DBA domain?
Is this function normally built into basic DB admin suites (e.g. pgAdmin III), are enterprise-level tools required, or is this something that can be done on the command-line easily enough?
Update
PostgreSQL articles on DB replication here and glossary.
You can't "merge a bunch of tables" but you could use Slony to replicate child tables (i.e. one partition per location) back to a master db.
This is not an out of the box solution but with something like Bucardo or Slony it can be done, albeit with a fair bit of work and added maintenance.

create data patch for database (synchronize databases)

There is 2 databases: "temp" and "production". Each night production database should be "synchronized", so it will have exactly same data as in "temp". Database sizes are several GB and just copying all data is not an option. But changes are usually quite small: ~100 rows added, ~1000 rows updated and some removed. About 5-50Mb per day.
I was thinking maybe there is some tool (preferably free) that could go trough both databases and create patch, that could be applied to "production database". Or as option just "synchronize" both databases. And it should quite fast. In other words something like rsync for data in databases.
If there is some solution for specific database (mysql, h2, db2, etc), it will be also fine.
PS: structure is guaranteed to be same, so this question is only about transferring data
Finally i found a way to do it in Kettle (PDI):
http://wiki.pentaho.com/display/EAI/Synchronize+after+merge
Only one con: I need create such transformation for each table separately.
Why not setup database replication from Temp Database to your Production database where your temp database will act as the Master and Production will act as a slave. Here is a link for setting up replication in MySql. MSSQL also supports database replication as well. Google should show up many tutorials.

Resources