Looking for a way to automate pg_dumps from all logical databases for all of our RDS instances and then send those dumps to s3. I see using AWS Batch as a solution but that involves a lot of services like Lambda that I do not have a background in. Any advice would be appreciated.
Related
I have some tables from three databases that I want to copy their data to another database in an automated way and these data are quite large. My servers are running on AWS. What is the simplest and most reliable way to do so?
Edit
I want them to stay on-sync (automation process as DevOps engineer)
The databases are all MySQL and all moved between AWS EC2. The data is in range between 100GiB and 200GiB
Currently, Maxwell to take the data from the tables then moved to Kafka and then a script written in Java to feed the other database.
I believe you can use AWS Database Migration Service (DMS) to replicate tables from each source into a single target. You would have a single target endpoint and three source endpoints. You would have three replication tasks that would take data from each source and put it into your target. DMS can keep data in sync via ongoing replication. Be sure to read up on the documentation before proceeding as it isn't the most intuitive service to use, but it should be able to do what you are asking.
https://docs.aws.amazon.com/dms/latest/userguide/Welcome.html
For staging in Snowflake, we need S3 AWS layer or Azure or Local machine. Instead of this, can we FTP a file from a source team directly to Snowflake internal storage, so that, from there the Snowpipe can the file and load to our Snowflake table.
If yes, please tell how. If no, please confirm that as well. If no, won't that is a big drawback of Snowflake to depend on other platforms every time.
You can use just about any driver from Snowflake to move files to Internal stage on Snowflake. ODBC, JDBC, Python, SnowSQL, etc. FTP isn't a very common protocol in the cloud, though. Snowflake has a lot of customers without any presence on AWS, Azure, or GCP that are using Snowflake without issues in this manner.
I'm new to building data pipelines where dumping files in the cloud is one or more steps in the data flow. Our goal is to store large, raw sets of data from various APIs in the cloud then only pull what we need (summaries of this raw data) and store that in our on premises SQL Server for reporting and analytics. We want to do this in the most easy, logical and robust way. We have chosen AWS as our cloud provider but since we're at the beginning phases are not attached to any particular architecture/services. Because I'm no expert with the cloud nor AWS, I thought I'd post my thought for how we can accomplish our goal and see if anyone has any advice for us. Does this architecture for our data pipeline make sense? Are there any alternative services or data flows we should look into? Thanks in advance.
1) Gather data from multiple sources (using APIs)
2) Dump responses from APIs into S3 buckets
3) Use Glue Crawlers to create a Data Catalog of data in S3 buckets
4) Use Athena to query summaries of the data in S3
5) Store data summaries obtained from Athena queries in on-premises SQL Server
Note: We will program the entire data pipeline using Python (which seems like a good call and easy no matter what AWS services we utilize as boto3 is pretty awesome from what I've seen thus far).
You may use glue jobs (pyspark) for #4 and #5. You may automate flow using Glue triggers
We are planning to implement a project in Azure cloud where data storage will be Azure Data lake for now and in future HDP will be implemented and ADLS will be the extended datanode. From ADLS we want to expose data for Dashboard creation using Tableau. Initial plan was to use Hive and Tableau will connect to Data through Hive. But here comes the performance issue as:
There will be multiple users who will have access to Data through Tableau(100+)
We will also have to expose Data to different portal with API calls.
Which means multiple connectivity will be established at the same time which will hit hive . My question is:
Can hive serve the purpose with minimal time?
How can i measure the performance?
I dont want to let my users to sit back after running a query in tableau and wait for a long time to see the dashboard.
Would you please share your experiences in this design issue? Should we use Hive or should We use some other tools which have better performance to work with tableau and HDFS storage. Someone suggested me to use Azure SQL Server and connect Tableau to SQL server. But its again the old fashion and also matter of cost as price is related with the execution of each query.
If you have any better solution experience please share , would be greatly appreciated.
Thanks in advance.
Hive LLAP could work, if you can get it installed.
Otherwise, at my work, we've had good experience with PrestoDB and Tableau on S3 data.
Some teams use Spark SQL, and you can setup a Spark Thrift Server, that should be compatible with the Hive JDBC/ODBC drivers
I'm aware of the various options in place for migrating a single database up to Azure. My problem is that these all only seem to cater for a single database at a time. However, I have a db per tenant model with over 2000 databases to migrate and not a lot of time to play with.
Can anyone point me in towards the best (ie fastest) way of doint this?
In the end we accomplished this with Powershell and the Azure API. Essentially batch creating bacpacs on the source server, uploading them to blob storage then importing them into Azure SQL server pools.
If I was facing the same challenge now I'd take a look at the Azure Database Migration Service - https://azure.microsoft.com/en-gb/services/database-migration/
I am also facing this problem and am going down the route of using the Visual Studio data compare tool.
All my tenant databases have the same schema so I made an empty template database in Azure, and just use the CREATE AS COPY command to make a new one each time ready for receiving the migration.
Then I ask Visual Studio to compare the empty database with the live database and automatically insert the data for me.
Seems to be working well so far, there's very little manual steps needed and it doesn't involve using the Azure Portal, or blob storage or creating databases outside of the elastic pool which is great. But the overall time will be slow to migrate data for all the databases.