Trying to query a large dataset from Athena using AWS data wrangler. The query fails for large datasets. This is for setting up a datawrangler pipeline using UI in AWS studio trying to add a Athena Source.
Some observations:
Small Athena queries works
Same dataset is successfully read from S3 after querying using Athena.
First I get the warning in UI saying your query takes longer than usual, and then failure message with no specific reason. No useful message in cloudformation logs also
Same query completed directly in Athena in around 30 minutes.
Anyone encountered a similar problem? any timeout settings for data wrangler?
Related
We have our datalake in AWS s3.
Metadata in hive, we have a small running cluster.(we havent used Athena/Glue) .
We use spark and presto to in our Airflow pipeline.
The processed data gets dumped into snowflake.
The Detalake has various formats but majorly in parquet.
We want to experiment with Databricks. Our plan is to
Create Deltalake tables instead of hive ones for the entire detalake.
Use Databricks for processing and warehousing for a significant part of the data.
We can not replace snowflake with databricks, at least at this moment.
So we need the deltalake tables to be used by other spark pipelines as well.
This last step above, is it possible this way without challenges or is it tricky ?
I am using SQL Server RDS as the source database and Apache-Kafka as the target in AWS DMS. I want to receive both the data and control records on every CDC changes made in the source database but I am only getting data records in case of CRUD commands and control record in case of the DDL commands. I went through the AWS DMS documentation but couldn't find anything relevant.
Is it possible to get both the control and data records in the Kafka topic?
It is not possible to get both the control and data records using aws dms.
I'm new to building data pipelines where dumping files in the cloud is one or more steps in the data flow. Our goal is to store large, raw sets of data from various APIs in the cloud then only pull what we need (summaries of this raw data) and store that in our on premises SQL Server for reporting and analytics. We want to do this in the most easy, logical and robust way. We have chosen AWS as our cloud provider but since we're at the beginning phases are not attached to any particular architecture/services. Because I'm no expert with the cloud nor AWS, I thought I'd post my thought for how we can accomplish our goal and see if anyone has any advice for us. Does this architecture for our data pipeline make sense? Are there any alternative services or data flows we should look into? Thanks in advance.
1) Gather data from multiple sources (using APIs)
2) Dump responses from APIs into S3 buckets
3) Use Glue Crawlers to create a Data Catalog of data in S3 buckets
4) Use Athena to query summaries of the data in S3
5) Store data summaries obtained from Athena queries in on-premises SQL Server
Note: We will program the entire data pipeline using Python (which seems like a good call and easy no matter what AWS services we utilize as boto3 is pretty awesome from what I've seen thus far).
You may use glue jobs (pyspark) for #4 and #5. You may automate flow using Glue triggers
I am working on a project where i need to display the database mssql server's performance metrics for example memory consumed/free, storage free space etc. I have researched for this purpose and one thing came up was DOGSTATSD.
Datadog provides the library for .net project to get custom metrics but that was not the solution for me because the metrics works on datadog website. I have to display the all (in graph or whatever suited) data, received from MSSQL SERVER. There will be multiple servers/instances.
Is there a way to do that, our WebApp connected with multiple databases and we receive/display information.
I cannot use already available tools for the insights.
You can easily get all needed data via querying dmv and other resources inside SQL Server. Good start is here.
We have regression tests using Selenium which picks a value from a web page as the actual result. The data in the web UI is sourced from Elastic Search.
The regression tests that have been written are comparing this UI value, directly with the original data in SQL Server before the transfer to Elastic.
My question is:
Should the regression test look for the expected result in SQL Server or Elastic search?
If we pull expected data from SQL server, then we are including the data transfer processing from SQL to Elastic in the test.
If we pull expected data from Elastic, we are just testing the UI and layers down to Elastic, but not the DB --> Elastic configuration.
I can see the benefits of both methods. Any thoughts