Has anyone here tried pulling data from MaestroQA to Snowflake?
From MaestroQA to Snowflake there is a way, but I'm wondering if there's way the other way around, from Snowflake pulling MaestroQA data, without using any APIs.
In addition, trying to look for a way to automate this.
I tried looking for documentation and any threads online, but couldn't find one.
Below are documents/links I have seen so far, but this method is from MaestroQA pushing data to Snowflake.
https://help.maestroqa.com/en/articles/1982484-data-warehouse-table-overview
https://help.maestroqa.com/en/articles/1557390-push-qa-data-to-your-data-warehouse.
Snowflake can only load data from its internal/external stages. It has no capabilities to pull data from anywhere.
You'll either need to use a tool with ETL capabilities or write your own process in, for example, python.
Related
I have source tables in Snowflake and Destination tables in Snowflake.
I need to load data from source to destination using ADF.
Requirement: I need to load data using single pipeline for all the tables.
Eg: For suppose i have 40 tables in source and load the total 40 tables data to destination tables. I need to create a single pipeline to load all tables at a time.
Can anyone help me in achieving this?
Thanks,
P.
This is a fairly broad question. So take this all as general thoughts, more than specific advice.
Feel free to ask more specific questions, and I'll try to update/expand on this.
ADF is useful as an orchestration/monitoring process, but can be tricky to manage the actual copying and maneuvering of data in Snowflake. My high level recommendation is to write your logic and loading code in snowflake stored procedures
then you can use ADF to orchestrate by simply calling those stored procedures. You get the benefits of using ADF for what it is good at, and allow Snowflake to do the heavy lifting, which is what it is good at.
hopefully you'd be able to parameterize procedures so that you can have one procedure (or a few) that takes a table name and dynamically figures out column names and the like to run your loading process.
Assorted Notes on implementation.
ADF does have a native Snowflake connector. It is fairly new, so a lot of online posts will tell you how to set up a custom ODBC connector. You don't need to do this. Use the native connector and auto resolve integration and it should work for you.
You can write a query in an ADF lookup activity to output your list of tables, along with any needed parameters (like primary key, order by column, procedure name to call, etc.), then feed that list into an ADF foreach loop.
foreach loops are a little limited in that there are some things that you can't nest inside of a loop (like conditionals). If you need extra functionality, you can have the foreach loop call a child ADF pipeline (passing in those parameters) and have the child pipline manage your table processing logic.
Snowflake has pretty good options for querying metadata based on a tablename. See INFORMATION_SCHEMA. Between that and just a tiny bit of javascript logic, it's not too bad to generate dynamic queries (e.g. with column names specific to a provided tablename).
If you do want to use ADF's Copy Activities, I think You'll need to set up an intermediary Azure Storage Account connection. I believe this is because it uses COPY INTO under the hood which requires using external storage.
ADF doesn't have many good options for avoiding running one pipeline multiple times at once. Either be careful about making sure that your code can handle edge cases like this, or that your scheduling/timeouts won't allow for that scenario with a pipeline running too long.
Extra note:
I don't know how tied you are to ADF, but without more context, I might suggest a quick look into DBT for this use case. It's a great tool for this specific scenario of Snowflake to Snowflake processing/transforming. My team's been much happier since moving some of our projects from ADF to DBT. (not sponsored :P )
My last couple of questions have been on how to connect to snowflake and add and read data with the python connector in a ipython notebook. However, I am having troubling with the next best step to create a report with the data I seek to visualize.
I would like to upload all of the data, store it, then analyze it, kind of like a homemade dashboard.
So what I have done so far is a small version:
Staged my data from a local file, and I will run adding new data
each time I open the notebook
Then I will use the python connector to call any data from storage
Create visualizations with numpy objects in the local notebook.
My data will start out very small, but over time I would imagine I would have to move computation to the cloud to minimize the memory used locally for the small dashboard.
My question is, my data is called from a api that results in json files, new data is no bigger that 75 MB a day 8 columns, with two aggregate calls to the data, done in the sql call. If I run these visualizations monthly, is it better to aggregate the information in Snowflake, or locally?
Put the raw data into Snowflake. Use tasks and procedures to aggregate it and store the result. Or better yet, don't do any aggregations except for when you want the data - let Snowflake do the aggregations in real-time off the raw data.
I think what you might be asking is whether you should ETL your data or ELT your data:
ETL: Extract, Transform, Load (in that order) - Extract data from your API. Transform it locally on your computer. Load it into Snowflake.
ELT: Extract, Load, Transform (in that order) - Extract data from your API. Load it into Snowflake. Transform it after it's in Snowflake.
Both ETL and ELT are valid. Many companies use both approaches w/ snowflake interchangeably. But Snowflake was built for it to kind of be your data lake - the idea being, "Just throw all your data up here and then use our awesome compute and storage resources to transform them quickly and easily."
Do a Google search on "Snowflake ELT" or "ELT vs ETL" for more information.
Here are some considerations either way off the top of my head:
Tools you're using: Some tools like SSIS were built w/ ETL in mind - transformation of the data before you store it in your warehouse. That's not to say you can't ELT, but it wasn't built w/ ELT in mind. More modern tools - like Fivetran or even Snowpipe assume you're going to aggregate all your data into Snowflake, and then transform it once it's up there. I really like the ELT paradigm - i.e. just get your data into the cloud - transform it quickly once it's up there.
Size and growth of your data: If your data is growing, it becomes harder and harder to manage it on local resources. It might not matter when your data is in gigabytes or millions of rows. But as you get into billions of rows or terabytes of data, the scale-ability of the cloud can't be matched. If you feel like this might happen and you think putting it into the cloud isn't a premature optimization, I'd load your raw data into Snowflake and transform it after it's up there.
Compute and Storage Capacity: Maybe you have a massive amount of storage and compute at your fingertips. Maybe you have an on-prem cluster you can provision resources from at the drop of a hat. Most people don't have that.
Short-Term Compute and Storage Cost: Maybe you have some modest resources you can use today and you'd rather not pay Snowflake while your modest resources can do the job. Having said that, it sounds like the compute to transform this data will be pretty minimal, and you'll only be doing it once a day or once a month. If that's the case, the compute cost will be very minimal.
Data Security or Privacy: Maybe you have a need to anonymize data before moving it to the public cloud. If this is important to you you should look into Snowflake's security features, but if you're in an organization where it's super difficult to get a security review and you need to move forward with something, transforming it on-prem while waiting for security review is a good alternative.
Data Structure: Do you have duplicates in your data? Do you need access to other data in Snowflake to join on in order to perform your transformations? As you start putting more and more data into Snowflake, it makes sense to transform it after it's in Snowflake - that's where all your data is and you will find it easier to join, query and transform in the cloud where all your other data is.
My question is, my data is called from a api that results in json files, new data is no bigger that 75 MB a day 8 columns, with two aggregate calls to the data, done in the sql call. If I run these visualizations monthly, is it better to aggregate the information in Snowflake, or locally?
I would flatten your data in python or Snowflake - depending on which you feel more comfortable using or how complex the data is. You can just do everything on the straight json, although I would rarely look to design something that way myself (it's going to be the slowest to query.)
As far as aggregating the data, I'd always do that on Snowflake. If you would like to slice and dice the data various ways, you may look to design a data mart data model and have your dashboard simply aggregate data on the fly via queries. Snowflake should be pretty good with that, but for additional speed then aggregating it up to months may be a good idea too.
You can probably mature your process from being local python script driven too something like serverless lambda and event driven wwith a scheduler as well.
firstly I'm very new to DynamoDB, and AWS services in general - so I'm finding it hard when bombarded with all the details.
My problem is that I have an excel file with my data in CSV format, and I'm looking to add said data to a DynamoDB table, for easy access for the Alexa function I'm looking to build. The format of the table is as follows:
ID, Name, Email, Number, Room
1534234, Dr Neesh Patel, Patel.Neesh#work.com, +44 (0)3424 111111, HW101
Some of the rows have empty fields.
But everywhere I look online, there doesn't appear to be an easy way to actually achieve this - and I can't find any official means either. So with my limited knowledge of this area - I am questioning whether I'm going about this all the entirely wrong way. So firstly, am I thinking about this wrong? Should I be looking at a completely different solution for a backend database? I would have thought this would be a common task but with the lack of support or easy solutions - am I wrong?
Secondly, if I'm going about this all fine - how can it be done? I understand that the DynamoDB requires a specific JSON format - and again there doesn't appear to be a straightforward way to convert my CSV into said format.
Thanks, guys.
I had the same problem when I start using DynamoDB. When you come to distributed, big data system you really need to architect how to move data across the systems. This is where you start with it.
Clearly documented here,
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/SampleData.LoadData.html
Adding more details to understand the process.
Step 1: Convert your csv to json file.
If you have small amount of data, you can use online tools.
http://www.convertcsv.com/csv-to-json.htm
{
"ID": 1534234,
"Name": "Dr Neesh Patel",
"Email": "Patel.Neesh#work.com",
"Number": "+44 (0)3424 111111",
"Room": "HW101"
}
You can see how nicely it formatted remove spaces, etc., Choose the right options and perform your conversion.
If your data is huge, then you need to use big data tools to parallely process those data to convert them.
Step 2: Upload using CLI for small and one time upload
aws dynamodb batch-write-item --request-items file://data.json
If you want to regularly upload the file, you need to create a data pipeline or a different process.
Hope it helps.
DynamoDb is cool. However, before you use it you have to know your data usage patterns. For your case, if you're only every going to query the DynamoDb table by ID then it is great. If you need to query by any one or combination of columns then well there are solutions for that:
Elastisearch in conjunction with DynamoDb (which can be expensive), secondary indexes on the
DynamoDb table (understand that each secondary index is creating a
full copy of your DynamoDb table with the columns you choose to store
in the index),
Elasticache in conjunction with DynamoDb (for tying searches back to the ID
column),
RDS instead of DynamoDb ('cause a sql-ish db is better when
you don't know your data usage patterns and you just don't want to
think about it),
etc.
It really depends on how much data you have and how you'll query the data that should define your architecture. For me it would come down to weighing cost and performance of each of the options available.
In terms of getting the data into your DynamoDb or RDS table:
AWS Glue may be able to work for you
AWS Lambda to programmatically get the data into your data store(s)
perhaps others
I've got a RDS database with a table containing a ton of data in several columns (some with geo spatial data) I want to search across. SQL queries and good covering indexes on this data is still far too slow to use for something like an AJAX type ahead suggestion field.
As such, I'm investigating options for search and came across Amazon CloudSearch (now powered by Apache Solr) and it seems to fit my needs. The problem is, I can't seem to find a way via the AWS console to import or provide data from RDS. Am I missing something? Other solutions like ElasticSearch have plugins like river to connect an transform MySQL data.
I know there are command line tools for uploading CSV and XML data into CloudSearch. So far the easiest thing I can find is to mysqldump table into CSV or XML format and manually load it with the CLI tools. Is this with some re-occuring cron job the best way to do get data?
As of 2014-06-17 this feature is not available on Amazon Cloudsearch.
I think AWS Data Pipeline can help. It works like a cron and you can program reoccuring jobs easily using this.
Ran into the same thing, it is only possible to pull directly from RDS if you are using noSQL and AWS's dynamoDB.
Looking into Elasticsearch after finding this out.
I need to fetch data from normalized MSSQL db and feed them in Solr index.
I was just wondering whether Apatar can be used to perform the job. I've gone through its documents, but doesn't get the information I'm looking for. It states, it can fetch data from SQL server, and post it over HTTP, but still not sure, whether it can post fetched data in XML over http or not?
Any advise will be highly valuable. thank you
I am not familiar with Apatar, but seeing as it is a Java application, it may be a bit challenging to implement it in a windows environment. However, for various scenarios where I need to fetch data from a MSSQL Database and feed it to Solr, I have written custom C# code leveraging the SolrNet client. This tends to be pretty straight forward and simple code and in the cases where we need to load data at specified intervals we are using scheduled tasks calling a console application. I would recommend checking out the Create/Update section of the SolrNet site for some examples of loading/updating data with the .Net client.