What is the DIFFERENCE between Snowflake Tasks to DBT - snowflake-cloud-data-platform

Hi :) I'm a novice when it comes to setting up a DWH from scratch. I have chosen Snowflake as our DWH, and am now trying to set up the ELT flows.
I understand that Snowflake has a Task object, which can run SQL and be scheduled as well.
However, there is a big push from the data community to use dbt for managing the T part of ELT.
Please can you tell me what's the difference between using Snowflake's Objects of Streaming and Task scheduling to dbt.
Thank you.

Here are two good links that explain why dbt is so great for data transformation tasks.
If I should sum it up:
Modularity (models, references, lineage)
Environments (development, deploy, scheduling)
Quality maintenance (version control (GitHub) & collaboration, tests, documentation)
Links
DBT's viewpoint: https://docs.getdbt.com/docs/about/viewpoint
And even the Snowflake community agrees: https://community.snowflake.com/s/article/Don-t-Do-Analytics-Engineering-in-Snowflake-Until-You-Read-This-Hint-dbt

Related

What is the difference between Databricks and Spark?

I am trying to a clear picture of how they are interconnected and if the use of one always require the use of the other. If you could give a non-technical definition or explanation of each of them, I would appreciate it.
Please do not paste a technical definition of the two. I am not a software engineer or data analyst or data engineer.
These two paragraphs summarize the difference quite good (from this source)
Spark is a general-purpose cluster computing system that can be used for numerous purposes. Spark provides an interface similar to MapReduce, but allows for more complex operations like queries and iterative algorithms. Databricks is a tool that is built on top of Spark. It allows users to develop, run and share Spark-based applications.
Spark is a powerful tool that can be used to analyze and manipulate data. It is an open-source cluster computing framework that is used to process data in a much faster and efficient way. Databricks is a company that uses Apache Spark as a platform to help corporations and businesses accelerate their work. Databricks can be used to create a cluster, to run jobs and to create notebooks. It can be used to share datasets and it can be integrated with other tools and technologies. Databricks is a useful tool that can be used to get things done quickly and efficiently.
In simple words, Databricks has a 'tool' that is built on top of Apache Spark, but it wraps and manipulates it in an intuitive way which is easier for people to use.
This, in principle, is the same as difference between Hadoop and AWS EMR.

Oracle Data Masking Clarification

Hi I am looking at the possibility of implementing the Oracle Data Masking Package but require some clarification. I am looking to mask just my non-production data. In the process of doing this does Oracle create clones of the the database? I am hoping to mask the PII data without creating a clone/ needing to create additional Oracle users. Does the Oracle solution meet these requirements? I am being told that deploying ODM, it will require application changes? Can anyone elaborate on this. My apologies I am extremely new to the DB world. Are there any other data masking solutions anyone can recommend?
Which version of Oracle you are using ? Do you have advanced security License ?
You can make use of Redaction features as part of Advanced security available from 12.1. Oracle provides DBMS_REDACT to hide values after execution of queries i.e. only when they are displayed at screen and hence dont impact performance of any dependencies.
There are multiple options available with full/partial redaction.
Let me know if you require any more details, I have recently implemented it in production environment to protect PII DATA
Official documentation

SSIS - Integration Services Catalog Deployment & Logging

We are using SSIS 2012. I am pretty new to it. My target is to custom logging
Execution start/end times of tasks/package
record counts (both extraction & loading) in the DFT
errors
events
Please note that in the package there could be several DFTs which may contain multiple source(s) and destination(s).
Very recently while I was sifting info on the internet, I came across the concept of SSIS package Catalog Deployment functionality which supposedly provides several log tables (e.g.[executions], [execution_parameter_values], [executable_statistics], [execution_data_statistics] etc) in SSISDB to capture metadata pertaining to package execution, performance, parameters, configurations etc extensively.
I wish to be cognizant if my understanding is correct and if this feature can be leveraged to capture log data for auditing, performance analysis etc. Also, I am keen to know if this feature will log automatically to the SSISDB (when deployed thru CATALOG) OR if, it is required that the package(s) would somehow need to be configured/setup to avail this functionality during execution.can we enable such ATALOG**s these logging will work automatically versus any settings/configuration that needs to be enabled in SSIS first facilitate such logging metadata to SSISDB catalog tables.
Having found this topic and consequently hopeful of its potential, I am inclined on leveraging the aforementioned functionality with the hope that leveraging sucha functionality will save a huge amount of effort/time versus implementing the same by coding.
NB: Additionally I would like to understand its limitations (if any) of this approach that could possibly manifest itself as a showstopper after being deployed in production.
Please share your thoughts. and point to me to any informative and relevant resources on the net.
MSDN help resource here.
Thanks

Cassandra Schema Management with CQL scripts

To Cassandra Experts: I am tasked with a work of coming up with recommendations on Cassandra CQL script management and deployment. How teams manage (should manage) large number of CQL scripts (schema definition scripts (DDL), data manipulation scripts (INSERT/UPDATE/DELETE) from the inception of Cassandra development and through subsequent changes to the application schema model. If I may, I would like to point out that the development team size is not that small (10+ developers per application functionality area).
One way (probably the wrong way) is to do what a typical relational database shop would do: app developers or development dbas design and create ddl, dml et., scripts, store and maintain them in version control system (e.g. SVN), and deploy the scripts in an environment (dev, qc, etc. ) using some automation (may be as simple as shell or perl script). I think where this breaks down in NoSQL solution such as Cassandra is the actors involved in these three steps.
1 - design and create CQL scripts - should this be done by DevOps (cassandra admins) or application developers?
(2) store and maintain them in SVN - should this be done similar to (1) above and (3) deployment of scripts - should someone from application development do this (or) DevOps do this?
I would also like to get answers from application schema control and auditing viewpoint. For example, for #1 and #2 above, if application developers design, create and store the CQL scripts in SVN, how can one be able to control what gets into the CQL schema and prevent costly errors. If there is dedicated, single team owning the data model rather than all cassandra developers (akin to DBA/Administrators), it is easier to achieve that control.
I am hoping someone those who have done this before could shed some insight into the choices and best practices for CQL code development, deployment and maintenance in a large environment.
Thanks as always.
I think the main issue you'll be confronted with is that you'll need to write code to perform some migrations, which is a significant difference to applying delta patches in a typical SQL scenario. Basic changes to the schema (as defined using CQL) can easily be applied using the cqlsh tool in a DevOps/DBA style. These types of changes would include adding columns and removing columns. But if you need to do something more fundamental, then you're going to have to write CQL client code to migrate the old data. This is especially true the more denormalization and non-declarative indexing your app requires.
FWIW and YMMV I was able to automate one aspect of CQL schema management which was to find a way to keep the schema and the application code in sync. To achieve this I wrote a CQL schema compiler that generates boilerplate application source code so that data binding is always in sync with the current schema in Cassandra. But that is just one aspect of the overall problem.

Testing database performance tools?

What do you know the best tool to testing database performance? I'm looking for a tool which help me find weak performance places in my db during use application.
There are at least two not obvious tools that can help you:
SoapUI has support for JDBC
JMeter has a JDBC sampler (don't miss these wonderful plugins!)
I said these tools are not obviouse because they are typically used for different targets (SOAP web services functional testing and HTTP accordingly). JMeter seems to be a bit better suited as it aimes for performance testing, but SoapUI can do this as well.
I'd just use SQL Server Profiler to capture a database-side trace, and then just sort by duration.
I do stuff like this 5 times a day.
Hope that helps
-Aaron MCITP: DBA
You could use a source code level profiler to profile the application, which accesses your database. Profilers can identify the slowest lines of code. Most profilers can filter their results by namespace or special naming patterns, so could could filter out all non-database access code. You can then have a look which database queries are made on these slow lines.
In some database systems you can set up logging to log which queries where run and how long they took. Database monitoring applications can show you which queries are running at the moment, so you can identify the slowest/most often executed queries very easy. If this is no option you can log the queries you are doing in your app into a text file and then run them manually against the database, usually the time taken is displayed to the user.
A good feature to optimize database queries is the EXPLAIN command supported by many DBMS.
If you told us which database exactly you are running we could help more.

Resources