Databricks Delta tables vs SQL Server Delta table - sql-server

Is there is a difference between Sql Delta table and Databricks Delta table? It looks like for SQL we use the name on a conceptual basis. The table that stores the difference of Base table is Delta. Is it the same for databricks?

No, Databricks Delta is storage layer that provides ACID transactions & other improvements to store big amounts of data for use with Apache Spark. It used to store complete datasets, that could be updated if necessary. Delta is open source project, with some enhancements available at Databricks platform.

Related

How is querying a data warehouse different than querying a database?

Say I have a data warehouse like BigQuery, RedShift. I store data which is fit for online analytical processing (OLAP). Similarly suppose I have a database like MySQL or Microsoft SQL Server which has some data fit for online transaction processing(OTLP).
What are the different parameters on which querying a data warehouse and a database would be different?
This is a very general question nevertheless I think the following can help you make your desicion:
1. How much data you have Vs relational features
2. Cloud solution Vs on premesies
3. Payment models (derived from 2) for example bq model is per scan while other is per storage

Using MSFT SQL Server for data warehouse (star schema)

We currently have a Microsoft SQL Server instance (oltp) we use as our transactional and reporting database. We want to pull out and create a separate database for reporting.
We are currently vetting Redshift and Snowflake. We came up with a question today which is why can't we create a new SQL Server instance for reporting which has the star schema and just use that (instead of redshift or snowflake)? We don't have many tables over a million rows. So maybe using a columnar data warehouse is over kill for us.
Does any know the pros and cons of using Microsoft SQL Server as a reporting database (data warehouse) with a star schema?
We also have a requirement to handle real time or near real time updates.
You can use SQL Server as a data warehouse repository. As long as you have a well designed star schema there is no reason not to use it for that purpose.

AWS Glue: SQL Server multiple partitioned databases ETL into Redshift

Our team is trying to create an ETL into Redshift to be our data warehouse for some reporting. We are using Microsoft SQL Server and have partitioned out our database into 40+ datasources. We are looking for a way to be able to pipe the data from all of these identical data sources into 1 Redshift DB.
Looking at AWS Glue it doesn't seem possible to achieve this. Since they open up the job script to be edited by developers, I was wondering if anyone else has had experience with looping through multiple databases and transfering the same table into a single data warehouse. We are trying to prevent ourselves from having to create a job for each database... Unless we can programmatically loop through and create multiple jobs for each database.
We've taken a look at DMS as well, which is helpful for getting the schema and current data over to redshift, but it doesn't seem like it would work for the multiple partitioned datasource issue as well.
This sounds like an excellent use-case for Matillion ETL for Redshift.
(Full disclosure: I am the product manager for Matillion ETL for Redshift)
Matillion is an ELT tool - it will Extract data from your (numerous) SQL server databases and Load them, via an efficient Redshift COPY, into some staging tables (which can be stored inside Redshift in the usual way, or can be held on S3 and accessed from Redshift via Spectrum). From there you can add Transformation jobs to clean/filter/join (and much more!) into nice queryable star-schemas for your reporting users.
If the table schemas on your 40+ databases are very similar (your question doesn't clarify how you are breaking your data down into those servers - horizontal or vertical) you can parameterise the connection details in your jobs and use iteration to run them over each source database, either serially or with a level of parallelism.
Pushing down transformations to Redshift works nicely because all of those transformation queries can utilize the power of a massively parallel, scalable compute architecture. Workload Management configuration can be used to ensure ETL and User queries can happen concurrently.
Also, you may have other sources of data you want to mash-up inside your Redshift cluster, and Matillion supports many more - see https://www.matillion.com/etl-for-redshift/integrations/.
You can use AWS DMS for this.
Steps:
set up and configure DMS instance
set up target endpoint for redshift
set up source endpoints for each sql server instance see
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.SQLServer.html
set up a task for each sql server source, you can specify the tables
to copy/synchronise and you can use a transformation to specify
which schema name(s) on redshift you want to write to.
You will then have all of the data in identical schemas on redshift.
If you want to query all those together, you can do that by wither running some transformation code inside redsshift to combine and make new tables. Or you may be able to use views.

How to perform Lookups in Azure Data Factory?

I'm a SSIS Developer. I do lots of SQL stored procedure lookup concepts in SSIS. But when coming to Azure Data Factory I haven't any idea how to perform a lookup using a SQL stored procedure.
Could anyone please guide me on this?
Thanks in advance !
Jay
Azure Data Factory (ADF) is more of an ELT tool rather than ETL, therefore direct lookups are not supported. Instead, this type of operation, along with other transforms is pushed down into the compute you are actually using. For example, if you are moving data to SQL Server, Azure SQL Database or Azure SQL Data Warehouse, you would ensure all data is on the same server and use a Stored Procedure task to execute the lookups using T-SQL and joins. If you are using Azure Data Lake Analytics (ADLA) you would use the U-SQL Activity to run U-SQL or execute ADLA stored procedures, again doing lookups via joins or custom U-SQL code such as Combiner, Applier, Reducer. In fact you can use any of the ADF compute options like SQL, HDInsight (including Hive, Pig, Map Reduce, Streaming and Spark script), Machiine Learning or custom .net activities.
So you need to think about things differently with ADF. Have a look through this article to gain greater understanding of transforming data in ADF:
Transform data in Azure Data Factory
https://learn.microsoft.com/en-us/azure/data-factory/data-factory-data-transformation-activities
As an aside, I would rarely use Lookups in SSIS as performance in early versions used to be poor. Although this has been improved in later versions, generally if you can do it in SQL you probably should. This pattern harnesses the power of SQL Server, rather than dragging data up into the SSIS pipeline, eg for the purposes of lookups (which are essentially joins) and pushing the data back out again. I reserve Data Flow transformations mainly when non-relational data is involved, eg xml or joining your email server with relational data. This is my personal view anyway : )

Represent Oracle sql Cube with Microstrategy

Hi i have a serveral cube tables on oracle 12c database. How respresent its with Microstrategy? The Object Intelligent Cube the Microstrategy don't represent correctly this cubes and It save in-memory sqls. I need execute sql realtime to cube table
A MicroStrategy cube is an in-memory copy of the results of an SQL query executed against your data warehouse. It's not intended to be a representation of the Oracle cubes.
I assume both these "cubes" organize data in a way that is easy and fast to use for dimensional queries, but I don't think you can import directly an Oracle cube into MicroStrategy IServer memory.
I'm not an expert with Oracle Cubes, but I think you need to map dimensions and facts like you would do with any other Oracle table. At the end an Oracle cube is a tool that Oracle provide to organize your data (once dimensions and metrics are defined) and speed up your query, but you still need to query it: MicroStrategy will write your queries, but also MicroStrategy needs to be aware of your dimensions and metrics (MicroStrategy facts).
At the end the a cube speeds up your queries organizing and aggregating your data, and it seems to me that you have achieved this already with your Oracle cube. A MicroStrategy cube is an in-memory structure that saves also the time required by a query against the database.
If your requirements are that you execute SQL against your database at all times, then you need to disable caching on the MicroStrategy side (this can be done on a report-by-report basis, or at a project level).
MicroStrategy Intelligent Cubes aren't going to be a good fit for you here, because they explicitly cache data, in order to decrease response time, and reduce load on your source database.

Resources