I have a scenario where I need to read the data from SQL Server Database (Azure) and perform calculations and save the calculated data back to SQL Server Database.
Here, I'm using the Timer Trigger Function so that I can schedule one after another as calculations are dependent on each other (a totally of 10 calculations running one after another)
The same can be achieved via Stored Procedures in an easy way as they reside in the backend. I want to understand which is the better way to perform/handle such a scenario? in terms of Performance, Scalability, Debugging Capabilities, Cost, etc.
If you are using SQL Server then definitely SQL Procedure is the right approach because of it's compatibility with SQL Server.
Another recommended approach is use data flow activity in Azure Data Factory and transform the data using the available functions. This is easy to use methods as you will get all the required transformation functions in-built.
You can also run Store Procedure in Azure data factory using Stored Procedure activity.
Refer: Create Azure Data Factory data flows
Related
I'm dealing with a three piece datapath: client application, host integration server, db server. Client application (MS Mashup Engine) is generating queries that pass through MS SQL Server to a legacy IBM iSeries DB backend.
I'm running into issues where the client is generating queries like
select * from x where numericValue = 1.46510+003
I'm checking the execution plan for these on the SQL Server and they result in a full data load with the comparison occurring on SQL Server (which is acting as the Host Integration Server).
By comparison, a human generated query
select * from x where numericValue = 1465.1
results in no scan and performance two orders of magnitude faster.
I have tried playing with the client application to force it to generate something like the human generated query, but I've had no luck.
I'm not sure if I can massage the way the query plan is generated in SQL server by playing with column data types. I.e. exposing a view over the backend DB with explicitly defined data types. Or otherwise forcing query plan generation?
Any thoughts?
I'm not sure if I can massage the way the query plan is generated in SQL server by playing with column data types. I.e. exposing a view over the backend DB with explicitly defined data types. Or otherwise forcing query plan generation?
No, the answer is no using all of the following:
casting types prior to delivering to power query/power bi changes nothing
adapter properties set for DB2OLEDB or IBMDASQL ( f.ex. Decimal As Numeric=True; Derive Parameters=True; etc...) have no effect
creating stored procedures over the data requires parametrization in power query which is fine, but does not integrate with the ui. I.e. - using a parameter field instead of the column filters is not clean.
TBH, this has become less and less of a problem as backend performance improves and frontend caching is available.
Hope it helps,
-Alex T.
Our team is trying to create an ETL into Redshift to be our data warehouse for some reporting. We are using Microsoft SQL Server and have partitioned out our database into 40+ datasources. We are looking for a way to be able to pipe the data from all of these identical data sources into 1 Redshift DB.
Looking at AWS Glue it doesn't seem possible to achieve this. Since they open up the job script to be edited by developers, I was wondering if anyone else has had experience with looping through multiple databases and transfering the same table into a single data warehouse. We are trying to prevent ourselves from having to create a job for each database... Unless we can programmatically loop through and create multiple jobs for each database.
We've taken a look at DMS as well, which is helpful for getting the schema and current data over to redshift, but it doesn't seem like it would work for the multiple partitioned datasource issue as well.
This sounds like an excellent use-case for Matillion ETL for Redshift.
(Full disclosure: I am the product manager for Matillion ETL for Redshift)
Matillion is an ELT tool - it will Extract data from your (numerous) SQL server databases and Load them, via an efficient Redshift COPY, into some staging tables (which can be stored inside Redshift in the usual way, or can be held on S3 and accessed from Redshift via Spectrum). From there you can add Transformation jobs to clean/filter/join (and much more!) into nice queryable star-schemas for your reporting users.
If the table schemas on your 40+ databases are very similar (your question doesn't clarify how you are breaking your data down into those servers - horizontal or vertical) you can parameterise the connection details in your jobs and use iteration to run them over each source database, either serially or with a level of parallelism.
Pushing down transformations to Redshift works nicely because all of those transformation queries can utilize the power of a massively parallel, scalable compute architecture. Workload Management configuration can be used to ensure ETL and User queries can happen concurrently.
Also, you may have other sources of data you want to mash-up inside your Redshift cluster, and Matillion supports many more - see https://www.matillion.com/etl-for-redshift/integrations/.
You can use AWS DMS for this.
Steps:
set up and configure DMS instance
set up target endpoint for redshift
set up source endpoints for each sql server instance see
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.SQLServer.html
set up a task for each sql server source, you can specify the tables
to copy/synchronise and you can use a transformation to specify
which schema name(s) on redshift you want to write to.
You will then have all of the data in identical schemas on redshift.
If you want to query all those together, you can do that by wither running some transformation code inside redsshift to combine and make new tables. Or you may be able to use views.
I'm a SSIS Developer. I do lots of SQL stored procedure lookup concepts in SSIS. But when coming to Azure Data Factory I haven't any idea how to perform a lookup using a SQL stored procedure.
Could anyone please guide me on this?
Thanks in advance !
Jay
Azure Data Factory (ADF) is more of an ELT tool rather than ETL, therefore direct lookups are not supported. Instead, this type of operation, along with other transforms is pushed down into the compute you are actually using. For example, if you are moving data to SQL Server, Azure SQL Database or Azure SQL Data Warehouse, you would ensure all data is on the same server and use a Stored Procedure task to execute the lookups using T-SQL and joins. If you are using Azure Data Lake Analytics (ADLA) you would use the U-SQL Activity to run U-SQL or execute ADLA stored procedures, again doing lookups via joins or custom U-SQL code such as Combiner, Applier, Reducer. In fact you can use any of the ADF compute options like SQL, HDInsight (including Hive, Pig, Map Reduce, Streaming and Spark script), Machiine Learning or custom .net activities.
So you need to think about things differently with ADF. Have a look through this article to gain greater understanding of transforming data in ADF:
Transform data in Azure Data Factory
https://learn.microsoft.com/en-us/azure/data-factory/data-factory-data-transformation-activities
As an aside, I would rarely use Lookups in SSIS as performance in early versions used to be poor. Although this has been improved in later versions, generally if you can do it in SQL you probably should. This pattern harnesses the power of SQL Server, rather than dragging data up into the SSIS pipeline, eg for the purposes of lookups (which are essentially joins) and pushing the data back out again. I reserve Data Flow transformations mainly when non-relational data is involved, eg xml or joining your email server with relational data. This is my personal view anyway : )
When developing a Azure SQL Data warehouse with SSIS. We have 2-phrase steps
to 1) copy data source to staging table, 2) copy staging table to report table
My question is, will SSIS actually extract data through it's own server, even it knows source & target are the same OLE DB provider? Or it's smart enough to use "SELECT INTO FROM SELECT * FROM .."? This makes a difference to us as Azure calculate the cost on exporting data out from Azure, and we have a lot of similar copying actions in DW, and SSIS is the only machine on-premise.
We could define a series of SQL statement tasks with nested query but it's hard to manage for TransactionOption in such a quantity.
Thanks.
I do not think it will do this as SSIS was designed with so many hooks into the pipe that it is counter-intuitive to try to optimize by skipping it.
You could however use TSQL to do the Select Into and keep processing on the same server as the database engine.
If you need to switch between the two methods you can have a parameter to your package and conditional execution via constraints.
I have an access 2003 database that holds all of my business data. This access database gets updated every few hours during the day.
We're currently writing a website that will need to use the data from the access database. This website (for the time being) will have only read only capabilities. Meaning there will only need to be one way transfer of data (Access -> SQL).
I'm imaging there's a way to perform this data migration from access to SQL server programatically. Does anyone have any links to something I can read about?
If this practice sounds odd, and you'd like to suggest another way to do this (or a situation where data can go both ways (Access -> SQL, SQL -> Access), that's perfectly fine.
The company is going to continue using Access 2003 for their business functionality. There's no way around that. But I'd like to build the (readonly) website on top of SQL Server.
The strategy you outlined can be very challenging. You could use INSERT queries to copy new Access rows to SQL Server, as described in another answer.
However, if you have changes to existing Access rows, and you also want those changes propagated to SQL Server, it won't be so simple. And it will be more complicated still if you want deleted Access rows deleted from SQL Server, too.
It seems more reasonable to me to use a different approach. Migrate the data to SQL Server once. Then replace the tables in your Access database with ODBC links to the SQL Server tables. Thereafter, changes to the data from within your Access application will not require a separate synchronization step ... they will already be in SQL Server. And you won't need to write any code to synchronize them.
If your concern is that the connections between the web server and SQL Server be read-only, just set them up that way. You can still independently allow read-write permissions for your Access application.
To do the initial data migration and set the SQL Server automatically, I would use the SQL Server Migration Assistant. The only thing you should definitely change that I can think of would be to turn off the Identity property on any columns that have it - to be explained below (MS Access calls Identity autonumber). Once you have your tables loaded, you can set up a dsnless connection to the database (and tables) you just created.
I haven't used the method just linked, but I believe it allows you to use SQL Server authentication to connect to the db. The benefit of using this method is you can easily change which SQL Server instance and/or database your are connecting to for development and testing.
There might be a better, automated way, but you can create several insert queries doing left joins from the primary key of the Access table to the SQL Server table, and putting a WHERE clause that specifies the SQL Server PrimaryKey must be null. This is why you need to turn off the Identity property in the SQL Server tables, so that you can insert the new data.
Finally, put the name of each query in one function, then run the function periodically.
I have used Microsoft's free SQL Server Migration Assistant (SSMA) to migrate Access to SQL Server. The tool is very simple to use. The only problem I have encountered with the tool was overloaded data types when migrating. What I mean by this is a small string will get converted to a NVARCHAR(MAX) in some instances. Otherwise, the tool is very handy and can be reused after setting up a 'profile'.