Stored procedures for inserting into Cassandra - sql-server

I am migrating my database from SQL to Cassandra.
Right now, my application logic is decoupled from insert logic as the application only calls the specific procedures in SQL which calls some data validations and makes corresponding insertions.
I want to do the same in Cassandra. However, I am unable to find anything similar for Cassandra. UDFs are not useful as they are mostly used in select query.

Related

Azure Functions vs Stored Procedures for Sequential Dependent Calculation Scenario

I have a scenario where I need to read the data from SQL Server Database (Azure) and perform calculations and save the calculated data back to SQL Server Database.
Here, I'm using the Timer Trigger Function so that I can schedule one after another as calculations are dependent on each other (a totally of 10 calculations running one after another)
The same can be achieved via Stored Procedures in an easy way as they reside in the backend. I want to understand which is the better way to perform/handle such a scenario? in terms of Performance, Scalability, Debugging Capabilities, Cost, etc.
If you are using SQL Server then definitely SQL Procedure is the right approach because of it's compatibility with SQL Server.
Another recommended approach is use data flow activity in Azure Data Factory and transform the data using the available functions. This is easy to use methods as you will get all the required transformation functions in-built.
You can also run Store Procedure in Azure data factory using Stored Procedure activity.
Refer: Create Azure Data Factory data flows

SQL Server tables connection

I have to connect multiple tables that are part of single or multiple databases. Approximately 10-15 tables in each query have to be connected to generate data for the analysis in SQL Server 2014.
I don't have access to the database diagram or architecture and these reports are to be sent out weekly. I want to understand the approach on how to begin writing these kind of queries which are of basic and advanced level and identify the relationship between tables and what kind of advanced level queries I can learn or utilize like CTE, Rank Partition, Subqueries etc.
Anybody who can provide a rough flow diagram or structure about the approach will be really helpful.
It's very unlikely that owners of those source systems want to be directly queried every time someone runs a report. Since you already have access to SQL Server, I would suggest building a data warehouse with that.
You haven't provided a whole lot of information to go on, but SSIS packages could be created to connect to the source systems and load into your data warehouse. And furthermore, those packages can be scheduled through Agent.
As for modeling... Again it is difficult with the lack of information, but generally the star model works great for reporting, which is a fact table surrounded by dimension (or attribute) tables.
As for figuring out relationships without a diagram, this will have to be done via experimentation and tieing to existing reports to make sure your joins aren't dropping records or cascading.
Good luck.

AWS Glue: SQL Server multiple partitioned databases ETL into Redshift

Our team is trying to create an ETL into Redshift to be our data warehouse for some reporting. We are using Microsoft SQL Server and have partitioned out our database into 40+ datasources. We are looking for a way to be able to pipe the data from all of these identical data sources into 1 Redshift DB.
Looking at AWS Glue it doesn't seem possible to achieve this. Since they open up the job script to be edited by developers, I was wondering if anyone else has had experience with looping through multiple databases and transfering the same table into a single data warehouse. We are trying to prevent ourselves from having to create a job for each database... Unless we can programmatically loop through and create multiple jobs for each database.
We've taken a look at DMS as well, which is helpful for getting the schema and current data over to redshift, but it doesn't seem like it would work for the multiple partitioned datasource issue as well.
This sounds like an excellent use-case for Matillion ETL for Redshift.
(Full disclosure: I am the product manager for Matillion ETL for Redshift)
Matillion is an ELT tool - it will Extract data from your (numerous) SQL server databases and Load them, via an efficient Redshift COPY, into some staging tables (which can be stored inside Redshift in the usual way, or can be held on S3 and accessed from Redshift via Spectrum). From there you can add Transformation jobs to clean/filter/join (and much more!) into nice queryable star-schemas for your reporting users.
If the table schemas on your 40+ databases are very similar (your question doesn't clarify how you are breaking your data down into those servers - horizontal or vertical) you can parameterise the connection details in your jobs and use iteration to run them over each source database, either serially or with a level of parallelism.
Pushing down transformations to Redshift works nicely because all of those transformation queries can utilize the power of a massively parallel, scalable compute architecture. Workload Management configuration can be used to ensure ETL and User queries can happen concurrently.
Also, you may have other sources of data you want to mash-up inside your Redshift cluster, and Matillion supports many more - see https://www.matillion.com/etl-for-redshift/integrations/.
You can use AWS DMS for this.
Steps:
set up and configure DMS instance
set up target endpoint for redshift
set up source endpoints for each sql server instance see
https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.SQLServer.html
set up a task for each sql server source, you can specify the tables
to copy/synchronise and you can use a transformation to specify
which schema name(s) on redshift you want to write to.
You will then have all of the data in identical schemas on redshift.
If you want to query all those together, you can do that by wither running some transformation code inside redsshift to combine and make new tables. Or you may be able to use views.

How to perform Lookups in Azure Data Factory?

I'm a SSIS Developer. I do lots of SQL stored procedure lookup concepts in SSIS. But when coming to Azure Data Factory I haven't any idea how to perform a lookup using a SQL stored procedure.
Could anyone please guide me on this?
Thanks in advance !
Jay
Azure Data Factory (ADF) is more of an ELT tool rather than ETL, therefore direct lookups are not supported. Instead, this type of operation, along with other transforms is pushed down into the compute you are actually using. For example, if you are moving data to SQL Server, Azure SQL Database or Azure SQL Data Warehouse, you would ensure all data is on the same server and use a Stored Procedure task to execute the lookups using T-SQL and joins. If you are using Azure Data Lake Analytics (ADLA) you would use the U-SQL Activity to run U-SQL or execute ADLA stored procedures, again doing lookups via joins or custom U-SQL code such as Combiner, Applier, Reducer. In fact you can use any of the ADF compute options like SQL, HDInsight (including Hive, Pig, Map Reduce, Streaming and Spark script), Machiine Learning or custom .net activities.
So you need to think about things differently with ADF. Have a look through this article to gain greater understanding of transforming data in ADF:
Transform data in Azure Data Factory
https://learn.microsoft.com/en-us/azure/data-factory/data-factory-data-transformation-activities
As an aside, I would rarely use Lookups in SSIS as performance in early versions used to be poor. Although this has been improved in later versions, generally if you can do it in SQL you probably should. This pattern harnesses the power of SQL Server, rather than dragging data up into the SSIS pipeline, eg for the purposes of lookups (which are essentially joins) and pushing the data back out again. I reserve Data Flow transformations mainly when non-relational data is involved, eg xml or joining your email server with relational data. This is my personal view anyway : )

Best way to perform distributed SQL query and joins, calling from .Net code

Here's my scenario:
I have to query two PeopleSoft Databases on different servers (both are SQL Server 2000) and do a join of the data. My application is a .Net application (BizTalk).
I'm wondering what the best option is with regards to performance?
use standard select queries to get data
and do the join in memory (e.g. LINQ) for example
generated complex dynamic queries using LINKED Server, e.g.
select blah
from Server1.HRDB.dbo.MyTable1
left join Server2.FinanceDb.dbo.MyTable2
use standard select queries to get the data into an intermediate / staging sql server database and do my queries / joins on this database instead.
should I consider using SSIS? ( are there features here that might be better than doing an in-memory, e.g. LINQ? )
I wish I could use stored procedures on the source database, but the owners of the PeopleSoft database refuse it
The main constraints we have is that the source database is old (SQL Server 2000) and that performance of the source database is paramount. Whatever queries I run on this server must not block the other users. Hence, the DBAs are adamant about no Stored Procedures. They also believe that queries involving Linked Servers will trump (i.e. take higher priority) to other queries being run against the the database.
Any feedback would be greatly appreciated.
Thanks!
Update: additional background information on the project
We are primarily integrating PeopleSoft databases (the HR and Finance) into another product. Some are simple - like AccountCode and Department. Others are more complex, like the personal data, job, and leave accrual. Some are real-time, other's are scheduled, and other's are 'batch' (e.g. at payroll runs).
Regardless, we have to get source data out of PeopleSoft database -- and my hope had been to let the (source) database do the 'heavy' lifting by executing SQL Queries. I don't really want BizTalk, or SSIS, or C# LINQ to be the ones doing the transformations/filtering.
Definitely open to suggestions.

Resources