How can I join data from both cosmos and sql database? - sql-server

I have a requirement where I need to get data from different database i.e. cosmos and sql.
How can I join both the table and get the data?
Below is the data that needs to be fetched. The common column in both is DossierGloabalId which can be use to join both databases tables.
Name--SQL
TaxableYear--SQL
Period--COSMOS
DossierType--SQL
VATType--COSMOS
LastUpdated--SQL

There is no way to do a join between SQL Server and Cosmos DB, as they are two completely different database engines.
Further: While Cosmos DB's SQL API does have a SQL-based query language, it's not a relational database, and is not compatible with SQL Server.
You'll need to do separate sets of queries: one against Cosmos DB, and one against SQL Server.

If you're asking to perform a relational join on data stored in Azure SQL with data stored in CosmosDB, there is no out-of-the-box support for doing so. You are going to have to query records in Azure SQL, query the corresponding records in CosmosDB, and then join them together in your own application code. There are a bunch of approaches to how you go about this in your own application code, and it's highly dependent on your own application.

Related

SSIS, query Oracle table using ID's from SQL Server?

Here's the basic idea of what I want to do in SSIS:
I have a large query against a production Oracle database, and I need the following where clause that brings in a long list of ids from SQL Server. From there, the results are sent elsewhere.
select ...
from Oracle_table(s) --multi-join
where id in ([select distinct id from SQL_SERVER_table])
Alternatively, I could write the query this way:
select ...
from Oracle_table(s) --multi-join
...
join SQL_SERVER_table sst on sst.ID = Oracle_table.ID
Here are my limitations:
The Oracle query is large and cannot be run without the where id in (... clause
This means I cannot run the Oracle query, then join it against the ids in another step. I tried this, and the DBA's killed the temp table after it became 3 TB in size.
I have 160k id's
This means it is not practical to iterate through the id's one by one. In the past, I have run against ~1000 IDs, using a comma-separated list. It runs relatively fast - a few minutes.
The main query is in Oracle, but the ids are in SQL Server
I do not have the ability to write to Oracle
I've found many questions like this.
None of the answers I have found have a solution to my limitations.
Similar question:
Query a database based on result of query from another database
To prevent loading all rows from the Oracle table. The only way is to apply the filter in the Oracle database engine. I don't think this can be achieved using SSIS since you have more than 160000 ids in the SQL Server table, which cannot be efficiently loaded and passed to the Oracle SQL command:
Using Lookups and Merge Join will require loading all data from the Oracle database
Retrieving data from SQL Server, building a comma-separated string, and passing it to the Oracle SQL command cannot be done with too many IDs (160K).
The same issue using a Script Task.
Creating a Linked Server in SQL Server and Joining both tables will load all data from the Oracle database.
To solve your problem, you should search for a way to create a link to the SQL Server database from the Oracle engine.
Oracle Heterogenous Services
I don't have much experience in Oracle databases. Still, after a small research, I found something in Oracle equivalent to "Linked Servers" in SQL Server called "heterogeneous connectivity".
The query syntax should look like this:
select *
from Oracle_table
where id in (select distinct id from SQL_SERVER_table#sqlserverdsn)
You can refer to the following step-by-step guides to read more on how to connect to SQL Server tables from Oracle:
What is Oracle equivalent for Linked Server and can you join with SQL Server?
Making a Connection from Oracle to SQL Server - 1
Making a Connection from Oracle to SQL Server - 2
Heterogeneous Database connections - Oracle to SQL Server
Importing Data from SQL Server to a staging table in Oracle
Another approach is to use a Data Flow Task that imports IDs from SQL Server to a staging table in Oracle. Then use the staging table in your Oracle query. It would be better to create an index on the staging table. (If you do not have permission to write to the Oracle database, try to get permission to a separate staging database.)
Example of exporting data from SQL Server to Oracle:
Export SQL Server Data to Oracle using SSIS
Minimizing the data load from the Oracle table
If none of the solutions above solves your issue. You can try minimizing the data loaded from the Oracle database as much as possible.
As an example, you can try to get the Minimum and Maximum IDs from the SQL Server table, store both values within two variables. Then, you can use both variables in the SQL Command that loads the data from the Oracle table, like the following:
SELECT * FROM Oracle_Table WHERE ID > #MinID and ID < #MaxID
This will remove a bunch of useless data in your operation. In case your ID column is a string, you can use other measures to filter data, such as the string length, the first character.

Cross database queries.How to proper use cross database features?

I am investigating the possibility to split one DB into multiple. We decided to move some tables into another database, but we have queries with join on these tables. I found a few solutions about how to achieve that:
Azure SQL Database elastic query
EXTERNAL DATA SOURCE
But I don`t know what the difference between them and what to choose.
Thanks for any help!
Azure SQL Database Elastic Queries and External data sources are two names for the same concept.
My suggestion is to avoid cross database queries and avoid splitting one database into multiples because query performance involving external data sources won't be the same no matter what strategy you choose to query those external tables.
If you still want to stick with the plan of splitting the database into multiple databases, then know that cross database queries show good performance when the remote tables are not big. When remote tables are big, this article shows you how to perform joins remotely using table variables and improve performance. This other article shows you also how to push parameterized operations to remote databases and improve performance.
if you are thinking to split your DB into multiple SQL server DB with the different host then you can prefer Linked server which has flexible to join across SQL servers

One-Way Replication Into Different Schema

I'm looking for the best (best practice) option for one way replication between two databases. I would like to keep this purely SQL but, can write something in C# or use an ETL tool if there are no other good options.
Current setup:
DB1 - There are three instances of this database. It is a large relational database, the schema is the same for each but, they are separate data pots (no replication). Two databases on a 2012 server and one on a 2014 server
DB2 - There are two instances of this database on seperate servers (Europe, Americas) and the data is merge replicated between the two. The publisher is the 2014 server.
The Goal:
DB2 is tied to some reports. It has one table and a small application attached to that table. Users from many different countries enter data via a small application into DB2 and generate reports out of the application.
DB1 is a relational database that has a very large application on top of it but with fewer users. If users are using the application for DB1 then they should not need to duplicate their records into DB2.
There should be one-way replication from the multiple seperate DB1s into DB2. How quickly this happens is not too important.
The important things are:
No backwards replication occurs from DB2s > DB1s (Data only flows from DB1s into one of the DB2s)
Create, Update, and Delete actions should occur in DB2 based on the results
of a comparisson with DB1 (the one way replication)
Current Approach:
I currently have a flat sql view on each DB1 database that has the same schema as the table in the DB2 db's that the data needs to go into.
The servers are also joined as linked servers.
My though was to do a sort of manually written replication script on one of the DB2 databases that calls the views from the DB1s and does the CUD actions on a timed basis.
It seems to me that there should be an easier way though!?
Any thoughts on how to do this would be very much appreciated.
Keep in mind that since several of the DB1s exist on a SQL 2012 server that there may be some issues as 2012 might not be allowed to be a publisher for replication to a 2014 server.

Best way to perform distributed SQL query and joins, calling from .Net code

Here's my scenario:
I have to query two PeopleSoft Databases on different servers (both are SQL Server 2000) and do a join of the data. My application is a .Net application (BizTalk).
I'm wondering what the best option is with regards to performance?
use standard select queries to get data
and do the join in memory (e.g. LINQ) for example
generated complex dynamic queries using LINKED Server, e.g.
select blah
from Server1.HRDB.dbo.MyTable1
left join Server2.FinanceDb.dbo.MyTable2
use standard select queries to get the data into an intermediate / staging sql server database and do my queries / joins on this database instead.
should I consider using SSIS? ( are there features here that might be better than doing an in-memory, e.g. LINQ? )
I wish I could use stored procedures on the source database, but the owners of the PeopleSoft database refuse it
The main constraints we have is that the source database is old (SQL Server 2000) and that performance of the source database is paramount. Whatever queries I run on this server must not block the other users. Hence, the DBAs are adamant about no Stored Procedures. They also believe that queries involving Linked Servers will trump (i.e. take higher priority) to other queries being run against the the database.
Any feedback would be greatly appreciated.
Thanks!
Update: additional background information on the project
We are primarily integrating PeopleSoft databases (the HR and Finance) into another product. Some are simple - like AccountCode and Department. Others are more complex, like the personal data, job, and leave accrual. Some are real-time, other's are scheduled, and other's are 'batch' (e.g. at payroll runs).
Regardless, we have to get source data out of PeopleSoft database -- and my hope had been to let the (source) database do the 'heavy' lifting by executing SQL Queries. I don't really want BizTalk, or SSIS, or C# LINQ to be the ones doing the transformations/filtering.
Definitely open to suggestions.

How do I create a named query to join multiple data sources in SSAS 2005?

In the SQL Server 2005 books online section "Defining Named Queries in a Data Source View (Analysis Services)", it states:
A named query can also be used to join multiple database tables from one or more data sources into a single data source view table.
Does anyone know where I can find examples or tutorials on how this can be done?
EDIT: To provide some additional background...
I am working with an analysis services project in the SQL Server Business Intelligence Development Studio for SQL Server 2005. I have defined a data source for each of my databases which are on different servers. I am trying to create a named query which will be a union of a table from each data source. The problem is that the named query requires me to choose a single data source for the query. The query is executed against this data source which does not know anything about the data sources in my project. However, according to the SQL Server 2005 books online, what I am trying to accomplish should be possible based on my quote from above.
MSDN has this link describing Named Queries and this link walking you through the process of creating one.
Edit: I think that to use multiple datasources, you would need to fully qualify your table to hit other datasources when creating your query, like this:
SELECT user_id, first_name, 'DB1' as DB FROM users
UNION
SELECT user_id, first_name, 'DB2' as DB FROM Database2Name.dbo.users
to get results like
user_id first_name DB
1 Bob DB1
2 Joe DB1
11 Greg DB2
12 Mark DB2
If by "multiple data sources" you mean multiple databases, then you can do this if you fully qualify the database name.
For example if I have two databases I can do this:
SELECT * FROM DatabaseA.dbo.SomeTable
JOIN DatabaseB.dbo.OtherTable
ON DatabaseA.dbo.SomeTable.Id = DatabaseB.dbo.OtherTable.Id
Make sure that you don't forget the dbo bit (the owner), otherwise it won't work.
The only other sort of "multiple data sources" that I'm aware of is distributed queries which allows you to perform queries over multiple remote instances of sql server:
sp_addlinkedserver 'server\instance'
SELECT * FROM [server\instance].DatabaseA.dbo.SomeTable
JOIN DatabaseB.dbo.OtherTable
ON [server\instance].DatabaseA.dbo.SomeTable.Id = DatabaseB.dbo.OtherTable.Id

Resources