SQL Server Merge Update With Partial Sources - sql-server

I have a target table for which partial data arrives at different times from 2 departments. The keys they use are the same, but the fields they provide are different. Most of the rows they provide have common keys, but there are some rows that are unique to each department. My question is about the fields, not the rows:
Scenario
the target table has a key and 30 fields.
Dept. 1 provides fields 1-20
Dept. 2 provides fields 21-30
Suppose I loaded Q1 data from Dept. 1, and that created new rows 100-199 and populated fields 1-20. Later, I receive Q1 data from Dept. 2. Can I execute the same merge code I previously used for Dept. 1 to update rows 100-199 and populate fields 21-30 without unintentionally changing fields 1-20? Alternatively, would I have to tailor separate merge code for each Dept.?
In other words, does (or can) "Merge / Update" operate only on target fields that are present in the source table while ignoring target fields that are NOT present in the source table? In this way, Dept. 1 fields would NOT be modified when merging Dept. 2, or vice-versa, in the event I get subsequent corrections to this data from either Dept.

You can use a merge instruction, where you define a source and a target data, and what happens when a registry is found on both, just on the source, just on the target, and even expand it with custom logic, like it's just on the source, and it's older than X, or it's from department Y.
-- I'm skipping the fields 2-20 and 22-30, just to make this shorter.
create table #target (
id int primary key,
field1 varchar(100), -- and so on until 20
field21 varchar(100), -- and so on until 30
)
create table #dept1 (
id int primary key,
field1 varchar(100)
)
create table #dept2 (
id int primary key,
field21 varchar(100)
)
/*
Creates some data to merge into the target.
The expected result is:
| id | field1 | field21 |
| - | - | - |
| 1 | dept1: 1 | dept2: 1 |
| 2 | | dept2: 2 |
| 3 | dept1: 3 | |
| 4 | dept1: 4 | dept2: 4 |
| 5 | | dept2: 5 |
*/
insert into #dept1 values
(1,'dept1: 1'),
--(2,'dept1: 2'),
(3,'dept1: 3'),
(4,'dept1: 4')
insert into #dept2 values
(1,'dept2: 1'),
(2,'dept2: 2'),
--(3,'dept2: 3'),
(4,'dept2: 4'),
(5,'dept2: 5')
-- Inserts the data from the first department. This could be also a merge, it necessary.
insert into #target(id, field1)
select id, field1 from #dept1
merge into #target t
using (select id, field21 from #dept2) as source_data(id, field21)
on (source_data.id = t.id)
when matched then update set field21=source_data.field21
when not matched by source and t.field21 is not null then delete -- you can even use merge to remove some records that match your criteria
when not matched by target then insert (id, field21) values (source_data.id, source_data.field21); -- Every merge statement should end with ;
select * from #target
You can see this code running on this DB Fiddle

Related

Break up the data in a database column of a record into multiple records

Azure SQL Server - we have a table like this:
MyTable:
ID Source ArticleText
-- ------ -----------
1 100 <nvarchar(max) field with unstructured text from media articles>
2 145 "
3 866 "
4 232 "
ID column is the primary key and auto-increments on INSERTS.
I run this query to find the records with the largest data size in the ArticleText column:
SELECT TOP 500
ID, Source, DATALENGTH(ArticleText)/1048576 AS Size_in_MB
FROM
MyTable
ORDER BY
DATALENGTH(ArticleText) DESC
We are finding that for many reasons both technical and practical, the data in the ArticleText column is just too big in certain records. The above query allows me to look at a range of sizes for our largest records, which I'll need to know for what I'm trying to formulate here.
The feat I need to accomplish is, for all existing records in this table, any record whose ArticleText DATALENGTH is greater than X, break that record into X amount of records where each record will then contain the same value in the Source column, but have the data in the ArticleText column split up across those records in smaller chunks.
How would one achieve this if the exact requirement was say, take all records whose ArticleText DATALENGTH is greater than 10MB, and break each into 3 records where the resulting records' Source column value is the same across the 3 records, but the ArticleText data is separated into three chunks.
In essence, we would need to divide the DATALENGTH by 3 and apply the first 1/3 of the text data to the first record, 2nd 1/3 to the 2nd record, and the 3rd 1/3 to the third record.
Is this even possible in SQL Server?
You can use the following code to create a side table with the needed data:
CREATE TABLE #mockup (ID INT IDENTITY, [Source] INT, ArticleText NVARCHAR(MAX));
INSERT INTO #mockup([Source],ArticleText) VALUES
(100,'This is a very long text with many many words and it is still longer and longer and longer, and even longer and longer and longer')
,(200,'A short text')
,(300,'A medium text, just long enough to need a second part');
DECLARE #partSize INT=50;
WITH recCTE AS
(
SELECT ID,[Source]
,1 AS FragmentIndex
,A.Pos
,CASE WHEN A.Pos>0 THEN LEFT(ArticleText,A.Pos) ELSE ArticleText END AS Fragment
,CASE WHEN A.Pos>0 THEN SUBSTRING(ArticleText,A.Pos+2,DATALENGTH(ArticleText)/2) END AS RestString
FROM #mockup
CROSS APPLY(SELECT CASE WHEN DATALENGTH(ArticleText)/2 > #partSize
THEN #partSize - CHARINDEX(' ',REVERSE(LEFT(ArticleText,#partSize)))
ELSE -1 END AS Pos) A
UNION ALL
SELECT r.ID,r.[Source]
,r.FragmentIndex+1
,A.Pos
,CASE WHEN A.Pos>0 THEN LEFT(r.RestString,A.Pos) ELSE r.RestString END
,CASE WHEN A.Pos>0 THEN SUBSTRING(r.RestString,A.Pos+2,DATALENGTH(r.RestString)/2) END AS RestString
FROM recCTE r
CROSS APPLY(SELECT CASE WHEN DATALENGTH(r.RestString)/2 > #partSize
THEN #partSize - CHARINDEX(' ',REVERSE(LEFT(r.RestString,#partSize)))
ELSE -1 END AS Pos) A
WHERE DATALENGTH(r.RestString)>0
)
SELECT ID,[Source],FragmentIndex,Fragment
FROM recCTE
ORDER BY [Source],FragmentIndex;
GO
DROP TABLE #mockup
The result
+----+--------+---------------+---------------------------------------------------+
| ID | Source | FragmentIndex | Fragment |
+----+--------+---------------+---------------------------------------------------+
| 1 | 100 | 1 | This is a very long text with many many words and |
+----+--------+---------------+---------------------------------------------------+
| 1 | 100 | 2 | it is still longer and longer and longer, and |
+----+--------+---------------+---------------------------------------------------+
| 1 | 100 | 3 | even longer and longer and longer |
+----+--------+---------------+---------------------------------------------------+
| 2 | 200 | 1 | A short text |
+----+--------+---------------+---------------------------------------------------+
| 3 | 300 | 1 | A medium text, just long enough to need a second |
+----+--------+---------------+---------------------------------------------------+
| 3 | 300 | 2 | part |
+----+--------+---------------+---------------------------------------------------+
Now you have to update the existing line with the value at FragmentIndex=1, while you have to insert the values of FragmentIndex>1. Do this sorted by FragmentIndex and your IDENTITY ID-column will reflect the correct order.

Return all results from left table when NULL present in right table and results of inner join when no null present in right table

Hi just wondering if this scenario is possible?
I have two tables and a relationship table to create a many to many relationships between the two tables. See the below tables for a simple representation;
| Security ID | Security Group |
| 1 | Admin |
| 2 | Basic |
| Security ID | Access ID |
| 1 | NULL |
| 2 | 1 |
| Function ID | Function Code |
| 1 | Search |
| 2 | Delete |
What I want to achieve is while checking the relationship table I want to return all functions a user on a security group has access to. If the user is assigned to a security group that contains a NULL value in the relationship table then grant them access to all functions.
For instance, a user on the "Basic" security group would have access to the search function while a user on the "Admin" security group should have access to both Search and Delete.
The reason it is set up this way is because a user can have 0 to many security groups and the list of functions is very large requiring the use of a whitelist of functions you can access instead of a list of a blacklist of functions you can't access.
Thank you for your time.
Your tables' sample:
CREATE TABLE #G
(
Security_ID INT,
Security_Group VARCHAR(32)
)
INSERT INTO #G
VALUES (1, 'Admin'), (2, 'Basic')
CREATE TABLE #A
(
Security_ID INT,
Access_ID INT
)
INSERT INTO #A
VALUES (1, NULL), (2, 1)
CREATE TABLE #F
(
Function_ID INT,
Function_CODE VARCHAR(32)
)
INSERT INTO #F
VALUES (1, 'Search'), (2, 'Delete')
Query:
SELECT #G.Security_Group, #F.Function_CODE
FROM #G
JOIN #A ON #G.Security_ID = #A.Security_ID
JOIN #F ON #F.Function_ID = #A.Access_ID OR #A.Access_ID IS NULL
Dropping the sample tables:
DROP TABLE #G
DROP TABLE #A
DROP TABLE #F

Lookup primary ID from multiple tables having another (unique) field

I'm trying to add values in a junction table of a many to many relationship.
Tables look like these (all IDs are integers):
Table A
+------+----------+
| id_A | ext_id_A |
+------+----------+
| 1 | 100 |
| 2 | 101 |
| 3 | 102 |
+------+----------+
Table B is conceptually similar
+------+----------+
| id_B | ext_id_B |
+------+----------+
| 1 | 200 |
| 2 | 201 |
| 3 | 202 |
+------+----------+
Tables PK are id_A and id_B, as columns in my junction table are FK to those columns, but I have to insert values having only external ids (ext_id_A, ext_id_B).
External IDs are unique columns, (and therefore in a 1:1 with table id itself), so having ext_id I can lookup the exact row and get the id need to insert into junction table.
This is an example of what I've done so far, but doesn't look like an optimized sql statement:
-- Example table I receive with test values
declare #temp as table (
ext_id_a int not null,
ext_id_b int not null
);
insert into #temp values (100, 200), (101, 200), (101, 201);
--Insertion - code from my sp
declare #final as table (
id_a int not null,
id_b int not null
);
insert into #final
select a.id_a, b.id_b
from #temp as t
inner join table_a a on a.ext_id_a = t.ext_id_a
inner join table_b b on b.ext_id_b = t.ext_id_b
merge into junction_table as jt
using #final as f
on f.id_a = jt.id_a and f.id_b = tj.id_b
when not matched by target then
insert (id_a, id_b) values (id_a, id_b);
I was thinking about a MERGE statement since my stored procedure receives data in a Table Value Parameters parameter and I also have to check for already existing references.
Is anything I can do to improve insertion of these values?
No need to use the #final table variable:
; with cte as (
select tA.id_A, tB.id_B
from #temp t
join table_A tA on t.ext_id_a = tA.ext_id_A
join table_B tB on t.ext_id_B = tB.ext_id_B
)
merge into junction_table
using cte
on cte.id_A = junction_table.id_A and cte.id_B = junction_table.id_B
when not matched by target then
insert (id_A, id_B) values (cte.id_A, cte.id_B);

SQL Server : Bulk insert a Datatable into 2 tables

Consider this datatable :
word wordCount documentId
---------- ------- ---------------
Ball 10 1
School 11 1
Car 4 1
Machine 3 1
House 1 2
Tree 5 2
Ball 4 2
I want to insert these data into two tables with this structure :
Table WordDictionary
(
Id int,
Word nvarchar(50),
DocumentId int
)
Table WordDetails
(
Id int,
WordId int,
WordCount int
)
FOREIGN KEY (WordId) REFERENCES WordDictionary(Id)
But because I have thousands of records in initial table, I have to do this just in one transaction (batch query) for example using bulk insert can help me doing this purpose.
But the question here is how I can separate this data into these two tables WordDictionary and WordDetails.
For more details :
Final result must be like this :
Table WordDictionary:
Id word
---------- -------
1 Ball
2 School
3 Car
4 Machine
5 House
6 Tree
and table WordDetails :
Id wordId WordCount DocumentId
---------- ------- ----------- ------------
1 1 10 1
2 2 11 1
3 3 4 1
4 4 3 1
5 5 1 2
6 6 5 2
7 1 4 2
Notice :
The words in the source can be duplicated so I must check word existence in table WordDictionary before any insert record in these tables and if a word is found in table WordDictionary, the just found Word ID must be inserted into table WordDetails (please see Word Ball)
Finally the 1 M$ problem is: this insertion must be done as fast as possible.
If you're looking to just load the table the first time without any updates to the table over time you could potentially do it this way (I'm assuming you've already created the tables you're loading into):
You can put all of the distinct words from the datatable into the WordDictionary table first:
SELECT DISTINCT word
INTO WordDictionary
FROM datatable;
Then after you populate your WordDictionary you can then use the ID values from it and the rest of the information from datatable to load your WordDetails table:
SELECT WD.Id as wordId, DT.wordCount as WordCount, DT.documentId AS DocumentId
INTO WordDetails
FROM datatable as DT
INNER JOIN WordDictionary AS WD ON WD.word = DT.word
There a little discrepancy between declared table schema and your example data, but it was solved:
1) Setup
-- this the table with the initial data
-- drop table DocumentWordData
create table DocumentWordData
(
Word NVARCHAR(50),
WordCount INT,
DocumentId INT
)
GO
-- these are result table with extra information (identity, primary key constraints, working foreign key definition)
-- drop table WordDictionary
create table WordDictionary
(
Id int IDENTITY(1, 1) CONSTRAINT PK_WordDictionary PRIMARY KEY,
Word nvarchar(50)
)
GO
-- drop table WordDetails
create table WordDetails
(
Id int IDENTITY(1, 1) CONSTRAINT PK_WordDetails PRIMARY KEY,
WordId int CONSTRAINT FK_WordDetails_Word REFERENCES WordDictionary,
WordCount int,
DocumentId int
)
GO
2) The actual script to put data in the last two tables
begin tran
-- this is to make sure that if anything in this block fails, then everything is automatically rolled back
set xact_abort on
-- the dictionary is obtained by considering all distinct words
insert into WordDictionary (Word)
select distinct Word
from DocumentWordData
-- details are generating from initial data joining the word dictionary to get word id
insert into WordDetails (WordId, WordCount, DocumentId)
SELECT W.Id, DWD.WordCount, DWD.DocumentId
FROM DocumentWordData DWD
JOIN WordDictionary W ON W.Word = DWD.Word
commit
-- just to test the results
select * from WordDictionary
select * from WordDetails
I expect this script to run very fast, if you do not have a very large number of records (millions at most).
This is the query. I'm using temp table to be able to test.
if you use the 2 CTEs, you'll be able to generate the final result
1.Setting up a sample data for test.
create table #original (word varchar(10), wordCount int, documentId int)
insert into #original values
('Ball', 10, 1),
('School', 11, 1),
('Car', 4, 1),
('Machine', 3, 1),
('House', 1, 2),
('Tree', 5, 2),
('Ball', 4, 2)
2. Use cte1 and cte2. In your real database, you need to replace #original with the actual table name you have all initial records.
;with cte1 as (
select ROW_NUMBER() over (order by word) Id, word
from #original
group by word
)
select * into #WordDictionary
from cte1
;with cte2 as (
select ROW_NUMBER() over (order by #original.word) Id, Id as wordId,
#original.word, #original.wordCount, #original.documentId
from #WordDictionary
inner join #original on #original.word = #WordDictionary.word
)
select * into #WordDetails
from cte2
select * from #WordDetails
This will be data in #WordDetails
+----+--------+---------+-----------+------------+
| Id | wordId | word | wordCount | documentId |
+----+--------+---------+-----------+------------+
| 1 | 1 | Ball | 10 | 1 |
| 2 | 1 | Ball | 4 | 2 |
| 3 | 2 | Car | 4 | 1 |
| 4 | 3 | House | 1 | 2 |
| 5 | 4 | Machine | 3 | 1 |
| 6 | 5 | School | 11 | 1 |
| 7 | 6 | Tree | 5 | 2 |
+----+--------+---------+-----------+------------+

Find "regional" relationships in SQL data using a query, or SSIS

Edit for clarification: I am compiling data weekly, based on Zip_Code, but some Zip_Codes are redundant. I know I should be able to compile a small amount of data, and derive the redundant zip_codes if I can establish relationships.
I want to define a zip code's region by the unique set of items and values that appear in that zip code, in order to create a "Region Table"
I am looking to find relationships by zip code with certain data. Ultimately, I have tables which include similar values for many zip codes.
I have data similar to:
ItemCode |Value | Zip_Code
-----------|-------|-------
1 |10 | 1
2 |15 | 1
3 |5 | 1
1 |10 | 2
2 |15 | 2
3 |5 | 2
1 |10 | 3
2 |10 | 3
3 |15 | 3
Or to simplify the idea, I could even concantenate ItemCode + Value into unique values:
ItemCode+
Value | Zip_Code
A | 1
B | 1
C | 1
A | 2
B | 2
C | 2
A | 3
D | 3
E | 3
As you can see, Zip_Code 1 and 2 have the same distinct ItemCode and Value. Zip_Code 3 however, has different values for certain ItemCodes.
I need to create a table that establishes a relationship between Zip_Codes that contain the same data.
The final table will look something like:
Zip_Code | Region
1 | 1
2 | 1
3 | 2
4 | 2
5 | 1
6 | 3
...etc
This will allow me to collect data only once for each unique Region, and derive the zip_code appropriately.
Things I'm doing now:
I am currently using a query similar to a join, and compares against Zip_Code using something along the lines of:
SELECT a.ItemCode
,a.value
,a.zip_code
,b.ItemCode
,b.value
,b.zip_code
FROM mytable as a, mytable as b -- select from table twice, similar to a join
WHERE a.zip_code = 1 -- left table will have all ItemCode and Value from zip 1
AND b.zip_code = 2 -- right table will have all ItemCode and Value from zip 2
AND a.ItemCode = b.ItemCode -- matches rows on ItemCode
AND a.Value != b.Value
ORDER BY ItemCode
This returns nothing if the two zip codes have exactly the same ItemNum, and Value, and returns a slew of differences between the two zip codes if there are differences.
This needs to move from a manual process to an automated process however, as I am now working with more than 100 zip_codes.
I do not have much programming experience in specific languages, so tools in SSIS are somewhat limited to me. I have some experience using the Fuzzy tools, and feel like there might be something in Fuzzy Grouping that might shine a light on apparent regions, but can't figure out how to set it up.
Does anyone have any suggestions? I have access to SQLServ and its related tools, and Visual Studio. I am trying to avoid writing a program to automate this, as my c# skills are relatively nooby, but will figure it out if necessary.
Sorry for being so verbose: This is my first Question, and the page I agreed to in order to ask a question suggested to explain in detail, and talk about what I've tried...
Thanks in advance for any help I might receive.
Give this a shot (I used the simplified example, but this can easily be expanded). I think the real interesting part of this code is the recursive CTE...
;with matches as (
--Find all pairs of zip_codes that have matching values.
select d1.ZipCode zc1, d2.ZipCode zc2
from data d1
join data d2 on d1.Val=d2.Val
group by d1.ZipCode, d2.ZipCode
having count(*) = (select count(distinct Val) from data where zipcode = d1.Zipcode)
), cte as (
--Trace each zip_code to it's "smallest" matching zip_code id.
select zc1 tempRegionID, zc2 ZipCode
from matches
where zc1<=zc2
UNION ALL
select c.tempRegionID, m.zc2
from cte c
join matches m on c.ZipCode=m.zc1
and c.ZipCode!=m.zc2
where m.zc1<=m.zc2
)
--For each zip_code, use it's smallest matching zip_code as it's region.
select zipCode, min(tempRegionID) as regionID
from cte
group by ZipCode
Demonstrating that there's a use for everything, though normally it makes me cringe: concatenate the values for each zip code into a single field. Store ZipCode and ConcatenatedValues in a lookup table (PK on the one, UQ on the other). Now you can assess which zip codes are in the same region by grouping on ConcatenatedValues.
Here's a simple function to concatenate text data:
CREATE TYPE dbo.List AS TABLE
(
Item VARCHAR(1000)
)
GO
CREATE FUNCTION dbo.Implode (#List dbo.List READONLY, #Separator VARCHAR(10) = ',') RETURNS VARCHAR(MAX)
AS BEGIN
DECLARE #Concat VARCHAR(MAX)
SELECT #Concat = CASE WHEN Item IS NULL THEN #Concat ELSE COALESCE(#Concat + #Separator, '') + Item END FROM #List
RETURN #Concat
END
GO
DECLARE #List AS dbo.List
INSERT INTO #List (Item) VALUES ('A'), ('B'), ('C'), ('D')
SELECT dbo.Implode(#List, ',')

Resources