Find "regional" relationships in SQL data using a query, or SSIS - sql-server

Edit for clarification: I am compiling data weekly, based on Zip_Code, but some Zip_Codes are redundant. I know I should be able to compile a small amount of data, and derive the redundant zip_codes if I can establish relationships.
I want to define a zip code's region by the unique set of items and values that appear in that zip code, in order to create a "Region Table"
I am looking to find relationships by zip code with certain data. Ultimately, I have tables which include similar values for many zip codes.
I have data similar to:
ItemCode |Value | Zip_Code
-----------|-------|-------
1 |10 | 1
2 |15 | 1
3 |5 | 1
1 |10 | 2
2 |15 | 2
3 |5 | 2
1 |10 | 3
2 |10 | 3
3 |15 | 3
Or to simplify the idea, I could even concantenate ItemCode + Value into unique values:
ItemCode+
Value | Zip_Code
A | 1
B | 1
C | 1
A | 2
B | 2
C | 2
A | 3
D | 3
E | 3
As you can see, Zip_Code 1 and 2 have the same distinct ItemCode and Value. Zip_Code 3 however, has different values for certain ItemCodes.
I need to create a table that establishes a relationship between Zip_Codes that contain the same data.
The final table will look something like:
Zip_Code | Region
1 | 1
2 | 1
3 | 2
4 | 2
5 | 1
6 | 3
...etc
This will allow me to collect data only once for each unique Region, and derive the zip_code appropriately.
Things I'm doing now:
I am currently using a query similar to a join, and compares against Zip_Code using something along the lines of:
SELECT a.ItemCode
,a.value
,a.zip_code
,b.ItemCode
,b.value
,b.zip_code
FROM mytable as a, mytable as b -- select from table twice, similar to a join
WHERE a.zip_code = 1 -- left table will have all ItemCode and Value from zip 1
AND b.zip_code = 2 -- right table will have all ItemCode and Value from zip 2
AND a.ItemCode = b.ItemCode -- matches rows on ItemCode
AND a.Value != b.Value
ORDER BY ItemCode
This returns nothing if the two zip codes have exactly the same ItemNum, and Value, and returns a slew of differences between the two zip codes if there are differences.
This needs to move from a manual process to an automated process however, as I am now working with more than 100 zip_codes.
I do not have much programming experience in specific languages, so tools in SSIS are somewhat limited to me. I have some experience using the Fuzzy tools, and feel like there might be something in Fuzzy Grouping that might shine a light on apparent regions, but can't figure out how to set it up.
Does anyone have any suggestions? I have access to SQLServ and its related tools, and Visual Studio. I am trying to avoid writing a program to automate this, as my c# skills are relatively nooby, but will figure it out if necessary.
Sorry for being so verbose: This is my first Question, and the page I agreed to in order to ask a question suggested to explain in detail, and talk about what I've tried...
Thanks in advance for any help I might receive.

Give this a shot (I used the simplified example, but this can easily be expanded). I think the real interesting part of this code is the recursive CTE...
;with matches as (
--Find all pairs of zip_codes that have matching values.
select d1.ZipCode zc1, d2.ZipCode zc2
from data d1
join data d2 on d1.Val=d2.Val
group by d1.ZipCode, d2.ZipCode
having count(*) = (select count(distinct Val) from data where zipcode = d1.Zipcode)
), cte as (
--Trace each zip_code to it's "smallest" matching zip_code id.
select zc1 tempRegionID, zc2 ZipCode
from matches
where zc1<=zc2
UNION ALL
select c.tempRegionID, m.zc2
from cte c
join matches m on c.ZipCode=m.zc1
and c.ZipCode!=m.zc2
where m.zc1<=m.zc2
)
--For each zip_code, use it's smallest matching zip_code as it's region.
select zipCode, min(tempRegionID) as regionID
from cte
group by ZipCode

Demonstrating that there's a use for everything, though normally it makes me cringe: concatenate the values for each zip code into a single field. Store ZipCode and ConcatenatedValues in a lookup table (PK on the one, UQ on the other). Now you can assess which zip codes are in the same region by grouping on ConcatenatedValues.
Here's a simple function to concatenate text data:
CREATE TYPE dbo.List AS TABLE
(
Item VARCHAR(1000)
)
GO
CREATE FUNCTION dbo.Implode (#List dbo.List READONLY, #Separator VARCHAR(10) = ',') RETURNS VARCHAR(MAX)
AS BEGIN
DECLARE #Concat VARCHAR(MAX)
SELECT #Concat = CASE WHEN Item IS NULL THEN #Concat ELSE COALESCE(#Concat + #Separator, '') + Item END FROM #List
RETURN #Concat
END
GO
DECLARE #List AS dbo.List
INSERT INTO #List (Item) VALUES ('A'), ('B'), ('C'), ('D')
SELECT dbo.Implode(#List, ',')

Related

SQL Server Merge Update With Partial Sources

I have a target table for which partial data arrives at different times from 2 departments. The keys they use are the same, but the fields they provide are different. Most of the rows they provide have common keys, but there are some rows that are unique to each department. My question is about the fields, not the rows:
Scenario
the target table has a key and 30 fields.
Dept. 1 provides fields 1-20
Dept. 2 provides fields 21-30
Suppose I loaded Q1 data from Dept. 1, and that created new rows 100-199 and populated fields 1-20. Later, I receive Q1 data from Dept. 2. Can I execute the same merge code I previously used for Dept. 1 to update rows 100-199 and populate fields 21-30 without unintentionally changing fields 1-20? Alternatively, would I have to tailor separate merge code for each Dept.?
In other words, does (or can) "Merge / Update" operate only on target fields that are present in the source table while ignoring target fields that are NOT present in the source table? In this way, Dept. 1 fields would NOT be modified when merging Dept. 2, or vice-versa, in the event I get subsequent corrections to this data from either Dept.
You can use a merge instruction, where you define a source and a target data, and what happens when a registry is found on both, just on the source, just on the target, and even expand it with custom logic, like it's just on the source, and it's older than X, or it's from department Y.
-- I'm skipping the fields 2-20 and 22-30, just to make this shorter.
create table #target (
id int primary key,
field1 varchar(100), -- and so on until 20
field21 varchar(100), -- and so on until 30
)
create table #dept1 (
id int primary key,
field1 varchar(100)
)
create table #dept2 (
id int primary key,
field21 varchar(100)
)
/*
Creates some data to merge into the target.
The expected result is:
| id | field1 | field21 |
| - | - | - |
| 1 | dept1: 1 | dept2: 1 |
| 2 | | dept2: 2 |
| 3 | dept1: 3 | |
| 4 | dept1: 4 | dept2: 4 |
| 5 | | dept2: 5 |
*/
insert into #dept1 values
(1,'dept1: 1'),
--(2,'dept1: 2'),
(3,'dept1: 3'),
(4,'dept1: 4')
insert into #dept2 values
(1,'dept2: 1'),
(2,'dept2: 2'),
--(3,'dept2: 3'),
(4,'dept2: 4'),
(5,'dept2: 5')
-- Inserts the data from the first department. This could be also a merge, it necessary.
insert into #target(id, field1)
select id, field1 from #dept1
merge into #target t
using (select id, field21 from #dept2) as source_data(id, field21)
on (source_data.id = t.id)
when matched then update set field21=source_data.field21
when not matched by source and t.field21 is not null then delete -- you can even use merge to remove some records that match your criteria
when not matched by target then insert (id, field21) values (source_data.id, source_data.field21); -- Every merge statement should end with ;
select * from #target
You can see this code running on this DB Fiddle

Query from multiple XML columns in SQL Server

Environment: SQL Server 2016; abstracted example of "real" data.
From a first query to a table containing XML data I have a SQL result set that has the following columns:
ID (int)
Names (XML)
Times (XML)
Values (XML)
Columns 2-4 contain multiple values in XML format, e.g.
Names:
Row 1: <name>TestR1</name><name>TestR2</name>...
Row 2: <name>TestS1</name><name>TestS2</name>...
Times:
Row 1: <time>0.1</time><time>0.2</time>...
Row 2: <time>-0.1</time><time>-0.2</time>...
Values:
Row 1: <value>1.1</value><value>1.2</value>...
Row 2: <value>-1.1</value><value>-1.2</value>...
The XML of all XML columns contain the exact same number of elements.
What I want now is to create a select that has the following output:
| ID | Name | Time | Value |
+----+--------+------+-------+
| 1 | TestR1 | 0.1 | 1.1 |
| 1 | TestR1 | 0.2 | 1.2 |
| .. | ...... | .... | ..... |
| 2 | TestS1 | -0.1 | -1.1 |
| 2 | TestS2 | -0.2 | -1.2 |
| .. | ...... | .... | ..... |
For a single column CROSS APPLY works fine:
SELECT ID, N.value('.', 'nvarchar(50)') AS ExtractedName
FROM <source>
CROSS APPLY <source>.nodes('/name') AS T(N)
Applying multiple CROSS APPLY statements makes no sense here to me.
I would guess it would work if I would create selects for each column that then produce individual result sets and perform a select over all of the result sets
but that's very likely not the best solution as I am duplicating selects for each additional column.
Any suggestions on how to design a query like that would be highly appreciated!
I'd suggest this approach:
First I create a declared table variable and fill it with your sample data to simulate your issue. This is called "MCVE", please try to provide this yourself in your next question.
DECLARE #tbl TABLE(ID INT, Names XML,Times XML,[Values] XML);
INSERT INTO #tbl VALUES
(1,'<name>TestR1</name><name>TestR2</name>','<time>0.1</time><time>0.2</time>','<value>1.1</value><value>1.2</value>')
,(2,'<name>TestS1</name><name>TestS2</name>','<time>0.3</time><time>0.4</time>','<value>2.1</value><value>2.2</value>');
--The query
SELECT t.ID
,t.Names.value('(/name[sql:column("tally.Nmbr")])[1]','nvarchar(max)') AS [Name]
,t.Times.value('(/time[sql:column("tally.Nmbr")])[1]','decimal(10,4)') AS [Time]
,t.[Values].value('(/value[sql:column("tally.Nmbr")])[1]','decimal(10,4)') AS [Value]
FROM #tbl t
CROSS APPLY
(
SELECT TOP(t.Names.value('count(*)','int')) ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) Nmbr FROM master..spt_values
) tally;
The idea in short:
We create a tally on the fly by using APPLY to create a list of numbers.
The TOP-clause will limit this list to the count of <name> elements in the given row.
In this case I take master..spt_values as a source for many rows. We do not need the content, just an appropriate list to create a tally. This said, if there is a physical numbers table in your database, this was even better.
Finally we can pick the content by the element's position using sql:column() to introduce the tally's value into the XQuery predicate.

Using STRING_SPLIT for 2 columns in a single table

I've started from a table like this
ID | City | Sales
1 | London,New York,Paris,Berlin,Madrid| 20,30,,50
2 | Istanbul,Tokyo,Brussels | 4,5,6
There can be an unlimited amount of cities and/or sales.
I need to get each city and their salesamount their own record. So my result should look something like this:
ID | City | Sales
1 | London | 20
1 | New York | 30
1 | Paris |
1 | Berlin | 50
1 | Madrid |
2 | Istanbul | 4
2 | Tokyo | 5
2 | Brussels | 6
What I got so far is
SELECT ID, splitC.Value, splitS.Value
FROM Table
CROSS APLLY STRING_SPLIT(Table.City,',') splitC
CROSS APLLY STRING_SPLIT(Table.Sales,',') splitS
With one cross apply, this works perfectly. But when executing the query with a second one, it starts to multiply the number of records a lot (which makes sense I think, because it's trying to split the sales for each city again).
What would be an option to solve this issue? STRING_SPLIT is not neccesary, it's just how I started on it.
STRING_SPLIT() is not an option, because (as is mentioned in the documantation) the output rows might be in any order and the order is not guaranteed to match the order of the substrings in the input string.
But you may try with a JSON-based approach, using OPENJSON() and string transformation (comma-separated values are transformed into a valid JSON array - London,New York,Paris,Berlin,Madrid into ["London","New York","Paris","Berlin","Madrid"]). The result from the OPENJSON() with default schema is a table with columns key, value and type and the key column is the 0-based index of each item in this array:
Table:
CREATE TABLE Data (
ID int,
City varchar(1000),
Sales varchar(1000)
)
INSERT INTO Data
(ID, City, Sales)
VALUES
(1, 'London,New York,Paris,Berlin,Madrid', '20,30,,50'),
(2, 'Istanbul,Tokyo,Brussels', '4,5,6')
Statement:
SELECT d.ID, a.City, a.Sales
FROM Data d
CROSS APPLY (
SELECT c.[value] AS City, s.[value] AS Sales
FROM OPENJSON(CONCAT('["', REPLACE(d.City, ',', '","'), '"]')) c
LEFT OUTER JOIN OPENJSON(CONCAT('["', REPLACE(d.Sales, ',', '","'), '"]')) s
ON c.[key] = s.[key]
) a
Result:
ID City Sales
1 London 20
1 New York 30
1 Paris
1 Berlin 50
1 Madrid NULL
2 Istanbul 4
2 Tokyo 5
2 Brussels 6
STRING_SPLIT has no context of what oridinal positions are. In fact, the documentation specifically states that it doesn't care about it:
The order of the output may vary as the order is not guaranteed to match the order of the substrings in the input string.
As a result, you need to use something that is aware of such basic things, such as DelimitedSplit8k_LEAD.
Then you can do something like this:
WITH Cities AS(
SELECT ID,
DSc.Item,
DSc.ItemNumber
FROM dbo.YourTable YT
CROSS APPLY dbo.DelimitedSplit8k_LEAD(YT.City,',') DSc)
Sales AS(
SELECT ID,
DSs.Item,
DSs.ItemNumber
FROM dbo.YourTable YT
CROSS APPLY dbo.DelimitedSplit8k_LEAD(YT.Sales,',') DSs)
SELECT ISNULL(C.ID,S.ID) AS ID,
C.Item AS City,
S.Item AS Sale
FROM Cities C
FULL OUTER JOIN Sales S ON C.ItemNumber = S.ItemNumber;
Of course, however, the real solution is fix your design. This type of design is going to only cause you 100's of problems in the future. Fix it now, not later; you'll reap so many rewards sooner the earlier you do it.

Break up the data in a database column of a record into multiple records

Azure SQL Server - we have a table like this:
MyTable:
ID Source ArticleText
-- ------ -----------
1 100 <nvarchar(max) field with unstructured text from media articles>
2 145 "
3 866 "
4 232 "
ID column is the primary key and auto-increments on INSERTS.
I run this query to find the records with the largest data size in the ArticleText column:
SELECT TOP 500
ID, Source, DATALENGTH(ArticleText)/1048576 AS Size_in_MB
FROM
MyTable
ORDER BY
DATALENGTH(ArticleText) DESC
We are finding that for many reasons both technical and practical, the data in the ArticleText column is just too big in certain records. The above query allows me to look at a range of sizes for our largest records, which I'll need to know for what I'm trying to formulate here.
The feat I need to accomplish is, for all existing records in this table, any record whose ArticleText DATALENGTH is greater than X, break that record into X amount of records where each record will then contain the same value in the Source column, but have the data in the ArticleText column split up across those records in smaller chunks.
How would one achieve this if the exact requirement was say, take all records whose ArticleText DATALENGTH is greater than 10MB, and break each into 3 records where the resulting records' Source column value is the same across the 3 records, but the ArticleText data is separated into three chunks.
In essence, we would need to divide the DATALENGTH by 3 and apply the first 1/3 of the text data to the first record, 2nd 1/3 to the 2nd record, and the 3rd 1/3 to the third record.
Is this even possible in SQL Server?
You can use the following code to create a side table with the needed data:
CREATE TABLE #mockup (ID INT IDENTITY, [Source] INT, ArticleText NVARCHAR(MAX));
INSERT INTO #mockup([Source],ArticleText) VALUES
(100,'This is a very long text with many many words and it is still longer and longer and longer, and even longer and longer and longer')
,(200,'A short text')
,(300,'A medium text, just long enough to need a second part');
DECLARE #partSize INT=50;
WITH recCTE AS
(
SELECT ID,[Source]
,1 AS FragmentIndex
,A.Pos
,CASE WHEN A.Pos>0 THEN LEFT(ArticleText,A.Pos) ELSE ArticleText END AS Fragment
,CASE WHEN A.Pos>0 THEN SUBSTRING(ArticleText,A.Pos+2,DATALENGTH(ArticleText)/2) END AS RestString
FROM #mockup
CROSS APPLY(SELECT CASE WHEN DATALENGTH(ArticleText)/2 > #partSize
THEN #partSize - CHARINDEX(' ',REVERSE(LEFT(ArticleText,#partSize)))
ELSE -1 END AS Pos) A
UNION ALL
SELECT r.ID,r.[Source]
,r.FragmentIndex+1
,A.Pos
,CASE WHEN A.Pos>0 THEN LEFT(r.RestString,A.Pos) ELSE r.RestString END
,CASE WHEN A.Pos>0 THEN SUBSTRING(r.RestString,A.Pos+2,DATALENGTH(r.RestString)/2) END AS RestString
FROM recCTE r
CROSS APPLY(SELECT CASE WHEN DATALENGTH(r.RestString)/2 > #partSize
THEN #partSize - CHARINDEX(' ',REVERSE(LEFT(r.RestString,#partSize)))
ELSE -1 END AS Pos) A
WHERE DATALENGTH(r.RestString)>0
)
SELECT ID,[Source],FragmentIndex,Fragment
FROM recCTE
ORDER BY [Source],FragmentIndex;
GO
DROP TABLE #mockup
The result
+----+--------+---------------+---------------------------------------------------+
| ID | Source | FragmentIndex | Fragment |
+----+--------+---------------+---------------------------------------------------+
| 1 | 100 | 1 | This is a very long text with many many words and |
+----+--------+---------------+---------------------------------------------------+
| 1 | 100 | 2 | it is still longer and longer and longer, and |
+----+--------+---------------+---------------------------------------------------+
| 1 | 100 | 3 | even longer and longer and longer |
+----+--------+---------------+---------------------------------------------------+
| 2 | 200 | 1 | A short text |
+----+--------+---------------+---------------------------------------------------+
| 3 | 300 | 1 | A medium text, just long enough to need a second |
+----+--------+---------------+---------------------------------------------------+
| 3 | 300 | 2 | part |
+----+--------+---------------+---------------------------------------------------+
Now you have to update the existing line with the value at FragmentIndex=1, while you have to insert the values of FragmentIndex>1. Do this sorted by FragmentIndex and your IDENTITY ID-column will reflect the correct order.

TSQL Conditional Where or Group By?

I have a table like the following:
id | type | duedate
-------------------------
1 | original | 01/01/2017
1 | revised | 02/01/2017
2 | original | 03/01/2017
3 | original | 10/01/2017
3 | revised | 09/01/2017
Where there may be either one or two rows for each id. If there are two rows with same id, there would be one with type='original' and one with type='revised'. If there is one row for the id, type will always be 'original'.
What I want as a result are all the rows where type='revised', but if there is only one row for a particular id (thus type='original') then I want to include that row too. So desired output for the above would be:
id | type | duedate
1 | revised | 02/01/2017
2 | original | 03/01/2017
3 | revised | 09/01/2017
I do not know how to construct a WHERE clause that conditionally checks whether there are 1 or 2 rows for a given id, nor am I sure how to use GROUP BY because the revised date could be greater than or less than than the original date so use of aggregate functions MAX or MIN don't work. I thought about using CASE somehow, but also do not know how to construct a conditional that chooses between two different rows of data (if there are two rows) and display one of them rather than the other.
Any suggested approaches would be appreciated.
Thanks!
you can use row number for this.
WITH T AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY ID ORDER BY Type DESC) AS RN
FROM YourTable
)
SELECT *
FROM T
WHERE RN = 1
Is something like this sufficient?
SELECT *
FROM mytable m1
WHERE type='revised'
or 1=(SELECT COUNT(*) FROM mytable m2 WHERE m2.id=m1.id)
You could use a subquery to take the MAX([type]). In this case it works for [type] since alphabetically we want revised first, then original and "r" comes after "o" in the alphabet. We can then INNER JOIN back on the same table with the matching conditions.
SELECT T2.*
FROM (
SELECT id, MAX([type]) AS [MAXtype]
FROM myTABLE
GROUP BY id
) AS dT INNER JOIN myTable T2 ON dT.id = T2.id AND dT.[MAXtype] = T2.[type]
ORDER BY T2.[id]
Gives output:
id type duedate
1 revised 2017-02-01
2 original 2017-03-01
3 revised 2017-09-01
Here is the sqlfiddle: http://sqlfiddle.com/#!6/14121f/6/0

Resources