Large Table Select and Insert - sql-server

I haven't been able to find anything that solves this really, though I have found many things that seem to point in the right direction.
I have a table with ~4.7 Million records in it. This table also has ~319 columns. Of all of these ~319 columns, there are 16 that I am interested in, and I want to put them into another table that is just 2 columns. Now basically how this is set is that column "A" is just an ID and columns 1-15 are codes. None of the columns are grouped either (not sure if that matters).
I have tried things like:
Insert Into NewTable(ID,Profession)
Select ID, ProCode1 From OriginalTable WHERE ProCode1 > ''
UNION
Select ID, ProCode2 From OriginalTable WHERE ProCode2 > ''
And so on. This didn't seem to do anything at all and I let it go for ~ 20 minutes.
Now I can get a small result doing the same but dropping the union and using a TOP (1000) statement, however even that will never work.
So the question is what can I do to take this:
ID|PID|blah|blah|blah|...|ProCode1|ProCode2|ProCode3|...|ProCode15|blah|...
into:
ID|PID|ProCode|
across all ~4.7 million rows without running:
Insert Into NewTable(PID,ProCode)
select PID, ProCode1 FROM OriginalTable WHERE ProCode1 > ''
Insert Into NewTable(PID, ProCode)
select PID, ProCode2 FROM Original Table WHERE ProCode2 > ''
Insert Into New Table(PID, ProCode)
Select PID, ProCode3 FROM Original Table WHERE ProCode3 > ''
...
...
...
EDIT: I forgot that a majority of the columns for ProCodeX are blank. All ProCode1 rows are occupied, but that becomes exponentially less each increase (e.g. ProCode2 is <50% occupied, ProCode3 is <10% occupied)

Use Cross Apply with Table valued constructor to unpivot the data instead of using different UNION ALL
Insert Into NewTable(PID,ProCode)
select PID, ProCode FROM OriginalTable
Cross apply
(
values(ProCode1),(ProCode2),(ProCode3),..(ProCode15)
)
cs (ProCode)
Where ProCode <> ''
This will be much faster than the UNION ALL query since this will do single physical table hit.

Related

SQL Server & SSMS 2012 - Move a value from one column to a new one to ensure only one row

This is a problem that has troubled several times in the past an I have always wondered if a solution is possible.
I have a query using several tables one of the values is mobile phone number.
I have name, addresss etc.... I also have income information in the table which is used for a summary in Excel.
Where the problem occurs is when a contact has more than one mobile number, as you know this will create extra rows with the majority of the data being duplicate including the income.
Question: is it possible for the query to identify whether the contact has more than one number and if so create a new column with the 2nd mobile number.
Effectively returning the contacts information to one row and creating new columns.
My SQL is intermediate and I cannot think of a solution so thought I would ask.
Many thanks
I am pretty sure that it isn't the best possible solution, since we don't have information on how many records do you have in your dataset and I didn't have enough time, so just an idea how you can solve your original problem with two different numbers for one same customer.
declare #t table (id int
,firstName varchar(20)
,lastName varchar(20)
,phoneNumber varchar(20)
,income money)
insert into #t values
(1,'John','Doe','1234567',50)
,(1,'John','Doe','6789856',50)
,(2,'Mike','Smith','5687456',150)
,(3,'Stela','Hodhson','3334445',500)
,(4,'Nick','Slotter','5556667',550)
,(4,'Nick','Slotter','8889991',550)
,(5,'Abraham','Lincoln','4578912',52)
,(6,'Ronald','Regan','6987456',587)
,(7,'Thomas','Jefferson','8745612',300);
with a as(
select id
,max(phoneNumber) maxPhone
from #t group by id
),
b as(
select id
,min(phoneNumber) minPhone
from #t group by id
)
SELECT distinct t.id
,t.firstName
,t.lastName
,t.income
,a.maxPhone as phoneNumber1
,case when b.minPhone = a.maxPhone then ''
else b.minphone end as phoneNumber2
from #t t
inner join a a on a.id = t.id
inner join b b on b.id = t.id

sql server - insert if not exist else update

I have here a list of 100 types of item flavor. Then I have a table where I need a record for every item in every flavor. So if I have 50 items, I need 100 records for each of the 50 items in this table_A. so there will be a total of 100x50 records in this table at the end.
What I have now is a random mix of data and I know I don't have a record for each type of flavor for every item.
What I need help with is, an idea/algorithm so solve this problem. pseudo code would do. I have a table with all possible flavors (tbl_flavor) and a table with all 50 items (tbl_items). These two will dictate what needs to be put in table_A which is basically an inventory.
Please advise.
If I'm understanding your question correctly, a SQL Server EXCEPT query will help.
As already pointed out in the comments, here's how to get the matrix of items and flavors:
SELECT Items.Item, Flavors.Flavor
FROM Items
CROSS JOIN Flavors
Here's how to get the matrix of items and flavors, omitting the combinations that are already in your other table.
SELECT Items.Item, Flavors.Flavor
FROM Items
CROSS JOIN Flavors
EXCEPT SELECT Item, Flavor
FROM Table_A
So the INSERT becomes:
INSERT INTO Table_A (Item, Flavor)
SELECT Items.Item, Flavors.Flavor
FROM Items
CROSS JOIN Flavors
EXCEPT SELECT Item, Flavor
FROM Table_A
This query is untested because I'm not 100% sure about the question. If you post more detail I'll test it.
There are a few ways you can tackle this sort of problem. Here is psuedocode for one of those ways.
Update table
set Col1 = SomeValue
where MyKeys = Mykeys
if (##ROWCOUNT = 0)
begin
Insert table
(Cols)
Values
(Vals)
end
Or you can use MERGE. https://msdn.microsoft.com/en-us/library/bb510625.aspx
Try This
UPDATE MyTable
SET
ColumnToUpdate = NewValue
WHERE EXISTS
(
SELECT
1
FROM TableWithNewValue
WHERE ColumnFromTable1 = MyTable.ColumnName
)
INSERT INTO MyTable
(
Column1,
Column2,
...
ColumnN
)
SELECT
Value1,
Value2,
...
ValueN
FROM TableWithNewValue
WHERE NOT EXISTS
(
SELECT
1
FROM MyTable
WHERE ColumnName = TableWithNewValue.ColumnFromTable1
)

Table Vs Table Variable

I have below mentioned 2 approaches of accomplishing one task.
1st is selecting from Table directly multiple times and 2nd in selecting desired columns from table into table variable first and then using that table variable multiple times. Which one would perform better and why?
declare
#var1 varchar(10),
#var2 varchar(10)
----------------------------------------------------------------------------
-- 1st approach
----------------------------------------------------------------------------
select *
from tab1
where tab1.col1 in (select tab2.col1 from tab2 where tab2.col2 <> #var1) or
tab1.col2 in (select tab2.col2 from tab2 where tab2.col3 <> #var2)
----------------------------------------------------------------------------
-- 2nd approach
----------------------------------------------------------------------------
declare #tab2 table (col1 varchar(10), col2 varchar(10))
insert into #tab2
select col1,
col2
from tab2
select *
from tab1
where tab1.col1 in (select t.col1 from #tab2 as t where t.col2 <> #var1) or
tab1.col2 in (select t.col2 from #tab2 as t where t.col3 <> #var2)
According to me the first approach will be faster and efficient.
If u see the execution plan, extra cost for table insert gets added into the second approach.
Execution Plan for first approach:
Execution Plan for second approach:
EDIT: I wasn't understanding the question. Forget my answer. Please.
I don't think there is any difference of performance between your 2 methods, because the only difference is a tiny request to retrieve your column, which should be negligible.
The bonus that you get with the second approach is that if you change your column names in the future, you won't need to update your script.
PS:
I think your query to retrieve your columns isn't quite right. Y'oure not retrieving columns names here, but datas. I don't know your DBMS, but if it's Oracle, it should be something like :
SELECT column_name
FROM USER_TAB_COLUMNS
WHERE table_name = 'MYTABLE'
Why in the world would you think two selects are faster than one?
Why would you not just select col1, col2 from tab1 where ... ?
In both cases you have a select where
A select on table is faster than a select on table varible
So all you have done is added the overhead of inserting into a table variable to get a less efficient select
A table variable is stored in tempdb
Microsoft has all sorts of warnings on use of table variables
[Table variable][1]
For one not to use for more then 100 rows
It does not have indexes
Really what if tab1 had a million rows and the where limited it to 10
You really think insert a million rows into #tab2 is going to make it faster?

SQL WHERE NOT EXISTS (skip duplicates)

Hello I'm struggling to get the query below right. What I want is to return rows with unique names and surnames. What I get is all rows with duplicates
This is my sql
DECLARE #tmp AS TABLE (Name VARCHAR(100), Surname VARCHAR(100))
INSERT INTO #tmp
SELECT CustomerName,CustomerSurname FROM Customers
WHERE
NOT EXISTS
(SELECT Name,Surname
FROM #tmp
WHERE Name=CustomerName
AND ID Surname=CustomerSurname
GROUP BY Name,Surname )
Please can someone point me in the right direction here.
//Desperate (I tried without GROUP BY as well but get same result)
DISTINCT would do the trick.
SELECT DISTINCT CustomerName, CustomerSurname
FROM Customers
Demo
If you only want the records that really don't have duplicates (as opposed to getting duplicates represented as a single record) you could use GROUP BY and HAVING:
SELECT CustomerName, CustomerSurname
FROM Customers
GROUP BY CustomerName, CustomerSurname
HAVING COUNT(*) = 1
Demo
First, I thought that #David answer is what you want. But rereading your comments, perhaps you want all combinations of Names and Surnames:
SELECT n.CustomerName, s.CustomerSurname
FROM
( SELECT DISTINCT CustomerName
FROM Customers
) AS n
CROSS JOIN
( SELECT DISTINCT CustomerSurname
FROM Customers
) AS s ;
Are you doing that while your #Tmp table is still empty?
If so: your entire "select" is fully evaluated before the "insert" statement, it doesn't do "run the query and add one row, insert the row, run the query and get another row, insert the row, etc."
If you want to insert unique Customers only, use that same "Customer" table in your not exists clause
SELECT c.CustomerName,c.CustomerSurname FROM Customers c
WHERE
NOT EXISTS
(SELECT 1
FROM Customers c1
WHERE c.CustomerName = c1.CustomerName
AND c.CustomerSurname = c1.CustomerSurname
AND c.Id <> c1.Id)
If you want to insert a unique set of customers, use "distinct"
Typically, if you're doing a WHERE NOT EXISTS or WHERE EXISTS, or WHERE NOT IN subquery,
you should use what is called a "correlated subquery", as in ypercube's answer above, where table aliases are used for both inside and outside tables (where inside table is joined to outside table). ypercube gave a good example.
And often, NOT EXISTS is preferred over NOT IN (unless the WHERE NOT IN is selecting from a totally unrelated table that you can't join on.)
Sometimes if you're tempted to do a WHERE EXISTS (SELECT from a small table with no duplicate values in column), you could also do the same thing by joining the main query with that table on the column you want in the EXISTS. Not always the best or safest solution, might make query slower if there are many rows in that table and could cause many duplicate rows if there are dup values for that column in the joined table -- in which case you'd have to add DISTINCT to the main query, which causes it to SORT the data on all columns.
-- Not efficient at all.
And, similarly, the WHERE NOT IN or NOT EXISTS correlated subqueries can be accomplished (and give the exact same execution plan) if you LEFT OUTER JOIN the table you were going to subquery -- and add a WHERE . IS NULL.
You have to be careful using that, but you don't need a DISTINCT. Frankly, I prefer to use the WHERE NOT IN subqueries or NOT EXISTS correlated subqueries, because the syntax makes the intention clear and it's hard to go wrong.
And you do not need a DISTINCT in the SELECT inside such subqueries (correlated or not). It would be a waste of processing (and for WHERE EXISTS or WHERE IN subqueries, the SQL optimizer would ignore it anyway and just use the first value that matched for each row in the outer query). (Hope that makes sense.)

SQL Server: Order By DateDiff Performance issue

I'm having a problem getting top 100 rows from a table with 2M rows in reasonable time.
The problem is the order by part, it takes more than 50 minutes to get results for this query..
What can be the best solution for this problem?
select top 100 * from THETABLE TT
Inner join SecondTable ST on TT.TypeID = ST.TypeID
ORDER BY DATEDIFF(Day, TT.LastCheckDate, GETDATE()) * ST.SomeParam DESC
Many thanks,
Bentzy
Edit:
* TheTable is the one with 2M rows.
* SomeParam has 15 distinct values (more or less)
There are two things that come to mind to speed up this fetch:
If you need to run this query often, you should index the column 'lastCheckDate'. No matter which sql db you are using, a well defined index on the column will allow for faster selects, especially in an orders by clause.
Perform the date math before doing the select query. You are getting the difference in days between the row's checkDate and the current date, times some parameter. Does the multiplication affect the ordering of the rows? Can this simply be ordered by the 'lastCheckDate desc'? Explore other sorting options that return the same result.
Two ideas come to mind:
a) If ST.param doesn't change often, perhaps you can cache the result of the multiplication somewhere. The numbers would be "off" after a day, but the relative values would be the same - i.e., the sort order wouldn't change.
b) Find a way to reduce the size of the input tables. There are probably some values of LastCheckDate &/or SomeParam that will never be in the top 100. For example,
Select *
into #tmp
from THETABLE
where LastCheckDate between '2012-06-01' and getdate()
select top 100 *
from #tmp join SecondTable ST on #tmp.TypeID = ST.TypeID
order by DateDiff(day, LastCheckDate, getdate()) * ST.SomeParam desc
It's a lot faster to search a small table than a big one.
DATEDIFF(Day, TT.LastCheckDate, GETDATE()) is the number of days since "last check".
If you just order by TT.LastCheckDate you get a similar order.
EDIT
Maybe you can work out what dates you don't expect to get back and filter on them. Of course you then also need an index on that LastDateCheck column. If everything works out, you can at least shorten the list of records to check from 2M to some managable amount.
It is quite complicated.Do you seriouslly need all columns in query?There is one thing which you could try here. First just get the top 100 rows typeid
something like below
select top 100 typeid
,TT.lastcheckdate,st.someparam --do not use these if the typeid is unqiue in both tables..
--or just the PK columns of both tables and typeid so that these can be joined on PK
into #temptable
from st inner join tt on st.typeid = tt.typeid
ORDER BY DATEDIFF(Day, TT.LastCheckDate, GETDATE()) * ST.SomeParam DESC
Above will sort very minimal data and thus should be faster.Based on how many columns you have in your table and indexes this should be way faster (it will be fast if you have many columns in both tables but this query will use just 3.Also, maybe these columns (st.typeid,st.someparam and tt.typeid and tt.lastcheckdate) are covered by some of indexes so no need to read underlying tables and thus reduce the IO as well) than actual one..Then join this data back to both tables.
If that doesnt work the way you expect.Then you can have indexed view using above select by adding the order by expression as column. Then use this indexed view to get top 100 and join with main tables.This will surely reduce the amount of work and thus improve perf.But Indexed view will have overhead which will depend on how frequently data changed in the table TT.
To lessen number of rows you might retrieve top (100) for each SecondTable record ordered by LastCheckDate, and then union all them and finally select top (100), by means of temporary table or dynamic sql generated query.
This solution uses cursor to fetch top 100 records for each value in SecondTable. With index on (TypeID, LastCheckDate) on TheTable it runs instantaneously (tested on my system with a table of 700,000 records and 50 SecondTable entries).
declare #SomeParam varchar(3)
declare #TypeID int
declare #tbl table (TheTableID int, LastCheckDate datetime, SomeParam float)
declare rstX cursor local fast_forward for
select TypeID, SomeParam
from SecondTable
open rstX
while 1 = 1
begin
fetch next from rstX into #TypeID, #SomeParam
if ##fetch_status <> 0
break
insert into #tbl
select top 100 ID, LastCheckDate, #SomeParam
from TheTable
where TypeID = #TypeID
order by LastCheckDate
end
close rstX
deallocate rstX
select top 100 *
from #tbl
order by DATEDIFF(Day, LastCheckDate, GETDATE()) * SomeParam
Obviously this solution fetches ID's only. You might want to expand temporary table with additional columns.

Resources