Looping through SQLDW table with Python - sql-server

I have a table in a Microsoft Azure SQLDW that has a date field, as well as other columns. There is a row for each day in the past 10 years of business days. Currently the table has not been stored in such a way like "order by date" etc. Below is the code I have so far:
import pyodbc driver = '{ODBC Driver 13 for SQL Server}'
conn = pyodbc.connect('DRIVER='+driver+';
PORT=1433;SERVER='+server+‌​';
DATABASE‌​='+database+';
UID='+‌​username+';
PWD='+ password)
cursor = conn.cursor() cursor.execute("SELECT * FROM index_att") i = 0 for row in cursor: i += 1 print(i)
If I'm using python to loop through every row in this table and I want to go in chronological order (from oldest date to current date) does the table need to be stored in a way that it is ordered by date? And do I need to add an additional incremental ID that starts at 1 with the oldest date and goes up each day?
Or is there a way to do this loop through dates with the data not sorted?

I would also ask you to review your process because doing row level operations on a Azure SQL Data Warehouse can cause large amounts of data movement.

Currently the table has not been stored in such a way like "order by date" etc.
How the data is stored is irrelevant. If you want ordered data, the client just needs to request it with an ORDER BY clause on a SELECT query.

Add an order by to your select statement:
cursor = conn.cursor()
cursor.execute("SELECT * FROM index_att ORDER BY [MyDate]")
i = 0
for row in cursor:
i += 1 print(i)

Related

Problem with SQL Server or SQL query OR both

I have SQL query like this:
DECLARE #cdate1 date = '20200401 00:00:00'
DECLARE #cdate2 date = '20200630 23:59:59'
SELECT DISTINCT ([hf].[id])
FROM ((([hf]
JOIN [pw] AS [PP] ON [PP].[identity] = [hf].[id]
AND [PP].[type] = 67
AND [PP].[ideletestate] = 0
AND [PP].[datein] = (SELECT MAX([datein])
FROM [pw]
WHERE [pw].[identity] = [hf].[id]
AND [pw].[ideletestate] = 0
AND [pw].[type] = 67))
JOIN [px] ON [px].[idpaper] = [PP].[id]
AND [px].[ideletestate] = 0
AND [px].[type] = 30036
AND [px].[nazvanie] NOT LIKE '')
JOIN [pw] ON ([pw].[identity] = [hf].[id]
AND ([pw].[id] > 0)
AND ([pw].[ideletestate] = 0)
AND ([pw].[type] IN (16, 2, 3012, 19, 3013)))
LEFT JOIN [px] AS [px102] ON [px102].[idpaper] = [pw].[id]
AND [px102].[type] = 102
AND [px102].[ideletestate] = 0)
WHERE
(([pw].[idcompany] in (12461, 12466, 12467, 12462, 12463, 13258)) OR
([pw].[idcompany2] in (12461, 12466, 12467, 12462, 12463, 13258)) OR
([px102].[idcompany] in (12461, 12466, 12467, 12462, 12463, 13258)) ) AND
[pw].[datein] >= #cdate1 AND [pw].[datein] <= #cdate2
It works fine, but if I print it like this ...AND [pw].[datein] >= '20200401 00:00:00' AND [pw].[datein] <= '20200630 23:59:59', it work very slowly. 10 minutes vs 1 sec.
One more strange, if i use first date '20200101 00:01:00' it work fast too. If date more then 10 March 2020, it work very slow (if date like string in query, if variable it work good).
Do I have a bad query? But why do it work with variable? Or is it some issue with SQL Server?
This looks like a statistics problem.
SQL server will build a histogram of the values in a table to give it some idea what kind of query plan to create.
For example, if you have a table T with a million rows in it, but the value of column C is always 1, and then you do select * from T with C = 1, the engine will choose a plan that expects to get a lot of rows returned, because the histogram says "it is statistically likely that this table contains a hell of a lot of rows where C = 1"
Alternatively, if you have a table T with a million rows in it, but the value of column C is never 1, then the histogram tells the engine "very few rows are likely to be returned for the query select * from T where C = 1 so pick a plan optimized for a small number of rows".
A problem can arise when the values in a column have significantly changed, but the histogram (statistics) have yet to be updated. Then SQL might pick a plan based on the histogram, where a different plan would have been much better. In your case, the histogram may not indicate to the engine that there are any values greater than about the 10th of March 2020. Statistics issues are fairly common with dates, because you are often inserting getdate(), which means newer rows in the table will contain values that have never been seen before, and thus won't be covered by the histogram until it gets updated.
SQL will automatically update statistics based on a number of different triggers (this is an old article, newer versions of the engine may have changed slightly), as long as you have auto update statistics enabled in the database settings.
You can find out whether this is the issue by forcing SQL to update statistics on your table. Statistics can be refreshed by either fully scanning the table, or sampling it. Sampling is much faster for large tables, but the result won't be as accurate.
To find out whether statistics is the problem in your case, do:
update statistics PW with fullscan

How do I save archive from SQL Server database

I have a database in SQL Server. Basically, table consists of a number of XML documents that represent same table data at given time (like backup history). What is the best method to cut off all the old (3 months) backups, remove from DB and save them archived?
There is no export out of the box in SQL Server.
Assuming
Your table can be pretty big, since it looks like you and image of the table every minute.
If you want to do it all from inside SQL Server.
Then I'll suggest doing cleanup in chunks.
The usual process in SQL to delete by chunks is using DELETE in combination with OUTPUT statement.
The easiest way to archive and remove then would be having the OUTPUT to a table in another database, for that sole purpose.
so your steps would be:
Create a new database (ArchiveDatabase)
Create an Archive table in ArchiveDatabase (ArchiveTable) with same structure of the table that you want to remove.
In a while loop perform the DELETE/OUTPUT
Backup the ArchiveDatabase
TRUNCATE ArchiveTable table in ArchiveDatabase
The DELETE/OUTPUT loops will look like something like
declare #RowsToDelete int = 1000
declare #DeletedRowsCNT int = 1000
while #DeletedRowsCNT = #RowsToDelete
begin
delete top (#RowsToDelete)
from MyDataTable
output deleted.* into ArchiveDatabase.dbo.ArchiveTable
where dt < dateadd(month, 3, getdate())
set #DeletedRowsCNT = ##ROWCOUNT
end

Increment Table Name Based on Current Date

I am trying to add a table into my power bi using the SQL Server data source. The thing is that a new table is added daily with the format of the year YYMMDD. Example: MYTABLE191121, tomorrow it will be MYTABLE191122.
How can i write a query in Power BI that always looks at the latest table based on today's date? I want to be able to refresh the content and have latest table's data.
Thank you
If your database name is TEST_DB on localhost your query can be like this:
let
Source = Sql.Databases("localhost"),
TEST_DB = Source{[Name="TEST_DB"]}[Data],
// chain a few functions to create today's date in YYMMDD
TODAYS_DATE = Text.Range(Text.Remove(Date.ToText(DateTime.Date(DateTime.LocalNow())),"-"),2),
// simply use & to concat MYTABLE and today's date
todays_table = TEST_DB{[Schema="dbo",Item="MYTABLE" & TODAYS_DATE]}[Data]
in
todays_table
This will create a query like this: SELECT * FROM [dbo].[MYTABLE191122] into the table todays_table.

How to convert rows of data into a single row through informatica/SQL Server

How to convert diagonal rows into single row in SQL Server 2014 for a particular ID field. Any help would be greatly appreciated
Please review.
My source is SQL Server 2014
Note:
I don't have write access to DB so will not be able to create functions etc.
I have option to get the desired o/p either by taking the source[given below] in Informatica from SQL Server - which is in turn result of a query and achieve the required o/p by applying some transformation or write logic in SQL server -2014. So in short either i can perform logic using Informatica or through SQL Server in the initial level itself to get the desired o/p.
I have several other joins and columns being pulled from different tables along with the below fields
Please review
Regarding Input:
The ID field will be constant, but POS field will be different for a particular ID
There can be 1 to 10 such occurrences of POS field [POS field can have values from 1 -10].DESC1 value from POS =1 will go to DESC1, value of POS 2 will go to desc2 and so on.
Right now I have given only 8 occurrences,but actually there are 10
INPUT:
ID|POS|DESC1|desc2|desc3|desc4|desc5|desc6|desc7|desc8|
1|1|ItemA|null|null|null|null|null|null|null
1|2|null|ItemB|null|null|null|null|null|null
1|3|null|null|Item C|null|null|null|null|null
1|4|null|null|null|ItemD|null|null|null|null
1|5|null|null|null|null|ItemE|null|null|null
1|6|null|null|null|null|null|value-random|null|null
1|7|null|null|null|null|null|null|Check!A|null
1|8|null|null|null|null|null|null|null|123456
OUTPUT:
ID|DESC1|desc2|desc3|desc4|desc5|Desc6|desc7|Desc8
1|ItemA|ItemB|Item C|ItemD|ItemE|value-random|Check!A|123456
This is simplified because I don't know what your current query looks like but you could use MAX() and GROUP BY id.
DEMO
SELECT
ID,
MAX(desc1) AS [desc1],
MAX(desc2) AS [desc2],
MAX(desc3) AS [desc3],
MAX(desc4) AS [desc4],
MAX(desc5) AS [desc5],
MAX(desc6) AS [desc6],
MAX(desc7) AS [desc7],
MAX(desc8) AS [desc8]
FROM dbo.YourTable
GROUP BY ID

Verifying Syntax of a Massive SQL Update Command

I'm new to SQL Server and am doing some cleanup of our transaction database. However, to accomplish the last step, I need to update a column in one table of one database with the value from another column in another table from another database.
I found a SQL update code snippet and re-wrote it for our own needs but would love someone to give it a once over before I hit the execute button since the update will literally affect hundreds of thousands of entries.
So here are the two databases:
Database 1: Movement
Table 1: ItemMovement
Column 1: LongDescription (datatype: text / up to 40 char)
Database 2: Item
Table 2: ItemRecord
Column 2: Description (datatype: text / up to 20 char)
Goal: set Column1 from db1 to the value of Colum2 from db2.
Here is the code snippet:
update table1
set table1.longdescription = table2.description
from movement..itemmovement as table1
inner join item..itemrecord as table2 on table1.itemcode = table2.itemcode
where table1.longdescription <> table2.description
I added the last "where" line to prevent SQL from updating the column where it already matches the source table.
This should execute faster and just update the columns that have garbage. But as it stands, does this look like it will run? And lastly, is it a straightforward process, using SQL Server 2005 Express to just backup the entire Movement db before I execute? And if it messes up, just restore it?
Alternatively, is it even necessary to re-cast the tables as table1 and table 2? Is it valid to execute a SQL query like this:
update movement..itemmovement
set itemmovement.longdescription = itemrecord.description
from movement..itemmovement
inner join item..itemrecord on itemmovement.itemcode = itemrecord.itemcode
where itemmovement.longdescription <> itemrecord.description
Many thanks in advance!
You don't necessarily need to alias your tables but I recommend you do for faster typing and reduce the chances of making a typo.
update m
set m.longdescription = i.description
from movement..itemmovement as m
inner join item..itemrecord as i on m.itemcode = i.itemcode
where m.longdescription <> i.description
In the above query I have shortened the alias using m for itemmovement and i for itemrecord.
When a large number of records are to be updated and there's question whether it would succeed or not, always make a copy in a test database (residing on a test server) and try it out over there. In this case, one of the safest bet would be to create a new field first and call it longdescription_text. You can make it with SQL Server Management Studio Express (SSMS) or using the command below:
use movement;
alter table itemmovement add column longdescription_test varchar(100);
The syntax here says alter table itemmovement and add a new column called longdescription_test with datatype of varchar(100). If you create a new column using SSMS, in the background, SSMS will run the same alter table statement to create a new column.
You can then execute
update m
set m.longdescription_test = i.description
from movement..itemmovement as m
inner join item..itemrecord as i on m.itemcode = i.itemcode
where m.longdescription <> i.description
Check data in longdescription_test randomly. You can actually do a spot check faster by running:
select * from movement..itemmovement
where longdescription <> longdescription_test
and longdescription_test is not null
If information in longdescription_test looks good, you can change your update statement to set m.longdescription = i.description and run the query again.
It is easier to just create a copy of your itemmovement table before you do the update. To make a copy, you can just do:
use movement;
select * into itemmovement_backup from itemmovement;
If update does not succeed as desired, you can truncate itemmovement and copy data back from itemmovement_backup.
Zedfoxus provided a GREAT explanation on this and I appreciate it. It is excellent reference for next time around. After reading over some syntax examples, I was confident enough in being able to run the second SQL update query that I have in my OP. Luckily, the data here is not necessarily "live" so at low risk to damage anything, even during operating hours. Given the nature of the data, the updated executed perfectly, updating all 345,000 entries!

Resources