CTAS operation in netezza consuming /nzscratch - netezza

When I create a table via CTAS(Create Table AS)with the help of 1 or more base tables which contains million of records. Then I can see transient data is getting saved to /nzscratch/tmp dir Or I can say during the CTAS operation, nzscratch/tmp is keep on filling unless it gets completed. Once CTAS is successful, everything gets cleared out from /nzscratch/tmp dir.
So I would like to know if it is expected behavior or not?
2nd thing: Does the same concept applies while running any normal query also which keeps on running for long time or transient data is getting save to memory side in this case?
An example of a CTAS that is generating transient data:
CREATE TABLE T1 AS
SELECT track_seq,
MAX(campaign_label) AS campaign_label,
MAX(creative_label) AS creative_label,
MAX(lob_label) AS lob_label,
MAX(placement_label) AS placement_label,
MAX(site_label) AS site_label
FROM (
SELECT *
FROM master_test
UNION
SELECT *
FROM labels_test_1
)
a
GROUP BY 1;
The relevant DDL:
CREATE TABLE admin.master_test
(
track_seq character varying(40),
campaign_label character varying(200),
creative_label character varying(200),
lob_label character varying(200),
placement_label character varying(200),
site_label character varying(200)
)
DISTRIBUTE ON (track_seq)
;
CREATE TABLE admin.labels_test_1
(
track_seq character varying(40),
campaign_label character varying(200),
creative_label character varying(200),
lob_label character varying(200),
placement_label character varying(200),
site_label character varying(200)
)
DISTRIBUTE ON (track_seq)
;

Generally speaking, you should only see transient data in the /nzscratch/tmp directory if your CTAS query does significant host-based processing (e.g. using a row_number() function with no PARTITION BY clause, which is an all too common mistake for generating surrogate keys on an MPP platform, in my opinion).
You may also see usage there in cases where you are doing a SELECT to a remote system (your desktop, or a BI server) where the receiving system can't keep up with the speed the Netezza system sends the data. In these cases you will see data spooled in /nzscratch as well.
For the specific CTAS example you provided, the culprit is the UNION in the subselect. A UNION must provide duplicate detection, and what is happening in this case is that both tables are being sent up to the host to be processed so that all the rows can be de-duplicated there. This is what is eating up your /nzscratch space.
Here is an alternative using GROUP BY to do the de-deuplication work that should avoid that host-based processing, and keep the work on the MPP backend, by specifying the distribution column in the GROUP BY.
CREATE TABLE T1 AS
SELECT track_seq,
MAX(campaign_label) AS campaign_label,
MAX(creative_label) AS creative_label,
MAX(lob_label) AS lob_label,
MAX(placement_label) AS placement_label,
MAX(site_label) AS site_label
FROM (
select *
from (
SELECT *
FROM master_test
UNION ALL
SELECT *
FROM labels_test_1
)
foo
group by 1,2,3,4,5,6
)
a
GROUP BY 1;

Related

Explanation for the fn_dblog() function's output on SQL Server 2008 R2

I have a query to get me some basic information regarding the transaction log (.ldf) file. Here it is:
WITH CTE AS
(
SELECT
AllocUnitName,
Operation,
Context,
[Lock Information],
SUM(CONVERT(BIGINT, [Log Record Length])) AS TotalTranLogBytes,
SUM(CONVERT(BIGINT, [Log Record Length])) * 100 /
SUM(CONVERT(MONEY, SUM(CONVERT(BIGINT, [Log Record Length]))))
OVER() AS PercentOfLog
FROM
sys.fn_dblog(NULL,NULL)
GROUP BY
AllocUnitName,
Operation,
Context,
[Lock Information]
)
SELECT
AllocUnitName,
Operation,
Context,
[Lock Information],
TotalTranLogBytes,
PercentOfLog
FROM
CTE
WHERE
PercentOfLog >= 0
ORDER BY
TotalTranLogBytes DESC
Unfortunately, I don't actually understand the output... I'm primarily concerned with only the very top row from that query's results, it's the largest amount of space used in the transaction log, simple!
However, there are other columns, AllocUnitName, Operation and Context. In my case, I get:
dbo.MyMassiveTable.PK_MyMassiveTable LOP_MODIFY_ROW LCX_TEXT_MIX 3848564 61.6838
...as my output. But what on EARTH does LOP_MODIFY_ROW, and LCX_TEXT_MIX actually MEAN?
Obviously I can vaguely understand that it's something to do with the primary key for that table, that it's associated with an UPDATE command, and that there was something happening with a Text column?
But I need precision!
Anyone that can help me understand why this particular part of the transaction log is so HUGE would be a great help!
This indicates that the table contains a column of some Large Object datatype that was subject to insert or update activity (i.e. a MAX datatype, XML, CLR datatype or IMAGE or [N]Text).
dbo.MyMassiveTable.PK_MyMassiveTable must either be the clustered index or a non clustered index that INCLUDE-s one or more LOB columns.
LCX_TEXT_MIX presumably indicates text mix page:
A text page that holds small chunks of LOB values plus internal parts
of text tree. These can be shared between LOB values in the same
partition of an index or heap.
LOP_MODIFY_ROW usually appears in the log when a value is updated but the example below shows that insert can also reproduce this same logging outcome.
CREATE TABLE dbo.MyMassiveTable
(
pk INT IDENTITY CONSTRAINT PK_MyMassiveTable PRIMARY KEY,
Blob1 NVARCHAR(MAX)
)
INSERT INTO dbo.MyMassiveTable
VALUES (REPLICATE(CAST(N'X' AS VARCHAR(MAX)), 3848564 / 2));

Convert Date Stored as VARCHAR into INT to compare to Date Stored as INT

I'm using SQL Server 2014. My request I believe is rather simple. I have one table containing a field holding a date value that is stored as VARCHAR, and another table containing a field holding a date value that is stored as INT.
The date value in the VARCHAR field is stored like this: 2015M01
The data value in the INT field is stored like this: 201501
I need to compare these tables against each other using EXCEPT. My thought process was to somehow extract or TRIM the "M" out of the VARCHAR value and see if it would let me compare the two. If anyone has a better idea such as using CAST to change the date formats or something feel free to suggest that as well.
I am also concerned that even extracting the "M" out of the VARCHAR may still prevent the comparison since one will still remain VARCHAR and the other is INT. If possible through a T-SQL query to convert on the fly that would be great advice as well. :)
REPLACE the string and then CONVERT to integer
SELECT A.*, B.*
FROM TableA A
INNER JOIN
(SELECT intField
FROM TableB
) as B
ON CONVERT(INT, REPLACE(A.varcharField, 'M', '')) = B.intField
Since you say you already have the query and are using EXCEPT, you can simply change the definition of that one "date" field in the query containing the VARCHAR value so that it matches the INT format of the other query. For example:
SELECT Field1, CONVERT(INT, REPLACE(VarcharDateField, 'M', '')) AS [DateField], Field3
FROM TableA
EXCEPT
SELECT Field1, IntDateField, Field3
FROM TableB
HOWEVER, while I realize that this might not be feasible, your best option, if you can make this happen, would be to change how the data in the table with the VARCHAR field is stored so that it is actually an INT in the same format as the table with the data already stored as an INT. Then you wouldn't have to worry about situations like this one.
Meaning:
Add an INT field to the table with the VARCHAR field.
Do an UPDATE of that table, setting the INT field to the string value with the M removed.
Update any INSERT and/or UPDATE stored procedures used by external services (app, ETL, etc) to do that same M removal logic on the way in. Then you don't have to change any app code that does INSERTs and UPDATEs. You don't even need to tell anyone you did this.
Update any "get" / SELECT stored procedures used by external services (app, ETL, etc) to do the opposite logic: convert the INT to VARCHAR and add the M on the way out. Then you don't have to change any app code that gets data from the DB. You don't even need to tell anyone you did this.
This is one of many reasons that having a Stored Procedure API to your DB is quite handy. I suppose an ORM can just be rebuilt, but you still need to recompile, even if all of the code references are automatically updated. But making a datatype change (or even moving a field to a different table, or even replacinga a field with a simple CASE statement) "behind the scenes" and masking it so that any code outside of your control doesn't know that a change happened, not nearly as difficult as most people might think. I have done all of these operations (datatype change, move a field to a different table, replace a field with simple logic, etc, etc) and it buys you a lot of time until the app code can be updated. That might be another team who handles that. Maybe their schedule won't allow for making any changes in that area (plus testing) for 3 months. Ok. It will be there waiting for them when they are ready. Any if there are several areas to update, then they can be done one at a time. You can even create new stored procedures to run in parallel for any updated app code to have the proper INT datatype as the input parameter. And once all references to the VARCHAR value are gone, then delete the original versions of those stored procedures.
If you want everything in the first table that is not in the second, you might consider something like this:
select t1.*
from t1
where not exists (select 1
from t2
where cast(replace(t1.varcharfield, 'M', '') as int) = t2.intfield
);
This should be close enough to except for your purposes.
I should add that you might need to include other columns in the where statement. However, the question only mentions one column, so I don't know what those are.
You could create a persisted view on the table with the char column, with a calculated column where the M is removed. Then you could JOIN the view to the table containing the INT column.
CREATE VIEW dbo.PersistedView
WITH SCHEMA_BINDING
AS
SELECT ConvertedDateCol = CONVERT(INT, REPLACE(VarcharCol, 'M', ''))
--, other columns including the PK, etc
FROM dbo.TablewithCharColumn;
CREATE CLUSTERED INDEX IX_PersistedView
ON dbo.PersistedView(<the PK column>);
SELECT *
FROM dbo.PersistedView pv
INNER JOIN dbo.TableWithIntColumn ic ON pv.ConvertedDateCol = ic.IntDateCol;
If you provide the actual details of both tables, I will edit my answer to make it clearer.
A persisted view with a computed column will perform far better on the SELECT statement where you join the two columns compared with doing the CONVERT and REPLACE every time you run the SELECT statement.
However, a persisted view will slightly slow down inserts into the underlying table(s), and will prevent you from making DDL changes to the underlying tables.
If you're looking to not persist the values via a schema-bound view, you could create a non-persisted computed column on the table itself, then create a non-clustered index on that column. If you are using the computed column in WHERE or JOIN clauses, you may see some benefit.
By way of example:
CREATE TABLE dbo.PCT
(
PCT_ID INT NOT NULL
CONSTRAINT PK_PCT
PRIMARY KEY CLUSTERED
IDENTITY(1,1)
, SomeChar VARCHAR(50) NOT NULL
, SomeCharToInt AS CONVERT(INT, REPLACE(SomeChar, 'M', ''))
);
CREATE INDEX IX_PCT_SomeCharToInt
ON dbo.PCT(SomeCharToInt);
INSERT INTO dbo.PCT(SomeChar)
VALUES ('2015M08');
SELECT SomeCharToInt
FROM dbo.PCT;
Results:

Removing characters from a alphanumeric field SQL

Im moving data from one table to another using insert into. in the select bit need to transfer from column with characters and numerical in to another with only the numerical. The original column is in varchar format.
original column -
ABC100
XYZ:200
DD2000
Wanted column
100
200
2000
Cant write a function because cant have a function in side select statement when inserting
Using MS SQL
I encourage you to read this:
Extracting Data
There is an example function that removes alpha characters from a string. This will be much faster than a bunch of replace statements.
You can probably do that with a regex replace. The syntax for this depends on your database software (which you haven't specified).
You should be able to do function calls in your SELECT statement, even when you're using it to INSERT INTO.
If your data is fixed-format I'd do something like
INSERT INTO SOME_TABLE(COLUMN1, COLUMN2, COLUMN3)
SELECT TO_NUMBER(SUBSTR(SOURCE_COLUMN, 4, 3)),
TO_NUMBER(SUBSTR(SOURCE_COLUMN, 12, 3)),
TO_NUMBER(SUBSTR(SOURCE_COLUMN, 18, 4))
FROM SOME_OTHER_TABLE
WHERE <conditions>;
The above code is for Oracle. Depending on the database you're using you may have to do things a bit differently.
I hope this helps.
You certainly can have a function inside a SELECT statement during an INSERT:
INSERT INTO CleanTable (CleanColumn)
SELECT dbo.udf_CleanString(DirtyColumn)
FROM DirtyTable
Your main problem is going to be getting the function right (the one the G Mastros linked to is pretty good) and getting it performing. If you're only talking thousands of rows, this should be fine. If you are talking about millions of rows, you might need a different strategy.
Writing a UDF is how I've solved this problem in the past. However, I got to thinking if there was a set-based solution. Here's what I have:
First my table which I used Red Gate's Data Generator to populate with a bunch of random alpha numeric values:
Create Table MixedValues (
Id int not null identity(1,1) Primary Key
, AlphaValue varchar(50)
)
Next I built a Tally table on the fly using a CTE but normally I have a fixed table for this. A Tally table is just a table of sequential numbers.
;With Tally As
(
Select ROW_NUMBER() OVER ( ORDER BY object_id ) As Num
From sys.columns
)
, IndividualChars As
(
Select MX.Id, Substring(MX.AlphaValue, Num, 1) As CharValue, Num
From Tally
Cross Join MixedValues As MX
Where Num Between 1 And Len(MX.AlphaValue)
)
Select MX.Id, MX.AlphaValue
, Replace(
(
Select '' + CharValue
From IndividualChars As IC
Where IC.Id = MX.Id
And PATINDEX('[ 0-9]', CharValue) > 0
Order By Num
For Xml Path('')
)
, ' ', ' ') As NewValue
From MixedValues As MX
From a top level, the idea here is to split the string into one row per individual character, test the type of pattern you want and then re-constitute it.
Note that my sys.columns table only contains 500 some odd rows. If you had strings larger than 500 characters, you could simply cross join sys.columns to itself and get 500^2 rows. In addition, For Xml Path returns a string with spaces escaped (note the space in my pattern index [ 0-9] which tells the system to include spaces.) so I use the replace function to reverse the escaping.
EDIT: Btw, this will only work in SQL 2005+ because of my use of the CTE. If you wanted a SQL 2000 solution, you would need to break up the CTE into separate table creation calls (e.g. Temp tables) but it could still be done.
EDIT: I added the Num column in the IndividualChars CTE and added an Order By to the NewValue query at the end. Although it probably will reconstitute the string in order, I wanted to ensure that it would by explicitly ordering the results.

How does sql server choose values in an update statement where there are multiple options?

I have an update statement in SQL server where there are four possible values that can be assigned based on the join. It appears that SQL has an algorithm for choosing one value over another, and I'm not sure how that algorithm works.
As an example, say there is a table called Source with two columns (Match and Data) structured as below:
(The match column contains only 1's, the Data column increments by 1 for every row)
Match Data
`--------------------------
1 1
1 2
1 3
1 4
That table will update another table called Destination with the same two columns structured as below:
Match Data
`--------------------------
1 NULL
If you want to update the ID field in Destination in the following way:
UPDATE
Destination
SET
Data = Source.Data
FROM
Destination
INNER JOIN
Source
ON
Destination.Match = Source.Match
there will be four possible options that Destination.ID will be set to after this query is run. I've found that messing with the indexes of Source will have an impact on what Destination is set to, and it appears that SQL Server just updates the Destination table with the first value it finds that matches.
Is that accurate? Is it possible that SQL Server is updating the Destination with every possible value sequentially and I end up with the same kind of result as if it were updating with the first value it finds? It seems to be possibly problematic that it will seemingly randomly choose one row to update, as opposed to throwing an error when presented with this situation.
Thank you.
P.S. I apologize for the poor formatting. Hopefully, the intent is clear.
It sets all of the results to the Data. Which one you end up with after the query depends on the order of the results returned (which one it sets last).
Since there's no ORDER BY clause, you're left with whatever order Sql Server comes up with. That will normally follow the physical order of the records on disk, and that in turn typically follows the clustered index for a table. But this order isn't set in stone, particularly when joins are involved. If a join matches on a column with an index other than the clustered index, it may well order the results based on that index instead. In the end, unless you give it an ORDER BY clause, Sql Server will return the results in whatever order it thinks it can do fastest.
You can play with this by turning your upate query into a select query, so you can see the results. Notice which record comes first and which record comes last in the source table for each record of the destination table. Compare that with the results of your update query. Then play with your indexes again and check the results once more to see what you get.
Of course, it can be tricky here because UPDATE statements are not allowed to use an ORDER BY clause, so regardless of what you find, you should really write the join so it matches the destination table 1:1. You may find the APPLY operator useful in achieving this goal, and you can use it to effectively JOIN to another table and guarantee the join only matches one record.
The choice is not deterministic and it can be any of the source rows.
You can try
DECLARE #Source TABLE(Match INT, Data INT);
INSERT INTO #Source
VALUES
(1, 1),
(1, 2),
(1, 3),
(1, 4);
DECLARE #Destination TABLE(Match INT, Data INT);
INSERT INTO #Destination
VALUES
(1, NULL);
UPDATE Destination
SET Data = Source.Data
FROM #Destination Destination
INNER JOIN #Source Source
ON Destination.Match = Source.Match;
SELECT *
FROM #Destination;
And look at the actual execution plan. I see the following.
The output columns from #Destination are Bmk1000, Match. Bmk1000 is an internal row identifier (used here due to lack of clustered index in this example) and would be different for each row emitted from #Destination (if there was more than one).
The single row is then joined onto the four matching rows in #Source and the resultant four rows are passed into a stream aggregate.
The stream aggregate groups by Bmk1000 and collapses the multiple matching rows down to one. The operation performed by this aggregate is ANY(#Source.[Data]).
The ANY aggregate is an internal aggregate function not available in TSQL itself. No guarantees are made about which of the four source rows will be chosen.
Finally the single row per group feeds into the UPDATE operator to update the row with whatever value the ANY aggregate returned.
If you want deterministic results then you can use an aggregate function yourself...
WITH GroupedSource AS
(
SELECT Match,
MAX(Data) AS Data
FROM #Source
GROUP BY Match
)
UPDATE Destination
SET Data = Source.Data
FROM #Destination Destination
INNER JOIN GroupedSource Source
ON Destination.Match = Source.Match;
Or use ROW_NUMBER...
WITH RankedSource AS
(
SELECT Match,
Data,
ROW_NUMBER() OVER (PARTITION BY Match ORDER BY Data DESC) AS RN
FROM #Source
)
UPDATE Destination
SET Data = Source.Data
FROM #Destination Destination
INNER JOIN RankedSource Source
ON Destination.Match = Source.Match
WHERE RN = 1;
The latter form is generally more useful as in the event you need to set multiple columns this will ensure that all values used are from the same source row. In order to be deterministic the combination of partition by and order by columns should be unique.

SQL Server Full Text Search Escape Characters?

I am doing a MS SQL Server Full Text Search query. I need to escape special characters so I can search on a specific term that contains special characters. Is there a built-in function to escape a full text search string? If not, how would you do it?
Bad news: there's no way. Good news: you don't need it (as it won't help anyway).
I've faced similar issue on one of my projects. My understanding is that while building full-text index, SQL Server treats all special characters as word delimiters and hence:
Your word with such a character is represented as two (or more) words in full-text index.
These character(s) are stripped away and don't appear in an index.
Consider we have the following table with a corresponding full-text index for it (which is skipped):
CREATE TABLE [dbo].[ActicleTable]
(
[Id] int identity(1,1) not null primary key,
[ActicleBody] varchar(max) not null
);
Consider later we add rows to the table:
INSERT INTO [ActicleTable] values ('digitally improvements folders')
INSERT INTO [ActicleTable] values ('digital"ly improve{ments} fold(ers)')
Try searching:
SELECT * FROM [ArticleTable] WHERE CONTAINS(*, 'digitally')
SELECT * FROM [ArticleTable] WHERE CONTAINS(*, 'improvements')
SELECT * FROM [ArticleTable] WHERE CONTAINS(*, 'folders')
and
SELECT * FROM [ArticleTable] WHERE CONTAINS(*, 'digital')
SELECT * FROM [ArticleTable] WHERE CONTAINS(*, 'improve')
SELECT * FROM [ArticleTable] WHERE CONTAINS(*, 'fold')
First group of conditions will match first row (and not the second) while the second group will match second row only.
Unfortunately I could not find a link to MSDN (or something) where such behaviour is clearly stated. But I've found an official article that tells how to convert quotation marks for full-text search queries, which is [implicitly] aligned with the above described algorithm.

Resources