nzload - skiprows not working when 1st column is not matching the table elements - netezza

When using nzload for fixed width where the first row are headers for the column, the skiprow works fine. But when I
Works fine if the 1st row has the same number of elements.
1HelloWorld2011-12-07
1HelloWorld2011-12-07
2Netezza 2010-02-16
The first row has a single text that I want nzload to skiprow on but because it's not the same number of elements, nzload throws an error
DummyRow
1HelloWorld2011-12-07
2Netezza 2010-02-16
Script example:
nzload -t "textFixed_tbl" -format fixed -layout "col1 int bytes 1, col2 char(10) bytes 10, col3 date YMD '-' bytes 10" -df /tmp/fixed_width.dat -bf /tmp/testFixedWidth.bad -lf /tmp/testFixedWidth.nzlog -skipRows 1 -maxErrors 1
Data File
DummyRow
1HelloWorld2011-12-07
2Netezza 2010-02-16
Error:
Error: Operation canceled
Error: External Table : count of bad input rows reached maxerrors limit
Record Format: FIXED Record Null-Indicator: 0
Record Length: 0 Record Delimiter:
Record Layout: 3 zones : "col1" INT4 DECIMAL BYTES 1 NullIf &&1 = '', "col2" CHAR(10) INTERNAL BYTES 10, "col3" DATE YMD '-' BYTES 10 NullIf &&3 = ''
Statistics
number of records read: 1
number of bytes read: 22
number of records skipped: 0
number of bad records: 1
number of records loaded: 0
Elapsed Time (sec): 0.0

The skiprows option for nzload / external tables discards the specified number number of rows, but it still processes the skipped rows. Consequently the rows must be properly formed, and this behavior won't act as you hoped/intended.
This is noted in the documentation:
You cannot use the SkipRows option for header row processing in a data file, because even the skipped rows are processed first. Therefore, data in the header rows should be valid with respect to the external table definition

Related

SQL Server - poor performance during Insert transaction

I have a stored procedure which executes a query and return the line into variables like below:
SELECT #item_id = I.ID, #label_ID = SL.label_id,
FROM tb_A I
LEFT JOIN tb_B SL ON I.ID = SL.item_id
WHERE I.NUMBER = #VAR
I have a IF to check if #label_ID is null or not. If it is null, it goes to INSERT statement, otherwise it goes to UPDATE statement. Let's focus on INSERT where I know I'm having problems. The INSERT part is like below:
IF #label_ID IS NULL
BEGIN
INSERT INTO tb_B (item_id, label_qrcode, label_barcode, data_leitura, data_inclusao)
VALUES (#item_id, #label_qrcode, #label_barcode, #data_leitura, GETDATE())
END
So, tb_B has a PK in ID column and a FK in item_ID column which refers to column ID in tb_A table.
I ran SQL Server Profiler and I saw that sometimes the duration for this stored procedure takes around 2300ms and the normal average for this is 16ms.
I ran the "Execution Plan" and the biggest cost is in the "Clustered Index Insert" component. Showing below:
Estimated Execution Plan
Actual Execution Plan
Details
More details about the tables:
tb_A Storage:
Index space: 6.853,188 MB
Row count: 45988842
Data space: 5.444,297 MB
tb_B Storage:
Index space: 1.681,688 MB
Row count: 15552847
Data space: 1.663,281 MB
Statistics for INDEX 'PK_tb_B'.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Name Updated Rows Rows Sampled Steps Density Average Key Length String Index
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
PK_tb_B Sep 23 2018 2:30AM 15369616 15369616 5 1 4 NO 15369616
All Density Average Length Columns
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
6.506343E-08 4 id
Histogram Steps
RANGE_HI_KEY RANGE_ROWS EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 0 1 0 1
8192841 8192198 1 8192198 1
8270245 65535 1 65535 1
15383143 7111878 1 7111878 1
15383144 0 1 0 1
Statistics for INDEX 'IDX_tb_B_ITEM_ID'.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Name Updated Rows Rows Sampled Steps Density Average Key Length String Index
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
IDX_tb_B_ITEM_ID Sep 23 2018 2:30AM 15369616 15369616 12 1 7.999424 NO 15369616
All Density Average Length Columns
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
6.50728E-08 3.999424 item_id
6.506343E-08 7.999424 item_id, id
Histogram Steps
RANGE_HI_KEY RANGE_ROWS EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
0 2214 0 1
16549857 0 1 0 1
29907650 65734 1 65734 1
32097131 131071 1 131071 1
32296132 196607 1 196607 1
32406913 98303 1 98303 1
40163331 7700479 1 7700479 1
40237216 65535 1 65535 1
47234636 6946815 1 6946815 1
47387143 131071 1 131071 1
47439431 31776 1 31776 1
47439440 0 1 0 1
PK_tb_B Index fragmentation
IDX_tb_B_Item_ID
Is there any best practices where I can apply and make this execution duration stable?
Hope you can help me !!!
Thanks in advance...
It's probably that the problem is the DbType of the clustered index. Clustered indexes store the data in the table based on the key values. By default, your primary key is created with a clustered index. This is often the best place to have it,
but not always. If you have, for example, a clustered index over a NVARCHAR column, every time that an INSERT is performed, needs to find the right place to insert the new record. For example, if your table have one million rows, with registers ordered alphabetically, and your new register starts with A, then your clustered index needs to move registers from B to Z to put your new register in the A group. If your new register stars with Z, then moves a smaller number of records, but this doesn't mean that its fine too. If you donĀ“t have a column that let you insert new register sequentially, then you can create an identity column for this or have another column that logically is sequential to any transaction entered regardless of the system, for example, a datetime column that registers the time at the insert ocurrs.
If you want more info, please check this Microsoft documentation

TSQL BULK INSERT with auto incremented key from .txt file

error im getting
This is to insert into an already created table:
CREATE TABLE SERIES(
SERIES_NAME VARCHAR(225) NOT NULL UNIQUE, --MADE VARCHAR(225) & UNIQUE FOR FK REFERENCE
ONGOING_SERIES BIT, --BOOL FOR T/F IF SERIES IS COMPLETED OR NOT
RUN_START DATE,
RUN_END DATE,
MAIN_CHARACTER VARCHAR(20),
PUBLISHER VARCHAR(12),
S_ID INT NOT NULL IDENTITY(1,1) PRIMARY KEY,
CONSTRAINT chk_DATES CHECK (RUN_START < RUN_END)
)
and the text file is organized as:
GREEN LANTERN,0,2005-07-01,2011-09-01,HAL JORDAN,DC
SPIDERMAN,0,2005-07-01,2011-09-01,PETER PARKER,MARVEL
I have already tried adding commas to the end of each line in .txt file
I have also tried adding ,' ' to the end of each line.
Any suggestions?
Indeed, the KEEPIDENTITY prevents the bulk insert from taken place. Removing the statement however won't resolve the problem.
Msg 4864, Level 16, State 1, Line 13
Bulk load data conversion error (type mismatch or invalid character for the specified codepage) for row 1, column 7 (S_ID).
The bulk insert expects to update all the columns. Another way of solving this issue is adding a format file for the text file, see MS Docs - Use a Format File to Bulk Import Data
You can create a format file for your text file with the following command.
bcp yourdatabase.dbo.series format nul -c -f D:\test.fmt -t, -T
Remove the last row, update the number of columns, and replace the last comma with the row terminator. The result will look like as shown below.
13.0
6
1 SQLCHAR 0 255 "," 1 SERIES_NAME SQL_Latin1_General_CP1_CI_AS
2 SQLCHAR 0 1 "," 2 ONGOING_SERIES ""
3 SQLCHAR 0 11 "," 3 RUN_START ""
4 SQLCHAR 0 11 "," 4 RUN_END ""
5 SQLCHAR 0 510 "," 5 MAIN_CHARACTER SQL_Latin1_General_CP1_CI_AS
6 SQLCHAR 0 510 "\r\n" 6 PUBLISHER SQL_Latin1_General_CP1_CI_AS
Remove KEEPIDENTIY from your BULK INSERT, since that specifies that you want to use the values in the source text file as your IDENTITY.
If this still fails, try adding a VIEW on the table that excludes the IDENTITY field, and INSERT into that instead, e.g.:
CREATE VIEW SeriesBulkInsertTarget
AS
SELECT Series_Name,
Ongoing_Series,
Run_Start,
Run_End,
Main_Character,
Publisher
FROM SERIES

Netezza: ERROR: 65536 : Record size limit exceeded

Can someone please explain below behavior
KAP.ADMIN(ADMIN)=> create table char1 ( a char(64000),b char(1516));
CREATE TABLE
KAP.ADMIN(ADMIN)=> create table char2 ( a char(64000),b char(1517));
ERROR: 65536 : Record size limit exceeded
KAP.ADMIN(ADMIN)=> insert into char1 select * from char1;
ERROR: 65540 : Record size limit exceeded => why this error during
insert if create table does not throw any error for same table as
shown above.
KAP.ADMIN(ADMIN)=> \d char1
Table "CHAR1"
Attribute | Type | Modifier | Default Value
-----------+------------------+----------+---------------
A | CHARACTER(64000) | |
B | CHARACTER(1516) | |
Distributed on hash: "A"
./nz_ddl_table KAP char1
Creating table: "CHAR1"
CREATE TABLE CHAR1
(
A character(64000),
B character(1516)
)
DISTRIBUTE ON (A)
;
/*
Number of columns 2
(Variable) Data Size 4 - 65520
Row Overhead 28
====================== =============
Total Row Size (bytes) 32 - 65548
*/
I would like to know the calculation of row size in above case.
I checked the netezza db user guide, but not able to understand its calculation in above example.
I think this link does a good job of explaining the over head of Netezza / PDA Datatypes:
For every row of every table, there is a 24-byte fixed overhead of the rowid, createxid, and deletexid. If you have any nullable columns, a null vector is required and it is N/8 bytes where N is the number of columns in the record.
The system rounds up the size of
this header to a multiple of 4 bytes.
In addition, the system adds a record header of 4 bytes if any of the following is true:
Column of type VARCHAR
Column of type CHAR where the length is greater than 16 (stored internally as VARCHAR)
Column of type NCHAR
Column of type NVARCHAR
Using UTF-8 encoding, each Unicode code point can require 1 - 4 bytes of storage. A 10-character string requires 10 bytes of storage if it is ASCII and up to 20 bytes if it is Latin, or as many as 40 bytes if it is Kanji.
The only time a record does not contain a header is if all the columns are defined as NOT NULL, there are no character data types larger than 16 bytes, and no variable character data types.
https://www.ibm.com/support/knowledgecenter/SSULQD_7.2.1/com.ibm.nz.dbu.doc/c_dbuser_data_types_calculate_row_size.html
First create a temp table based on one row of data.
create temp table tmptable as
select *
from Table
limit 1
Then check the used bytes of the temp table. That should be the size per row.
select used_bytes
from _v_sys_object_storage_size a inner join
_v_table b
on a.tblid = b.objid
and b.tablename = 'tmptable'
Netezza has some Limitations:
1)Maximum number of characters in a char/varchar field: 64,000
2)Maximum row size: 65,535 bytes
Beyond 65 k bytes is impossible for a record length in NZ.
Though NZ box offers huge space, it would be really good idea to move with accurate space forecasting rather radomly spacing. Now in your requirement does all the attributes would mandatorily require a char(64000) or can be compacted with real-time data analysis. If further compacting can be done, then revisit on the attribute length .
Also during such requirements, never go with insert into char1 select * ....... statements because this will allow system to choose preferred datatypes and that will be of higher sizing ends which might not be necessary.

Total length off all characters in all columns of each row

I'm new to SQL Server so I apologize if my question seems too easy. I tried finding and answer on my own, but I'm failing miserably. I am trying to create a query which will return total size on the drive of each row in the table.
i thought about using dbcc showconting but it doesn't work for varchar(max) which appears in my table. Also, it doesn't return size for each row, but rather the average, max and min size. My reading so far suggests that it is not possible to get query that could show the size of each individual row in the table so I decided to settle for the total length of all characters in each column in each row. Indirectly it will give me idea about the size of each row.
I have a table with some varchar(500) and varchar(max) columns. I noticed that some of the rows are a lot bigger than others.
What I need is top 1000 longest rows in the table, preferably in an output showing two columns:
Column 1 showing EntryID
Column 2 showing total length of the characters in all columns together for that record (eg total length of the characters in the column 1 + total length of the characters in the column 2 + column3 + column4 etc...) It would be great if this could be aliased RowLength.
What I tried so far is:
SELECT TOP 1000
(LEN(columnname1) + LEN(columnname2) + LEN(columnname3) + LEN(columnname4)) as RowLength,
FROM dbo.tablename
ORDER BY Length Desc
It works, but it doesn't show entry ID corresponding to the total length of all characters in the row. How do I add it?
It also doesn't show the alias for the column showing number of characters in the row.
Could you please suggest how I can change the query to get the expected outcome? I'll be very grateful for any suggestions.
it doesn't show EntryID corresponding to the total length of all
characters in the row. It also doesn't show the alias for the column
showing number of characters in the row.
You have not specified an alias, so what should it show? You also haven't selected EntryID. If you want the longest 1000 rows you have to order by the length:
SELECT TOP 1000
EntryID,
Length = LEN(columnname1) + LEN(columnname2) + LEN(columnname3) + LEN(columnname4)
FROM dbo.tablename
ORDER BY Length DESC
SELECT TOP 1000 EntryID,
(LEN(columnname1) + LEN(columnname2) + LEN(columnname3) + LEN(columnname4)) AS RowLength,
FROM dbo.tablename
ORDER BY EntryID

Calculate storage space used by sql server sql_variant data type to store fixed length data types

I'm trying to calculate the storage space used by sql_variant to store fixed length data types.
For my test I created a table with two columns:
Key int identitiy(1,1) primary key
Value sql_variant
I added one row with Value 1 of type int and I used DBCC PAGE to check the size of the row, that turned out being 21 bytes.
Using Estimate the Size of a Clustered Index I have:
Null_bitmap = 3
Fixed_Data_Size = 4 (Key column int)
Variable_Data_Size = 2 + 2 + 4 (Value column with an int)
Row_Size = 4 + 8 + 3 + 4 = 19 bytes
Why does the row take 21 bytes? What am I missing in my calculation?
I tried the same analysis with a table using an int column instead of the sql_variant and the used byte count reported by DBCC PAGE is 15, which match my calculation:
Null_bitmap = 3
Fixed_Data_Size = 8 (Key column int, Value column int)
Variable_Data_Size = 0
Row_Size = 4 + 8 + 3 = 15 bytes
The extra space is the sql_variant metadata information. From the BOL:
http://msdn.microsoft.com/en-us/library/ms173829.aspx
*Each instance of a sql_variant column records the data value and the metadata information. This includes the base data type, maximum size, scale, precision, and collation.
For compatibility with other data types, the catalog objects, such as the DATALENGTH function, that report the length of sql_variant objects report the length of the data. The length of the metadata that is contained in a sql_variant object is not returned.*
You missed part 7.
7 . Calculate the number of rows per page (8096 free bytes per page):
Rows_Per_Page = 8096 / (Row_Size + 2)
Because rows do not span pages, the number of rows per page should be
rounded down to the nearest whole row. The value 2 in the formula is
for the row's entry in the slot array of the page.

Resources