Netezza insert same row into table multiple times - netezza

i have table with the following values
location date count
2150 4/5/14 100
now i need to insert 100 rows into another table .The table should have 100 rows of
location date
2150 4/5/14
Help me in achieving this.My database is netezza

Netezza has a system view that has 1024 rows each with an idx value from 0 to 1023. You can exploit this to drive an arbitrary number of rows by joining to it. Note that this approach requires you to have determined some reasonable upper limit in order to know how many times to join to _v_vector_idx.
INSERT INTO target_table
SELECT *
FROM base_table
JOIN _v_vector_idx b
ON b.idx < 100;
Then if you want to drive it based on the third column from the base_table, you could do this:
INSERT INTO target_table
SELECT location,
DATE
FROM base_table a
JOIN _v_vector_idx b
ON b.idx < a.count;
One could also take a procedural approach and create a stored procedure if you didn't have a feel for what that reasonable upper limit might be.

Related

Using SQL Server Export Wizard in batches

I'm using SQL Server Export Wizard to migrate 2 million rows over to a Postgres database. After 10 hours, I got to 1.5 million records and it quit. Argh.
So I'm thinking the safest way to get this done is to do it in batches. 100k rows at a time. But how do you do that?
Conceptually, I'm thinking:
SELECT * FROM invoices WHERE RowNum BETWEEN 300001 AND 400000
But RowNum doesn't exist, right? Do I need to create a new column and somehow get a +1 incremental ID in there that I can use in a statement like this? There is no primary key and there are no columns with unique values.
Thanks!
The rows are invoices, so I created a new variable 'Quartile' that divides the invoice dollar values into quartiles using:
SELECT *,
NTILE(4) OVER(ORDER BY TOTAL_USD) AS QUARTILE
INTO invoices2
FROM invoices
This created four groups of 500k rows each. Then in the export wizard, I asked to:
SELECT * FROM invoices2 WHERE QUARTILE = 1 -- (or 2, 3, 4 etc)
And I'm going to send each group of 500k rows to its own Postgres table and then merge them over on pgAdmin. That way, if any one crashes, I can just do that smaller grouping over again without affecting the integrity of the others. Does that make sense? Maybe would have been just as easy to create an incrementing primary key?
Update:
All four batches transferred successfully. Worth noting that the total transfer time was 4x faster when sending the 2M rows as four simultaneous batches of 500k--4 hours instead of 16! Combined them back into a single table using the following query in pgAdmin:
--Combine tables back into one, total row count matches original
SELECT * INTO invoices_all FROM (
SELECT * FROM quar1
UNION All
SELECT * FROM quar2
UNION All
SELECT * FROM quar3
UNION All
SELECT * FROM quar4
) as tmp
And checked sums of all variables that had to be converted from SQL Server "money" to Postgres "numeric"
--All numeric sums match those from original
SELECT SUM("TOTAL_BEFORE_TP_TAX")
,SUM("TP_SELLER")
,SUM("TOTAL_BEFORE_TAX")
,SUM("TAX_TOTAL")
,SUM("TOTAL")
,SUM("TOTAL_BEFORE_TP_TAX_USD")
,SUM("TP_SELLER_USD")
,SUM("TOTAL_BEFORE_TAX_USD")
,SUM("TAX_TOTAL_USD")
,SUM("TOTAL_USD")
FROM PUBLIC.invoices_all

Fastest way to process Millions of Rows in SQL Server for a Chart

We are logging realtime data every second to a SQL Server database and we want to generate charts from 10 Million rows or more. At the moment we use something like the code below. The goal is to get at least 1000-2000 values to pass into the chart.
In the query below, we take an avg of every next n'th rows depending on the count of data we pick out from the LargeTable. This works fine up to 200.000 selected rows, but then it is way too slow.
SELECT
AVG(X),
AVG(Y)
FROM
(SELECT
X, Y,
(Id / #AvgCount) AS [Group]
FROM
[LargeTable]
WHERE
Timestmp > #From
AND Timestmp < #Till) j
GROUP BY
[Group]
ORDER BY
X;
Now we tried to select out only every n'th row from LargeTable and then make an average of this data to get more performance, but it takes nearly the same time.
SELECT
X, Y
FROM
(SELECT
X, Y,
ROW_NUMBER() OVER (ORDER BY Id) AS rownr
FROM
LargeTable
WHERE
Timestmp >= #From
AND Timestmp <= #Till) a
WHERE
a.rownr % (#count / 10000) = 0;
It is only pseudo code! We have indexes on all relevant columns.
Are there better and faster ways to get chart data?
I think on two approaches to improve the performance of the charts:
Trying to improve the performance of the queries.
Reducing the amount of data needed to be read.
It's almost impossible for me to improve the performance of the queries without the full DDL and execution plans. So I'm suggesting you to reduce the amount of data to be read.
The key is summarizing groups at a given granularity level as the data comes and storing it in a separate table like the following:
CREATE TABLE SummarizedData
(
int GroupId PRIMARY KEY,
FromDate datetime,
ToDate datetime,
SumX float,
SumY float,
GroupCount
)
IdGroup should be equals to Id/100 or Id/1000 depending on how much granularity you want in groups. With larger groups you get more coarse granularity but more efficient charts.
I'm assuming LargeTable Id column increases monotonically, so you can store the last Id that has been processed in another table called SummaryProcessExecutions
You would need a stored procedure ExecuteSummaryProcess that:
Read LastProcessedId from SummaryProcessExecutions
Read the Last Id on large table and store it into #NewLastProcessedId variable
Summarize all rows from LargeTable with Id > #LastProcessedId and Id <= #NewLastProcessedId and store the results into SummarizedData table
Store #NewLastProcessedId variable into SummaryProcessExecutions table
You can execute ExecuteSummaryProcess stored procedure frequently in a SQL Server Agent Job.
I believe that grouping by date would be a better choice than grouping by Id. It would simplify things. The SummarizedData GroupId column would not be related to LargeTable Id and you would not need to update SummarizedData rows, you would only need to insert rows.
Since the time to scan the table increases with the number of rows in it, I assume there is no index on Timestmp column. An index like the one bellow may speed up you query:
CREATE NONCLUSTERED INDEX [IDX_Timestmp] ON [LargeTable](Timestmp) INCLUDE(X, Y, Id)
Please note, that creation of such index may take significant amount of time, and it will impact your inserts too.

Save result of select statement into wide table SQL Server

I have read about the possibilty to create wide tables (30,000 columns) in SQL server (1)
But how do I actually save the result of a select statement (one that has 1024+ columns) into a wide table?
Because if I do:
Select *
Into wide_table
From (
**Select statement with 1024+ columns**
) b
I get: CREATE TABLE failed because column 'c157' in table 'wide_table' exceeds the maximum of 1024 columns.
And, will I be able to query that table and all it's columns in a regular manner?
Thank you for your help!
You are right you are allowed to created table with 30 000 columns, but you can SELECT or INSERT 'only' 4096 column in one clause:
So, in case of SELECT you will need to get the columns in parts or concatenate the results. All of this does not seem to be practical and easier and performance efficient.
If you are going to have so many columns, maybe it will be better to try to UNPIVOT the data and normalized it further.

sql select performance (Microsoft)

I have a very big table with many rows (50 million) and more than 500 columns. Its indexes are period and client. I need to keep for a period the client and another column (not an index). It takes too much time. So I'm trying to understand why:
If I do:
select count(*)
from table
where cd_periodo=201602
It takes less than 1 sec and returns the number 2 million.
If I select into a temp table the period it also takes no time (2 secs)
select cd_periodo
into #table
from table
where cd_periodo=201602
But if I select another column that it's not part of an index it takes more than 3 minutes.
select not_index_column
into #table
from table
where cd_periodo=201602
Why is this happening? I'm not doing any filter on the column.
When you select an indexed column, the reader doesn't have to process and go into the entire table and read the entire row. The index helps the reader to select the value without having to actually get the row.
When you select a nonindexed column, the opposite of what I said happens, and the reader have to read the whole table in order to get the value from this column.

query plan shows cost of 54% for an insert when no rows actually involved

In one of my queries there's an insert of data into a temp table. Looking at the query plan, it shows the the actual insert into temp table took 54% (just inserting data into temp table). However, no rows are being inserted into the temp table.
Why does the plan show a non zero value when no rows are being inserted?
Even in the actual query plan the subtree costs shown are based on estimates as well as various heuristics and magic numbers used by the cost based optimiser. They can be woefully wrong and should be taken with a big pinch of salt.
Example to reproduce
create table #t
(
i int
)
insert into #t
select number
from master.dbo.spt_values
where number = 99999999
The actual insert was zero rows but the estimate was 1 row which is where the subtree cost comes from.
Edit: I just tried the following
insert into #t
select top 0 number
from master.dbo.spt_values
where number = 99999999
Even when it gets the estimated number of rows right it still assigns a small non zero cost to the insert. I guess the heuristic it uses always assigns some small element of fixed cost.
take a look at this
insert into #temp
select * from sometable
where left(Somecol,3) = 'BLA'
that is not sargable so it will cause a scan, but if no rows are found the insert doesn't happen...the scan still happens
but if you did this then the cost should drop dramatically because now the index can be used
insert into #temp
select * from sometable
where Somecol like 'BLA%'
BTW I would use STATISTICS TIME and STATISTICS IO instead to measure performance, those two are much better indicators..when you see 3 reads vs 10000 reads you know what is happening..what does 45% tell you exactly when the whole process could run 3 minutes or 3 seconds
The cost of 54 doesn't mean that rows need be involved. Perhaps there was an index scan or other seek operation or perhaps some non-optimal lookup in the WHERE clause of that INSERT into temp table?

Resources