I have a very big table with many rows (50 million) and more than 500 columns. Its indexes are period and client. I need to keep for a period the client and another column (not an index). It takes too much time. So I'm trying to understand why:
If I do:
select count(*)
from table
where cd_periodo=201602
It takes less than 1 sec and returns the number 2 million.
If I select into a temp table the period it also takes no time (2 secs)
select cd_periodo
into #table
from table
where cd_periodo=201602
But if I select another column that it's not part of an index it takes more than 3 minutes.
select not_index_column
into #table
from table
where cd_periodo=201602
Why is this happening? I'm not doing any filter on the column.
When you select an indexed column, the reader doesn't have to process and go into the entire table and read the entire row. The index helps the reader to select the value without having to actually get the row.
When you select a nonindexed column, the opposite of what I said happens, and the reader have to read the whole table in order to get the value from this column.
Related
I'm using SQL Server Export Wizard to migrate 2 million rows over to a Postgres database. After 10 hours, I got to 1.5 million records and it quit. Argh.
So I'm thinking the safest way to get this done is to do it in batches. 100k rows at a time. But how do you do that?
Conceptually, I'm thinking:
SELECT * FROM invoices WHERE RowNum BETWEEN 300001 AND 400000
But RowNum doesn't exist, right? Do I need to create a new column and somehow get a +1 incremental ID in there that I can use in a statement like this? There is no primary key and there are no columns with unique values.
Thanks!
The rows are invoices, so I created a new variable 'Quartile' that divides the invoice dollar values into quartiles using:
SELECT *,
NTILE(4) OVER(ORDER BY TOTAL_USD) AS QUARTILE
INTO invoices2
FROM invoices
This created four groups of 500k rows each. Then in the export wizard, I asked to:
SELECT * FROM invoices2 WHERE QUARTILE = 1 -- (or 2, 3, 4 etc)
And I'm going to send each group of 500k rows to its own Postgres table and then merge them over on pgAdmin. That way, if any one crashes, I can just do that smaller grouping over again without affecting the integrity of the others. Does that make sense? Maybe would have been just as easy to create an incrementing primary key?
Update:
All four batches transferred successfully. Worth noting that the total transfer time was 4x faster when sending the 2M rows as four simultaneous batches of 500k--4 hours instead of 16! Combined them back into a single table using the following query in pgAdmin:
--Combine tables back into one, total row count matches original
SELECT * INTO invoices_all FROM (
SELECT * FROM quar1
UNION All
SELECT * FROM quar2
UNION All
SELECT * FROM quar3
UNION All
SELECT * FROM quar4
) as tmp
And checked sums of all variables that had to be converted from SQL Server "money" to Postgres "numeric"
--All numeric sums match those from original
SELECT SUM("TOTAL_BEFORE_TP_TAX")
,SUM("TP_SELLER")
,SUM("TOTAL_BEFORE_TAX")
,SUM("TAX_TOTAL")
,SUM("TOTAL")
,SUM("TOTAL_BEFORE_TP_TAX_USD")
,SUM("TP_SELLER_USD")
,SUM("TOTAL_BEFORE_TAX_USD")
,SUM("TAX_TOTAL_USD")
,SUM("TOTAL_USD")
FROM PUBLIC.invoices_all
I have a very big records table about 6.5 million records. When I try to select some records from it even 10 records I have to wait a long random time.
SELECT [Column1], [Column2], [Column3], [Column4], [Column5]
FROM [table]
WHERE deviceDataId = '640'
ORDER BY id ASC OFFSET 10 ROWS
FETCH NEXT 10 ROWS ONLY
My database is deployed on azure I also download and deploy it local system but it takes same time.
Query execution plan:
Now that we have your real query we can see that it appears that you have no index on your column deviceDataId; meaning that the entire table needs to be scanned. As such, even though you only want 10 rows, all 6.5M rows need to be scanned and the value of deviceDataId checked.
If you create an index on your column deviceDataId and minimally INCLUDE the others in your query, then you'll have a covering index which'll greatly help. This also assumes id is the column your CLUSTERED INDEX is ordered on.
CREATE NONCLUSTERED INDEX IX_Table_DeviceTableID_Cols1_5
ON dbo.[table] (deviceDataId)
INCLUDE ([Column1], [Column2], [Column3], [Column4], [Column5]);
Also, as I note in the comments, if deviceDataId is an int, use an int value in your WHERE clause. Don't wrap numerical values in single quotes, that is for literal strings.
WHERE deviceDataId = 640
i have table with the following values
location date count
2150 4/5/14 100
now i need to insert 100 rows into another table .The table should have 100 rows of
location date
2150 4/5/14
Help me in achieving this.My database is netezza
Netezza has a system view that has 1024 rows each with an idx value from 0 to 1023. You can exploit this to drive an arbitrary number of rows by joining to it. Note that this approach requires you to have determined some reasonable upper limit in order to know how many times to join to _v_vector_idx.
INSERT INTO target_table
SELECT *
FROM base_table
JOIN _v_vector_idx b
ON b.idx < 100;
Then if you want to drive it based on the third column from the base_table, you could do this:
INSERT INTO target_table
SELECT location,
DATE
FROM base_table a
JOIN _v_vector_idx b
ON b.idx < a.count;
One could also take a procedural approach and create a stored procedure if you didn't have a feel for what that reasonable upper limit might be.
I have a view that may contain more than one row, looking like this:
[rate] | [vendorID]
8374 1234
6523 4321
5234 9374
In a SPROC, I need to set a param equal to the value of the first column from the first row of the view. something like this:
DECLARE #rate int;
SET #rate = (select top 1 rate from vendor_view where vendorID = 123)
SELECT #rate
But this ALWAYS returns the LAST row of the view.
In fact, if I simply run the subselect by itself, I only get the last row.
With 3 rows in the view, TOP 2 returns the FIRST and THIRD rows in order. With 4 rows, it's returning the top 3 in order. Yet still top 1 is returning the last.
DERP?!?
This works..
DECLARE #rate int;
CREATE TABLE #temp (vRate int)
INSERT INTO #temp (vRate) (select rate from vendor_view where vendorID = 123)
SET #rate = (select top 1 vRate from #temp)
SELECT #rate
DROP TABLE #temp
.. but can someone tell me why the first behaves so fudgely and how to do what I want? As explained in the comments, there is no meaningful column by which I can do an order by. Can I force the order in which rows are inserted to be the order in which they are returned?
[EDIT] I've also noticed that: select top 1 rate from ([view definition select]) also returns the correct values time and again.[/EDIT]
That is by design.
If you don't specify how the query should be sorted, the database is free to return the records in any order that is convenient. There is no natural order for a table that is used as default sort order.
What the order will actually be depends on how the query is planned, so you can't even rely on the same query giving a consistent result over time, as the database will gather statistics about the data and may change how the query is planned based on that.
To get the record that you expect, you simply have to specify how you want them sorted, for example:
select top 1 rate
from vendor_view
where vendorID = 123
order by rate
I ran into this problem on a query that had worked for years. We upgraded SQL Server and all of a sudden, an unordered select top 1 was not returning the final record in a table. We simply added an order by to the select.
My understanding is that SQL Server normally will generally provide you the results based on the clustered index if no order by is provided OR off of whatever index is picked by the engine. But, this is not a guarantee of a certain order.
If you don't have something to order off of, you need to add it. Either add a date inserted column and default it to GETDATE() or add an identity column. It won't help you historically, but it addresses the issue going forward.
While it doesn't necessarily make sense that the results of the query should be consistent, in this particular instance they are so we decided to leave it 'as is'. Ultimately it would be best to add a column, but this was not an option. The application this belongs to is slated to be discontinued sometime soon and the database server will not be upgraded from SQL 2005. I don't necessarily like this outcome, but it is what it is: until it breaks it shall not be fixed. :-x
In one of my queries there's an insert of data into a temp table. Looking at the query plan, it shows the the actual insert into temp table took 54% (just inserting data into temp table). However, no rows are being inserted into the temp table.
Why does the plan show a non zero value when no rows are being inserted?
Even in the actual query plan the subtree costs shown are based on estimates as well as various heuristics and magic numbers used by the cost based optimiser. They can be woefully wrong and should be taken with a big pinch of salt.
Example to reproduce
create table #t
(
i int
)
insert into #t
select number
from master.dbo.spt_values
where number = 99999999
The actual insert was zero rows but the estimate was 1 row which is where the subtree cost comes from.
Edit: I just tried the following
insert into #t
select top 0 number
from master.dbo.spt_values
where number = 99999999
Even when it gets the estimated number of rows right it still assigns a small non zero cost to the insert. I guess the heuristic it uses always assigns some small element of fixed cost.
take a look at this
insert into #temp
select * from sometable
where left(Somecol,3) = 'BLA'
that is not sargable so it will cause a scan, but if no rows are found the insert doesn't happen...the scan still happens
but if you did this then the cost should drop dramatically because now the index can be used
insert into #temp
select * from sometable
where Somecol like 'BLA%'
BTW I would use STATISTICS TIME and STATISTICS IO instead to measure performance, those two are much better indicators..when you see 3 reads vs 10000 reads you know what is happening..what does 45% tell you exactly when the whole process could run 3 minutes or 3 seconds
The cost of 54 doesn't mean that rows need be involved. Perhaps there was an index scan or other seek operation or perhaps some non-optimal lookup in the WHERE clause of that INSERT into temp table?