I have a table with over 40m records, and i found the read from this table incredible slow.
The table itself has around 70 columns, and most of them have normal data type, like nvarchar(20) int,bit, etc.. I have only a few with nvarchar(1000) and nvarchar(4000) like 3-5 of them.
If I perform a select top1 row i see my IO cost is over 2000
When i want to select everything from the table, that takes more than an hour without any transform.
Is that normal? Is there any way to improve it?
if i could lower my Io cost, then maybe it could increase my performance.
Each nvarchar(4000) can use up to 8 000 bytes. 8kb is one page... So potentially with 5 columns of nvarchar(4000) can potentially read 5 pages... 70 columns with nvarchar(20) is 2 800 bytes. The result is that one row can use 6 pages...
40m rows (not record... that is a cobol term) will use a maximum of 40 000 000 x 6 pages, = 240 m pages...
Consider yourself happy that your query just cost 2816 in terms of IO just with an obese table....
Don't you think the design of your table respects all the normalization theory ?
Related
My application (industrial automation) uses SQL Server 2017 Standard Edition on a Dell T330 server, has the configuration:
Xeon E3-1200 v6
16gb DDR4 UDIMMs
2 x 2tb HD 7200RPM (Raid 1)
In this bank, I am saving the following tables:
Table: tableHistory
Insert Range: Every 2 seconds
410 columns type float
409 columns type int
--
Table: tableHistoryLong
Insert Range: Every 10 minutes
410 columns type float
409 columns type int
--
Table: tableHistoryMotors
Insert Range: Every 2 seconds
328 columns type float
327 columns type int
--
Table: tableHistoryMotorsLong
Insert Range: Every 10 minutes
328 columns type float
327 columns type int
--
Table: tableEnergy
Insert Range: Every 700 milliseconds
220 columns type float
219 columns type int
Note:
When I generate reports / graphs, my application inserts the inclusions in the buffer. Because the system cannot insert and consult at the same time. Because queries are well loaded.
A columns, they are values of current, temperature, level, etc. This information is recorded for one year.
Question
With this level of processing can I have any performance problems?
Do I need better hardware due to high demand?
Can my application break at some point due to the hardware?
Your question may be closed as too broad but I want to elaborate more on the comments and offer additional suggestions.
How much RAM you need for adequate performance depends on the reporting queries. Factors include the number of rows touched, execution plan operators (sort, hash, etc.), number of concurrent queries. More RAM can also improve performance by avoiding IO, especially costly with spinning media.
A reporting workload (large scans) against a 1-2TB database with traditional tables needs fast storage (SSD) and/or more RAM (hundreds of GB) to provide decent performance. The existing hardware is the worst case scenario because data are unlikely to be cached with only 16GB RAM and a singe spindle can only read about 150MB per second. Based on my rough calculation of the schema in your question, a monthly summary query of tblHistory will take about a minute just to scan 10 GB of data (assuming a clustered index on a date column). Query duration will increase with the number of concurrent queries such that it would take at least 5 minutes per query with 5 concurrent users running the same query due to disk bandwidth limitations. SSD storage can sustain multiple GB per second so, with the same query and RAM, a data transfer time for the query above will take under 5 seconds.
A columnstore (e.g. a clustered columnstore index) as suggested by #ConorCunninghamMSFT will reduce the amount of data transferred from storage greatly because only data for the columns specified in the query are read and inherent columnstore compression
will reduce both the size of data on disk and the amount transferred from disk. The compression savings will depend much on the actual column values but I'd expect 50 to 90 percent less space compared to a rowstore table.
Reporting queries against measurement data are likely to specify date range criteria so partitioning the columnstore by date will limit scans to the specified date range without a traditional b-tree index. Partitioning will also also facilitate purging for the 12-month retention criteria with sliding window partition maintenenace (partition TRUNCATE, MERGE, SPLIT) and thereby greatly improve performance of the process compared to a delete query.
I’m using SQL server 2016 and I have table in my database and table size is 120 GB. It has 300 columns and all columns are NVARCHAR(MAX) and it has 12,00,000 records in it. Mostly 100 columns are NULL all the time or it will have a short value. Here my doubt is why 12,00,000 records taken 120 GB, is it because of datatype?
This a Audit table. This will have CDC historical information.On average this table will get inserted 10,000 records per day. Because on this, my database size is increasing and SQL queries are slow. This is an Audit table and not used for any queries.
Please let me know the reason why my table is very big.
Of course, it depends on how you are measuring the size of the table and what other operations occur.
You are observing about 10,000 bytes per record. That does seem large, but there are things you need to consider.
NVARCHAR(MAX) has a minimum size:
nvarchar [ ( n | max ) ]
Variable-length Unicode string data. n defines the string length and
can be a value from 1 through 4,000. max indicates that the maximum
storage size is 2^31-1 bytes (2 GB). The storage size, in bytes, is
two times the actual length of data entered + 2 bytes. The ISO
synonyms for nvarchar are national char varying and national character
varying.
Even the empty fields occupy 2 bytes plus the nullable flag. With 300 fields, that is 600-plus bytes right there (600 + 600 / 8).
You may also have issues with pages that are only partially filled. This depends on how you insert data, the primary key, and system parameters.
And there are other considerations, depending on how you are measuring the size:
How large are the largest fields?
How often are rows occupying multiple pages (each additional page has additional overhead)?
You are using wide characters, so they may seem larger than they seem.
Is your estimate including indexes?
If you are measuring database size, you may be including log tables.
I would suggest that you have your DBA investigate the table to see if there are any obvious problems, such as many pages that are only partially filled.
Edit: updated answer upon clarification on the number of rows that the table really have.
Taking into account that 120GB are 120,000MB you are getting 100KB per row, that is about 330 bytes for each column on average, which its usually quite higher but not for a table with 300 nvarchar(max) columns (note that the nchar and nvarchar types take 2 bytes per char, not 1).
Also you commented that one of that columns have a size of 2,000-90,000 characters (!), supposing that column has on average 46k characters we get a size of:
1,200,000 rows x 46k chars x 2 byte/char = 105GB only for the data of that column.
That leaves 15GB for the rest of columns, or about 13KB per row, which is 44 bytes per column, quite low taking into account that almost all are nvarchar(max).
But those are only estimations, for getting the real size of any column use:
select sum(datalength(ColumnName))/1024.00 as SizeKB from TableName
And all of this is only taking into account data, which is not accurate because the database structures needs its size. For example, indexes sum to the total size of a table, roughly they take the sum of the size of the columns included in the index (for example, if you would define and index on the Big Column it would take another 100GB).
You can obtain how many space the whole table uses, using the following script from another question (it will show the size for each table of the DB):
Get size of all tables in database
Check the column UsedSpaceMB, that is the size needed for the data and the indexes, if for some reason the table is using more space (usually because you deleted data) you get that size in UnusedSpaceMB (a bit of unused space is normal).
I execute a simple query:
SELECT * FROM TABLE1
WHERE ID > 9 AND ID < 11
and the query verbose plan is:
[SPU Sequential Scan table "TABLE1" {(TABLE1."ID")}]
-- Estimated Rows = 1, ...
But after changing the where clause to
WHERE ID = 10
the query verbose plan changes:
[SPU Sequential Scan table "TABLE1" {(TABLE1."ID")}]
-- Estimated Rows = 1000, ...
(where 1000 is the total number of rows in TABLE1).
Why is it so? How does the estimation work?
The optimizer of any cost-based database is always full of surprises, and this one is not unusual across the platforms im familiar with.
A couple of questions:
- have you created statistics on the table? (otherwise you are flying blind)
- what is the datatype for that column ? (i hope it is an integer of some sort, not a NUMBER(x,y), even if y=0)
Furthermore:
The statistics for a column in netezza contains no distribution statistics (it won't know if there are more "solved" than "unsolved" cases in a support-system table with 5 years worth of data). Instead it relies on two things:
1) for all tables: simple statistics if you create them (number of distinct values, max+min values, number of nulls)
2) for large'ish tables (I think the configureable minimum value is close to 100 mill rows) it creates JIT syatistics (Just In Time) by scanning a few random data pages on the dataslices that all live up to the zone-mappable whereclauses and creating statistics for this one query.
The last feature is actually quite powerfull, even though is adds runtime to planning-phase of the query. It significantly increases the likelyhood that if there are SOME correlation between two whereclauses on a table, this will be taken into account.
An example: a whereclause on (AGE>60 and Retired=true) in a list of all citizens in a major city. It is most likely more or less irrelevant to add the AGE restriction, and Netezza will know that.
In general you should not worry about estimated number of rows being a bit off (as in this case) with netezza, it will most often get it "right enough" and throw hardware at the problem to compensate for any minor mistakes.
Untill recently I worked with SQLserver which is notorius (may be better in newer version) for being overly optimistic about the value of where clauses, and ending up in access plans with 5 levels of nested-loop joins with millions of rows in each, when joining 6 tables. Changing where clauses much like you did in the question, will cause sqlserver to put LESS empathesis on a specific restriction, and that can cause the 5 joins to become a more efficient HASH or other algorithm, resulting in better performance. In my experience that is MUCH too frequent an occurance on databases that relies TOO heavily on these estimates - perhaps because the optimizer were not created/tuned for a warehouse-like workload.
does the number of records from a db affect the speed of select queries?
i mean if a db has 50 records and another one has 5 million records, will the selects from the 2nd one be slower? assuming i have all the indexes in the right place
Yes, but it doesn't have to be a large penalty.
At the most basic level an index is a b-tree. Performance is somewhat correlated to the number of levels in the b-tree, so a 5 record database has about 2 levels, a 5 million record database has about 22 levels. But it's binary, so a 10 million row database has 23 levels, and really, index access times are typically not the problem in performance tuning - the usual problem is tables that aren't indexed properly.
As noted by odedsh, caching is also a large contributor, and small databases will be cached well. Sqlite stores records in primary key sequence, so picking a primary key that allows records that are commonly used together to be stored together can be a big benefit.
Yeah it matters for the reasons the others said.
There's other things that can effect the speed of Select statements to, such as how many columns you're grabbing data from.
I once did some speed tests in a table with over 150 columns, where I needed to grab only about 40 of the columns, and I needed all 20,000+ records. While the speed differences were very minimal (we're talking 20 to 40 milliseconds), it was actually faster to grab the data from All the columns with a 'SELECT ALL *', rather than going 'Select All Field1, Field2, etc'.
I assume the more records and columns in your table, the greater the speed difference this example will net you, but I never had a need to test it any farther in more extreme cases like 5 million records in a table.
Yes.
If a table is tiny and the entire db is tiny when you select anything from the table it is very likely that all the data is in memory already and the result can be returned immediately.
If the table is huge but you have an index and you are doing a simple select on the indexed columns then the index can be scanned then the correct blocks can be read from disk and the result returned.
If there is no index that can be used then the db will do a full table scan reading the table block by block looking for matches.
If there is a partial map between the index columns and the select query columns then the db can try to minimize the number of blocks that should be read. And a lot of thought can be placed into properly choosing the indexes structure and type (BITMAP / REGULAR)
And this is just for the most basic SQL that selects from a single table without any calculations.
I have a table with 40 million rows.
I want to pick up about 2 million at a time and "process" them.
Why?
Cos processing processing 10million+ rows degrades performance, and often times out. (I need this to work independant of data size, so i cant just keep increasing the time out limit.)
Also, I'm using SQL Server.
Is there an increasing key, such as an identity key? And is it the clustered index? If so, it should be fairly simple to track the last key you got to, and do things like:
SELECT TOP 1000000 *
FROM [MyTable]
WHERE [Id] > #LastId
ORDER BY [Id]
Also - be sure to read it with something like ExecuteReader, so that you aren't buffering too many rows.
Of course, beyond a few thousand rows, you might as well just accept the occasional round-trip, and make a number of requests for (say) 10000 rows at a time. I don't think this would be any less efficient in real terms (a few milliseconds here and there).