Does record size affect SQL performance? - sql-server

Our team is using Microsoft SQL Server, accessed using Entity Framework Core.
We have a table with 5-40 million records anticipated, which we want to optimize for high-velocity record create, read and update.
Each record is small and efficient:
5 integer (one of which is the indexed primary key)
3 bit
1 datetime
plus 2 varchar(128) - substantially larger than the other columns.
The two varchar columns are populated during creation, but used in only a tiny minority of subsequent reads, and never updated. Assume 10 reads and 4 updates per create.
Our question is: does it improve performance to put these larger columns in a different table (imposing a join penalty for create, but only a tiny minority of reads) versus writing two stored procedures, using one which retrieves the non-varchar columns for the majority of queries, and one which retrieves all columns when required?
Put another way: how much does individual record size affect SQL performance?

does it improve performance to put these larger fields in a different table (imposing a join penalty for create, but only a tiny minority of reads)
A much better alternative is to create indexes which exclude the larger columns and which support frequent access patterns.
The larger columns on the row will have very little cost on single-row operations on the table, but will substantially reduce the row density on the clustered index. So if you have to scan the clustered index, the having large, unused columns drives up the IO cost of your queries. That's where an appropriate non-clustered index can offload any scanning operations away from the clustered index.
But, as always, you should simply test. 40M rows is simple to generate, and then write your top few queries and test their performance with different combinations of indexes.

About: How much does individual record size affect SQL performance?
It depends on the query that is being executed.
For example, in a select statement, if the varchar field is not included in any part of the query, then it is almost not affecting performance.
You could try both models and measure the querys in SSMS, using
SET STATISTICS TIME ON
SET STATISTICS IO ON
GO
Or analyze the query in the Database Engine Tuning Advisor.

Indexing on the three fields which are not NVARCHAR can help
It depends on your mix of reads, writes, updates
It depends if you are appending to the end of a table such as a real-time data collection
Entity Framework is most likely not good for a simple table structure with millions of rows especially if the table is not related to many other tables

Related

WHERE clause vs Smaller table

Is there a meaningful difference (or a rule of thumb for a given table size) for query time of a table with a WHERE clause limiting the result set compared to a smaller table which is equal to the size of the post-WHERE, limited result set?
For example:
Your table has records with timestamps spanning many years. You run a query that contains a WHERE clause limiting your result to the last 10 days only.
Your table has only 10 days of data, and you run the same query as above (obviously without the WHERE clause since it's not necessary in this case).
Should I expect a query performance difference in the two scenarios above? Note that I'm using Redshift. Obviously there is a $$ cost savings of storing less data, which is one benefit of scenario 2. Any others?
It depends entirely on the table and the indexes (in case of Redshift the Sort Key). Traditionally if you have a descending index on the timestamp and use the timestamp on the where clause, then the query engine will pretty quickly find the records it needs and stop looking.
There may still be some benefit from having less records, perhaps even maintaining two tables, but duplicating data should be a very last resort if testing shows that the performance benefit is real and necessary.
In Redshift, The answer is yes, it is always quicker to query a smaller table rather than a where clause on a larger table. This is because Redshift will generally scan all of the rows in the table. or at least those rows which are not excluded by the distribution/sort key optimisations.
Lets also address the other important aspects of this question
In almost all cases Redshift storage is cheap - that is because storage is usually not the deciding factor when capacity planning a Redshift cluster. It is more about getting the performance you need for the queries that you want to run.
You can improve the performance of Redshift queries in 4 ways
Increase the size of the cluster.
Tune the query.
Alter the definition of the Redshift tables, taking into account
contents and usage patterns. Sort and Distribution keys can make a
big difference. compression types should also be considered.
Implement Redshift performance management, to give priority to
higher priority queries.

Optimum number of rows in a table for creating indexes

My understanding is that creating indexes on small tables could be more cost than benefit.
For example, there is no point creating indexes on a table with less than 100 rows (or even 1000 rows?)
Is there any specific number of rows as a threshold for creating indexes?
Update 1
The more I am investigating, the more I get conflicting information. I might be too concern about preserving IO write operations; since my SQL servers database is in HA Synchronous-commit mode.
Point #1:
This question concerns very much the IO write performance. With scenarios like SQL Server HA Synchronous-commit mode, the cost of IO write is high when database servers reside in cross subnet data centers. Adding indexes adds to the expensive IO write cost.
Point #2:
Books Online suggests:
Indexing small tables may not be optimal because it can take the query
optimizer longer to traverse the index searching for data than to
perform a simple table scan. Therefore, indexes on small tables might
never be used, but must still be maintained as data in the table
changes.
I am not sure adding index to a table with only 1 one row will ever have any benefit - or am I wrong?
Your understanding is wrong. Small tables also benefit from index specially when are used to join with bigger tables.
The cost of index has two part, storage space and process time during insert/update. First one is very cheap this days so is almost discard. So you only consideration should be when you have a table with lot of updates and inserts apply the proper configurations.

SQL Server 2008 indexes - performance gain on queries vs. loss on INSERT/UPDATE

How can you determine if the performance gained on a SELECT by indexing a column will outweigh the performance loss on an INSERT in the same table? Is there a "tipping-point" in the size of the table when the index does more harm than good?
I have table in SQL Server 2008 with 2-3 million rows at any given time. Every time an insert is done on the table, a lookup is also done on the same table using two of its columns. I'm trying to determine if it would be beneficial to add indexes to the two columns used in the lookup.
Like everything else SQL-related, it depends:
What kind of fields are they? Varchar? Int? Datetime?
Are there other indexes on the table?
Will you need to include additional fields?
What's the clustered index?
How many rows are inserted/deleted in a transaction?
The only real way to know is to benchmark it. Put the index(es) in place and do frequent monitoring, or run a trace.
This depends on your workload and your requirements. Sometimes data is loaded once and read millions of times, but sometimes not all loaded data is ever read.
Sometimes reads or writes must complete in certain time.
case 1: If table is static and is queried heavily (eg: item table in Shopping Cart application) then indexes on the appropriate fields is highly beneficial.
case 2: If table is highly dynamic and not a lot of querying is done on a daily basis (eg: log tables used for auditing purposes) then indexes will slow down the writes.
If above two cases are the boundary cases, then to build indexes or not to build indexes on a table depends on which case above does the table in contention comes closest to.
If not leave it to the judgement of Query tuning advisor. Good luck.

Best choice for a huge database table which contains only integers (have to use SUM() or AVG() )

I'm currently using a MySQL table for an online game under LAMP.
One of the table is huge (soon millions of rows) and contains only integers (IDs,timestamps,booleans,scores).
I did everything to never have to JOIN on this table. However, I'm worried about the scalability. I'm thinking about moving this single table to another faster database system.
I use intermediary tables to calculate the scores but in some cases, I have to use SUM() or AVERAGE() directly on some filtered rowsets of this table.
For you, what is the best database choice for this table?
My requirements/specs:
This table contains only integers (around 15 columns)
I need to filter by certain columns
I'd like to have UNIQUE KEYS
It could be nice to have "INSERT ... ON DUPLICATE UPDATE" but I suppose my scripts can manage it by themselves.
i have to use SUM() or AVERAGE()
thanks
Just make sure you have the correct indexes on so selecting should be quick
Millions of rows in a table isn't huge. You shouldn't expect any problems in selecting, filtering or upserting data if you index on relevant keys as #Tom-Squires suggests.
Aggregate queries (sum and avg) may pose a problem though. The reason is that they require a full table scan and thus multiple fetches of data from disk to memory. A couple of methods to increase their speed:
If your data changes infrequently then caching those query results in your code is probably a good solution.
If it changes frequently then the quickest way to improve their performance is probably to ensure that your database engine keeps the table in memory. A quick calculation of expected size: 15 columns x 8 bytes x millions =~ 100's of MB - not really an issue (unless you're on a shared host). If your RDBMS does not support tuning this for a specific table, then simply put it in a different database schema - shouldn't be a problem since you're not doing any joins on this table. Most engines will allow you to tune that.

Will SQL Server Partitioning increase performance without changing filegroups

Scenario I have a 10 million row table. I partition it into 10 partitions, which results in 1 million rows per partition but I do not do anything else (like move the partitions to different file groups or spindles)
Will I see a performance increase? Is this in effect like creating 10 smaller tables? If I have queries that perform key lookups or scans, will the performance increase as if they were operating against a much smaller table?
I'm trying to understand how partitioning is different from just having a well indexed table, and where it can be used to improve performance.
Would a better scenario be to move the old data (using partition switching) out of the primary table to a read only archive table?
Is having a table with a 1 million row partition and a 9 million row partition analagous (performance wise) to moving the 9 million rows to another table and leaving only 1 million rows in the original table?
Partitioning is not a performance feature, is for maintenance operations like moving data between tables and dropping data real fast. Partitioning a 10M rows table into 10 partitions of 1M rows each not only that it won't increase most queries performance, will likely make quite a few of them slower.
No query can operate against a smaller set of rows of a single partition unless the query can be determined that it only needs row in that partition alone. But this can be solved, always, and much better, by properly choosing the clustered index on the table, or at least a good covering non-clustered index.
Well first partitioning will only help you if you are using the enterprise version.
I believe it will improve your performance, although the actual benefit you'll get will depend on your specific work load (yhea, it always depends).
It is not exactly as creating 10 smaller tables but if your queries run against the ranges of your partitions the data in that partition will only be "touch". In those cases I think the performance improvement will be noticeable. In cases where queries run across the partition ranges the performance will be worse.
I think you'll have to try what solution fits your data the best, in some cases partition will help you and in some other will do the opposite, another benefit from partitioning is that you won't have to worry about to move your data around.
This is a good article about partitioning - bol

Resources