This question was migrated from Stack Overflow because it can be answered on Database Administrators Stack Exchange.
Migrated 3 days ago.
Environment:
SQL Server 2019 on Windows Server 2019, on KVM backed by TrueNAS, 16 cores, 32 GB RAM.
Application runs 50 parallel threads all inserting into the same massive table.
This combination appears to work against the SQL Server architecture
Additional details
the problem table is both deep and wide - 20,000,000 rows with over 300 columns and 40-50 indexes
The application uses JDBC Batch API's. This particular table, due to row size, is inserting in batches of 1,000 rows.
Tables with more reasonable row sizes are inserting in batches of 10,000 rows
I can't share the actual DDL, but it's pretty mundane apart from the row simply being massive (a surrogate key BIGINT ID column, two natural key VARCHAR columns, 300 or so cargo columns, 0 BLOB/CLOB columns, then 40-50 indexes)
The primary key index DDL is "create unique index mytable_pk on dbo.mytable (keycolumn);"
The only other unique index DDL is "create unique index mytable_ndx1 on dbo.mytable (division, itemnum)";
The product that owns the database is used by hundreds of fortune 2000 customers, so changing hte data model is not an option for me or the product vendor.
Restrictions
Since the database is ultimately a third party's, any changes I make
to it must be in-place. Once the data is inserted into it, I no
longer have any access to it.
The database is owned by a third party
off-the-shelf application.
the primary key is a sequential integer
Observations and metrics
Early in the process, we were bottlenecked on CPU resources.
Once we hit about 1,000,000 rows, we were single threading on latches, sometimes spending over two seconds in a latch, and rarely spending less than 500ms in a latch. Latching and IO buffer waits were both excessive. CPU dropped to about 12% usage.
In a second test, I dropped all of the indexes and re-ran the job. The job completed 8 times as quickly, showing zero load on the SQL server and bottlenecking on CPU on the application which is very good from the SQL Server perspective.
After reading Microsoft's literature, I came to the conclusion that the data model is working against SQL Server's indexing architecture for tuning for massive inserts.
I will not always have the option of dropping and recreating the indexes. Is there a way to tune the table to distribute the I/O
** Now to the real question **
Is there a way to tune SQL Server, under the covers, to distribute the IO so sequential numbers in an index not in the same buffer when doing massive inserts of sequential data?
There are several well-known approaches to addressing last page insert contention in SQL Server.
Many of these are covered in the documentation at Resolve last-page insert PAGELATCH_EX contention in SQL Server. Summarising the options from that link:
Use OPTIMIZE_FOR_SEQUENTIAL_KEY (details)
Move primary key off identity column
Make the leading key a non-sequential column
Add a non-sequential value as a leading key
Use a GUID as a leading key
Use table partitioning and a computed column with a hash value
Switch to In-Memory OLTP
Method 7 can also be implemented as an in-memory OLTP table to handle a high rate of ingestion with regular batch moves to the final destination table. For the very highest concurrency, use natively compiled code with the in-memory table as much as possible (including for the inserts). The frequency and size of moves is dictated by your requirements.
As mentioned in another answer, delayed durability can also improve insert performance in many cases.
Related Q & A: Solving periodic high PAGELATCH_EX Waits. Last page contention?
All that said, you haven't shown evidence of a last-page contention issue at all. More likely, you're encountering problems related to updating all those secondary indexes and a lack of memory on the instance meaning index maintenance often has to wait for pages to be brought in from storage for modification. You don't mention the type of latch you see waits on, but I imagine they'd be PAGEIOLATCH_*.
The primary solution would be to dramatically increase the memory available to SQL Server for its buffer pool so fewer IOs are necessary. Failing that, a faster storage subsystem would be required.
Have you tried using Delayed Durability?
When to use delayed transaction durability
Some of the cases in which you could benefit from using delayed transaction durability are:
You can tolerate some data loss.
If you can tolerate some data loss, for example, where individual records are not critical as long as you have most of the data, then delayed durability may be worth considering. If you cannot tolerate any data loss, do not use delayed transaction durability.
You are experiencing a bottleneck on transaction log writes.
If your performance issues are due to latency in transaction log writes, your application will likely benefit from using delayed transaction durability.
Your workloads have a high contention rate.
If your system has workloads with a high contention level much time is lost waiting for locks to be released. Delayed transaction durability reduces commit time and thus releases locks faster, which results in higher throughput.
The short answer to your "real question" is no because contiguous keys of a disk-based b-tree index must be stored in the same page.
I've never used SQL server, but your problem isn't specific to one database, so maybe this can still help.
When inserting a large number of rows per second the bottlenecks are either going to be parsing overhead (which can be parallelized), index updates (which may be parallelizable or not), primary key sequence generation, or other stuff like postgres' large object support, but that depends on your column types and database quirks. Then at some point any transactional database must generate sequential transaction log entries which is also a concurrency bottleneck.
First thing you should do is check if the inserts are grouped into transactions (not one insert per transaction). Then make sure the IO is fast, look for bottlenecks there, iowait, etc.
In a second test, I dropped all of the indexes and re-ran the job. The job completed 8 times as quickly, showing zero load on the SQL server
So that eliminates some of the candidates and hints that the problem is indices.
For example if 50 threads each insert a row at the same time, and...
You have a high cardinality index with each row hitting a different page in the index, then these can be parallelized
You have a low cardinality index, most of the inserted rows have the same value in the same column, and all these threads are fighting for control of the same index page.
This can compound with index/table page splits if your fillfactor is too high, in this case all the threads will want to insert in the same index page, and it's already full, so one thread is splitting the page while all others are waiting.
Unfortunately you didn't post the table info in the question, which you should really do. But you probably know if your indices are low cardinality or high. The first thing you could do is run the same tests again, adding the indices one by one, try to see which one causes trouble.
You can also lower fillfactor so there is less chance the inserts end up in a page that is already full.
If you find a problematic low cardinality index then you should first wonder if it's actually useful for queries, maybe you can drop it. If you want to keep it, you can hack it into a high cardinality index by adding a dummy column at the end. For example if you have an index on (category) which has few different values and causes problems for inserts, you can turn it into (category,other_column) which will work just as well for selecting based on category and might provide some extra features like sorting on other_column while selecting on category. However other_column should not be the PK or date or any other column that will have have values that end up in the same page in all your concurrent inserts, because that would be back to square one.
Next, you can try single-threading, or a low number of threads. Back to this:
In a second test, I dropped all of the indexes and re-ran the job. The job completed 8 times as quickly, showing zero load on the SQL server and bottlenecking on CPU on the application which is very good from the SQL Server perspective.
This may look nice at first glance but there's a problem here. Basically your application is doing the easy things (processing rows) and delegating the hard things (ie, concurrency) to the database. That's fine until it exceeds the database's capabilities, then it breaks down. Databases are excellent at handling concurrency correctly, but doing it fast is a very hard problem: coordinating several cores on a lock has a hard performance limit, caused by latency of communication between the cores, which is the speed of information propagation, in other words the speed of light, which cannot be negotiated with.
Locks are just memory held as cache lines in CPU caches. So a side effect of the way multicore systems work is, it's much faster for the same core to reacquire a lock it just released, because the line is still in its cache, so there is no slow inter-core communication involved. Likewise, several cores attempting to modify different parts of the same index page will result in cache line exchanges between them and lots of communication to determine what core owns what byte in that page. And that is surprisingly slow, it can take microseconds instead of nanoseconds.
In addition you have 50 client threads, so 50 server threads, and only 16 cores, so on the database server the OS will multitask the 50 threads between the 16 cores. This means the OS will end up putting one thread to sleep while it's holding a lock, and when that happens, performance is destroyed.
So the next test you can do is to compare insertion time with all your indices between these two scenarii:
Your current one with 50 threads
Then stop it, copy the inserted data from your main table into a temp table, truncate the main table, and insert the exact same data again with:
INSERT INTO yourtable SELECT * FROM temptable
In the second case you're inserting the same data. For the test to be valid it should be in the same order, so you might want to add an ORDER BY primary key while copying the rows into the temp table, so they come out in the proper order. I don't know if the tables are clustered, but you'll find a way to get the order correct.
You can also try various orders, one of the indices may be faster if data is inserted in an order that it likes.
If the second insert is much faster than the mutli-threaded one, then that will give you a clue of what you need to do. In this case that's probably a funnel, ie a process that gathers rows generated by the many threads and inserts them using a low number of threads, maybe just one.
This can simply be all the threads inserting into a non-indexed table, and a separate task flushing this table into the main one every X milliseconds.
When my cluster runs for a certain time(maybe a day, maybe two days), some queries may become very slow, about 2~10min to finish, when this happens, I need to restart the whole cluster, and the query become normal, but after some time, very slow queries happen again
The query response time depends on multiple factor including
1. Table size, if table size grows with time then response time will also increase
2. If it is open source version then time spent in GC pauses, which in turn will depend on number of objects/grabage present in the JVM Heap
3. Number of Concurrent queries being run
4. Amount of data overflown to disk
You will need to describe in detail your usage pattern of snappydata. Only then It would be possible to characterise the issue.
Some of the questions that should be answered are like
1. What is cluster size?
2. What are the table sizes?
3. Are writes happening continuously on the tables or only queries are getting executed?
You can engage us at slack channel to provide us informations related to your clusters.
I am doing an IoT sensor based project. In this each sensor is sending data to the server in every minute. I am expecting a maximum of 100k sensors in the future.
I am logging the data sent by each sensor in history table. But I have a Live Information table in which latest status of each sensor is being updated.
So I want to update the row corresponding to each sensor in Live Table, every minute.
Is there any problem with this? I read that frequent update operation is bad in cassandra.
Is there a better way?
I am already using Redis in my project for storing session etc. Should I move this LIVE table to Redis?
This is what you're looking for: https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_memtable_thruput_c.html
How you tune memtable thresholds depends on your data and write load. Increase memtable throughput under either of these conditions:
The write load includes a high volume of updates on a smaller set of data.
A steady stream of continuous writes occurs. This action leads to more efficient compaction.
So increasing commitlog_total_space_in_mb will make Cassandra flush memtables to disk less often. This means most of your updates will happen in memory only and you will have fewer duplicates of data.
At C* there's consistency levels for reading and consistency levels to write. If are going to have only one node then this not apply, zero problems, but if are going to use more than one dc or racks you need to increase the consistency level to grant that what you are retrieving is the last version of the updated row, or at writing level use an high consistency level. In my case I'm using ANY to write and QUORUM to read. This allows me to have all nodes expect one down to write and 51% up of the nodes to read. This is a trade off in the CAP theorem. Pls take a look at:
http://docs.datastax.com/en/cassandra/latest/cassandra/dml/dmlConfigConsistency.html
https://wiki.apache.org/cassandra/ArchitectureOverview
From one of our partners, I receive about 10.000 small tab delimited text files with +/- 30 records in each file. It is impossible for them to deliver it in one big file.
I process these files in a ForEach loop container. After reading a file, 4 column derivations are performed and then finally contents are stored in a SQL Server 2012 table.
This process can take up to two hours.
I already tried processing the small files into one big file and then importing this one in the same table. This process takes even more time.
Does anyone have any suggestions to speed up processing?
One thing that sounds counter intuitive is to replace your one Derived Column Transformation with 4 and have each one perform a single task. The reason this can provide performance improvement is that the engine can better parallelize operations if it can determine that these changes are independent.
Investigation: Can different combinations of components affect Dataflow performance?
Increasing Throughput of Pipelines by Splitting Synchronous Transformations into Multiple Tasks
You might be running into network latency since you are referencing files on a remote server. Perhaps you can improve performance by copying those remote files to the local box before you being processing. The performance counters you'd be interested in are
Network Interface / Current Bandwidth
Network Interface / Bytes Total / sec
Network Interface / Transfers/sec
The other thing you can do is replace your destination and derived column with a Row Count transformation. Run the package a few times for all the files and that will determine your theoretical maximum speed. You won't be able to go any faster than that. Then add in your Derived column and re-run. That should help you understand whether the drop in performance is due to the destination, the derived column operation or the package is running as fast as the IO subsystem can go.
Do your files offer an easy way (i.e. their names) of subdividing them into even (or mostly even) groups? If so, you could run your loads in parallel.
For example, let's say you could divide them into 4 groups of 2,500 files each.
Create a Foreach Loop container for each group.
For your destination for each group, write your records to their own staging table.
Combine all recordss from all staging tables into your big table at the end.
If the files themselves don't offer an easy way to group them, consider pushing them into subfolders when your partner sends them over, or inserting the file paths into a database so you can write a query to subdivide them and use the file path field as a variable in the Data Flow task.
I have a large table which is both heavily read, and heavily written (append only actually).
I'd like to get an understanding of how the indexes are affecting write speed, ideally the duration spent updating them (vs the duration spent inserting), but otherwise some sort of feel for the resources used solely for the index maintenance.
Is this something that exists in sqlserver/profiler somewhere?
Thanks.
Look at the various ...wait... columns under sys.dm_db_index_operational_stats. This will account for waits for locks and latches, however it will not account for log write times. For log writes you can do a simple math based on row size (ie. a new index that is 10 bytes wide on a table that is 100 bytes wide will add 10% log write) since log write time is driven just by the number of bytes written. The Log Flush... counters under Database Object will measure the current overall DB wide log wait times.
Ultimately, the best measurement is base line comparison of well controlled test load.
I don't believe there is a way to find out the duration of the update, but you can check the last user update on the index by querying sys.dm_db_index_usage_stats. This will give you some key information on how often the index is queried and updated, and the datetime stamp of this information.