I am new to SQL Server, though have spent sufficient time in Oracle databases. In the current application I am managing contains a lot of denormalized staging tables to receive upstream data.
Views have been created on staging tables each consisting of about 40 tables and multiple joins. These views load datamart tables of the same name as the view in another database.
These views take a lot of time to load the datamart table approx 5hrs. The logic is truncate load i.e. each day whole database is truncated and data is loaded from source system files using format files into staging area tables.
How to tune these views to make the load process faster as the truncate load process has been written on purpose?
You probably need to look into the normal stuff:
Turn on statistics io to see which table causes most of the I/O in the query
Check from the leftmost node in the actual plan that your query plan creation doesn't end up into timeout (because of all the joins)
Look into fat arrows (=a lot of rows being handled) in the plan
Check any expensive operations (sorts, spools, key lookups with big row counts) in the plan
Check the plan for orders of magnitude difference in estimated vs actual number of rows
Don't pay that much attention to the cost percentages in the actual plan, those are just estimates and can be extremely misleading.
Without more details (create table & index clauses, actual query plan) etc. it's quite difficult to give any more detailed information.
Related
I'm moving data from ODBC to OLE Destination, records get inserted everyday on the ODBC in different tables. The packages gets slower and slower it take about a day for million records sometimes more. The tables can have new data inserted or new updated data and the loading and looking up of new data slows the processs. Is the anyway i can fast track the ETL process or is there any open source platform i can use to load the data faster
Tried to count the number of rows in the OLE Destination to check and only insert new records that are greater than the ones in the ODBC Source, but to my surprise the ROW_NUMBER() function isn't supported in Openedge ODBC
Based on the limited information in your question, I'd design your packages like the following
SEQC PG to SQL
The point of these operations is to transfer data from our source system verbatim to the target. The target table should be brand new and the SQL Server equivalent of the PG table from a data type perspective. Clustered Key if one exists, otherwise, see how a heap performs. I am going to reference this as a staging table.
The Data Flow itself is going to be bang simple
By default, the destination will perform a fast load and lock the table.
Run the package and observe times.
Edit the OLE DB Destination and change the Maximum Commit Size to something less than 2147483647. Try 100000 - is it better, worse? Move up/down an order of magnitude until you have an idea of what it looks like will be the fastest the package can move data.
There are a ton of variables at this stage of the game - how busy is the source PG database, what are the data types involved, how far does the data need to travel from the Source, to your computer, to the Destination but this can at least help you understand "can I pull (insert large number here) rows from the source system within the expected tolerance" If you can get the data moved from PG to SQL within the expected SLA and you still have processing time left, then move on to the next section.
Otherwise, you have to rethink your strategy for what data gets brought over. Maybe there's reliable (system generated) insert/update times associated to the rows. Maybe it's a financial-like system where rows aren't updated, just new versions of the row are insert and the net values are all that matters. Too many possibilities here but you'll likely need to find a Subject Matter Expert on the system - someone who knows the logical business process the database models as well as how the data is stored in the database. Buy that person some tasty snacks because they are worth their weight in gold.
Now what?
At this point, we have transferred the data from PG to SQL Server and we need to figure out what to do with it. 4 possibilities exist
The data is brand new. We need to add the row into the target table
The data is unchanged. Do nothing
The data exists but is different. We need to change the existing row in the target table
There is data in the target table that isn't in the staging table. We're not going to do anything about this case either.
Adding data, inserts, are easy and can be fast - it depends on table design.
Changing data, updates, are less easy in SSIS and are slower than adding new rows. Slower because behind the scenes, the database will delete and add the row back in.
Non-Clustered indexes are also potential bottlenecks here, but they can also be beneficial. Welcome to the world of "it depends"
Option 1 is to just write the SQL statements to handle the insert and update. Yes, you have a lovely GUI tool for creating data flows but you need speed and this is how you get it (especially since we've already moved all the data from the external system to a central repository)
Option 2 is to use a Data Flow and potentially an Execute SQL Task to move the data. The idea being, the Data Flow will segment your data into New which will use an OLE DB Destination to write the inserts. The updates - it depends on volume what makes the most sense from an efficiency perspective. If it's tens, hundreds, thousands of rows to update, eh take the performance penalty and use an OLE DB Command to update the row. Maybe it's hundreds of thousands and the package runs good enough, then keep it.
Otherwise, route your changed rows to yet another staging table and then do a mass update from the staged updates to the target table. But at this point, you just wrote half the query you needed for the first option so just write the Insert and be done (and speed up performance because now everything is just SQL Engine "stuff")
You might want to investigate Progress' Change Data Capture feature. If you have a modern release of OpenEdge (11.7 or better) and the proper licenses you can enable CDC policies to track changes. Your ETL process could then use that information to target its efforts.
Warning: it's complicated. There is a lot more to actually doing it than marketing would have you believe. But if your use-case is straight-forward it might not be too terrible.
Or you could implement Progress "Pro2" product to do all the dirty work for you. (That's an extra cost option.)
I have SQL Server table with 160+ million records having continuous CRUD operations from UI, batch jobs etc. basically from multiple sources
Currently I have partitioned the table on a column to have better performance on the table.
I came across In-Memory tables which can be used in case of tables with frequent updates and also if updates happening from multiple sources it won't put a lock instead it will maintain row versioning, so concurrent updates is better using this approach.
So what are my options in this case ?
Partition the table or Create In-Memory table
As I have read SQL server is not supporting In-Memory table when table is partitioned.
What is the better option in this case In-Memory table or partitioned table.
It depends.
In-memory tables look great on theory, but you really need to spend time learning the details in order to make the right implementation. You may find some details disturbing. For example:
there are no parallel inserts in in-memory tables which make creation of rows slower compare to parallel insert in traditional table stored in SSD
not all index operations supported by dis-based indexes are available in in-memory table indexes
not all data types are supported
there are both unsupported features and T-SQL constructs
you may need more RAM then you think
If you are ready to pay the price for using Hekaton, you may start with reading its white-paper.
The partitioning itself comes with benefits but there is no guarantee it will heal your system. Only particular queries and case-scenarios can benefit from it. For example, if 99% of your workload is touching the data in one partition you may see no optimization at all. On the other hand, if your reports are based on historical data and your inserts/updates/deletes touch another partition it will be better.
Both of the technologies are good, but need to be examine in details and applied carefully. Often, folks believe that using some new tech will solve their problems, when the problems can be solved just applying some basic concepts.
For example, you said that you are performing CRUD over 160+ millions records. Ask yourself:
is my table normalized - when data is stored in normalized way you gain two things - first, you will perform CRUD only on part of the data, the engine may read only the data that is needed for particular query (without the need to support an index)
are my T-SQL statements write well - row by agonizing row, calling stored procedures in loops or not processing the data in batches are common sources of slow queries
which are the blocking and deadlocked queries - for example, there is a possibility one long running query to block all your inserts - identify these types of issues first and try to resolve them with data pre-calculation (indexed view) or creating covering indexes (which can be filtered with include columns, too)
are readers and writers being blocked - you can try different isolation levels to solve this type of issues - RCSI is the Azure default isolation level. You may need to add more RAM to your RAMDISK used by your TempDB, but since your are looking at Hekaton, this will be easier to test (and rollback) compare to it(or partitioning)
Am writing some processes to pre-format certain data for another downstream process to consume. The pre-formatting essentially involves gathering data from several permanent tables in one DB, applying some logic, and saving the results into another DB.
The problem i am running into is the volume of data. the resulting data set that i need to commit has about 132.5million rows. The commit itself takes almost 2 hours. I can cut that by changing the logging to simple, but it's still quite substantial (seeing as the generating of the 132.5 million rows into a temp table only takes 9 mins).
I have been reading on best methods to migrate large data, but most of the solutions implicitly assumes that the source data already resides in a single file/data table (which is not the case here). Some solutions like using SSMS task option makes it difficult to embed some of the logic applications that i need.
Am wondering if anyone here has some solutions.
Assuming you're on SQL Server 2014 or later the temp table is not flushed to disk immediately. So the difference is probably just disk speed.
Try making the target table a Clustered Columnstore to optimize for compression and minimize IO.
I have a spotfire project that references several large SQL Server based tables (One has 700,000 rows with 200 columns, another is 80,000,000 rows with 10 columns, a few others that are much smaller by comparison). Currently I use information links with prompts to narrow down the data before loading into spotfire. Still have issues sometimes with RAM usage creeping up and random CPU spikes after data has been loaded.
My questions are if I add indexes to the SQL tables:
Will the amount of RAM/CPU usage by spotfire get better (lower)?
Will it help speed up the initial data load time?
Should I even bother?
I'm using SQL Server 2016 and Tibco Spotfire Analyst 7.7.0 (build version 7.7.0.39)
Thanks
If you add indexes without logical reason, it actually makes your system slower because indexes constantly update themselves after each INSERT, UPDATE, DELETE. You can ignore my statement if your DB has static data and you won't change the content usually.
You need to understand what parts of your queries consume most of resources, then create indexes accordingly.
Following URLs will help you:
https://www.liquidweb.com/kb/mysql-performance-identifying-long-queries/
https://www.eversql.com/choosing-the-best-indexes-for-mysql-query-optimization/
We have a web service that pumps data into 3 database tables and a web application that reads that data in aggregated format in a SQL Server + ASP.Net environment.
There is so much data arriving to the database tables and so much data read from them and at such high velocity, that the system started to fail.
The tables have indexes on them, one of them is unique. One of the tables has billions of records and occupies a few hundred gigabytes of disk space; the other table is a smaller one, with only a few million records. It is emptied daily.
What options do I have to eliminate the obvious problem of simultaneously reading and writing from- and to multiple database tables?
I am interested in every optimization trick, although we have tried every trick we came across.
We don't have the option to install SQL Server Enterprise edition to be able to use partitions and in-memory-optimized tables.
Edit:
The system is used to collect fitness tracker data from tens of thousands of devices and to display data to thousands of them on their dashboard in real-time.
Way too broad of requirements and specifics to give a concrete answer. But a suggestion would be to setup a second database and do log shipping over to it. So the original db would be the "write" and the new db would be the "read" database.
Cons
Diskspace
Read db would be out of date by the length of time for log tranfser
Pro
- Could possible drop some of the indexes on "write" db, this would/could increase performance
- You could then summarize the table in the "read" database in order to increase query performance
https://msdn.microsoft.com/en-us/library/ms187103.aspx
Here's some ideas, some more complicated than others, their usefulness depending really heavily on the usage which isn't fully described in the question. Disclaimer: I am not a DBA, but I have worked with some great ones on my DB projects.
[Simple] More system memory always helps
[Simple] Use multiple files for tempdb (one filegroup, 1 file for each core on your system. Even if the query is being done entirely in memory, it can still block on the number of I/O threads)
[Simple] Transaction logs on SIMPLE over FULL recover
[Simple] Transaction logs written to separate spindle from the rest of data.
[Complicated] Split your data into separate tables yourself, then union them in your queries.
[Complicated] Try and put data which is not updated into a separate table so static data indices don't need to be rebuilt.
[Complicated] If possible, make sure you are doing append-only inserts (auto-incrementing PK/clustered index should already be doing this). Avoid updates if possible, obviously.
[Complicated] If queries don't need the absolute latest data, change read queries to use WITH NOLOCK on tables and remove row and page locks from indices. You won't get incomplete rows, but you might miss a few rows if they are being written at the same time you are reading.
[Complicated] Create separate filegroups for table data and index data. Place those filegroups on separate disk spindles if possible. SQL Server has separate I/O threads for each file so you can parallelize reads/writes to a certain extent.
Also, make sure all of your large tables are in separate filegroups, on different spindles as well.
[Complicated] Remove inserts with transactional locks
[Complicated] Use bulk-insert for data
[Complicated] Remove unnecessary indices
Prefer included columns over indexed columns if sorting isn't required on them
That's kind of a generic list of things I've done in the past on various DB projects I've worked on. Database optimizations tend to be highly specific to your situation...which is why DBA's have jobs. Some of the 'complicated' answers could be simple if your architecture supports it already.