Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I'm currently trying to develop a solution for data logging related to a SCADA application, using SQL Server 2012 Express. The SCADA application is configured to execute a stored procedure on SQL Server to push data in the db.
The data flow imho is quite heavy ( 1.4 - 1.9 m rows per day, averaging 43 bytes in length, after some tweaks). The table which stores data has one clustered index on three columns. For now our focus is to store this data as compactly as possible and without generating too much fragmentation (SELECTS are not of major interest right now).
Currently the DB occupies ~250 MB (I have pre-allocated 5120 MB for the DB), and holds only this data table one other table which can be ignored, and the transaction logs.
My questions are:
How can I setup index maintenance on this DB? Being Express edition I can't use SQL Server agent. I'll use task scheduler but should I use rebuild or reorganize? Is it advisable to use a fill factor under 100? Should I configure the task scheduler to call at intervals such that the task will only reorganize (fragmentation under 30%)? Is rebuilding an increasingly expensive operation (day x index is rebuilt, will day x+1 take less time to rebuild as opposed to rebuilding only once in 2 days), after it reaches max storage space?
Again having SQL Server Express edition limits the data capacity to 10 GB. I'm trying to squeeze as much as I can in that amount. I'm planning to build a ring buffer - can I setup the DB such that after I get in the event logs the message that the alter database expand etc. failed the stored procedure will use update on oldest values as a means of inserting data (my fear is that even updates will take some new space, and at that point I'll have to somehow aggressively shrink the DB)?
I have also considered using a compressed win partition to store the files of the DB, and using a free unlimited DB such as MySQL for storage purposes, and SQL Server only as a frontend - the SCADA app must be configured with SQL Server. Is that worth considering?
To optimize inserts I'm using a global temp db which holds up to 1k rows (counting with a sequence) as a form of buffer and then push the data into the main table and truncate the temp table. Is that efficient? Should I consider transactions for efficiency instead - I've tried to begin a named transaction in the stored procedure if it doesn't exist and if the sequence is reaching 1k commit the tran? Does increasing the threshold to 10k rows lead to less fragmentation?
If you're thinking I'm unfamiliar with Databases then you are right. Atm there is only one scada application using SQL Server, but the actual application is setup redundantly so at the end everything will take twice the resources (and each instance of the SCADA application will get its own storage). Also I need to mention that I can't just upgrade to a superior edition of SQL Server, but I have the freedom to use any piece of free software.
Most of the answers cross over the 4 numbers, so I just put responses in bullets to help:
Indexes should probably be maintained, but in your case, they can be prohibitive. Besides the clustered index on the table, indexes generally (the nonclustered type) are for querying the data.
To let a Scheduled Task do the work since no Agent can be use, the sqlcmd utility (https://technet.microsoft.com/en-us/library/ms165702%28v=sql.105%29.aspx). The command line tools may have to be installed, but that let's you write a batch script to run the SQL commands.
With an app doing as much inserting as you describe, I would design a 2-step process. First, a basic table with no nonclustered indexes to accept the inserts. Second, a table you'd be querying the data. Then, use a scheduled task to call a stored proc to transfer transactional data from table 1 to table 2 perhaps hourly or daily based on your query needs (and also remove the original data from table 1 after transfer to table 2 - this should definitely be done in a transaction).
Otherwise, every insert has to not only insert the raw data for the table, but also insert the records for the indexes.
Due to the quantity of your inserts, high fill factors should probably be avoided (probably set to less than 50%). A high (100%) fill factor means the nonclustered indexes don't leave any space in the pages of the tables to actually insert records. Every record you insert means the pages of the table have to be re-organized. Having a high fill factor will leave space in each page of the table so new records can be inserted in indexes without having to reorganize them.
To optimize your inserts, I would use the 2-step process above to insert records straight into your first table. If you can have your app use SQL Bulk Copy, I would explore that as well.
To optimize space, you can explore a few things:
Do you need all the records accessible in real time? Perhaps you can work with the business to create a data retention policy in which you keep every record in the database for 24 hours, then a summary by minute or something for 1 week, hourly for 2 weeks, daily for 6 months, etc. You could enhance this with a daily backup so that you could restore any particular day in its entirety if needed.
Consider changing the database level from full recovery to simple or bulk-logged. This can control your transaction log with the bulk inserts you may be doing.
More Info: https://technet.microsoft.com/en-us/library/ms190692%28v=sql.105%29.aspx
You'll have to work hard to manage your transaction log. Take frequent checkpoints and transaction log backups.
Related
Am writing some processes to pre-format certain data for another downstream process to consume. The pre-formatting essentially involves gathering data from several permanent tables in one DB, applying some logic, and saving the results into another DB.
The problem i am running into is the volume of data. the resulting data set that i need to commit has about 132.5million rows. The commit itself takes almost 2 hours. I can cut that by changing the logging to simple, but it's still quite substantial (seeing as the generating of the 132.5 million rows into a temp table only takes 9 mins).
I have been reading on best methods to migrate large data, but most of the solutions implicitly assumes that the source data already resides in a single file/data table (which is not the case here). Some solutions like using SSMS task option makes it difficult to embed some of the logic applications that i need.
Am wondering if anyone here has some solutions.
Assuming you're on SQL Server 2014 or later the temp table is not flushed to disk immediately. So the difference is probably just disk speed.
Try making the target table a Clustered Columnstore to optimize for compression and minimize IO.
Primary DB have all the raw data every 10 minutes, but it only store for 1 week. I would like to keep all the raw data for 1 year in another DB, and it is different server. How can it possible?
I have created T-query to select the required data from Primary DB. How can it keep update the data from primary DB and insert to secondary DB accordingly? The table has Datetime, would it able to insert new data for latest datetime?
Notes: source data SQL 2012
secondary db SQL 2005
If you are on sql2008 or higher the merge command (ms docs) may be very useful in your actual update process. Be sure to you understand it.
You table containing the full year data sounds like it could be OLAP, so I refer to it that way occasionally (if you don't know what OLAP is, look it up sometime, but it does not matter to this answer)
If you are only updating 1 or 2 tables, log shipping replication and failover may not work well for you, especially since you are not replicating the table due to different retention policies if nothing else. So make sure you understand how replication, etc. work before you go down that path. If these tables are over perhaps 50% of the total database, log shipping style methods might still be your best method. They work well and handle downtime issues for you -- you just replicate the source database to the OLAP server and then update from the duplicate database into your OLAP database.
Doing an update this every second is an unusual requirement. However, if you create a linked server, you be able to insert your selected rows into a staging table on the remote sever and them update from them to your OLAP table(s). If you can reliably update your OLAP table(s) on the remote server in 1 second, you have a potentially useful method. If not, you may fall behind on posting data to your OLAP tables. If you can update once a minute, you may find you are much less likely to fall behind on the update cycle (at the cost of being slightly less current at all times).
You want to consider putting after triggers on the source table(s) that copies the changes to a staging table (still on the source database) into staging table(s) with an identity on this staging table along with a flag to indicate Insert, Update or Delete and you are well positioned to ship updates for one or a few tables instead of the whole database. You don't need to requery your source database repeatedly to determine what data needs to be transmitted, just select top 1000 from from your staging table(s) (order by the staging id) and move them to the remote staging table.
If your fall behind, a top 1000 loop keeps from trying to post to much data in any one cross server call.
Depending on your data, you may be able to optimize storage and reduce log churn by not copying all columns to your staging table, just the staging id and the primary key of the source table and pretend that whatever data is in the source record at the time you post it to the OLAP database accurately reflects the data at the time the record was staged. It won't be 100% accurate on your OLAP table at all times, but it will be accurate eventually.
Cannot over emphasize that you need to accommodate the downtime in your design -- unless you can live with data loss or just wrong data. Even reliable connections are not 100% reliable.
Background:
I am developing an application that allows users to generate lots of different reports. The data is stored in PostgreSQL and has natural unique group key, so that the data with one group key is totally independent from the data with others group key. Reports are built only using 1 group key at a time, so all of the queries uses "WHERE groupKey = X;" clause. The data in PostgreSQL updates intensively via parallel processes which adds data into different groups, but I don't need a realtime report. The one update per 30 minutes is fine.
Problem:
There are about 4 gigs of data already and I found that some reports takes significant time to generate (up to 15 seconds), because they need to query not a single table but 3-4 of them.
What I want to do is to reduce the time it takes to create a report without significantly changing the technologies or schemes of the solution.
Possible solutions
What I was thinking about this is:
Splitting one database into several databases for 1 database per each group key. Then I will get rid of WHERE groupKey = X (though I have index on that column in each table) and the number of rows to process each time would be significantly less.
Creating the slave database for reads only. Then I will have to sync the data with replication mechanism of PostgreSQL for example once per 15 minutes (Can I actually do that? Or I have to write custom code)
I don't want to change the database to NoSQL because I will have to rewrite all sql queries and I don't want to. I might switch to another SQL database with column store support if it is free and runs on Windows (sorry, don't have Linux server but might have one if I have to).
Your ideas
What would you recommend as the first simple steps?
Two thoughts immediately come to mind for reporting:
1). Set up some summary (aka "aggregate") tables that are precomputed results of the queries that your users are likely to run. Eg. A table containing the counts and sums grouped by the various dimensions. This can be an automated process -- a db function (or script) gets run via your job scheduler of choice -- that refreshes the data every N minutes.
2). Regarding replication, if you are using Streaming Replication (PostgreSQL 9+), the changes in the master db are replicated to the slave databases (hot standby = read only) for reporting.
Tune the report query. Use explain. Avoid procedure when you could do it in pure sql.
Tune the server; memory, disk, processor. Take a look at server config.
Upgrade postgres version.
Do vacuum.
Out of 4, only 1 will require significant changes in the application.
We have an audit database (oracle) that holds monitor information of all activities performed by services (about 100) deployed on application servers. As you may imagine the audit database is really huge because of the volume of requests the services process. And the only write transaction that occurs on this database is services writing audit information in real-time.
As the audit database started growing (more than a million records per day), querying required data (for example select all errors occurred with service A for requests between start date and end date) quickly became nearly impossible.
To address this, some "smart kids" decided to device a batch job that will copy data from the database over to another database (say, audit_archives) and delete records so that only 2 days worth of audit data is retained in audit database.
This initially looked neat but whenever the "batch" process runs, the audit process that inserts data to audit database starts to become very slow - and sometimes the "batch" process also fails due to database contention.
What is a better way to design this scenario to perform above mentioned archival in most efficient way so that there is least impact to the audit process and the batch?
You might want to look into partitioning your base table.
Create a mirror table (as the target of the "historic" data) and create the same partitioning scheme on that one (most probably on a per-date basis).
Then you can simply exchange the "old" partitions (using ALTER TABLE the_table EXCHANGE partition) from one table to the other. Should only take a few seconds to "move" the partition. The actual performance would depend on the indexes defined (local, global).
This technique is usually used to do it the other way round (prepare new data to be fed into a reporting table in a datawarehouse environment) but should work for "archiving" as well.
I Easy way.
delete old records partially the best with FORALL statement
copy data partially the best with FORALL
add partitioning based on day of the week
II Queues
delete old records partially the best with FORALL statement
fill audit_archives with trigger on audit, in trigger use queue to avoid long dml
There is a project in flight at my organization to move customer data and all the associated records (billing transactions, etc) from one database to another, if the customer has not had account activity within a certain timeframe.
The total number of rows in all the tables is in the millions. Perhaps 100 million rows, with all the various tables combined. The schema is more-or-less normalized. The project's designers have decided on SSIS to execute this and initial analysis is showing 5 months of execution time.
Basically, the process:
Fills an "archive" database that has the same schema as the database of origin
Delete the original rows from the source database
I can provide more detail if necessary. What I'm wondering is, is SSIS the correct approach? Is there some sort of canonical way to move very large quantities of data around? Are there common performance pitfalls to avoid?
I just can't believe that this is going to take months to run and I'd like to know if there's something else that we should be looking into.
SSIS is just a tool. You can write a 100M rows transfer in SSIS to take 24h, you can write it to take 5 mo. The problem is what you write (ie. the workflow in SSIS case), not SSIS.
There isn't anything specific to SSID that would dictate 'the transfer cannot be done faster than 5 mo'.
The guiding principles for such a task (logically partition the data, process each logical partition in parallel, eliminate access and update contention between processing, batch commit changes, don't transfer more data that is necessary on the wire, use set based processing as much as possible, be able to suspend and resume etc etc) can be implemented on SSIS just as well as any other technology (if not better).
For the record, the ETL world speed record stands at about 2TB per hour. Using SSIS. And just as a matter of fact, I just finished a transfer of 130M rows, ~200Gb of data, took some 24h (I'm lazy and not shooting for ETL record).
I would understand 5mo for development, testing and deployment, but not 5mo for actual processing. That is like 7 rows a second, and is realy realy lame.
SSIS is probably not the right choice if you are simply deleting records.
This might be of interest: Performing fast SQL Server delete operations
UPDATE: as Remus correctly points out, SSIS can perform well or badly depending on how the flows are written, and there have been some huge benchmarks (on high end systems). But for just deletes there are simply ways, such as a SQL Agent job running a TSQL delete in batches.