I have an .exe that compares a vbTab delimited .txt file with an SQL table.
Updates to the table's existing records goes very fast. Inserts into the table for new records is quite slow.
As I'm new to SQL, I'm wondering if my idea is crazy talk:
I thought that maybe a solution would be to "pre populate" the database with 10,000 empty rows (minus the primary key) and somehow have this speed up the process?
Any suggestions would be greatly appreciated.
There is no straightforward answer to your question as many things are unknown to us (DB configuration, HW, existing data etc.)
But you can try below things,
Try using DB export-import functionality
Instead of fetching records from DB with an iterator and comparing them with a record from a file and then do insert of modification you can directly import those records into DB using upsert (update if present or insert if not) strategy. Believe me this works lot faster than previous.
If you have indexes on that table, while import or insert drop the current indexes on that table and do the operation. After operation re-apply those indexes again. Indexes slows down the performance of inserts.
If import strategy is not good for you (If you are doing with those records first before insertion) then probably go for stored procedure for modification and insert new rows after dropping indexes.
During this activity check for DB configuration as well. Use proper tuning for buffers, paging, locking.
Hope this helps :)
To answer your question we may need more information.
How many rows does your table have?
I guess it may be a lack of indexes.
Related
I have reports that perform some time consuming data calculations for each user in my database, and the result is 10 to 20 calculated new records for each user. To improve report responsiveness, a nightly job was created to run the calculations and dump the results to a snapshot table in the database. It only runs for active users.
So with 50k users, 30k of which are active, the job "updates" 300k to 600k records in the large snapshot table. The method it currently uses is it deletes all previous records for a given user, then inserts the new set. There is no PK on the table, only a business key is used to group the sets of data.
So my question is, when removing and adding up to 600k records every night, are there techniques to optimize the table to handle this? For instance, since the data can be recreated on demand, is there a way to disable logging for the table as these changes are made?
UPDATE:
One issue is I cannot do this in batch because the way the script works, it's examining one user at a time, so it looks at a user, deletes the previous 10-20 records, and inserts a new set of 10-20 records. It does this over and over. I am worried that the transaction log will run out of space or other performance issues could occur. I would like to configure the table to now worry about data preservation or other items that could slow it down. I cannot drop the indexes and all that because people are accessing the table concurrently to it being updated.
It's also worth noting that indexing could potentially speed up this bulk update rather than slow it down, because UPDATE and DELETE statements still need to be able to locate the affected rows in the first place, and without appropriate indexes it will resort to table scans.
I would, at the very least, consider a non-clustered index on the column(s) that identify the user, and (assuming you are using 2008) consider the MERGE statement, which can definitely avoid the shortcomings of the mass DELETE/INSERT method currently employed.
According to The Data Loading Performance Guide (MSDN), MERGE is minimally logged for inserts with the use of a trace flag.
I won't say too much more until I know which version of SQL Server you are using.
This is called Bulk Insert, you have to drop all indexes in destination table and send insert commands in large packs (hundreds of insert statements) separated by ;
Another way is to use BULK INSERT statement http://msdn.microsoft.com/en-us/library/ms188365.aspx
but it involves dumping data to file.
See also: Bulk Insert Sql Server millions of record
It really depends upon many things
speed of your machine
size of the records being processed
network speed
etc.
Generally it is quicker to add records to a "heap" or an un-indexed table. So dropping all of your indexes and re-creating them after the load may improve your performance.
Partitioning the table may see performance benefits if you partition by active and inactive users (although the data set may be a little small for this)
Ensure you test how long each tweak adds or reduces your load and work from there.
I would like to ask couple question how to handle a huge 100 million of data in 1 single table.
The table will perform INSERT, SELECT & UPDATE.
I have got some advise that to Index the table and Archive the table into couple table.
Any other suggestion that can help to tweak the SQL Performance.
Case:
SQL Server 2008.
Most of the time the update regarding decimal value, and status of tiny int.
The INSERT statement will not using BULK INSERT since I'm assuming that per min that there'r a lot of users let said 10000-500000 performing INSERT statement and Update the table.
You should consider what kind of columns you have.
The more nvarchar/text/etc columns you have included in the different indexes, the slower the index will be.
Also what RDBMS are you going to use? You have different options based on SQL Server, Oracle and MySQL...
But the crucial thing is differently to build the right index's that you would use...
One other thing, you could use BULK INSERT on SQL Server to speed up the inserts.
But ask away, i have dealt with databases being populated with 70 mill data rows pr day ;)
EDIT ----- After more information has come
I'll try to take a little other approach to the case and compare it to data scraping.
There are no doubt that INSERTs are faster than UPDATEs. And you might want to make a table that acts as a "collect" table. What I mean is that it only get inserts all the time. No updates, all is handle with inserts.
Then you use a trigger/event/scheduler to handle what has come into that table and populate what you need to another(s) table(s).
This way you will be able to apply a little business logic to the "cleanup" (update) and keep the performance on the DB Server and not hold up a connection while these things are done.
This of course also have something to do with what the "final" data are to be used for...
\T
Clearly SQL 2008 is capable of 100 million records but a lot of details to look at that just do not come into play at 100 thousand. Pick a good primary key. Fill factor. Other indexes (will slow down insert but speed select). Concurrency (locking). If you can accept dirty reads then that will help performance. This question needs a lot more detail. You need to post the table design and your select, update, and insert TSQL statements. I did not vote your question down but if you don't provide more detail it will get voted down.
For insert be aware you can insert multiple rows at once and is much faster than multiple insert statements if BULK INSERT is not an option.
INSERT INTO Production.UnitMeasure
VALUES (N'FT2', N'Square Feet ', '20080923'), (N'Y', N'Yards', '20080923'), (N'Y3', N'Cubic Yards', '20080923');
I have a huge data base with complicated relations, how can I delete all tables contents without violating foreign key constraints,is there a a such way to do that?
note that I am writing a SQL script file to delete tables such as the following example:
delete from A
delete from B
delete from C
delete from D
delete from E
but I don't know what table should I start with.
In SQL Server, there is no native way to do what you're asking. You do have a few options depending on your particular environment limitations:
Figure out the relationships between the tables and start deleting rows out in the appropriate order from foreigns to parents. This may be time-consuming for a large number of objects, but is the "safest" in terms of least destruction.
Disable the foreign key constraints and TRUNCATE TABLE. This will be a bit faster if you're dealing with lots of data, but you still have to to know where all your relationships are. Not too terrible if you're working with fewer tables, though option 1 becomes just as viable
Script out the database objects and DROP DATABASE/CREATE DATABASE. If you don't care about a raw teardown of the database, this is another option, however, you'll still need to be aware of object precedence for creation. SQL Server—as well as third-party tools— offer ways to script object DROP/CREATE. If you decide to go this route, the upside is that you have a scripted backup of all the objects (which I like to keep "just in case") and future tear-downs are nearly instantaneous as long as you keep your scripts synchronized with any changes.
As you can see, it's not a terribly simple process because you're trying to subvert the very reason for the existence of the constraints.
Steps can be:
disable all the constraint in all the tables
delete all the records from all the tables
enable the constraint back again.
Also see this discussion: SQL: delete all the data from all available tables
TRUNCATE TABLE tableName
Removes all rows from a table without
logging the individual row deletions.
TRUNCATE TABLE is similar to the
DELETE statement with no WHERE clause;
however, TRUNCATE TABLE is faster and
uses fewer system and transaction log
resources.
TRUNCATE TABLE (Transact-SQL)
Dude, taking your question at face value... that you want to COMPLETELY recreate the schema with NO data... forget the individual queries (too slow)... just destroydb, and then createdb (or whatever your RDBM's equivalent is)... and you might want to hire a competent DBA.
Everyday a company drops a text file with potentially many records (350,000) onto our secure FTP. We've created a windows service that runs early in the AM to read in the text file into our SQL Server 2005 DB tables. We don't do a BULK Insert because the data is relational and we need to check it against what's already in our DB to make sure the data remains normalized and consistent.
The problem with this is that the service can take a very long time (hours). This is problematic because it is inserting and updating into tables that constantly need to be queried and scanned by our application which could affect the performance of the DB and the application.
One solution we've thought of is to run the service on a separate DB with the same tables as our live DB. When the service is finished we can do a BCP into the live DB so it mirrors all of the new records created by the service.
I've never worked with handling millions of records in a DB before and I'm not sure what a standard approach to something like this is. Is this an appropriate way of doing this sort of thing? Any suggestions?
One mechanism I've seen is to insert the values into a temporary table - with the same schema as the target table. Null IDs signify new records and populated IDs signify updated records. Then use the SQL Merge command to merge it into the main table. Merge will perform better than individual inserts/updates.
Doing it individually, you will incur maintenance of the indexes on the table - can be costly if its tuned for selects. I believe with merge its a bulk action.
It's touched upon here:
What's a good alternative to firing a stored procedure 368 times to update the database?
There are MSDN articles about SQL merging, so Googling will help you there.
Update: turns out you cannot merge (you can in 2008). Your idea of having another database is usually handled by SQL replication. Again I've seen in production a copy of the current database used to perform a long running action (reporting and aggregation of data in this instance), however this wasn't merged back in. I don't know what merging capabilities are available in SQL Replication - but it would be a good place to look.
Either that, or resolve the reason why you cannot bulk insert/update.
Update 2: as mentioned in the comments, you could stick with the temporary table idea to get the data into the database, and then insert/update join onto this table to populate your main table. The difference is now that SQL is working with a set so can tune any index rebuilds accordingly - should be faster, even with the joining.
Update 3: you could possibly remove the data checking from the insert process and move it to the service. If you can stop inserts into your table while this happens, then this will allow you to solve the issue stopping you from bulk inserting (ie, you are checking for duplicates based on column values, as you don't yet have the luxury of an ID). Alternatively with the temporary table idea, you can add a WHERE condition to first see if the row exists in the database, something like:
INSERT INTO MyTable (val1, val2, val3)
SELECT val1, val2, val3 FROM #Tempo
WHERE NOT EXISTS
(
SELECT *
FROM MyTable t
WHERE t.val1 = val1 AND t.val2 = val2 AND t.val3 = val3
)
We do much larger imports than that all the time. Create an SSIS pacakge to do the work. Personally I prefer to create a staging table, clean it up, and then do the update or import. But SSIS can do all the cleaning in memory if you want before inserting.
Before you start mirroring and replicating data, which is complicated and expensive, it would be worthwhile to check your existing service to make sure it is performing efficiently.
Maybe there are table scans you can get rid of by adding an index, or lookup queries you can get rid of by doing smart error handling? Analyze your execution plans for the queries that your service performs and optimize those.
I have a table with about 45 columns and as more data goes in, the longer it takes for the inserts to happen. I have increased the size of the data and log files, reduced the fill factor on all the indexes on that table, and still slower and slower insert times. Any ideas would be GREATLY appreciated.
For inserts, you want to DECREASE the fillfactor on the indexes on the table in order to reduce page splitting.
It is somewhat expected that it will take longer to insert as more data goes in, because your indexes just plain get bigger.
Try putting in data in batches instead of row-by-row. SQL Server is more efficient that way.
Make sure you don't have too many indexes on your tables.
Consider using SQL Server 2005's INCLUDE statement on your indexes if you are just including columns in your indexes because you want them covered in your queries.
How big is the table?
What is the context? Is this a batch of many new records?
Can you post the schema including index definition?
Can you SET STATISTICS IO ON, SET STATISTICS TIME ON, and post the display for one iteration?
Is there anything pathological about the data, or the context? Is this on a server or a laptop (testing)?
Why dont you drop index before inserting and recreate index back on to table so you no need to do update statistics
You could also ensure that the indexes on that table are defragmented