I would like to ask couple question how to handle a huge 100 million of data in 1 single table.
The table will perform INSERT, SELECT & UPDATE.
I have got some advise that to Index the table and Archive the table into couple table.
Any other suggestion that can help to tweak the SQL Performance.
Case:
SQL Server 2008.
Most of the time the update regarding decimal value, and status of tiny int.
The INSERT statement will not using BULK INSERT since I'm assuming that per min that there'r a lot of users let said 10000-500000 performing INSERT statement and Update the table.
You should consider what kind of columns you have.
The more nvarchar/text/etc columns you have included in the different indexes, the slower the index will be.
Also what RDBMS are you going to use? You have different options based on SQL Server, Oracle and MySQL...
But the crucial thing is differently to build the right index's that you would use...
One other thing, you could use BULK INSERT on SQL Server to speed up the inserts.
But ask away, i have dealt with databases being populated with 70 mill data rows pr day ;)
EDIT ----- After more information has come
I'll try to take a little other approach to the case and compare it to data scraping.
There are no doubt that INSERTs are faster than UPDATEs. And you might want to make a table that acts as a "collect" table. What I mean is that it only get inserts all the time. No updates, all is handle with inserts.
Then you use a trigger/event/scheduler to handle what has come into that table and populate what you need to another(s) table(s).
This way you will be able to apply a little business logic to the "cleanup" (update) and keep the performance on the DB Server and not hold up a connection while these things are done.
This of course also have something to do with what the "final" data are to be used for...
\T
Clearly SQL 2008 is capable of 100 million records but a lot of details to look at that just do not come into play at 100 thousand. Pick a good primary key. Fill factor. Other indexes (will slow down insert but speed select). Concurrency (locking). If you can accept dirty reads then that will help performance. This question needs a lot more detail. You need to post the table design and your select, update, and insert TSQL statements. I did not vote your question down but if you don't provide more detail it will get voted down.
For insert be aware you can insert multiple rows at once and is much faster than multiple insert statements if BULK INSERT is not an option.
INSERT INTO Production.UnitMeasure
VALUES (N'FT2', N'Square Feet ', '20080923'), (N'Y', N'Yards', '20080923'), (N'Y3', N'Cubic Yards', '20080923');
Related
The following statement takes at least 4 seconds:
INSERT INTO [SomeSmallTable]
SELECT * FROM ComplexView
WHERE [Date] = convert(datetime, '23/09/2020',103)
However, if we only run the SELECT part without the INSERT INTO, it takes less than half a second:
SELECT *
FROM ComplexView
WHERE [Date] = convert(datetime, '23/09/2020',103)
The view selects less than 200 rows, and the table called "SomeSmallTable" holds only a few rows. I think this issue started when we updated the view called "ComplexView". ComplexView is based on other views (and some of these views are based on other views itself), as well as some tables.
I tried to refresh all views using sp_refreshview, but to no avail.
How can we determine the cause of this issue and hopefully solve it?
[EDIT]
My reply to some comments:
#Dale K: I can't post the execution plans, I think they way to complex, and not relevant as they are equal for both statements, with or without the INSERT part, except for the Table Insert part. But I did see that the INSERT costs 100%. For some reason SQL has trouble inserting the view results in the table.
#Panagiotis Kanavos: Nobody but me is using the database. It's a copy of our clients database and I'm working on it on my local machine.
#gotqn: SomeSmallTable is a table, so no table variable or temporary table. However, it is created when a user opens a specific form in our application, and deleted then the user closes this form.
#Arvo: SomeSmallTable has no keys and no triggers. The view returns less than 200 rows which are inserted in this table, and before these are inserted the table is empty.
I followed the steps in the accepted answer, and eventually compared the current "ComplexView" with the previous version, and found out what caused this issue.
Checking the execution plan is the first step, as others have said. Given that the INSERT (rather than the query) is causing the delay, you could troubleshoot that further. Here are some things you can try:
Try using Statistics IO to find out more, as answered here.
Attempt an INSERT using static data (e.g. INSERT INTO [SomeSmallTable] VALUES (1, 2, '...etc');). This will tell you if the issue is any INSERT statement, or when inserting from a view specifically.
Check how much data the view is returning. 4s may or may not be reasonable, depending on how many rows are being inserted.
Check the table design to see how it is using primary keys, foreign keys, composite keys, indexes, triggers, etc. Some of these features optimise a table's design for selecting, but make insertion slower as a trade-off. A good answer about this can be found here.
If you know it's not a load issue (because you're the only one using this database), check whether something else might be restricting resources on the machine you're using (other resource-intensive tasks, any other queries happening at the same time, scheduled jobs within SQL Server, etc.) You can use SQL Server Profiler to watch the queries in real time.
If slow performance is not limited to this particular query, then there are other general design considerations you can look into.
I would need information what might be the impact for production DB of creating triggers for ~30 Production tables that capture any Update,Delete and Insert statement and put following information "PK", "Table Name", "Time of modification" to separate table.
I have limited ability to test it as I have read only permissions to both Prod and Test environment (and I can get one work day for 10 end users to test it).
I have estimated that number of records inserted from those triggers will be around ~150-200k daily.
Background:
I have project to deploy Data Warehouse for database that is very customized + there are jobs running every day that manipulate the data. Updated on Date column is not being maintain (customization) + there are hard deletes occurring on tables. We decided to ask DEV team to add triggers like:
CREATE TRIGGER [dbo].[triggerName] ON [dbo].[ProductionTable]
FOR INSERT, UPDATE, DELETE
AS
INSERT INTO For_ETL_Warehouse (Table_Name, Regular_PK, Insert_Date)
SELECT 'ProductionTable', PK_ID, GETDATE() FROM inserted
INSERT INTO For_ETL_Warehouse (Table_Name, Regular_PK, Insert_Date)
SELECT 'ProductionTable', PK_ID, GETDATE() FROM deleted
on core ~30 production tables.
Based on this table we will pull delta from last 24 hours and push it to Data Warehouse staging tables.
If anyone had similar issue and can help me estimate how it can impact performance on production database I will really appreciate. (if it works - I am saved, if not I need to propose other solution. Currently mirroring or replication might be hard to get as local DEVs have no idea how to set it up...)
Other ideas how to handle this situation or perform tests are welcome (My deadline is Friday 26-01).
First of all I would suggest you code your table name into a smaller variable and not a character one (30 tables => tinyint).
Second of all you need to understand how big is the payload you are going to write and how:
if you chose a correct clustered index (date column) then the server will just need to out data row by row in a sequence. That is a silly easy job even if you put all 200k rows at once.
if you code the table name as a tinyint, then basically it has to write:
1byte (table name) + PK size (hopefully numeric so <= 8bytes) + 8bytes datetime - so aprox 17bytes on the datapage + indexes if any + log file . This is very lightweight and again will put no "real" pressure on sql sever.
The trigger itself will add a small overhead, but with the amount of rows you are talking about, it is negligible.
I saw systems that do similar stuff on a way larger scale with close to 0 effect on the work process, so I would say that it's a safe bet. The only problem with this approach is that it will not work in some cases (ex: outputs to temp tables from DML statements). But if you do not have these kind of blockers then go for it.
I hope it helps.
I have to insert one record per tables across 30 tables. The data coming from some other System. I have to insert data in the tables for the first time, then if any update happened, then I need to update tables in the SQL Server. I have two options:
a) I can check timestamp for individual table rows and update if the timestamp is greater.
b) Everytime I can stateway delete records and insert data.
Which one will be faster in SQL Server Database? Is there any other option to address the situatation?
If you are not changing the index fields of the record, the stategy of trying to update first and then insert is usually faster than drop/insert as you don't force the database into updating a bunch of index info.
If using Sql2008+ you should be using the merge command, as it explictly handles the update/insert condition cleanly and clearly
ADDED
I should also add that is you know the usage pattern in rarely update (i.e., 90% insert), you may have a case when drop/insert in faster than update/insert -- depends on lots of details. Regardless, merge is the clear winner if using 2008+
I generally like drop and re-insert. I find it to be cleaner and easier to code. However, if this is happening very frequently and you're worried about concurrency issues, you're probably better off with option 1.
Also, another thing to factor in is how often does the timestamp check fail (where you don't have to insert nor update). If 99% of data is redundant/outdated data, you're probably better off with option 1 regardless.
I have a very large database, little over 60 gigs, with many tables with millions of rows. I am getting some timeout errors, so I am rethinking some of my code design.
Currently, my pseduo code is like this:
delete from table where person=123 (deletes about 200 rows)
Then I re-insert the updated data (again, 200 rows). The data is always different, as it's time sensitive.
If I was to do an update, instead of insert, I'd have to select the row first (I'm using an ORM in c#).
tl;dr
I am just wondering, simple question, what is more cost effective.
Select / Update or Delete/Insert?
If you update any column that is part of the clustered index key then your update is handled internally as a delete/insert anyway
How would you handle the difference in cardinality with an UPDATE? Ie. person=123 has 200 rows to delete, but only 199 to insert. Update would not be able to handle this.
Your best approach should be to use a MERGE statement and a table valued parameter with the new values. Of course, no ORM can handle this, but you mention 'performance', and the terms 'performance' and 'ORM' cannot be used in the same sentence...
With Delete/Insert, you will be writing to the database twice. One time to delete and one time to insert. You will also be logging both of those transactions separately, unless you are properly wrapping the entire process in a single transaction.
You could test both methods and watch the results in SQL Profiler, but 9/10 Update will be quicker.
Could of cavets, I'd make sure the person key is indexed so that you are not doing a complete table scan to find the affected records.
Finally, as #Mundu say, you may want to do this using a parametrized query via ADO.NET instead of the ORM.
Our team needs to insert a cruel amount of data into our SQL Server 2008 database. We're looking for a good solution. Now we came up with one, but I have doubts with it, simply because it doesn't feel right. So I'm asking here if this seems like a good solution. Extra challange is that it's a peer-to-peer replicated database over 4 servers! :)
Imagine we have 1 million rows to insert
Start transaction
Increase current ident value on a table with 1 million
Have a DataSet/DataTable ready with 1 million rows and the correct ids
BulkCopy the data into the database
Commit transaction
Is this a good solution, might we get into concurrency issues, have too large transactions, etc.
you'll only get problems (as far as I can see, so there might be things I overlook!) if the database is online and users can insert rows into that table. Increasing the identity value for new rows on the meta-level simply means that the next row inserted by the system will use that number, so if you bump it with 1 million, it means you reserved those numbers up front.
Identity columns are 'nice' but have the side effect that they're not transferable. So if you have to migrate the data to another DB, realize that you likely have to adjust the data inserted to match the database you insert it in (as that's the scope of the data which means identity fields could collide with rows already in the table).
If this is a one-time affair, it might work out. If you're planning to do this regularly, I'd look into a more higher-level migration system where you migrate the data to new identity values or use guid's with NEWSEQUENTIALID() so you get proper checked indexes and also unique, transferable id's.