This pertains to a project I am inheriting and cannot change table structure or data access model. I have been asked to optimize the algorithm being used to insert data into the database.
We have a dataset in table T. From that, we pull a set we will call A. We also query an XML feed and get a set we will call X.
If a value from X is in A, record in A must be updated to reflect data for X.record
If a value from X is not in A, X.record must be inserted into A
If a value from A is not in X, A.record must be preserved in A
X must be fully iterated through for all records, and A must be updated
All these changes need to be insert back into the database.
The algo as set up does the following:
Query XML into a LIST
foreach over the XML LIST
look up foreach.item in A via LINQ (i.e. query = from record in A where
record.GUID == foreach.item.GUID
select record)
if query.Count() == 0
insert into A (via context.AddToTableName(newTableNameObject)
else
var currentRecord = query.First()
set all properties on currentRecord = properties from foreach.item
context.SaveChanges()
I know this to be suboptimal. I tried to get A into a object (call it queryA) outside of the foreach loop in an effort to move the query to memory and not hitting the disk, but after thinking that through, I realized the database was already in memory.
Having added timer objects into the algo, it's clear that what is costing the most time is the SaveChanges() function call. In some cases it's 20ms, and in some others, in explicably, it will jump to 100ms.
I would prefer to only call the SaveChanges() one time. I can't figure out how to do that given my depth of knowledge of EF (which is thin at best) and the constraints of not being able to change the table structures and having to preserve data from A which is not in X.
Suggestions?
I don't think that you will improve the performance when using Entity framework:
The query
Loading each record by separate query is not good
You can improve the performance by loading multiple records in the same query. For example you can load small batch of records by using either || in condition or Contains (like IN in SQL). Contains is supported only by .NET 4.0.
Another improvement can be replacing the query with stored procedure and table valued parameter to pass all guids to SQL server join A with X.Guids and get results. Table valued parameters are only supported on SQL 2008 and newer.
The data modification
You don't have to should not call SaveChanges after each modification. You can call it after foreach loop and it will still work. It will pass all modifications in single transaction but you will not get any performance boost by such operation and according to this answer it can give you a significant boost.
EF doesn't support command batching and because of that each update or insert always takes separate round trip to the database. There is no way around this when using EF to modify data except implementing whole new EF ADO.NET provider (it is like starting a new project).
Again solution is reducing roundtrips by using stored procedure with table valued parameter
If your DB also uses that GUID as primary key and clustered index you have another performance decrease by reordering index after each insert = modifying data on disk.
The problem is not in algorithm but in the way you process the data and technology used to process the data. Entity framework is not a good choice for data pumps. You should go with these information to your boss because improving performance means more complicated change in your application. It is not your fault and it is not the fault of the programmer who did the application. This is a feature of EF which is not very well known and as I know it is not clearly stated in any MS best practices.
Related
I use SSMS 2016. I have a view that has a few millions of records. The view is not indexed and should not be as it's being updated (insert, delete, update) every 5 minutes by a job on the server to then display update data sets in to the client calling application in GUI.
The view does a very heavy volume of conversion INT values to VARCHAR appending to them some string values.
The view also does some CAST operations on the NULL assigning them column names aliases. And the worst performance hit is that the view uses FOR XML PATH('') function on 20 columns.
Also the view uses two CTEs as the source as well as Subsidiaries to define a single column value.
I made sure I created the right indexes (Clustered, nonclustered,composite and Covering) that are used in the view Select,JOIN,and WHERE clauses.
Database Tuning Advisor also have not suggested anything that could substantialy improve performance.
AS a workaround I decided to create two identical physical tables with clustered indexes on each and using the Merge statement (further converted into a SP and then Into as SQL Server Agent Job) maintain them updated. And to assure there is no long locking of the view. I will then swap(rename) the tables names immediately after each merge finishes. So in this case all the heavy workload falls onto a SQL Server Agent Job keep the tables updated.
The problem is that the merge will take roughly 15 minutes considering current size of the data, which may increase in the future. So, I need to have a real time design to assure that the view has the most up-to-date information.
Any ideas?
I have an API that i'm trying to read that gives me just the updated field. I'm trying to take that and update my tables using a stored procedure. So far the only way I have been able to figure out how to do this is with dynamic SQL but i would prefer to not do that if there is a way not to.
If it was just a couple columns, I'd just write a proc for each but we are talking about 100 fields and any of them could be updated together. One ticket might just need a timestamp updated at this time, but the next ticket might be a timestamp and who modified it while the next one might just be a note.
Everything I've read and have been taught have told me that dynamic SQL is bad and while I'll write it if I have too, I'd prefer to have a proc.
YOU CAN PERHAPS DO SOMETHING LIKE THIS:::
IF EXISTS (SELECT * FROM NEWTABLE NOT IN (SELECT * FROM OLDTABLE))
BEGIN
UPDATE OLDTABLE
SET OLDTABLE.OLDRECORDS = NEWTABLE.NEWRECORDS
WHERE OLDTABLE.PRIMARYKEY= NEWTABLE.PRIMARYKEY
END
The best way to solve your problem is using MERGE:
Performs insert, update, or delete operations on a target table based on the results of a join with a source table. For example, you can synchronize two tables by inserting, updating, or deleting rows in one table based on differences found in the other table.
As you can see your update could be more complex but more efficient as well. Using MERGE requires some proficiency, but when you start to use it you'll use it with pleasure again and again.
I am not sure how your business logic works that determines what columns are updated at what time. If there are separate business functions that require updating different but consistent columns per function, you will probably want to have individual update statements for each function. This will ensure that each process updates only the columns that it needs to update.
On the other hand, if your API is such that you really don't know ahead of time what needs to be updated, then building a dynamic SQL query is a good idea.
Another option is to build a save proc that sets every user-configurable field. As long as the calling process has all of that data, it can call the save procedure and pass every updateable column. There is no harm in having a UPDATE MyTable SET MyCol = #MyCol with the same values on each side.
Note that even if all of the values are the same, the rowversion (or timestampcolumns) will still be updated, if present.
With our software, the tables that users can edit have a widely varying range of columns. We chose to create a single save procedure for each table that has all of the update-able columns as parameters. The calling processes (our web servers) have all the required columns in memory. They pass all of the columns on every call. This performs fine for our purposes.
I needed to create a function today which will always return the exact same value on the specific database it's executed on. It may / may not be the same across databases which is why it has to be able to load it from a table the first time it's required.
CREATE FUNCTION [dbo].[PAGECODEGET] ()
RETURNS nvarchar(6)
AS
BEGIN
DECLARE #PageCode nvarchar(6) = ( SELECT PCO_IDENTITY FROM PAGECODES WHERE PCO_PAGE = 'SWTE' AND PCO_TAB = 'RECORD' )
RETURN #PageCode
END
The PCO_IDENTITY field is a sql identity field, so once the record is inserted for the first time, it's always going to return the same result thereafter.
My question is, is there any way to persist this value to something equivalent to a C# readonly variable?
From a perfomance point of view I know sql will optimise the plan etc, but from a best practice point of view I'm thinking there may possibly be a better way of doing it.
We use a mix of SQL Servers, but the lowest is 2008 R2 in case there's a version specific solution.
I'm afraid there's no such thing as a global variable like you suggest in SQL Server.
As you've pointed out, the function will potentially return different results on another database, depending on a variety of factors, such as when the row was inserted, what other values exist in the table already etc. - basically, the PCO_IDENTITY value for this row cannot be relied upon to be consistent.
A few observations:
I don't see how getting this value occasionally is really going to be a performance bottleneck. I don't think best practices cover this, as selecting a value from a table is as basic as you can get.
If this is part of another larger query, you will probably get better performance by using a join to the PAGECODES table directly, rather than potentially running this function for every row
However, if you are really worried:
There are objects in the database which are persistant - tables. When you first insert this value, retrieve the PCO_IDENTITY value, and create a new table with just that in, that you may join to in your queries. Seems a bit of a waste for one value, doesn't it? (Note you could also make a view, but how would that be any better performing than the function you started with?)
You could force these values into a row with a specific PCO_IDENTITY value, using IDENTITY_INSERT. That way the value is consistent, and you know what it is - you could hard code it in your queries. (NB: Turn IDENTITY_INSERT off again afterwards, and other rows inserted into this table will continue to be automatically generated again)
TL;DR: How you are doing it is probably fine. I suspect you are trying to optimise something that isn't a problem. As always - if in doubt, try out a few approaches and measure.
I am creating a Java function that needs to use a SQL query with a lot of joins before doing a full scan of its result. Instead of hard-coding a lot of joins I decided to create a view with this complex query. Then the Java function just uses the following query to get this result:
SELECT * FROM VW_####
So the program is working fine but I want to make it faster since this SELECT command is taking a lot of time. After taking a look on its plan execution plan I created some indexes and made it +-30% faster but I want to make it faster.
The problem is that every operation in the execution plan have cost between 0% and 4% except one operation, a clustered-index insert that has +-50% of the execution cost. I think that the system is using a temporary table to store the view's data, but an index in this view isn't useful for me because I need all rows from it.
So what can I do to optimize that insert in the CWT_PrimaryKey? I think that I can't turn off that index because it seems to be part of the SQL Server's internals. I read somewhere that this operation could appear when you use cursors but I think that I am not using (or does the view use it?).
The command to create the view is something simple (no T-SQL, no OPTION, etc) like:
create view VW_#### as SELECTS AND JOINS HERE
And here is a picture of the problematic part from the execution plan: http://imgur.com/PO0ZnBU
EDIT: More details:
Well the query to create the problematic view is a big query that join a lot of tables. Based on a single parameter the Java-Client modifies the query string before creating it. This view represents a "data unit" from a legacy Database migrated to the SQLServer that didn't had any Foreign or Primary Key, so our team choose to follow this strategy. Because of that the view have more than 50 columns and it is made from the join of other seven views.
Main view's query (with a lot of Portuguese words): http://pastebin.com/Jh5vQxzA
The other views (from VW_Sintese1 until VW_Sintese7) are created like this one but without using extra views, they just use joins with the tables that contain the data requested by the main view.
Then the Java Client create a prepared Statement with the query "Select * from VW_Sintese####" and execute it using the function "ExecuteQuery", something like:
String query = "Select * from VW_Sintese####";
PreparedStatement ps = myConn.prepareStatement(query,ResultSet.TYPE_SCROLL_INSENSITIVE, ResultSet.CONCUR_READ_ONLY);
ResultSet rs = ps.executeQuery();
And then the program goes on until the end.
Thanks for the attention.
First: you should post the code of the view along with whatever is using the views because of the rest of this answer.
Second: the definition of a view in SQL Server is later used to substitute in querying. In other words, you created a view, but since (I'm assuming) it isn't an indexed view, it is the same as writing the original, long SELECT statement. SQL Server kind of just swaps it out in the DML statement.
From Microsoft's 'Querying Microsoft SQL Server 2012': T-SQL supports the following table expressions: derived tables, common table expressions (CTEs), views, inline table-valued functions.
And a direct quote:
It’s important to note that, from a performance standpoint, when SQL Server optimizes
queries involving table expressions, it first unnests the table expression’s logic, and therefore interacts with the underlying tables directly. It does not somehow persist the table expression’s result in an internal work table and then interact with that work table. This means that table expressions don’t have a performance side to them—neither good nor
bad—just no side.
This is a long way of reinforcing the first statement: please include the SQL code in the view and what you're actually using as the SELECT statement. Otherwise, we can't help much :) Cheers!
Edit: Okay, so you've created a view (no performance gain there) that does 4-5 LEFT JOIN on to the main view (again, you're not helping yourself out much here by eliminating rows, etc.). If there are search arguments you can use to filter down the resultset to fewer rows, you should have those in here. And lastly, you're ordering all of this at the top, so your query engine will have to get those views, join them up to a massive SELECT statement, figure out the correct order, and (I'm guessing here) the result count is HUGE and SQL's db engine is ordering it in some kind of temporary table.
The short answer: get less data (fewer columns and only the rows you need); don't order the results if the resultset is very large, just get the data to the client and then sort it there.
Again, if you want more help, you'll need to post table schemas and index strategies for all tables that are in the query (including the views that are joined) and you'll need to include all view definitions (including the views that are joined).
A few months back, I started using a CRUD script generator for SQL Server. The default insert statement that this generator produces, SELECTs the inserted row at the end of the stored procedure. It does the same for the UPDATE too.
The previous way (and the only other way I have seen online) is to just return the newly inserted Id back to the business object, and then have the business object update the Id of the record.
Having an extra SELECT is obviously an additional database call, and more data is being returned to the application. However, it allows additional flexibility within the stored procedure, and allows the application to reflect the actual data in the table.
The additional SELECT also increases the complexity when wanting to wrap the insert/update statements in a transaction.
I am wondering what people think is better way to do it, and I don't mean the implementation of either method. Just which is better, return just the Id, or return the whole row?
We always return the whole row on both an Insert and Update. We always want to make sure our client apps have a fresh copy of the row that was just inserted or updated. Since triggers and other processes might modify values in columns outside of the actual insert/update statement, and since the client usually needs the new primary key value (assuming it was auto generated), we've found it's best to return the whole row.
The select statement will have some sort of an advantage only if the data is generated in the procedure. Otherwise the data that you have inserted is generally available to you already so no point in selecting and returning again, IMHO. if its for the id then you can have it with SCOPE_IDENTITY(), that will return the last identity value created in the current session for the insert.
Based on my prior experience, my knee-jerk reaction is to just return the freshly generated identity value. Everything else the application is inserting, it already knows--names, dollars, whatever. But a few minutes reflection and reading the prior 6 (hmm, make that 5) replies, leads to a number of “it depends” situations:
At the most basic level, what you inserted is what you’d get – you pass in values, they get written to a row in the table, and you’re done.
Slightly more complex that that is when there are simple default values assigned during an insert statement. “DateCreated” columns that default to the current datetime, or “CreatedBy” that default to the current SQL login, are a prime example. I’d include identity columns here, since not every table will (or should) contain them. These values are generated by the database upon table insertion, so the calling application cannot know what they are. (It is not unknown for web server clocks to not be synchronized with database server clocks. Fun times…) If the application needs to know the values just generated, then yes, you’d need to pass those back.
And then there are are situations where additional processing is done within the database before data is inserted into the table. Such work might be done within stored procedures or triggers. Once again, if the application needs to know the results of such calculations, then the data would need to be returned.
With that said, it seems to me the main issue underlying your decision is: how much control/understanding do you have over the database? You say you are using a tool to automatically generate your CRUD procedures. Ok, that means that you do not have any elaborate processing going on within them, you’re just taking data and loading it on in. Next question: are there triggers (of any kind) present that might modify the data as it is being written to the tables? Extend that to: do you know whether or not such triggers exists? If they’re there and they matter, plan accordingly; if you do not or cannot know, then you might need to “follow up” on the insert to see if changes occurred. Lastly: does the application care? Does it need to be informed of the results of the insert action it just requested, and if so, how much does it need to know? (New identity value, date time it was added, whether or not something changed the Name from “Widget” to “Widget_201001270901”.)
If you have complete understanding and control over the system you are building, I would only put in as much as you need, as extra code that performs no useful function impacts performance and maintainability. On the flip side, if I were writing a tool to be used by others, I’d try to build something that did everything (so as to increase my market share). And if you are building code where you don't really know how and why it will be used (application purpose), or what it will in turn be working with (database design), then I guess you'd have to be paranoid and try to program for everything. (I strongly recommend not doing that. Pare down to do only what needs to be done.)
Quite often the database will have a property that gives you the ID of the last inserted item without having to do an additional select. For example, MS SQL Server has the ##Identity property (see here). You can pass this back to your application as an output parameter of your stored procedure and use it to update your data with the new ID. MySQL has something similar.
INSERT
INTO mytable (col1, col2)
OUTPUT INSERTED.*
VALUES ('value1', 'value2')
With this clause, returning the whole row does not require an extra SELECT and performance-wise is the same as returning only the id.
"Which is better" totally depends on your application needs. If you need the whole row, return the whole row, if you need only the id, return only the id.
You may add an extra setting to your business object which can trigger this option and return the whole row only if the object needs it:
IF #return_whole_row = 1
INSERT
INTO mytable (col1, col2)
OUTPUT INSERTED.*
VALUES ('value1', 'value2')
ELSE
INSERT
INTO mytable (col1, col2)
OUTPUT INSERTED.id
VALUES ('value1', 'value2')
FI
I don't think I would in general return an entire row, but it could be a useful technique.
If you are code-generating, you could generate two procs (one which calls the other, perhaps) or parametrize a single proc to determine whther to return it over the wire or not. I doubt the DB overhead is significant (single-row, got to have a PK lookup), but the data on the wire from DB to client could be significant when all added up and if it's just discarded in 99% of the cases, I see little value. Having an SP which returns different things with different parameters is a potential problem for clients, of course.
I can see where it would be useful if you have logic in triggers or calculated columns which are managed by the database, in which case, a SELECT is really the only way to get that data back without duplicating the logic in your client or the SP itself. Of course, the place to put any logic should be well thought out.
Putting ANY logic in the database is usually a carefully-thought-out tradeoff which starts with the minimally invasive and maximally useful things like constraints, unique constraints, referential integrity, etc and growing to the more invasive and marginally useful tools like triggers.
Typically, I like logic in the database when you have multi-modal access to the database itself, and you can't force people through your client assemblies, say. In this case, I would still try to force people through views or SPs which minimize the chance of errors, duplication, logic sync issues or misinterpretation of data, thereby providing as clean, consistent and coherent a perimeter as possible.