jpa preserve order in database after update - database

I use jpa 2 with hibernate 4 to persist/update objects to the database. To view these objects a web page with jsf 2 is used, which all works great.
I'm having the following problem: when i update any of the objects, the database (postgres) will put the updated object on the last row in the database. I'm wondering why this is happening? Can't the database (or is it jpa or hibernate?) just update the row and leave it where it is? The id is of course never changed.
INSERT INTO sector (id, code) VALUES(1, 'PRIVAAT');
INSERT INTO sector (id, code) VALUES(2, 'PUBLIEK');
Select * from sector;
returns in every program for querying databases:
1 privaat
2 publiek
now i update the first row:
update sector set code = 'privaat' where id = 1;
now it returns in every program:
2 publiek
1 privaat
This is also the case when using our application, which in turn uses Postgres as DB.
However, in our regression tests, which uses in memory DB HSQLDB, it returns always:
1 privaat
2 publiek
The regression tests is clearly what we expect from databases and didn't had this behavior before with other databases.
The problem is that in the web we want to preserve the order so that users don't have to search into the data: it is always a fixed position. Specially when working with a lot of referenced data for the user it is very annoying. The tests confirm this, but the application (which uses a different DB) works differently.
The only solution we have now is to put everywhere order by clauses on the id, but we would like to have a clean way of preserving the order. Using everywhere order by has also a small performance hit...
So my question is, is this due to how specific databases work (vendor-dependent)? Or is this due to JPA mapping queries in a different way depending on the dialect? If this is the second case, is there a property which can preserve the order in JPA (or maybe hibernate)?

There is no order there. You're just getting the semi-random order that rows happen to be read from the heap in.
Use ORDER BY if you want an ordering.
Some database implementations have a natural ordering because of how the data is structured (say, as a b-tree table). PostgreSQL uses a heap, so there is no order unless you ask for one.

Related

Add DATE column to store when last read

We want to know what rows in a certain table is used frequently, and which are never used. We could add an extra column for this, but then we'd get an UPDATE for every SELECT, which sounds expensive? (The table contains 80k+ rows, some of which are used very often.)
Is there a better and perhaps faster way to do this? We're using some old version of Microsoft's SQL Server.
This kind of logging/tracking is the classical application server's task. If you want to realize your own architecture (there tracking architecture) do it on your own layer.
And in any case you will need application server there. You are not going to update tracking field it in the same transaction with select, isn't it? what about rollbacks? so you have some manager who first run select than write track information. And what is the point to save tracking information together with entity info sending it back to DB? Save it into application server file.
You could either update the column in the table as you suggested, but if it was me I'd log the event to another table, i.e. id of the record, datetime, userid (maybe ip address etc, browser version etc), just about anything else I could capture and that was even possibly relevant. (For example, 6 months from now your manager decides not only does s/he want to know which records were used the most, s/he wants to know which users are using the most records, or what time of day that usage pattern is etc).
This type of information can be useful for things you've never even thought of down the road, and if it starts to grow large you can always roll-up and prune the table to a smaller one if performance becomes an issue. When possible, I log everything I can. You may never use some of this information, but you'll never wish you didn't have it available down the road and will be impossible to re-create historically.
In terms of making sure the application doesn't slow down, you may want to 'select' the data from within a stored procedure, that also issues the logging command, so that the client is not doing two roundtrips (one for the select, one for the update/insert).
Alternatively, if this is a web application, you could use an async ajax call to issue the logging action which wouldn't slow down the users experience at all.
Adding new column to track SELECT is not a practice, because it may affect database performance, and the database performance is one of major critical issue as per Database Server Administration.
So here you can use one very good feature of database called Auditing, this is very easy and put less stress on Database.
Find more info: Here or From Here
Or Search for Database Auditing For Select Statement
Use another table as a key/value pair with two columns(e.g. id_selected, times) for storing the ids of the records you select in your standard table, and increment the times value by 1 every time the records are selected.
To do this you'd have to do a mass insert/update of the selected ids from your select query in the counting table. E.g. as a quick example:
SELECT id, stuff1, stuff2 FROM myTable WHERE stuff1='somevalue';
INSERT INTO countTable(id_selected, times)
SELECT id, 1 FROM myTable mt WHERE mt.stuff1='somevalue' # or just build a list of ids as values from your last result
ON DUPLICATE KEY
UPDATE times=times+1
The ON DUPLICATE KEY is right from the top of my head in MySQL. For conditionally inserting or updating in MSSQL you would need to use MERGE instead

Entity Framework Algorithm For Combining Data

This pertains to a project I am inheriting and cannot change table structure or data access model. I have been asked to optimize the algorithm being used to insert data into the database.
We have a dataset in table T. From that, we pull a set we will call A. We also query an XML feed and get a set we will call X.
If a value from X is in A, record in A must be updated to reflect data for X.record
If a value from X is not in A, X.record must be inserted into A
If a value from A is not in X, A.record must be preserved in A
X must be fully iterated through for all records, and A must be updated
All these changes need to be insert back into the database.
The algo as set up does the following:
Query XML into a LIST
foreach over the XML LIST
look up foreach.item in A via LINQ (i.e. query = from record in A where
record.GUID == foreach.item.GUID
select record)
if query.Count() == 0
insert into A (via context.AddToTableName(newTableNameObject)
else
var currentRecord = query.First()
set all properties on currentRecord = properties from foreach.item
context.SaveChanges()
I know this to be suboptimal. I tried to get A into a object (call it queryA) outside of the foreach loop in an effort to move the query to memory and not hitting the disk, but after thinking that through, I realized the database was already in memory.
Having added timer objects into the algo, it's clear that what is costing the most time is the SaveChanges() function call. In some cases it's 20ms, and in some others, in explicably, it will jump to 100ms.
I would prefer to only call the SaveChanges() one time. I can't figure out how to do that given my depth of knowledge of EF (which is thin at best) and the constraints of not being able to change the table structures and having to preserve data from A which is not in X.
Suggestions?
I don't think that you will improve the performance when using Entity framework:
The query
Loading each record by separate query is not good
You can improve the performance by loading multiple records in the same query. For example you can load small batch of records by using either || in condition or Contains (like IN in SQL). Contains is supported only by .NET 4.0.
Another improvement can be replacing the query with stored procedure and table valued parameter to pass all guids to SQL server join A with X.Guids and get results. Table valued parameters are only supported on SQL 2008 and newer.
The data modification
You don't have to should not call SaveChanges after each modification. You can call it after foreach loop and it will still work. It will pass all modifications in single transaction but you will not get any performance boost by such operation and according to this answer it can give you a significant boost.
EF doesn't support command batching and because of that each update or insert always takes separate round trip to the database. There is no way around this when using EF to modify data except implementing whole new EF ADO.NET provider (it is like starting a new project).
Again solution is reducing roundtrips by using stored procedure with table valued parameter
If your DB also uses that GUID as primary key and clustered index you have another performance decrease by reordering index after each insert = modifying data on disk.
The problem is not in algorithm but in the way you process the data and technology used to process the data. Entity framework is not a good choice for data pumps. You should go with these information to your boss because improving performance means more complicated change in your application. It is not your fault and it is not the fault of the programmer who did the application. This is a feature of EF which is not very well known and as I know it is not clearly stated in any MS best practices.

SQL 2005/2008 Database track field change

I do not want Auditing or History tracking.
I have an application that pulls its data from a external source.
We mirror the external tables in our own DB. We allow our users to update the data in our DB.
Next time the system syncs with the external data, we only override fields that we have not changed locally.
Off the top of my head I can think of 2 ways to do this
1) Store 2 Rows for each Object. First 1 is the external version, the 2nd row links to the external version but will only have data in a field if that field has been changed.
e.g.
id | parentId | field1 | field2
1 | null | foo | bar
2 | 1 | new foo | null
This illustrates what the data would look like when a local user changed field1.
If no change occurred there would only be the first row.
2) Double the number of columns in the table.
e.g
name_external
name_internal
I like option 1 better as it seems like it would provides better separation and easier to query and to do in code comparisons between the 2 objects. The only downside is that it will result in more rows, but the DB will be pretty small.
Is there any other patterns I could use? Or a reason I shouldn't go with option 1.
I will be using .NET 4 WCF services
Solution
I will go with the two table answer provided below. and use the following SQL to return a Row that has the fields which have changed locally merged with the orginal values
SELECT
a.[ID],
isnull(b.[firstName], a.[firstName]),
isnull(b.[lastName], a.[lastName]),
isnull(b.[dob], a.[dob]),
isnull(b.active, a.active)
From tbl1 a
left join tbl2 b on a.[ID] = b.[ID]
As in my the DB will only ever be able to be updated via the UI system. And I can ensure people are not allowed to enter NULL as a value instead I force them to enter a blank string. This will allow me to overcome the issue what happens if the user updates a value to NULL
There are two issues I can think of if you choose option 1.
Allowing users to update the data means you will either have to write procedures to perform the insert/update/delete statements for them, managing the double row structure, or you have to train all the users to update the table correctly.
The other problem is modelling fields in your table which can be NULL. If you are using NULL to represent the field has not changed how can you represent a field changing to NULL?
The second option of doubling the number of columns avoids the complexity of updates and allows you to store NULL values but you may see performance decrease due to the increased row size and therefore the amount of data the server has to move around (without testing it I realise this claim is unsubstantiated but I thought it worth mentioning).
Another suggestion would be to duplicate the tables, perhaps putting them in another schema, which hold a snapshot of the data just after the sync with the external data source is performed.
I know you are not wanting a full audit, this would be a copy of the data at a point in time. You can then allow users to modify the data as they wish, without the complexity of the double row/column, and when you come to re-sync with the external source you can find out which fields have changed.
I found a great article: The shortest, fastest, and easiest way to compare two tables in SQL Server: UNION! describing such a comparison.
Consider having an update trigger that updates an UpdatedBy field. We have a user for our imports that no other process is allowed to use that updates the field if they were done by the import. So if any value other than user -1 was the last to update you can easily tell. Then you can use the merge statment to insert/delete/update.

What should be returned when inserting into SQL?

A few months back, I started using a CRUD script generator for SQL Server. The default insert statement that this generator produces, SELECTs the inserted row at the end of the stored procedure. It does the same for the UPDATE too.
The previous way (and the only other way I have seen online) is to just return the newly inserted Id back to the business object, and then have the business object update the Id of the record.
Having an extra SELECT is obviously an additional database call, and more data is being returned to the application. However, it allows additional flexibility within the stored procedure, and allows the application to reflect the actual data in the table.
The additional SELECT also increases the complexity when wanting to wrap the insert/update statements in a transaction.
I am wondering what people think is better way to do it, and I don't mean the implementation of either method. Just which is better, return just the Id, or return the whole row?
We always return the whole row on both an Insert and Update. We always want to make sure our client apps have a fresh copy of the row that was just inserted or updated. Since triggers and other processes might modify values in columns outside of the actual insert/update statement, and since the client usually needs the new primary key value (assuming it was auto generated), we've found it's best to return the whole row.
The select statement will have some sort of an advantage only if the data is generated in the procedure. Otherwise the data that you have inserted is generally available to you already so no point in selecting and returning again, IMHO. if its for the id then you can have it with SCOPE_IDENTITY(), that will return the last identity value created in the current session for the insert.
Based on my prior experience, my knee-jerk reaction is to just return the freshly generated identity value. Everything else the application is inserting, it already knows--names, dollars, whatever. But a few minutes reflection and reading the prior 6 (hmm, make that 5) replies, leads to a number of “it depends” situations:
At the most basic level, what you inserted is what you’d get – you pass in values, they get written to a row in the table, and you’re done.
Slightly more complex that that is when there are simple default values assigned during an insert statement. “DateCreated” columns that default to the current datetime, or “CreatedBy” that default to the current SQL login, are a prime example. I’d include identity columns here, since not every table will (or should) contain them. These values are generated by the database upon table insertion, so the calling application cannot know what they are. (It is not unknown for web server clocks to not be synchronized with database server clocks. Fun times…) If the application needs to know the values just generated, then yes, you’d need to pass those back.
And then there are are situations where additional processing is done within the database before data is inserted into the table. Such work might be done within stored procedures or triggers. Once again, if the application needs to know the results of such calculations, then the data would need to be returned.
With that said, it seems to me the main issue underlying your decision is: how much control/understanding do you have over the database? You say you are using a tool to automatically generate your CRUD procedures. Ok, that means that you do not have any elaborate processing going on within them, you’re just taking data and loading it on in. Next question: are there triggers (of any kind) present that might modify the data as it is being written to the tables? Extend that to: do you know whether or not such triggers exists? If they’re there and they matter, plan accordingly; if you do not or cannot know, then you might need to “follow up” on the insert to see if changes occurred. Lastly: does the application care? Does it need to be informed of the results of the insert action it just requested, and if so, how much does it need to know? (New identity value, date time it was added, whether or not something changed the Name from “Widget” to “Widget_201001270901”.)
If you have complete understanding and control over the system you are building, I would only put in as much as you need, as extra code that performs no useful function impacts performance and maintainability. On the flip side, if I were writing a tool to be used by others, I’d try to build something that did everything (so as to increase my market share). And if you are building code where you don't really know how and why it will be used (application purpose), or what it will in turn be working with (database design), then I guess you'd have to be paranoid and try to program for everything. (I strongly recommend not doing that. Pare down to do only what needs to be done.)
Quite often the database will have a property that gives you the ID of the last inserted item without having to do an additional select. For example, MS SQL Server has the ##Identity property (see here). You can pass this back to your application as an output parameter of your stored procedure and use it to update your data with the new ID. MySQL has something similar.
INSERT
INTO mytable (col1, col2)
OUTPUT INSERTED.*
VALUES ('value1', 'value2')
With this clause, returning the whole row does not require an extra SELECT and performance-wise is the same as returning only the id.
"Which is better" totally depends on your application needs. If you need the whole row, return the whole row, if you need only the id, return only the id.
You may add an extra setting to your business object which can trigger this option and return the whole row only if the object needs it:
IF #return_whole_row = 1
INSERT
INTO mytable (col1, col2)
OUTPUT INSERTED.*
VALUES ('value1', 'value2')
ELSE
INSERT
INTO mytable (col1, col2)
OUTPUT INSERTED.id
VALUES ('value1', 'value2')
FI
I don't think I would in general return an entire row, but it could be a useful technique.
If you are code-generating, you could generate two procs (one which calls the other, perhaps) or parametrize a single proc to determine whther to return it over the wire or not. I doubt the DB overhead is significant (single-row, got to have a PK lookup), but the data on the wire from DB to client could be significant when all added up and if it's just discarded in 99% of the cases, I see little value. Having an SP which returns different things with different parameters is a potential problem for clients, of course.
I can see where it would be useful if you have logic in triggers or calculated columns which are managed by the database, in which case, a SELECT is really the only way to get that data back without duplicating the logic in your client or the SP itself. Of course, the place to put any logic should be well thought out.
Putting ANY logic in the database is usually a carefully-thought-out tradeoff which starts with the minimally invasive and maximally useful things like constraints, unique constraints, referential integrity, etc and growing to the more invasive and marginally useful tools like triggers.
Typically, I like logic in the database when you have multi-modal access to the database itself, and you can't force people through your client assemblies, say. In this case, I would still try to force people through views or SPs which minimize the chance of errors, duplication, logic sync issues or misinterpretation of data, thereby providing as clean, consistent and coherent a perimeter as possible.

What is a maintainable way to store large text fields without sacrificing performance?

I have been dancing around this issue for awhile but it keeps coming up. We have a system and our may of our tables start with a description that is originally stored as an NVARCHAR(150) and I then we get a ticket asking to expand the field size to 250, then 1000 etc, etc...
This cycle is repeated on ever "note" field and/or "description" field we add to most tables. Of course the concern for me is performance and breaking the 8k limit of the page. However, my other concern is making the system less maintainable by breaking these fields out of EVERY table in the system into a lazy loaded reference.
So here I am faced with these same to 2 options that have been staring me in the face. (others are welcome) please lend me your opinions.
Change all may notes and/or descriptions to NVARCHAR(MAX) and make sure we do exclude these fields in all listings. Basically never do a: SELECT * FROM [TableName] unless is it only retrieving one record.
Remove all notes and/or description fields and replace them with a forign key reference to a [Notes] table.
CREATE TABLE [dbo].[Notes] (
[NoteId] [int] NOT NULL,
[NoteText] [NVARCHAR](MAX)NOT NULL )
Obviously I would prefer use option 1 because it will change so much in our system if we go with 2. However if option 2 is really the only good way to proceed, then at least I can say these changes are necessary and I have done the homework.
UPDATE:
I ran several test on a sample database with 100,000 records in it. What I find is that the because of cluster index scans the IO required for option 1 is "roughly" twice that of option 2. If I select a large number of records (1000 or more) option 1 is twice as slow even if I do not include the large text field in the select. As I request less rows the lines blur more. I a web app where page sizes of 50 or so are the norm, so option 1 will work, but I will be converting all instances to option 2 in the (very) near future for scalability.
Option 2 is better for several reasons:
When querying your tables, the large
text fields fill up pages quickly,
forcing the database to scan more
pages to retrieve data. This is
especially taxing when you don't
actually need to return the text
data.
As you mentioned, it gives you
a clean break to change the data
type in one swoop. Microsoft has
deprecated TEXT in SQL Server 2008,
so you should stick with
VARCHAR/VARBINARY.
Separate filegroups. Having
all your text data in a slower,
cheaper storage location might be
something you decide to pursue in
the future. If not, no harm, no
foul.
While Option 1 is easier for now, Option 2 will give you more flexibility in the long-term. My suggestion would be to implement a simple proof-of-concept with the "notes" information separated from the main table and perform some of your queries on both examples. Compare the execution plans, client statistics and logical I/O reads (SET STATISTICS IO ON) for some of your queries against these tables.
A quick note to those suggesting the use of a TEXT/NTEXT from MSDN:
This feature will be removed in a
future version of Microsoft SQL
Server. Avoid using this feature in
new development work, and plan to
modify applications that currently use
this feature. Use varchar(max),
nvarchar(max) and varbinary(max) data
types instead. For more information,
see Using Large-Value Data Types.
I'd go with Option 2.
You can create a view that joins the two tables to make the transition easier on everyone, and then go through a clean-up process that removes the view and uses the single table wherever possible.
You want to use a TEXT field. TEXT fields aren't stored directly in the row; instead, it stores a pointer to the text data. This is transparent to queries, though - if you ask for a TEXT field, it will return the actual text, not the pointer.
Essentially, using a TEXT field is somewhat between your two solutions. It keeps your table rows much smaller than using a varchar, but you'll still want to avoid asking for them in your queries if possible.
The TEXT/NTEXT data type has practically unlimited length while taking up next to nothing in your record.
It comes with a few strings attached, like special behavior with string functions, but for a secondary "notes/description" type of field these may be less of a problem.
Just to expand on Option 2
You could:
Rename existing MyTable to MyTable_V2
Move the Notes column into a joined Notes table (with 1:1 joining ID)
Create a VIEW called MyTable that joins MyTable_V2 and Notes tables
Create an INSTEAD OF trigger on MyTable view which saves the Notes column into the Notes table (IF NULL then delete any existing Notes row, if NOT NULL then Insert if not found, otherwise Update). Perform appropriate action on MyTable_V2 table
Note: We've had trouble doing this where there is a Computed column in MyTable_V2 (I think that was the problem, either way we've hit snags when doing this with "unusual" tables)
All new Insert/Update/Delete code should be written to operate directly on MyTable_V2 and Notes tables
Optionally: Have the INSERT OF trigger on MyTable log the fact that it was called (it can do this minimally, UPDATE a pre-existing log table row with GetDate() only if existing row's date is > 24 hours old - so will only do an update once a day).
When you are no longer getting any log records you can drop the INSTEAD OF trigger on MyTable view and you are now fully MyTable_V2 compliant!
Huge amount of hassle to implement, as you surmised.
Alternatively trawl the code for all references to MyTable and change them to MyTable_V2, put a VIEW in place of MyTable for SELECT only, and not create the INSTEAD OF trigger.
My plan would be to fix all Insert/Update/Delete statements referencing the now deprecated MyTable. For me this would be made somewhat easier because we use unique names for all tables and columns in the database, and we use the same names in all application code, so making sure I had found all instances by a simple FIND would be high.
P.S. Option 2 is also preferable if you have any SELECT * lying around. We have had clients whos application performance has gone downhill fast when they added large Text/Blob columns to existing tables - because of "lazy" SELECT * statements. Hopefully that isn;t the case in your shop though!

Resources