My JPA entity:
#Entity
public class Test implements Serializable {
#Id
#GeneratedValue(strategy = GenerationType.AUTO)
private long id;
#Lob
public byte[] data;
}
Now, let’s say I store 100 entries in my database and each entry contains 3 MB.
SELECT x FROM Test x returns 100 entries and the database (on the file system) has a size of about 300 MB (as expected).
The next step is deleting all 100 entries by calling: entityManager.remove(test) for each entry.
SELECT x FROM Test x now results in an empty table, BUT the database still has got a size of 300 MB! First if I drop the table, the database will shrink to the initial state.
What’s going wrong here? If I delete entries, they won’t really get removed?!
I tried with JavaDB and Oracle XE and I’am using EclipseLink.
For me I will see whether the JPA transaction was committed successfully and select entity counts from database console to double check.
If table records not really gone it means you might have some trouble deleting the records. Try commit trasnaction or flush it.
If the table records already gone but database disk space remains occurred, means it might be becoz of the database space associate management policies, then check your database configuration see how to get the space released once records are gone.
The database does not delete the data physically (obviously). When and how this is done depends on the database and its setup, e.g. it could be triggered by certain file size thresholds, manual compact commands or scheduled maintance tasks. This is completely independent of JPA.
Related
We would like to synchronize data (insert, update) from Oracle (11g) to PostgreSQL (10). Our approach was the following:
A trigger on the table in Oracle updates a column with nextval from a sequence before insert and update.
PostgreSQL knows the last sequence number processed and fetches the rows from Oracle > lastSequenceNumberFetched.
We now have the following problem:
Session 1 in Oracle inserts a row, sequence number (let's say 45) is written but no COMMIT is done in Oracle.
Session 2 in Oracle inserts a row, sequence number is written (let's say 49 (because sequences in Oracle can have gaps)) and a COMMIT is done in Oracle.
Session in PostgreSQL fetches rows from Oracle with sequenceNumber > 44 (because the lastSequenceNumberFetched is 44) and gets the row with sequenceNumber 49. So this is the new lastSequenceNumberFetched.
Session 1 in Oracle makes a commit.
Session in PostgreSQL fetches rows from Oracle with sequenceNumber > 49. Problem is that the row with sequenceNumber 45 is never fetched.
Are there any better approaches for our use case avoiding our problem with missing data?
In case you don't have delete operations in your tables and the tables are not very big then I suggest to use Oracle System Change Number (SCN) on the row level which is returned by the pseudo column ORA_ROWSCN (link). This is the commit time presented by number. By default the SCN is tracked for the data block, but you can enable tracking on the row level (keyword rowdependencies). So you have to recreate your table with this keyword. At the sync procedure launch you get the current scn by the function call dbms_flashback.get_system_change_number, then scan all tables where ora_rowscn between _last_scn_value_ and _current_scn_value_. The disadvantage is that this pseudo columns is not indexed, so you will have full table scans, which is slow for big tables.
If you use delete statements then you have to track the records which were deleted. For this purpose you can use one log table having the following columns: table_name, table_id_value, operation (insert/update/delete). Table is filled by the trigger on base tables. So for your case when session 1 commits data in base table - then you have the record in log table to process. And you don't see it until the session commits. So no issues with sequence numbers that you described.
Hope that helps.
Is this purely a data project or do you have some client here. If you do have a middle tier you could use an ORM to abstract some of this and do writes to both. Do you care whether the sequences are the same? It would be possible to do something like collect all the data to synchronize since a particular timestamp (every table would have to have a UTC timestamp) and then take a hash of all the data and compare with what is in Postgres.
It might be useful to have some more of your requirements for the synchronization of data and the reasoning behind this e.g.
Do the keys need to be the same against both environments? Why?
Who views the data, is the same consumer looking at both sources.
Why wouldn't you just use an ORM to target only one db why do you need oracle and postgres?
I have seen a similar setup. An application on Postgres mostly for reporting and other secondary tasks while main app was on Oracle.
Some of the main app tables are cached in Postgres for convenience. But this setup brings in the sync problem.
The compromise solution was a mix of incremental sequence-based sync during daytime and full table copy overnight
Regarding other solutions proposed here:
Postgres fdw is slow for complex queries and it puts extra load on foreign db especially when where clause refer to both local and foreign tables.
The same query will run much faster if foreign table is cached in postgres.
Incremental/differential sync using sequence numbers -tried this and works acceptable for small tables, but the nightmare starts with child relations maybe an orm can help here
The ideal solution in my opinion would probably be to stream Oracle changes to Postgres or intermediary process that replicates changes to Postgres
I have no clue about how to do this as I understood it requires Oracle golden gate app (+ licence)
We have a system that stores tens of thousands of databases, each holding the data of a customer subscribing to our services. Each of those databases has their own user that is known and used by the custom thorough the distributed, thick, application front-ends we provide.
Now I would like to add a trigger to one of the tables in each one of those dbs, that should update one common db with some of the data from the inserted rows. Sort of a many-to-one db scenario if you will...
This new common db can be set up pretty much how we like - as far as users/permissions and so on, and the trigger we insert in the old dbs can be written pretty freely. But we can not change those customer dbs as far as users/permissions.
Currently I'm experimenting with using the guest user, it has been given writing access, on the common db (named "foo" in example below) like this, but it is not working (i think due to the guest user of the common db not being allowed to access the customer db table - named Bf in example below - it is triggered in?). It may also be that I'm using "execute as" where I should be using execute as login="? I'm having a hard time finding a comprehensible place that describes the difference.
This is the trigger we would like to get working, and inserted in every customer db:
ALTER TRIGGER [dbo].[trgAfterInsert] ON [dbo].[Bf]
WITH EXECUTE AS 'guest'
FOR INSERT
AS
begin
insert into [foo].[dbo].[baar]
(publicKey, fileId)
SELECT b.publicKey, a.autoId
FROM client_db_1.[dbo].[Integrationer] as b, inserted as a where
a.autoId in (select top 1 autoId from inserted order by autoId)
end
As you may guess, I'm not an experienced user of triggers or sql permissions/access work. But the info we want to collect is harmless, and it should take close to 0 time to execute, and in a secured, non-exposed, environment, so I'm very willing to read/learn if anyone has advice?
/Henry
I'm copying 99 million rows from one SQL Server instance to another using the right-click "Tasks" > "Import Data" method. It's just a straight copy into a new, empty table on a new and empty NDF file. I'm using the identity insert when doing the copy so that the IDs will stay in tact. It was going very slowly (30 million records after 12 hours), so my boss told me to cancel it, remove all indexes from the new empty table, then run again.
Will removing indexes on the new table really speed up the transfer of records, and why? I imagine I can create indexes after the table is filled.
What is the underlying process behind right-click "Import Data"? Is it using SqlBulkCopy, is it logging tons of stuff? I know it's not in a transaction because cancelling it stopped it immediately and the already inserted rows were there.
My file growth on the NDF file that holds the table is 20MB. Will increasing this speed up the process when using the above records on 99 million records? It's just an idea I had.
Yes, it should. Each new row being inserted will cause each index to be updated with the new data. It's worth noting that if you remove the indexes, import, then re-add the indexes, those indexes will take a very long time to build anyway.
It essentially runs as a very simple SSIS package. It reads rows from the source and inserts in chunks as a transaction. If your recovery model is set to Full, you could switch it to Bulk Logged for the import. This should be done if you're bulk-moving data when other updates to the database won't be happening, though.
I would try to size the MDF/NDF close to what you'd expect the end result to be. The autogrowth can take time, especially if you have it set low.
I'm writing an application which must log information pretty frequently, say, twice in a second. I wish to save the information to an sqlite database, however I don't mind to commit changes to the disk once every ten minutes.
Executing my queries when using a file-database takes to long, and makes the computer lag.
An optional solution is to use an in-memory database (it will fit, no worries), and synchronize it to the disk from time to time,
Is it possible? Is there a better way to achieve that (can you tell sqlite to commit to disk only after X queries?).
Can I solve this with Qt's SQL wrapper?
Let's assume you have an on-disk database called 'disk_logs' with a table called 'events'. You could attach an in-memory database to your existing database:
ATTACH DATABASE ':memory:' AS mem_logs;
Create a table in that database (which would be entirely in-memory) to receive the incoming log events:
CREATE TABLE mem_logs.events(a, b, c);
Then transfer the data from the in-memory table to the on-disk table during application downtime:
INSERT INTO disk_logs.events SELECT * FROM mem_logs.events;
And then delete the contents of the existing in-memory table. Repeat.
This is pretty complicated though... If your records span multiple tables and are linked together with foreign keys, it might be a pain to keep these in sync as you copy from an in-memory tables to on-disk tables.
Before attempting something (uncomfortably over-engineered) like this, I'd also suggest trying to make SQLite go as fast as possible. SQLite should be able to easily handly > 50K record inserts per second. A few log entries twice a second should not cause significant slowdown.
If you're executing each insert within it's own transaction - that could be a significant contributor to the slow-downs you're seeing. Perhaps you could:
Count the number of records inserted so far
Begin a transaction
Insert your record
Increment count
Commit/end transaction when N records have been inserted
Repeat
The downside is that if the system crashes during that period you risk loosing the un-committed records (but if you were willing to use an in-memory database, than it sounds like you're OK with that risk).
A brief search of the SQLite documentation turned up nothing useful (it wasn't likely and I didn't expect it).
Why not use a background thread that wakes up every 10 minutes, copies all of the log rows from the in-memory database to the external database (and deletes them from the in-memory database). When your program is ready to end, wake up the background thread one last time to save the last logs, then close all of the connections.
I have a requirement to take a "snapshot" of a current database and clone it into the same database, with new Primary Keys.
The schema in question consists of about 10 tables, but a few of the tables will potentially contain hundreds of thousands to 1 million records that need to be duplicated.
What are my options here?
I'm afraid that writing a SPROC will require a locking of the database rows in question (for concurrency) for the entire duration of the operation, which is quite annoying to other users. How long would such an operation take, assuming that we can optimize it to the full extent sqlserver allows? Is it going to be 30 seconds to 1 minute to perform this many inserts? I'm not able to lock the whole table(s) and do a bulk insert, because there are other users under other accounts that are using the same tables independently.
Depending on performance expectations, an alternative would be to dump the current db into an xml file and then asynchronously clone the db from this xml file at leisure in the background. The obvious advantage of this is that the db is only locked for the time it takes to do the xml dump, and the inserts can run in the background.
If a good DBA can get the "clone" operation to execute start to finish in under 10 seconds, then it's probably not worth the complexity of the xmldump/webservice solution. But if it's a lost cause, and inserting potentially millions of rows is likely to balloon out in time, then I'd rather start out with the xml approach right away.
Or maybe there's an entirely better approach altogether??
Thanks a lot for any insights you can provide.
I would suggest backing the up database, and then restoring it as new db on your server. You can use that new DB as your source.
I will definitely recommend against the xml dump idea..
Does it need to be in the exact same tables? You could make a set of "snapshots" tables where all these records go, you would only need a single insert + select, like
insert into snapshots_source1 (user,col1, col2, ..., colN)
select 'john', col1, col2, ..., colN from source1
and so on.
You can make snapshots_* to have an IDENTITY column that will create the 'new PK' and that can also preserve the old one if you so wished.
This has (almost) no locking issues and looks a lot saner.
It does require a change in the code, but shouldn't be too hard to make the app to point to the snapshots table when appropriate.
This also eases cleaning and maintenance issues
---8<------8<------8<---outdated answer---8<---8<------8<------8<------8<---
Why don't you just take a live backup and do the data manipulation (key changing) on the destination clone?
Now, in general, this snapshot with new primary keys idea sounds suspect. If you want a replica, you have log shipping and cluster service, if you want a copy of the data to generate a 'new app instance' a backup/restore/manipulate process should be enough.
You don't say how much your DB will occupy, but you can certainly backup 20 million rows (800MB?) in about 10 seconds depending on how fast your disk subsystem is...