I need to synchronize a large XML file (containing 6 million records sorted by ID) against a SAP MaxDB database table.
This is my present strategy:
read XML object from file and convert to Java bean
load object with the same primary key from database (see below)
if object does not exist, insert object
if object exists and is changed, update object
at the end of the process, scan the database table for (deleted) objects that are not contained in the file
For efficiency reasons, the "load database object" method keeps a cache of the next 1000 records and issues statements to load the next bunch of objects. It works like this:
List<?> objects = (List<?>)transactionTemplate.execute(new TransactionCallback() {
public Object doInTransaction(TransactionStatus status) {
ht.setMaxResults(BUNCH_SIZE);
ht.setFetchSize(BUNCH_SIZE);
List<?> objects = ht.find("FROM " + simpleName + " WHERE " +
primaryKeyName + " >= ? ORDER BY " + primaryKeyName, primaryValue);
return objects;
}
});
Unfortunately, for some constant values of BUNCH_SIZE (10,000) I get a SQL exception "result table space exhausted".
How can I better optimize the process?
How can I avoid the SQL exception / the bunch size problem?
The code that saves the changed objects is the following:
if (prevObject == null) {
ht.execute(new HibernateCallback(){
public Object doInHibernate(Session session)
throws HibernateException, SQLException {
session.save(saveObject);
session.flush();
session.evict(saveObject);
return null;
}
});
newObjects++;
} else {
List<String> changes = ObjectUtil.getChangedProperties(prevObject, currentObject);
if (hasImportantChanges(changes)) {
changedObjects++;
ht.merge(currentObject);
} else
unchangedObjects++;
}
}
While this code works in principle, it produces masses of database log entries (we are talking about more than 50 GB of log backups) if there are a lot of new or changed objects in the source file.
Can I improve this code by using a lower transaction isolation level?
Can I reduce the amount of database log data written?
Maybe there is a problem with the database configuration?
I am very grateful for any help. Thanks a lot, Matthias
Related
To give a simplified example:
I have a database with one table: names, which has 1 million records each containing a common boy or girl's name, and more added every day.
I have an application server that takes as input an http request from parents using my website 'Name Chooser' . With each request, I need to pick up a name from the db and return it, and then NOT give that name to another parent. The server is concurrent so can handle a high volume of requests, and yet have to respect "unique name per request" and still be high available.
What are the major components and strategies for an architecture of this use case?
From what I understand, you have two operations: Adding a name and Choosing a name.
I have couple of questions:
Qustion 1: Do parents choose names only or do they also add names?
Question 2 If they add names, doest that mean that when a name is added it should also be marked as already chosen?
Assuming that you don't want to make all name selection requests to wait for one another (by locking of queueing them):
One solution to resolve concurrency in case of choosing a name only is to use Optimistic offline lock.
The most common implementation to this is to add a version field to your table and increment this version when you mark a name as chosen. You will need DB support for this, but most databases offer a mechanism for this. MongoDB adds a version field to the documents by default. For a RDBMS (like SQL) you have to add this field yourself.
You havent specified what technology you are using, so I will give an example using pseudo code for an SQL DB. For MongoDB you can check how the DB makes these checks for you.
NameRecord {
id,
name,
parentID,
version,
isChosen,
function chooseForParent(parentID) {
if(this.isChosen){
throw Error/Exception;
}
this.parentID = parentID
this.isChosen = true;
this.version++;
}
}
NameRecordRepository {
function getByName(name) { ... }
function save(record) {
var oldVersion = record.version - 1;
var query = "UPDATE records SET .....
WHERE id = {record.id} AND version = {oldVersion}";
var rowsCount = db.execute(query);
if(rowsCount == 0) {
throw ConcurrencyViolation
}
}
}
// somewhere else in an object or module or whatever...
function chooseName(parentID, name) {
var record = NameRecordRepository.getByName(name);
record.chooseForParent(parentID);
NameRecordRepository.save(record);
}
Before whis object is saved to the DB a version comparison must be performed. SQL provides a way to execute a query based on some condition and return the row count of affected rows. In our case we check if the version in the Database is the same as the old one before update. If it's not, that means that someone else has updated the record.
In this simple case you can even remove the version field and use the isChosen flag in your SQL query like this:
var query = "UPDATE records SET .....
WHERE id = {record.id} AND isChosend = false";
When adding a new name to the database you will need a Unique constrant that will solve concurrenty issues.
I am using GAE for my server where I have all my entities in Datastore. One of the entity has more than 2000 records, and it is taking almost 30 secs to read whole entity. So I wanted to use cache to improve performance.
I have tried Datastore objectify #cache annotation, but not finding
how to read from the stored cache. I have declared entity as below:
#Entity
#Cache
public class Devices{
}
Second thing I tried is memcache. I am storing whole List s
in key, but this is not storing, I couldn't see in console memcache,
but at the same time not showing any errors or exceptions while
storing objects.
putvalue("temp", List<Devices>)
public void putValue(String key, Object value) {
Cache cache = getCache();
logger.info(TAG + "getCache() :: storing memcache for key : " + key);
try {
if (cache != null) {
cache.put(key, value);
}
}catch (Exception e) {
logger.info(TAG + "getCache() :: exception : " + e);
}
}
When I tried to retrieve using getValue("temp"), it is returning
null or empty.
Object object = cache.get(key);
My main object is to limit the time to 5secs to get all the records of entity.
Can anyone suggest what I am doing wrong here? Or any better solution to retrieve the records fast from Datastore.
Datastore Objectify actually uses the App Engine Memcache service to cache your entity data globally when you use the #Cache annotation. However, as explained in the doc here, only get-by-key, save(), and delete() interact with the cache. Query operations are not cached.
Regarding the App Engine Memcache method, you may be hitting the limit for the maximum size of a cached data value which is 1 MiB, although I believe this raise an exception indeed.
Regarding the query itself, you may be better off using a keys_only query and then doing a key.get() on each returned key. That way, Memcache will be used for each record.
I have an endpoint method that first uses a query to see if an entity with certain params exists, and if it does not it will create it. If it exists, I want to increase a counter in the variable:
Report report = ofy().load().type(Report.class)
.filter("userID", userID)
.filter("state", state).first().now();
if (report == null) {
//write new entity to datastore with userID and state
} else {
//increase counter in entity +1
report.setCount(report.count+1)
//save entity to datastore
}
My question is, what if someone clicks a button to execute the above endpoint with the same params very rapidly, what will happen? Will two identical Report entities get written to the datastore? I only want to make sure one is written.
By itself this code is not safe and has a race condition that will allow multiple Reports to be created.
To make this safe, you need to run the code in a transaction. Which means you must have an ancestor query (or convert it to a simple primary key lookup). One option is to give Report a #Parent of the User. Then you can so something like this:
ofy().transact(() -> {
Report report = ofy().load().type(Report.class)
.ancestor(user)
.filter("state", state).first().now();
if (report == null) {
//write new entity to datastore with userID and state
} else {
//increase counter in entity +1
report.setCount(report.count+1)
//save entity to datastore
}
});
I insert alot of data into a table with a autogenerated key using the batchUpdate functionality of JDBC. Because JDBC doesn't say anything about batchUpdate and getAutogeneratedKeys I need some database independant workaround.
My ideas:
Somehow pull the next handed out sequences from the database before inserting and then using the keys manually. But JDBC hasn't got a getTheNextFutureKeys(howMany). So how can this be done? Is pulling keys e.g. in Oracle also transaction save? So only one transaction can ever pull the same set of future keys.
Add an extra column with a fake id that is only valid during the transaction.
Use all the other columns as secondary key to fetch the generated key. This isn't really 3NF conform...
Are there better ideas or how can I use idea 1 in a generalized way?
Partial answer
Is pulling keys e.g. in Oracle also transaction save?
Yes, getting values from a sequence is transaction safe, by which I mean even if you roll back your transaction, a sequence value returned by the DB won't be returned again under any circumstances.
So you can prefetch the id-s from a sequence and use them in the batch insert.
Never run into this, so I dived into it a little.
First of all, there is a way to retrieve the generated ids from a JDBC statement:
String sql = "INSERT INTO AUTHORS (LAST, FIRST, HOME) VALUES " +
"'PARKER', 'DOROTHY', 'USA', keyColumn";
int rows = stmt.executeUpdate(sql,
Statement.RETURN_GENERATED_KEYS);
ResultSet rs = stmt.getGeneratedKeys();
if (rs.next()) {
ResultSetMetaData rsmd = rs.getMetaData();
int colCount = rsmd.getColumnCount();
do {
for (int i = 1; i <= colCount; i++) {
String key = rs.getString(i);
System.out.println("key " + i + "is " + key);
}
}
while (rs.next();)
}
else {
System.out.println("There are no generated keys.");
}
see this http://download.oracle.com/javase/1.4.2/docs/guide/jdbc/getstart/statement.html#1000569
Also, theoretically it could be combined with the JDBC batchUpdate
Although, this combination seems to be rather non-trivial, on this pls refer to this thread.
I sugest to try this, and if you do not succeed, fall back to pre-fetching from sequence.
getAutogeneratedKeys() will also work with a batch update as far as I remember.
It returns a ResultSet with all newly created ids - not just a single value.
But that requires that the ID is populated through a trigger during the INSERT operation.
In the code below, pathToNonDatabase is the path to a simple text file, not a real sqlite database. I was hoping for sqlite3_open to detect that, but it doesn't (db is not NULL, and result is SQLITE_OK). So, how to detect that a file is not a valid sqlite database?
sqlite3 *db = NULL;
int result = sqlite3_open(pathToNonDatabase, &db);
if((NULL==db) || (result!=SQLITE_OK)) {
// invalid database
}
sqlite opens databases lazily. Just do something immediately after opening that requires it to be a database.
The best is probably pragma schema_version;.
This will report 0 if the database hasn't been created (for instance, an empty file). In this case, it's safe work with (and run CREATE TABLE, etc)
If the database has been created, it will return how many revisions the schema has gone through. This value might not be interesting, but that it's not zero is.
If the file exists and isn't a database (or empty), you'll get an error.
If you want a somewhat more thorough check, you can use pragma quick_check;. This is a lighter-weight integrity check, which skips checking that the contents of the tables line up with the indexes. It can still be very slow.
Avoid integrity_check. It not only checks every page, but then verifies the contents of the tables against the indexes. This is positively glacial on a large database.
For anyone needing to do this in C# with System.Data.SQLite you can start a transaction, and then immediately roll it back as follows:-
private bool DatabaseIsValid(string filename)
{
using (SQLiteConnection db = new SQLiteConnection(#"Data Source=" + filename + ";FailIfMissing=True;"))
{
try
{
db.Open();
using (var transaction = db.BeginTransaction())
{
transaction.Rollback();
}
}
catch (Exception ex)
{
log.Debug(ex.Message, ex);
return false;
}
}
return true;
}
If the file is not a valid database the following SQLiteException is thrown - file is encrypted or is not a database (System.Data.SQLite.SQLiteErrorCode.NotADb). If you aren't using encrypted databases then this solution should be sufficient.
(Only the 'db.Open()' was required for version 1.0.81.0 of System.Data.SQLite but when I upgraded to version 1.0.91.0 I had to insert the inner using block to get it to work).
I think a pragma integrity_check; could do it.
If you want only to check if the file is a valid sqlite database then you can check with this function:
private bool CheckIfValidSQLiteDatabase(string databaseFilePath)
{
byte[] bytes = new byte[16];
using (FileStream fileStream = new FileStream(databaseFilePath, FileMode.Open, FileAccess.Read))
{
fileStream.Read(bytes, 0, 16);
}
string gg = System.Text.ASCIIEncoding.ASCII.GetString(bytes);
return gg.Contains("SQLite format");
}
as stated in the documentation:
sqlite database header