Huge tables with simultanous read and write - database

I have a couple of million rows in a postgresql table. I have up to 20 proceeses writing to that table (a few hundred inserts/updates per second) and I have a few processes reading from it at the same time (once a while a big select). This results in many failures (Stream Closed, Input/Ouput Error) on both sides, reading and writing.
I now think about splitting that table into multiple tables. I would split by "type" of object, which is basically a field that has only 20 possible values that are kind of equally distributed.
The question is, should I use multiple tables, multiple schemas or multiple databases to guarantee a non blocking access to the data. Or maybe I should use a completly different setup. Another database maybe? Maybe HTable?
The integrity of the data is not that important. It has to be there in the end but I do not really need an Isolation Level or Transactions. I just need a fast system that can write and read from multiple processes without performance impact and that allows to make queries based on field values.
Right now I use JDBC with Isolation Level TRANSACTION_READ_UNCOMMITTED and a connection per process.
UPDATE:
The schema looks as follows:
CREATE TABLE rev
(
id integer NOT NULL,
source text,
date timestamp with time zone,
title text,
summary text,
md5sum text,
author text,
content text,
CONSTRAINT rev_id_pk PRIMARY KEY (id),
CONSTRAINT md5sum_un UNIQUE (md5sum)
)
CREATE TABLE resp
(
id integer NOT NULL,
source text,
date timestamp with time zone,
title text,
summary text,
md5sum text,
author text,
content text,
CONSTRAINT resp_id_pk PRIMARY KEY (id),
CONSTRAINT md5sum_un UNIQUE (md5sum)
)
And I have a few indexes on some of the fields.
A sample query looks like:
SELECT * FROM rev LEFT JOIN resp ON rev.id = resp.parent_id WHERE rev.date > ? LIMIT 1000 OFFSET ?
The resp table is much smaller, but it too gets updates and is queried in the joins.

This results in many failures on both sides, reading and writing.
What kind of failures? Reading and writing on the same table should not be a problem at all in PostgreSQL, MVCC works fine.
Hard to tell you how to fix your problems without any information about the system and what the processes are doing. Could you tell us more about it? And show a database schema?
Right now I use JDBC with Isolation Level TRANSACTION_READ_UNCOMMITTED
READ UNCOMMITTED doesn't exist in PostgreSQL, it's treated like Read Committed:
In PostgreSQL, you can request any of the four standard transaction
isolation levels. But internally, there are only two distinct
isolation levels, which correspond to the levels Read Committed and
Serializable. When you select the level Read Uncommitted you really
get Read Committed, and when you select Repeatable Read you really get
Serializable, so the actual isolation level might be stricter than
what you select.

I'm not sure how much a slight delay is for getting readable data is, but you might want to look into Slony Replication. Essentially, you have a master database with a slave database. All of your inserts/writes would be put into your master database, then Slony would replicate those new entries into the slave database (this takes a little bit of time, but nothing monumental. A few minutes, perhaps.). Then you can have your applications read exclusively from the slave database.
If Slony doesn't seem right for you, you can look at some "multi-master" alternatives here. These will let you have multiple machines be writeable, and have all their contents be replicated onto the read-machine.

Related

How does a SQL Server lock rows?

I have a question to understand how to optimize a database. We have a SQL Server and the main data is ordered vertically. So a record has these columns
ID version fieldindex fieldvalue
and ID, version, fieldindex is the primary key.
So if you want to load a logical recordset you have to load all lines of an ID + version. One "record" could contain of somewhat 60 lines. I'm afraid that this causes some problems in the performance of the database.
There are around 10 users working in parallel in the application and we are getting deadlocks very often. We even get deadlocks inserting new lines, so normally a record that isn't persistent can't be locked.
So my question is, how does SQL Server lock records? Is it possible that a record is locked even if I am not selecting this record particularly?
I would be glad if someone could explain how the database is working.
You've got EAV, which is generally considered bad.
To make EAV work acceptably, you'll need the right indexing structure and possibly some care with lock hints and transactions.
Generally you'll want your clustered index to be (EntityID, AttributeId), so all the attributes for an entity are co-located. But to avoid deadlocks you may need to X lock the main Entity row when modifying the attributes, as SQL Server will otherwise use row locking on the AttributeValue table, which can lead to deadlocks, and logical inconsistencies. You can X lock it by modifying the row as the first operation in your transaction, or by reading it with an XLOCK hint.
Depending on the role of "version" in your system, it will be somewhere in the Clustered Index too. If attributes are individually versioned, then at the end. If individual Entities are viersioned then in the middle. And if a Version contains multiple Entities then as the leading column.

Read Committed Snapshot isolation with LOBs

I have a table in an SQL Server 2017 DB used by a lot of long running transactions that originate from multiple threads. This causes deadlocking several times a day so I am considering implementing read committed snapshot isolation. The trick is that this table has 3 VARBINARY(MAX) columns and each of them contains data between 10-1000MB (with the mean around 20 MB) beside several int and bit columns.
Now the questions:
Q1: Will SQL Server copy the entire row (including the VARBINARY(MAX) columns) into the TEMPDB?
Q2: If so, would the performance benefit from moving the VARBINARY(MAX) columns into a separate table with a 1:1 relationship to the original table?
Sql Server has to present you with consistent view on your data (e.g. T2 sees your row, including LOB, as it were before T1 started mutating transaction). Which means -- yes, it has no choice but to copy LOB with the rest of the row data. Which makes me think that yes, performance may benefit from having separate table with LOBs.
As usual, I would recommend doing simple experiment that will measure performance with both configurations. Please post your results here.

Avoiding Locking Contention on DB2 zOS

I want to place DB2 Triggers for Insert, Update and Delete on DB2 Tables heavily used in parallel online Transactions. The tables are shared by several members on a Sysplex, DB2 Version 10.
In each of the DB2 Triggers I want to insert a row into a central table and have one background process calling a Stored Procedure to read this table every second to process the newly inserted rows, ordered by sequence of the insert (sequence number or timestamp).
I'm very concerned about DB2 Index locking contention and want to make sure that I do not introduce Deadlocks/Timeouts to the applications with these Triggers.
Obviously I would take advantage of DB2 Features to reduce locking like rowlevel locking, but still see no real good approach how to avoid index contention.
I see three different options to select the newly inserted rows.
Put a sequence number in the table and the store the last processed sequence number in the background process. I would do the following select Statement:
SELECT COLUMN_1, .... Column_n
FROM CENTRAL_TABLE
WHERE SEQ_NO > 'last-seq-number'
ORDER BY SEQ_NO;
Locking Level must be CS to avoid selecting uncommited rows, which will be later rolled back.
I think I need one Index on the table with SEQ_NO ASC
Pro: Background process only reads rows and makes no updates/deletes (only shared locks)
Neg: Index contention because of ascending key used.
I can clean-up processed records later (e.g. by rolling partions).
Put a Status field in the table (processed and unprocessed) and change the Select as follows:
SELECT COLUMN_1, .... Column_n
FROM CENTRAL_TABLE
WHERE STATUS = 'unprocessed'
ORDER BY TIMESTAMP;
Later I would update the STATUS on the selected rows to "processed"
I think I need an Index on STATUS
Pro: No ascending sequence number in the index and no direct deletes
Cons: Concurrent updates by online transactions and the background process
Clean-up would happen in off-hours
DELETE the processed records instead of the status field update.
SELECT COLUMN_1, .... Column_n
FROM CENTRAL_TABLE
ORDER BY TIMESTAMP;
Since the table contains very few records, no index is required which could create a hot spot.
Also I think I could SELECT with Isolation Level UR, because I would detect potential uncommitted data on the later delete of this row.
For a Primary Key index I could use GENERATE_UNIQUE,which is random an not ascending.
Pro: No Index hot spot and the Inserts can be spread across the tablespace by random UNIQUE_ID
Con: Tablespace scan and sort on every call of the Stored Procedure and deleting records in parallel to the online inserts.
Looking forward what the community thinks about this problem. This must be a pretty common problem e.g. SAP should have a similar issue on their Batch Input tables.
I tend to favour Option 3, because it avoids index contention.
May be there is still another solution in your minds out there.
I think you are going to have numerous performance problems with your various solutions.
(I know premature optimazation is a sin, but experience tells us that some things are just not going to work in a busy system).
You should be able to use DB2s autoincrement feature to get your sequence number, with little or know performance implications.
For the rest perhaps you should look at a Queue based solution.
Have your trigger drop the operation (INSERT/UPDATE/DELETE) and the keys of the row into a MQ queue,
Then have a long running backgound task (in CICS?) do your post processing as its processing one update at a time you should not trip over yourself. Having a single loaded and active task with the ability to batch up units of work should give you a throughput in the order of 3 to 5 hundred updates a second.

Empty the contents of a database

I have a huge data base with complicated relations, how can I delete all tables contents without violating foreign key constraints,is there a a such way to do that?
note that I am writing a SQL script file to delete tables such as the following example:
delete from A
delete from B
delete from C
delete from D
delete from E
but I don't know what table should I start with.
In SQL Server, there is no native way to do what you're asking. You do have a few options depending on your particular environment limitations:
Figure out the relationships between the tables and start deleting rows out in the appropriate order from foreigns to parents. This may be time-consuming for a large number of objects, but is the "safest" in terms of least destruction.
Disable the foreign key constraints and TRUNCATE TABLE. This will be a bit faster if you're dealing with lots of data, but you still have to to know where all your relationships are. Not too terrible if you're working with fewer tables, though option 1 becomes just as viable
Script out the database objects and DROP DATABASE/CREATE DATABASE. If you don't care about a raw teardown of the database, this is another option, however, you'll still need to be aware of object precedence for creation. SQL Server—as well as third-party tools— offer ways to script object DROP/CREATE. If you decide to go this route, the upside is that you have a scripted backup of all the objects (which I like to keep "just in case") and future tear-downs are nearly instantaneous as long as you keep your scripts synchronized with any changes.
As you can see, it's not a terribly simple process because you're trying to subvert the very reason for the existence of the constraints.
Steps can be:
disable all the constraint in all the tables
delete all the records from all the tables
enable the constraint back again.
Also see this discussion: SQL: delete all the data from all available tables
TRUNCATE TABLE tableName
Removes all rows from a table without
logging the individual row deletions.
TRUNCATE TABLE is similar to the
DELETE statement with no WHERE clause;
however, TRUNCATE TABLE is faster and
uses fewer system and transaction log
resources.
TRUNCATE TABLE (Transact-SQL)
Dude, taking your question at face value... that you want to COMPLETELY recreate the schema with NO data... forget the individual queries (too slow)... just destroydb, and then createdb (or whatever your RDBM's equivalent is)... and you might want to hire a competent DBA.

Synchronizing sqlite database from memory to file

I'm writing an application which must log information pretty frequently, say, twice in a second. I wish to save the information to an sqlite database, however I don't mind to commit changes to the disk once every ten minutes.
Executing my queries when using a file-database takes to long, and makes the computer lag.
An optional solution is to use an in-memory database (it will fit, no worries), and synchronize it to the disk from time to time,
Is it possible? Is there a better way to achieve that (can you tell sqlite to commit to disk only after X queries?).
Can I solve this with Qt's SQL wrapper?
Let's assume you have an on-disk database called 'disk_logs' with a table called 'events'. You could attach an in-memory database to your existing database:
ATTACH DATABASE ':memory:' AS mem_logs;
Create a table in that database (which would be entirely in-memory) to receive the incoming log events:
CREATE TABLE mem_logs.events(a, b, c);
Then transfer the data from the in-memory table to the on-disk table during application downtime:
INSERT INTO disk_logs.events SELECT * FROM mem_logs.events;
And then delete the contents of the existing in-memory table. Repeat.
This is pretty complicated though... If your records span multiple tables and are linked together with foreign keys, it might be a pain to keep these in sync as you copy from an in-memory tables to on-disk tables.
Before attempting something (uncomfortably over-engineered) like this, I'd also suggest trying to make SQLite go as fast as possible. SQLite should be able to easily handly > 50K record inserts per second. A few log entries twice a second should not cause significant slowdown.
If you're executing each insert within it's own transaction - that could be a significant contributor to the slow-downs you're seeing. Perhaps you could:
Count the number of records inserted so far
Begin a transaction
Insert your record
Increment count
Commit/end transaction when N records have been inserted
Repeat
The downside is that if the system crashes during that period you risk loosing the un-committed records (but if you were willing to use an in-memory database, than it sounds like you're OK with that risk).
A brief search of the SQLite documentation turned up nothing useful (it wasn't likely and I didn't expect it).
Why not use a background thread that wakes up every 10 minutes, copies all of the log rows from the in-memory database to the external database (and deletes them from the in-memory database). When your program is ready to end, wake up the background thread one last time to save the last logs, then close all of the connections.

Resources