Caching objects to improve performance - database

I'm storing a set of messages in a SQL table. Each message has a size and there's a column in the table which contains the size of the message. These messages are connected to accounts. When a new message arrives, I need to check that the current account size + the new message size is less than the quota for the account (which is just a "maxaccountsize" column in a row in the accounts table). If not, I need to report back to the sender that the message does not fit in the account.
To simplify:
Table messages:
ID int
AccountID int
Size int
Table accounts:
ID int
MaxSize int
To calculate the total size of each account, I execute statements similar to SELECT SUM(Size) from messages WHERE AccountID = 12345.
In a large user databases where there's hundreds of thousands of messages in accounts, this operation is heavy and becomes a big bottleneck when receiving a message. My software use both Microsoft SQL Server, MySQL and PostgreSQL as backend.
To solve this, I've added some in-memory caching of the value. This is cumbersome to me since I need to implement thread-safe updates of the cache, and I need to make sure that the cache is always up to date. Also, it doesn't work if someone manually edits the database.
An alternative solution would be to store the current account size in the accounts table. However, this would mean I have somewhat redundant data (of course, one can say that this is already the case today with my in-memory cache). If I choose this solution, I need to make sure I always update the account size when creating or deleting messages. This is also a bit cumbersome and I can bet that there will be times when the sum(size) does not equal to the CurrentAccountSize value in Accounts row. With the in-memory cache, at least it will be reset to its correct value when the server is restarted.
Does anyone have an opinion on what should be done in situations like these?

In this use case I would definitely store redundant data in your database.
Do you think your bank account calculates the sum of all transactions in its history when calculating your balance?

Related

DB table with an n:n helper table - read twice or duplicate rows?

This is a question about database access performance vs code simplicity and pest practices.
Let's say I have a Users table and an Addresses table. Every user can have more than one address, which will be stored in the Addresses table with a foreign key to the Users table.
What would be the best way to read users from the database, assuming that I always want to get the addresses along with the users?
First option would be to query the user, say by his username, and once I have the object, use the user's id to query the Addresses table for all the user's addresses.
pros:
Simple code
No duplicate data is transferred
Cons:
Requires two queries to the database
Second option would be to write a query that joins Users with Addresses and returns a user result line for every address the user has. All the columns, except for the address column, would be exactly the same for every line. I would then aggregate all the lines into a single user object with a list of addresses.
Pros:
Requires a single query to the database
Cons
Relatively complicated code (aggregating the users)
A lot of the data transferred is redundant
Those are the two ways I could think of, both have their pros and cons. Which of the options would you suggest?
Maybe another solution altogether?
My first rule of thumb is usually to let the database engine do what it is good at. Joining of tables is a basic function that the database performs with maximum efficiency. A join by the DB will always be faster than what you can do by making multiple calls.
The point you make about the fact that it fetches a lot of user data is true only if you have real problems with data transfer or the data is really massive.
In exchange, you are making just one call to the database instead of multiple calls. That saving can well outweigh the possible downside of data size.
I'm not quite sure what you meant by "aggregating the user data" since you just take it from the first entry of that user and skip the rest.
At the end of the day, let the database do its work unless there is a really good reason not to do so.
In really serious cases there are ways to bring nulls in the user data all but the first row. However, this complicates the SQL query greatly and, once again, is generally not worth the overhead.
I just had a long debate over it with Microsoft over GitHub and a discussion with MS-SQL MVP.
Summarizing that thread (from my prescriptive):
To SQL Server it doesn't matter if you'll have a single query or 10, the redundant fields returned have 0 impact on SQL Server.
Splitting the queries is what SQL is doing internally anyway, and when people try to optimize SQL, it's usually to the worst as SQL is doing better when not forcing it do act in a specific way.
Having multiple queries have overhead on SQL.
The only thing that actually splitting queries solves is bandwidth on the network as there will be less bytes transferred over the wire, and he says it's negligible compared to having multiple queries.
When you have massive returned rows, you'll want to split the queries because of Table Spools and because of the bandwidth.
In the end, I decided to use
GROUP_CONCAT(DISTINCT addresses.address SEPARATOR ' | ') addresses
...
GROUP BY userId
I then split the addresses into a list in the client (specifically, in my customer BeanPropertyRowMapper)

looking for a better way to sync databases

I have an app (vb.net) which collects data from users and stores the data locally on their laptop until they sync it up with a central SQLServer 2008 database. The sync needs to be in both directions. So right now, I have a timestamp on each record that gets set when that record gets updated. Then I compare times on the records to see which is more recent. If a record on the laptop is more recent than the one on the central DB, the laptop record gets sent up. And if the record on the central DB is more recent than the laptop, that record gets sent down to the laptop.
I have several hundred thousand records spread over about 15 tables. It is taking 3 to 4 minutes to run through all of them if you are local on the network. The problem really gets worse for remote users. It takes them 20 to 30 minutes to sync. via VPN.
I have about 5 users doing this and they all need to maintain the same information with each other by way of the central database. They all sync to the central DB, not with each other.
Is there a better way to check every record other than comparing timestamps?
Note that only a handful of records (5%) change each time they sync, but I don't know which ones it may be. It could be any of them. So I have to check all of them.
Thanks.
In my opinion timestamps are not the way to go for determining which records to send to the other party.
Although they might be "ok" for conflict resolution, time differences on synchronization parties (computers), might cause records to be skipped from sending out, causing real problems.
Myself I use an identity column (on the server side) on one specific table to generate sequence nr's, and in every transaction, I get a new sequence number, and assign this to all updated/inserted rows of the other tables that need synchronization.
Now when a client requests synchronization, it provides the server with the latest 'sequence' it received during last synchronization or 0 if it is the first time.
The server would send only those records that have a greater sequence number, and then determines what the highest sequence number was on those records it actually sent to the client, and give this number to the client for next synchronization requests.
In my scenario, conflict resolution is done on the client, because all business logic is their anyway, and this means, that the client always receives updates first, before it start to send theirs.
Because you use one newly generated sequence number for every transaction, you maintain referential integrity during each synchronization, but to make sure that's actually true,
you need to determine the currently highest sequence number before you start to send synchronization data, and never retrieve any records higher then this number, because otherwise you could break referential integrity.
This because, some other thread might have committed inserts of Orders and OrderItems after you already looked up the Orders but not the OrderItems, by which you have OrderItems in your outwards synchronization package without the Order.
For deletions, I use a IsDeleted column, and the server holds records for some period before they really get deleted.
When clients insert data, I give them feedback of what (primary) keys that records where given, etc.. etc..
Well, there is so much more to this then I can mention here, but here are some key thoughts for you that you should watch carefully:
How to prevent:
Missing records
Missing deletes
Double inserts
Unnecessary sending of records (I use a nullable field LastModifierId)
Input validation
Referential integrity
Conflict resolution
Performance costs (choose the right indexes, filtered unique indexes are great for keeping track of temporary client insert identities of records, so they might also be null, you need these to prevent double inserts)
Well good luck, hope this gives food for thoughts..

hbase data modeling for activity feeds/news feeds/timeline

I decided to use HBase in a project to store the users activities in a social network. Despite the fact that HBase has a simple way to express data (column oriented) I'm facing some difficulties to decide how I would represent the data.
So, imagine that you have millions of users, and each user is generating an activity when they, for example, comment in a thread, publishes something, like, vote, etc. I thought basically in two approaches with an Activity hbase table:
The key could be the user reference + timestamp of activity creation, the value all the activity metadata (most of time fixed size)
The key is the user reference, and then each activity would be stored as a new column inside a column family.
I saw examples for others types of system (such as blogs) that uses the 2nd approach. The first approach (with fixed columns, varying only when you change the schema) is more commonly seen.
What would be the impact in the way I access the data for these 2 approaches?
In general you are asking if your table should be wide or long. HBase works with both, up to a point. Wide tables should never have a row that exceeds region size (by default 256MB) -- so a really prolific user may crash the system if you store large chunks of data for their actions. However, if you are only storing a few bytes per action, then putting all user activity in one row will allow you to get their full history with one get. However, you will be retrieving the full row, which could cause some slowdown for a lot of history (10s of seconds for > 100MB rows).
Going with a tall table and an inverse time stamp would allow you to get a users recent activity very quickly (start a scan with the key = user id).
Using timestamps as the end of a key is a good idea if you want to query by time, but it is a bad idea if you want to optimize writes to your database (writes will always be in the most recent region in the system, causing hot spotting).
You might also want to consider putting more information (such as the activity) in the key so that you can pick up all activity of a particular type more easily.
Another example to look at is OpenTSDB

How to auto remove an expired record from a database?

We are building a large stock and forex trading platform using a relational database. At any point during the day there will be thousands, if not millions, of records in our Orders table. Some orders, if not fulfilled immediately, expire and must be removed from this table, otherwise, the table grows very quickly. Each order has an expiration time. Once an order expires it must be deleted. Attempting to do this manually using a scheduled job that scans and deletes records is very slow and hinders the performance of the system. We need to force the record to basically delete itself.
Is there way to configure any RDBMS database to automatically remove a record based on a date/time field if the time occurs in the past?
Since you most likely will have to implement complex order handling, e.g. limit orders, stop-limit orders etc. you need a robust mechanism for monitoring and executing orders in real time. This process is not only limited to expired orders. This is a core mechanism in a trading platform and you will have to design a robust solution that fulfill your needs.
To answer your question: Delete expired orders as part of your normal order handling.
Why must the row be deleted?
I think you are putting the cart before the horse here. If a row is expired, it can be made "invisible" to other parts of the system in many ways, including views which only show orders meeting certain criteria. Having extra deleted rows around should not hamper performance if your database is appropriately indexed.
What level of auditing and tracking is necessary? Is no analysis ever done on expired orders?
Do fulfilled orders become some other kind of document/entity?
There are techniques in many databases which allow you to partition tables. Using the partition function, it is possible to regularly purge partitions (of like rows) much more easily.
You have not specified what DB you are using but lets assume you use MSSQL you could create a agent job that runs periodicly, but you are saying that that might not be a solution for you.
So what t about having an Insert Trigger that when new record is inserted you delete all the record that are expired? This will keep number of record all relatively small.

Database design--billions of records in one table?

Let's say you're creating a database to store messages for a chat room application. There's an infinite number of chat rooms (they're created at run-time on-demand), and all messages need to be stored in the database.
Would it be a mistaken to create one giant table to store messages for all chat rooms, knowing that there could eventually be billions of records in that one table?
Would it be more prudent to dynamically create a table for each room created, and store that room's messages only in that table?
It would be proper to have a single table. When you have n tables which grows by application usage, you're describing using the database itself as a table of tables, which is not how an RDBMS is designed to work. Billions of records in a single table is trivial on a modern database. At that level, your only performance concerns are good indexes and how you do joins.
Billions of records?
Assuming you have constantly 1000 active users with 1 message per minute, this results in 1.5mio messages per day, and approx 500mio messages per year.
If you still need to store chat messages several years old (what for?), you could archive them into year-based tables.
I would definitely argue against dynamic creation of room-based tables.
Whilst a table per chat room could be performed, each database has limits over the number of tables that may be created, so given an infinite number of chat rooms, you are required to create an infinite number of tables, which is not going to work.
You can on the other hand store billions of rows of data, storage is not normally the issue given the space - retrieval of the information within a sensible time frame is however and requires careful planning.
You could partition the messages by a date range, and if planned out, you can use LUN migration to move older data onto slower storage, whilst leaving more recent data on the faster storage.
Strictly speaking, your design is right, a single table. fields with low entropy {e.g 'userid' - you want to link from ID tables, i.e following normal database normalization patterns}
you might want to think about range based partitioning. e.g 'copies' of your table with a year prefix. Or maybe even a just a 'current' and archive table
Both of these approaches mean that your query semantic is more complex {consider if someone did a multi-year search}, you would have to query multiple tables.
however, the upside is that your 'current' table will remain at a roughly constant size, and archiving is more straightforward. - {you can just drop table 2005_Chat when you want to archive 2005 data}
-Ace

Resources