Solution to handling data that frequently changes - sql-server

I'm currently trying to figure out a solution to best optimize for data that is going to frequently change. The server is running IIS/SQL Server and it is an ASP.NET Web API application. My table structure is something like the following:
User Table:
UserID PK
Status Table:
StatusID PK,
Title varchar
UserStatus Table:
UserID PK (CLUSTERED),
StatusID FK (NON-CLUSTERED),
Date DateTimeOffset (POSSIBLY INDEXED) - This would be used as an expiration. Old records become irrelevant.
There will be roughly 5000+ records in the users table. The status table will have roughly 500 records. The UserStatus table would have frequent changes (change every 5-30 seconds) to the StatusID and Date Fields by anywhere from 0 - 1000 users at any given time. This UserStatus table will also have frequent SELECT queries performed against it as well filtering out records with old/irrelevant dates.
I have considered populating the UserStatus table with a record for
each user and only performing updates. This would mean there would
always be the expected record present and it would limit the checks
for existence. My concern is performance and all of the fragmenting
of the indexes. I would then query against the table for records with
dates that fall within several minutes of the current time.
I have considered only inserting relevant records to the
UserStatus table, updating when they exist for a user, and running
a task that cleans old/irrelevant data out. This method would keep
the number of records down but I would have to check for the
existence of records before performing a task and indexes may inhibit
performance.
Finally I have considered a MemoryCache or something of the like. I
do not know much about caching in a Web API, but from what I have
read about it, I quickly decided against this because of potential
concurrency issues when iterating over the cache.
Does anyone have a recommendation for a scenario like this? Is there another methodology I am not considering?

Given the number of records you are talking about I would use the tsql Merge that will update existing records and add new ones with one efficient statement.
Given the design you mentioned, you should be able to run a periodic maint script that will fix any fragmentation issues.
The solution can be scaled. If the records got the the point where some slowdown was occurring I would consider SSD where fragmentation is not an issue.
If the disadvantages of SSD make that undesirable you can look into in-memory OLTP.

Related

Fast data retrieval without indexes on a table with data insertions at every 10 seconds (short time span)

I am fetching data from a table having 20K row from a third-party source where the way of filling the table can't be changed table.
On the third party, the table is filled as following
New data is coming at every 15 seconds approx 7K rows.
At any given time only the last three timestamps will be available rest data will be deleted.
No index on the table is there. Neither it can be requested due to unavoidable reasons and might be slowness in the insert.
I am aware of the following
Row locks and up the hierarchy other locks are being taken while data insert.
The problem persists with select with NO LOCK.
There is no Join with any other table while fetching as we are joining the tables when data is at the local with us in the temp table.
When the data insertion is stopped at the third party the data comes in 100ms to 122ms.
When service is on it takes 3 to 5 seconds.
Any help/suggestion/approach is appreciated in advance.
The following is a fairly high-end solution. Based on what you have said I believe it would work, but there'd be a lot of detail to work out.
Briefly: table partitions.
Set up a partition scheme on this table
Based on an article I read recently, this CAN be done with unindexed heaps
Data is loaded every 15 seconds? Then the partitions need to be based on those 15 second intervals
For a given dataload (i.e. once per 15 seconds):
Create the "next" partition
Load the data
SWITCH the new partition (new data) into the main table
SWITCH the oldest partition out (data for only three time periods present at a time, right?
Drop that "retired" partition
While potentially efficient and effective, this would be very messy. The big problem I see is, if they can't add a simple index, I don't see how they could possibly set up table partitioning.
Another similar trick is to set up partitioned views, which essentially is "roll your own partitioning". This would go something like:
Have a set of identically structured tables
Create a view UNION ALLing the tables
On dataload, create a new table, load data into that table, then ALTER VIEW to include that newest table and remove the oldest table.
This could have worse locking/blocking issues than the partitioning solution, though much depends on how heavy your read activity is. And, of course, it is much messier than just adding an index.

Which data store is best for my scenario

I'm working on an application that involves very high execution of update / select queries in the database.
I have a base table (A) which will have about 500 records for an entity for a day. And for every user in the system, a variation of this entity is created based on some of the preferences of the user and they are stored in another table (B). This is done by a cron job that runs at midnight everyday.
So if there are 10,000 users and 500 records in table A, there will be 5M records in table B for that day. I always keep data for one day in these tables and at midnight I archive historical data to HBase. This setup is working fine and I'm having no performance issues so far.
There has been some change in the business requirements lately and now some attributes in base table A ( for 15 - 20 records) will change every 20 seconds and based on that I have to recalculate some values for all of those variation records in table B for all users. Even though only 20 master records change, I need to do recalculation and update 200,000 user records which takes more than 20 seconds and by then the next update occurs eventually resulting in all Select queries getting queued up. I'm getting about 3 get request / 5 seconds from online users which results in 6-9 Select queries. To respond to an api request, I always use the fields in table B.
I can buy more processing power and solve this situation but I'm interested in having a properly scaled system which can handle even a million users.
Can anybody here suggest a better alternative? Does nosql + relational database help me here ? Are there any platforms / datastores which will let me update data frequently without locking and at the same time give me the flexibility of running select queries on various fields in an entity ?
Cheers
Jugs
I recommend looking at an in memory DBMS that fully implements MVCC, to eliminate blocking issues. If your application is currently using SQL, then there's no reason to move away from that to nosql. The performance requirements you describe can certainly be met by an in memory SQL-capable DBMS.
What I understand from your saying you are updating 200K records for every 20 sec. Then like in 10min you will update almost all of your data. In that case why are you writing those state to database if that is so frequently updated. I don't know anything about your requirements but why don't you just calculate it on demand using data from table A?

Fact table partitioning: how to handle updates in ETL?

We are trying to implement table partitioning for a Data Warehouse Fact table which contains approximately 400M rows. Our ETL takes data from source system 50 days backwards (new rows, modified rows, based on source system timestamp) from the previous load. So in every ETL cycle there are new rows coming in, and also old rows which are updating the corresponding rows in the Fact table. The idea is to insert new rows into the Fact table and update modified rows.
The partition column would be date (int, YYYYMMDD) and we are considering to partition by month.
As far as I'm concerned, table partitioning would ease our inserts via fast partition switch operations. We could split the most recent partition to create a new free partition, load new rows into a staging table (using date constraint, e.g for the most recent month) and then use partition switch operation to "move" new rows into the partitioned Fact table. But how can we handle the modified rows which should update the corresponding rows in the Fact table? Those rows can contain data from the previous month(s). Does partition switch help here? Usually INSERT and UPDATE rows are determined by an ETL tool (e.g. SSIS in our case) or by MERGE statement. How partitioning works in these kind of situations?
I'd take another look at the design and try to figure out if there's a way around the updates. Here are a few implications of updating the fact table:
Performance: Updates are fully logged transactions. Big fact tables also have lots of data to read and write.
Cubes: Updating the fact table requires reprocessing the affected partitions. As your fact table continues to grow, the cube processing time will continue to as well.
Budget: Fast storage is expensive. Updating big fact tables will require lots of fast reads and writes.
Purist theory: You should not change the fact table unless the initial value was an error (ie the user entered $15,000 instead of $1,500). Any non-error scenario will be changing the originally recorded transaction.
What is changing? Are the changing pieces really attributes of a dimension? If so, can they be moved to a dimension and have changes handled with a Slowly Changing Dimension type task?
Another possibility, can this be accomplished via offsetting transactions? Example:
The initial InvoiceAmount was $10.00. Accounting later added $1.25 for tax then billed the customer for $11.25. Rather than updating the value to $11.25, insert a record for $1.25. The sum amount for the invoice will still be $11.25 and you can do a minimally logged insert rather than a fully logged update to accomplish.
Not only is updating the fact table a bad idea in theory, it gets very expensive and non-scalable as the fact table grows. You'll be reading and writing more data, requiring more IOPS from the storage subsytem. When you get ready to do analytics, cube processing will then throw in more problems.
You'll also have to constantly justify to management why you need so many IOPS for the data warehouse. Is there business value/justification in needing all of those IOPS for your constant changing "fact" table?
If you can't find a way around updates on the fact table, at least establish a cut-off point where the data is determined read-only. Otherwise, you'll never be able to scale.
Switching does not help here.
Maybe you can execute updates concurrently using multiple threads on distinct ranges of rows. That might speed it up. Be careful not to trigger lock escalation so you get good concurrency.
Also make sure that you update the rows mostly in ascending sort order of the clustered index. This helps with disk IO (this technique might not work well with multi-threading).
There are as many reasons to update a fact record as there are non-identifying attributes in the fact. Unless you plan on a "first delete" then "insert", you simply cannot avoid updates. You cannot simply say "record the metric deltas as new facts".

How to auto remove an expired record from a database?

We are building a large stock and forex trading platform using a relational database. At any point during the day there will be thousands, if not millions, of records in our Orders table. Some orders, if not fulfilled immediately, expire and must be removed from this table, otherwise, the table grows very quickly. Each order has an expiration time. Once an order expires it must be deleted. Attempting to do this manually using a scheduled job that scans and deletes records is very slow and hinders the performance of the system. We need to force the record to basically delete itself.
Is there way to configure any RDBMS database to automatically remove a record based on a date/time field if the time occurs in the past?
Since you most likely will have to implement complex order handling, e.g. limit orders, stop-limit orders etc. you need a robust mechanism for monitoring and executing orders in real time. This process is not only limited to expired orders. This is a core mechanism in a trading platform and you will have to design a robust solution that fulfill your needs.
To answer your question: Delete expired orders as part of your normal order handling.
Why must the row be deleted?
I think you are putting the cart before the horse here. If a row is expired, it can be made "invisible" to other parts of the system in many ways, including views which only show orders meeting certain criteria. Having extra deleted rows around should not hamper performance if your database is appropriately indexed.
What level of auditing and tracking is necessary? Is no analysis ever done on expired orders?
Do fulfilled orders become some other kind of document/entity?
There are techniques in many databases which allow you to partition tables. Using the partition function, it is possible to regularly purge partitions (of like rows) much more easily.
You have not specified what DB you are using but lets assume you use MSSQL you could create a agent job that runs periodicly, but you are saying that that might not be a solution for you.
So what t about having an Insert Trigger that when new record is inserted you delete all the record that are expired? This will keep number of record all relatively small.

Your first gut feeling on this SqlServer design question

We have 2 tables. One holds measurements, the other one holds timestamps (one for every minute)
every measurement holds a FK to a timestamp.
We have 8M (million) measurements, and 2M timestamps.
We are creating a report database via replication, and my first solution was this: when a new measurement was received via the replication process, lookup the right timestamp and add it to the measurement table.
Yes, it's duplication of data, but it is for reporting and since we have measurements every 5 minutes and users can query for yearly data (105.000 measurements) we have to optimize for speed.
But a co-developer said: you don't have to do that, we'll just query with a join (on the two tables), SqlServer is so fast, you don't see the difference.
My first reaction was: a join on two tables with 8M and 2M records can't make 'no difference'.
What is your first feeling on this?
EDIT:
new measurements: 400 records per 5 minutes
EDIT 2:
maybe the question is not so clear:
the first solution is to get the data from the timestamp table and copy it to the measurement table when the measurement record is inserted.
In that case we have an action when the record is inserted AND an extra (duplicated) timestamp value. In this case we lonly query ONE table because it holds all the data.
The second solution is to join the two tables in a query.
With the proper index the join will make no difference*. My initial thought is that if the report is querying over the entire dataset, the joins might actually be faster because there is literally 6 million fewer timestamps that it has to read from the disk.
*This is just a guess based on my experience with tables with millions of records. You results will vary based on your queries.
I'd create an Indexed View (similar to a Materialized view in Oracle) which joins the tables using appropriate indexes.
If the query just retrieves the data for the given date ranges, there will be a merge join - that is, a range scan for each of tow tables. Since the timestamp table presumably contains only timestamp, this shouldn't be expensive.
On the other hand, if you have only one table and index on the date column, the index itself becomes larger and more expensive to scan.
So, with properly constructed indexes and queries I won't expect a significant difference in performance.
I'd suggest you to keep properly normalized design until you start having performance problems that force you to change it. And then you need to carefully analyze query plans and measure performance with different options - there're lots of thing that could matter in your particular case.
Frankly in this case your best bet is try both solutions and see which one is better. Performance tuning is an art when you start talking about large data sets and is highly dependant onthe not only the database design you have but the hardware and the whther you are using partioning, etc. Be sure to test both getting the data out and putting the data in. Since you have so many inserts, insert speed is critical and tthe index you would need on on the datetime field is critical to select performance, so you really need to thouroughly test this. Don't forget about dumping the cache when you test. And test multiple times and if possible test under a typical query load.

Resources