Which data store is best for my scenario - database

I'm working on an application that involves very high execution of update / select queries in the database.
I have a base table (A) which will have about 500 records for an entity for a day. And for every user in the system, a variation of this entity is created based on some of the preferences of the user and they are stored in another table (B). This is done by a cron job that runs at midnight everyday.
So if there are 10,000 users and 500 records in table A, there will be 5M records in table B for that day. I always keep data for one day in these tables and at midnight I archive historical data to HBase. This setup is working fine and I'm having no performance issues so far.
There has been some change in the business requirements lately and now some attributes in base table A ( for 15 - 20 records) will change every 20 seconds and based on that I have to recalculate some values for all of those variation records in table B for all users. Even though only 20 master records change, I need to do recalculation and update 200,000 user records which takes more than 20 seconds and by then the next update occurs eventually resulting in all Select queries getting queued up. I'm getting about 3 get request / 5 seconds from online users which results in 6-9 Select queries. To respond to an api request, I always use the fields in table B.
I can buy more processing power and solve this situation but I'm interested in having a properly scaled system which can handle even a million users.
Can anybody here suggest a better alternative? Does nosql + relational database help me here ? Are there any platforms / datastores which will let me update data frequently without locking and at the same time give me the flexibility of running select queries on various fields in an entity ?
Cheers
Jugs

I recommend looking at an in memory DBMS that fully implements MVCC, to eliminate blocking issues. If your application is currently using SQL, then there's no reason to move away from that to nosql. The performance requirements you describe can certainly be met by an in memory SQL-capable DBMS.

What I understand from your saying you are updating 200K records for every 20 sec. Then like in 10min you will update almost all of your data. In that case why are you writing those state to database if that is so frequently updated. I don't know anything about your requirements but why don't you just calculate it on demand using data from table A?

Related

Developing an optimal solution/design that sums many database rows for a reporting engine

Problem: I am developing a reporting engine that displays data about how many bees a farm detected (Bees is just an example here)
I have 100 devices that each minute count how many bees were detected on the farm. Here is how the DB looks like:
So there can be hundreds of thousands of rows in a given week.
The farmer wants a report that shows for a given day how many bees came each hour. I developed two ways to do this:
The server takes all 100,000 rows for that day from the DB and filters it down. The server uses a large amount of memory to do this and I feel this is a brute force solution
I have a Stored Procedure that returns a temporarily created table, with every hour the amount of bees collected for each device totaled. The server takes this table and doesn't need to process 100,000 rows.
This return (24 * 100) rows. However it takes much longer than I expected to do this ~
What are some good candidate solutions for developing a solution that can consolidate and sum this data without taking 30 seconds just to sum a day of data (where I may need a months worth divided between days)?
If performance is your primary concern here, there's probably quite a bit you can do directly on the database. I would try indexing the table on time_collected_bees so it can filter down to 100K lines faster. I would guess that that's where your slowdown is happening, if the database is scanning the whole table to find the relevant entries.
If you're using SQL Server, you can try looking at your execution plan to see what's actually slowing things down.
Give database optimization more of a look before you architect something really complex and hard to maintain.

Solution to handling data that frequently changes

I'm currently trying to figure out a solution to best optimize for data that is going to frequently change. The server is running IIS/SQL Server and it is an ASP.NET Web API application. My table structure is something like the following:
User Table:
UserID PK
Status Table:
StatusID PK,
Title varchar
UserStatus Table:
UserID PK (CLUSTERED),
StatusID FK (NON-CLUSTERED),
Date DateTimeOffset (POSSIBLY INDEXED) - This would be used as an expiration. Old records become irrelevant.
There will be roughly 5000+ records in the users table. The status table will have roughly 500 records. The UserStatus table would have frequent changes (change every 5-30 seconds) to the StatusID and Date Fields by anywhere from 0 - 1000 users at any given time. This UserStatus table will also have frequent SELECT queries performed against it as well filtering out records with old/irrelevant dates.
I have considered populating the UserStatus table with a record for
each user and only performing updates. This would mean there would
always be the expected record present and it would limit the checks
for existence. My concern is performance and all of the fragmenting
of the indexes. I would then query against the table for records with
dates that fall within several minutes of the current time.
I have considered only inserting relevant records to the
UserStatus table, updating when they exist for a user, and running
a task that cleans old/irrelevant data out. This method would keep
the number of records down but I would have to check for the
existence of records before performing a task and indexes may inhibit
performance.
Finally I have considered a MemoryCache or something of the like. I
do not know much about caching in a Web API, but from what I have
read about it, I quickly decided against this because of potential
concurrency issues when iterating over the cache.
Does anyone have a recommendation for a scenario like this? Is there another methodology I am not considering?
Given the number of records you are talking about I would use the tsql Merge that will update existing records and add new ones with one efficient statement.
Given the design you mentioned, you should be able to run a periodic maint script that will fix any fragmentation issues.
The solution can be scaled. If the records got the the point where some slowdown was occurring I would consider SSD where fragmentation is not an issue.
If the disadvantages of SSD make that undesirable you can look into in-memory OLTP.

one large sql server table of several smaller ones?

I have some design question to ask.
Suppose I'm having some TBL_SESSIONS table where I keep every logged-in user-id and some Dirty-flag (indicating whether the content he has was changed since he got it).
Now, this table should be periodically be asked every few seconds by any signed-in user about that Dirty-flag, i.e., there are a lot of read calls on that table, and also this Dirty-flag is changed a lot.
Suppose I'm expecting many users to be logged in at the same time. I was wondering if there is any reason in creating like say 10 such tables and have the users distributed (say according their user-ids) between those table.
I'm asking this from two aspects. First, in terms of performance. Second, in terms of scalability.
I would love to hear your opinions
For 1m rows use a single table.
With correct indexing the access time should not be a problem. If you use multiple tables with users distributed across them you will have additional processing necessary to find which table to read/update.
Also, how many users will you allocate to each table? And what happens when the number of users grows... add more tables? I think you'll end up with a maintenance nightmare.
Having several tables makes sence if you have really lots of records in it, like billions, and high load that your server cannot handle. This way you can perform sharding -> split one table to several on different servers. E.g. have first 100 millions of records on server A (id 1 to 100 000 000), next 100 million on server B (100 000 001 to 200 000 000) etc. Other way - is having same data on different and querying data through some kind of balancer, wich may be harder (replication isn't key feature of RDBMS, so it depends on engine)

Best way to access averaged static data in a Database (Hibernate, Postgres)

Currently I have a project (written in Java) that reads sensor output from a micro controller and writes it across several Postgres tables every second using Hibernate. In total I write about 130 columns worth of data every second. Once the data is written it will stay static forever.This system seems to perform fine under the current conditions.
My question is regarding the best way to query and average this data in the future. There are several approaches I think would be viable but am looking for input as to which one would scale and perform best.
Being that we gather and write data every second we end up generating more than 2.5 million rows per month. We currently plot this data via a JDBC select statement writing to a JChart2D (i.e. SELECT pressure, temperature, speed FROM data WHERE time_stamp BETWEEN startTime AND endTime). The user must be careful to not specify too long of a time period (startTimem and endTime delta < 1 day) or else they will have to wait several minutes (or longer) for the query to run.
The future goal would be to have a user interface similar to the Google visualization API that powers Google Finance. With regards to time scaling, i.e. the longer the time period the "smoother" (or more averaged) the data becomes.
Options I have considered are as follows:
Option A: Use the SQL avg function to return the averaged data points to the user. I think this option would get expensive if the user asks to see the data for say half a year. I imagine the interface in this scenario would scale the amount of rows to average based on the user request. I.E. if the user asks for a month of data the interface will request an avg of every 86400 rows which would return ~30 data points whereas if the user asks for a day of data the interface will request an avg of every 2880 rows which will also return 30 data points but of more granularity.
Option B: Use SQL to return all of the rows in a time interval and use the Java interface to average out the data. I have briefly tested this for kicks and I know it is expensive because I'm returning 86400 rows/day of interval time requested. I don't think this is a viable option unless there's something I'm not considering when performing the SQL select.
Option C: Since all this data is static once it is written, I have considered using the Java program (with Hibernate) to also write tables of averages along with the data it is currently writing. In this option, I have several java classes that "accumulate" data then average it and write it to a table at a specified interval (5 seconds, 30 seconds, 1 minute, 1 hour, 6 hours and so on). The future user interface plotting program would take the interval of time specified by the user and determine which table of averages to query. This option seems like it would create a lot of redundancy and take a lot more storage space but (in my mind) would yield the best performance?
Option D: Suggestions from the more experienced community?
Option A won't tend to scale very well once you have large quantities of data to pass over; Option B will probably tend to start relatively slow compared to A and scale even more poorly. Option C is a technique generally referred to as "materialized views", and you might want to implement this one way or another for best performance and scalability. While PostgreSQL doesn't yet support declarative materialized views (but I'm working on that this year, personally), there are ways to get there through triggers and/or scheduled jobs.
To keep the inserts fast, you probably don't want to try to maintain any views off of triggers on the primary table. What you might want to do is to periodically summarize detail into summary tables from crontab jobs (or similar). You might also want to create views to show summary data by using the summary tables which have been created, combined with detail table where the summary table doesn't exist.
The materialized view approach would probably work better for you if you partition your raw data by date range. That's probably a really good idea anyway.
http://www.postgresql.org/docs/current/static/ddl-partitioning.html

How to auto remove an expired record from a database?

We are building a large stock and forex trading platform using a relational database. At any point during the day there will be thousands, if not millions, of records in our Orders table. Some orders, if not fulfilled immediately, expire and must be removed from this table, otherwise, the table grows very quickly. Each order has an expiration time. Once an order expires it must be deleted. Attempting to do this manually using a scheduled job that scans and deletes records is very slow and hinders the performance of the system. We need to force the record to basically delete itself.
Is there way to configure any RDBMS database to automatically remove a record based on a date/time field if the time occurs in the past?
Since you most likely will have to implement complex order handling, e.g. limit orders, stop-limit orders etc. you need a robust mechanism for monitoring and executing orders in real time. This process is not only limited to expired orders. This is a core mechanism in a trading platform and you will have to design a robust solution that fulfill your needs.
To answer your question: Delete expired orders as part of your normal order handling.
Why must the row be deleted?
I think you are putting the cart before the horse here. If a row is expired, it can be made "invisible" to other parts of the system in many ways, including views which only show orders meeting certain criteria. Having extra deleted rows around should not hamper performance if your database is appropriately indexed.
What level of auditing and tracking is necessary? Is no analysis ever done on expired orders?
Do fulfilled orders become some other kind of document/entity?
There are techniques in many databases which allow you to partition tables. Using the partition function, it is possible to regularly purge partitions (of like rows) much more easily.
You have not specified what DB you are using but lets assume you use MSSQL you could create a agent job that runs periodicly, but you are saying that that might not be a solution for you.
So what t about having an Insert Trigger that when new record is inserted you delete all the record that are expired? This will keep number of record all relatively small.

Resources