Performance implications of using crontab to automate numerous DB tasks? - database

I think I'm going to use crontab to run a bunch of scripts that will:
close all expired posts
accept uncontested disputes
add interest charges
email out invoices
send "about to expire" notifications
I want the expired stuff to be removed pretty shortly after the event occurs, so I'm thinking about writing one script that will run and check for all these various dates every 5 to 15 minutes. Can I expect any troubles doing this? At what number of "posts" might I start seeing performance issues? Are we talking thousands or millions?

If you're doing anything with the "posts" and there's only 15k per year...I wouldn't worry about any performance issues with them on reasonably modern hardware for quite some time. Unless you're doing something pretty crazy with invoices, you shouldn't have issues with 250k per year for a good number of years. This is shooting from the hip though, it all depends on hardware and what exactly you're doing

Related

What amazon ec2 would be suggested for an ecommerce website with windows and sql server?

We are running an eCommerce venture that has around 2000 unique visitors in a day. The total data is around 6 GB as of now.
We are using SQL Server as our database and in the coming months the website may scale up to 10000 users per day.
From this link deciphered that it would be best to use M1 instance but could anyone help really clueless as to what to purchase from these options.
Note: Our budget is around 170 Dollars PM.
EDIT: The number of concurrent users we have had is around 150
I'd try to fit everything in memory. If you can't due to budget, you need to make sure the disk response times are up to par for your expected load. You application can vary widely. One visit to a homepage could generate many queries, or maybe you have application caching set up - so it's hard for anyone to just tell you. You should also get solid numbers on your peak number of concurrent users so you can plan for that. You don't mention your current environment, but you can get some numbers about CPU, Disk MB Read/Writes/s and memory used to help you get the right size.
I'd look at the xlarge m1. That gives you 15GB of memory to play with. You'll be able to cache all the data you need and have some left over for the OS and also have some room to grow. CPU probably won't be your issue, but be sure to check out your current use.
If you have some time to spend on it, I'd try setting up JMeter to do some load testing and see how many concurrent users you can max out with one of the cheaper options.
This topic may be better suited to ServerFault.
I'd suggest you to check at the reserved instances in heavy category where you can get interesting discounts if you plan to run it for a year or similar.
But with that budget you should be thinking about an m1.medium instance, which might be a little tight for your requirements.

What is the recommended way to build functionality similar to Stackoverflow's "Inbox"?

I have an asp.net-mvc website and people manage a list of projects. Based on some algorithm, I can tell if a project is out of date. When a user logs in, i want it to show the number of stale projects (similar to when i see a number of updates in the inbox).
The algorithm to calculate stale projects is kind of slow so if everytime a user logs in, i have to:
Run a query for all project where they are the owner
Run the IsStale() algorithm
Display the count where IsStale = true
My guess is that will be real slow. Also, on everything project write, i would have to recalculate the above to see if changed.
Another idea i had was to create a table and run a job everything minutes to calculate stale projects and store the latest count in this metrics table. Then just query that when users log in. The issue there is I still have to keep that table in sync and if it only recalcs once every minute, if people update projects, it won't change the value until after a minute.
Any idea for a fast, scalable way to support this inbox concept to alert users of number of items to review ??
The first step is always proper requirement analysis. Let's assume I'm a Project Manager. I log in to the system and it displays my only project as on time. A developer comes to my office an tells me there is a delay in his activity. I select the developer's activity and change its duration. The system still displays my project as on time, so I happily leave work.
How do you think I would feel if I receive a phone call at 3:00 AM from the client asking me for an explanation of why the project is no longer on time? Obviously, quite surprised, because the system didn't warn me in any way. Why did that happen? Because I had to wait 30 seconds (why not only 1 second?) for the next run of a scheduled job to update the project status.
That just can't be a solution. A warning must be sent immediately to the user, even if it takes 30 seconds to run the IsStale() process. Show the user a loading... image or anything else, but make sure the user has accurate data.
Now, regarding the implementation, nothing can be done to run away from the previous issue: you will have to run that process when something that affects some due date changes. However, what you can do is not unnecessarily run that process. For example, you mentioned that you could run it whenever the user logs in. What if 2 or more users log in and see the same project and don't change anything? It would be unnecessary to run the process twice.
Whatsmore, if you make sure the process is run when the user updates the project, you won't need to run the process at any other time. In conclusion, this schema has the following advantages and disadvantages compared to the "polling" solution:
Advantages
No scheduled job
No unneeded process runs (this is arguable because you could set a dirty flag on the project and only run it if it is true)
No unneeded queries of the dirty value
The user will always be informed of the current and real state of the project (which is by far, the most important item to address in any solution provided)
Disadvantages
If a user updates a project and then upates it again in a matter of seconds the process would be run twice (in the polling schema the process might not even be run once in that period, depending on the frequency it has been scheduled)
The user who updates the project will have to wait for the process to finish
Changing to how you implement the notification system in a similar way to StackOverflow, that's quite a different question. I guess you have a many-to-many relationship with users and projects. The simplest solution would be adding a single attribute to the relationship between those entities (the middle table):
Cardinalities: A user has many projects. A project has many users
That way when you run the process you should update each user's Has_pending_notifications with the new result. For example, if a user updates a project and it is no longer on time then you should set to true all users Has_pending_notifications field so that they're aware of the situation. Similarly, set it to false when the project is on time (I understand you just want to make sure the notifications are displayed when the project is no longer on time).
Taking StackOverflow's example, when a user reads a notification you should set the flag to false. Make sure you don't use timestamps to guess if a user has read a notification: logging in doesn't mean reading notifications.
Finally, if the notification itself is complex enough, you can move it away from the relationship between users and projects and go for something like this:
Cardinalities: A user has many projects. A project has many users. A user has many notifications. A notifications has one user. A project has many notifications. A notification has one project.
I hope something I've said has made sense, or give you some other better idea :)
You can do as follows:
To each user record add a datetime field sayng the last time the slow computation was done. Call it LastDate.
To each project add a boolean to say if it has to be listed. Call it: Selected
When you run the Slow procedure set you update the Selected fileds
Now when the user logs if LastDate is enough close to now you use the results of the last slow computation and just take all project with Selected true. Otherwise yourun again the slow computation.
The above procedure is optimal, becuase it re-compute the slow procedure ONLY IF ACTUALLY NEEDED, while running a procedure at fixed intervals of time...has the risk of wasting time because maybe the user will neber use the result of a computation.
Make a field "stale".
Run a SQL statement that updates stale=1 with all records where stale=0 AND (that algorithm returns true).
Then run a SQL statement that selects all records where stale=1.
The reason this will work fast is because SQL parsers, like PHP, shouldn't do the second half of the AND statement if the first half returns true, making it a very fast run through the whole list, checking all the records, trying to make them stale IF NOT already stale. If it's already stale, the algorithm won't be executed, saving you time. If it's not, the algorithm will be run to see if it's become stale, and then stale will be set to 1.
The second query then just returns all the stale records where stale=1.
You can do this:
In the database change the timestamp every time a project is accessed by the user.
When the user logs in, pull all their projects. Check the timestamp and compare it with with today's date, if it's older than n-days, add it to the stale list. I don't believe that comparing dates will result in any slow logic.
I think the fundamental questions need to be resolved before you think about databases and code. The primary of these is: "Why is IsStale() slow?"
From comments elsewhere it is clear that the concept that this is slow is non-negotiable. Is this computation out of your hands? Are the results resistant to caching? What level of change triggers the re-computation.
Having written scheduling systems in the past, there are two types of changes: those that can happen within the slack and those that cause cascading schedule changes. Likewise, there are two types of rebuilds: total and local. Total rebuilds are obvious; local rebuilds try to minimize "damage" to other scheduled resources.
Here is the crux of the matter: if you have total rebuild on every update, you could be looking at 30 minute lags from the time of the change to the time that the schedule is stable. (I'm basing this on my experience with an ERP system's rebuild time with a very complex workload).
If the reality of your system is that such tasks take 30 minutes, having a design goal of instant gratification for your users is contrary to the ground truth of the matter. However, you may be able to detect schedule inconsistency far faster than the rebuild. In that case you could show the user "schedule has been overrun, recomputing new end times" or something similar... but I suspect that if you have a lot of schedule changes being entered by different users at the same time the system would degrade into one continuous display of that notice. However, you at least gain the advantage that you could batch changes happening over a period of time for the next rebuild.
It is for this reason that most of the scheduling problems I have seen don't actually do real time re-computations. In the context of the ERP situation there is a schedule master who is responsible for the scheduling of the shop floor and any changes get funneled through them. The "master" schedule was regenerated prior to each shift (shifts were 12 hours, so twice a day) and during the shift delays were worked in via "local" modifications that did not shuffle the master schedule until the next 12 hour block.
In a much simpler situation (software design) the schedule was updated once a day in response to the day's progress reporting. Bad news was delivered during the next morning's scrum, along with the updated schedule.
Making a long story short, I'm thinking that perhaps this is an "unask the question" moment, where the assumption needs to be challenged. If the re-computation is large enough that continuous updates are impractical, then aligning expectations with reality is in order. Either the algorithm needs work (optimizing for local changes), the hardware farm needs expansion or the timing of expectations of "truth" needs to be recalibrated.
A more refined answer would frankly require more details than "just assume an expensive process" because the proper points of attack on that process are impossible to know.

How stable is Cassandra?s

I've been using Cassandra instance without reboot for a few days for a simple
task for storing tweets, 1-2 saves in a second. After then Cassandra got really slow and I had to kill a restart it.
Is this Cassandras's expected stability now? Would it be a good solution to write a daemon to kill/restart it every day or two?
No. Cassandra is widely expected to be more stable than that. If it is not stable, there is a substantial chance you have configured it wrong. It may be attempting to use more memory than you expect, for instance. If you have encountered a bug or defect in Cassandra, it is not one which is afflicting the majority of users.
As for your "restart daemon" plan, I'm going to go with "that's a horrible solution for pretty much everything, and especially so for something that you're trusting with any data you actually care about."
From https://cassandra.apache.org/
Cassandra is in use at Netflix, Twitter, Urban Airship, Constant
Contact, Reddit, Cisco, OpenX, Digg, CloudKick, Ooyala, and more
companies that have large, active data sets. The largest known
Cassandra cluster has over 300 TB of data in over 400 machines.
It was (is?) largely used at Facebook too. I would say it's stable. :)
And btw, I don't think it's meant to be restarted every 1-2 days at all: you use it if you have huge data sets with high availability (HA) requirements, and going down every 2 days is not HA.

Performance Testing - How much data should I create

I'm very new to performance engineering, so I have a very basic question.
I'm working in a client-server system that uses SQL server backend. The application is a huge tax-related application that requires testing performance at peak load. Meaning that there should be like 10 million tax returns in the system when we run scenarios related to creating tax returns and submitting them. Then there will also be proportional number of users that need to be created.
Now I'm hearing in meetings that we need to create 10 million records to test performance and run scenarios with 5000 users and I just don't think it is feasible.
When one talks about creating a smaller dataset and extrapolating the performance planning, a very common answer I hear is that we need to 10 million records because we cannot tell from a smaller data set how the database or network will behave.
So how does one plan capacity and test performance on large enterprise application without creating peak level of data or running peak number of scenarios?
Thanks.
Personally, I would throw as much data and traffic at it as you can. Forget what traffic you "think you need to handle". And just see how much traffic you CAN handle and go from there. Knowing the limits of your system is more valuable than simply knowing it can handle 10 million records.
Maybe it does handle 10 million, but at 11 million it dies a horrible death. Or maybe it's well written and will scale to 100 million before it dies. There's a very distinct difference between the two even though both pass the "10 million test"
Now I'm hearing in meetings that we need to create 10 million records to test performance and run scenarios with 5000 users and I just don't think it is feasible.
Why do you think so?
Of course you can (and should) test with limited amounts of data, but you also really, really need to test with a realistic load, which means testing with the amount (and type) of data that you will use in production.
This is just a special case of a general rule: For system or integration testing, you need to test in a scenario that is as close as possible to production; ideally you just copy/clone a live production system, data, config and all and use that for testing. That is actually what we do (if we technically can and the client agrees). We just run a few SQL scripts to randomize personal data in the test data set, so prevent privacy concerns.
There are always issues that crop up because production data is somehow different from what you tested on, and this is the only way to prevent (or at least limit) these problems.
I've planned and implemented reporting and imports, and they invariably break or misbehave the first time they're exposed to real data, because there are always special cases or scaling problems you didn't expect. You want that breakage to happen during development, not in production :-).
In short:
Bite the bullet, and (after having done all the tests with "toy data"), get a realistic dataset to test on. If you don't have the hardware to handle that, then you don't have the right hardware for your tests :-).
I would take a look at Redgate's SQL Data Generator. It does a good job of generating representative data.
Have a peek at "The art of application performance testing / Ian Molyneaux, O’Reilly, 2009".
Your test data is ideally a realistic variety of records. But for first approximations you could have just a few unique records, and duplicate them until you have the desired size. Then use ApacheBench to roughly approximate the traffic.
To help generate data look at ruby faker and perl data faker. I have had good luck with it in generating large data sets for testing. SQL generator from redgate is good too.

If you were asked if a system could sustain double growth, what 3 things would you do to answer?

Let's say at your job your boss says,
That system over there, which has lost all institutional knowledge but seems to run pretty good right now, could we dump double the data in it and survive?
You're completely unfamiliar with the system.
It's in SQL Server 2000 (primarily a database app).
There's no test environment.
You might be able to hijack it on the weekends if you needed to run a benchmark.
What would be the 3 things you'd do to convince yourself and then your manager that you could take on that extra load. And if you couldn't do it, on the same hardware... the extra hardware (measured in dollars) it would take to satisfy that request.
To address the response from doofledorfer, you assumptions are almost all 180 degrees off. But that's my fault for an ambiguous question.
One of the main servers runs 7x24 at 70% base and spikes from there and no one knows what it is doing.
This isn't an issue of buy-in or whining... Our company may not have much of a choice in the matter.
Because this is being externally mandated, delays in implementation could result in huge fines. So large meeting to assess risk are almost impossible. There is one risk, that dumping double the data would take the system down for the existing customers.
I was hoping someone would say something like, see if you take the system off line Sunday night at midnight and run SQLIO tests to see how close the storage subsystem is to saturation. Things like that.
Set up a test environment, even if I have to do it on my laptop.
Enable some kind of logging on the production system to get an idea of the volume of transactions in addition to the volume of data.
Read the source code as I run stress tests on my laptop with increasing amounts of data.
Having said that, I sympathize with this assignment, because it's unfair. It's like asking someone in a boat if the boat can float with twice the cargo -- but you can't get out of the boat or take it out of its regular service.
You've just described a typical Agile project. Your answer should be:
I don't know, and I won't be able to tell without testing.
In addition to data volume, there might be issues with usage patterns, application interactions, database and server tuning, etc.
So let's work through a basic list of risk factors, and how we might resolve them.
Once we've done that, let's work through them in inverse order of risk; and make a stop/continue decision as we develop the results.
etc.
Without management buy-in and participation at least at that level, any other answer you might give is high-risk wishing, and "3 most important" is a non sequitur.
I'd be optimistic unless your current system is substantially loaded already. Most servers should run at less than 50% capacity on all resources, or else be on life-support.
And I expect you wouldn't be having the conversation if the existing server were already dealing with load issues; although "seems to run pretty good right now" is imprecise enough to be worrisome.
it mostly depends on its current level. If doubling is going from 2GB to 4GB just do it. If it's going from 1TB to 2TB you've got some planning to do.
I'd collect some info using Performance Monitor and provide it to help make an educated decision.
It depends what you mean by "double the data".
If that is going to affect one table only (say product table) then you are probably safe as most queries that are referring to that one table are most likely to double the time of execution (that assumes that you do not reference the same time twice in a query).
The problem will arise if you double the amount of data in all the tables as the execution time may grow in exponential fashion then and it can lead to some serious issues.
But in general I would support the answer by doofledorfer

Resources