I've been using Cassandra instance without reboot for a few days for a simple
task for storing tweets, 1-2 saves in a second. After then Cassandra got really slow and I had to kill a restart it.
Is this Cassandras's expected stability now? Would it be a good solution to write a daemon to kill/restart it every day or two?
No. Cassandra is widely expected to be more stable than that. If it is not stable, there is a substantial chance you have configured it wrong. It may be attempting to use more memory than you expect, for instance. If you have encountered a bug or defect in Cassandra, it is not one which is afflicting the majority of users.
As for your "restart daemon" plan, I'm going to go with "that's a horrible solution for pretty much everything, and especially so for something that you're trusting with any data you actually care about."
From https://cassandra.apache.org/
Cassandra is in use at Netflix, Twitter, Urban Airship, Constant
Contact, Reddit, Cisco, OpenX, Digg, CloudKick, Ooyala, and more
companies that have large, active data sets. The largest known
Cassandra cluster has over 300 TB of data in over 400 machines.
It was (is?) largely used at Facebook too. I would say it's stable. :)
And btw, I don't think it's meant to be restarted every 1-2 days at all: you use it if you have huge data sets with high availability (HA) requirements, and going down every 2 days is not HA.
Related
I am new in learning distributed systems and I read about the CAP theorem, I am interested in an AP system such as Cassandra.
My question is in what cases can you actually sacrifice consistency? Effectively what I am saying is sacrificing consistency means serving inaccurate data. In what cases would then you actually use an AP datastore like Cassandra? I can't think of any case where I wouldn't want my reads to be consistent.
By AP system, I assume you will at least target to ensure eventual consistency.
Imagine you're developing a social network where users have friends and their own news feeds. It doesn't matter if a particular user's feed has occasional five minutes lag (his feed list has eventual consistency). Missing 2/3 very recent updates in the news feed is okay in this scenario as long as those feeds will eventually appear. And in fact, Facebook built it's news feed using Cassandra.
Imagine a distributed key-value store cache system where update is very rare. If there is almost no update operations, ensuring strong consistency is un-necessary, so you can focus on availability. Occasional cache miss (the key-value entry is not populated yet) and request to database due to eventual consistency should be okay.
My question is in what cases can you actually sacrifice consistency?
One case would be when building a recommendation engine data set and serving it with Cassandra. These data sets are essentially the aggregation of many, many users to determine purchasing/viewing patterns.
For example: If I add a Rey Star Wars action figure to my shopping cart, the underlying recommendation engine runs a query for similar resulting purchasing patterns based on others who have also purchased an action figure of Rey. The query returns the top 5 product results, and puts them at the bottom of the page.
Those 5 products returned are the result of analysis and aggregation of several thousand prior purchases. Let's assume that some of that data isn't consistent, causing a variance in the 5 products returned. Is that really a big deal?
tl;dr; The real question to ask; is whether or not getting a somewhat-accurate list of 5 product recommendations in less than 10ms, is better than getting a 100% accurate list of 5 product recommendations in 100ms?
Both result sets will help drive sales. But the one which is returned fast enough that it doesn't hinder the user experience is much more preferred.
'C' in CAP refers to linearizability which is a very strong form of consistancy that you don't need most of the time.
Linearizability is a recency guarantee which makes it appear that there is a single copy of data. As soon as you make a change in the data, all subsequent reads will return the changed data. Such a level of consistency is expensive and doesn't scale well. Yet in certain scenarios we need linearizability, viz.
Leader election
Allowing end users to create their unique user id
Distributed locking etc.
When you have these usecases, you'd use something like ZooKeeper, etcd etc. Cassandra also has Light Weight Transaction (LWT) which uses an extension of the classic Paxos algorithm to implement linearizability. This feature can be used to address those rare use cases where you must have linearizability and serializability, but it is expensive. And in vast majority of cases you are just fine with a little weaker consistency to get better scalability and performance. You trade a little bit of consistency with scalability and performance.
Some eCommerce websites send apology letter to customers for not being able to fulfill their orders. That is because the last copy of the product has been sold to more than one customers due to lack and linearizability. They prefer to deal with that over not being able to scale with the customer base and not being able to respond to their requests within stringent SLAs.
Cassandra is said to have a tuneable consistency. You may want to record user clicks or activities for analysis. You are okay if some data are lost, but you cannot compromise with the performance. You'd probably use a write consistency level of ANY with hints enabled (sloppy quorum).
If you want a little more consistency, you'd use a QUORUM consistency level to read and write along with hints and read repair. In vast majority of case all nodes are updated instantaneously. Even if one or two nodes go down, a majority of nodes will have the data and failed nodes would be repaired when they come back using hints, read repair, anti entropy repair.
Cassandra is particularly useful for cases where you'd not have many concurrent updates on same data. The reason is, unlike the dynamo architecture, it does not use vector clocks for conflict resolution between replicas. Instead it uses Last Write Wins (LWW) based on timestamp. If timestamps are same, it uses lexicographical order. Since the time on nodes cannot be accurate even in the presence of NTPD, there is a possibility of data loss, although Cassandra has taken some steps to avoid that - for e.g. client side timestamp instead of server side timestamp.
The CAP theorem says that given partition tolerence, you can either choose availability or consistency in a distributed database (no one would want to give up partition tolerence in any case). So if you want to have maximum availability, you'll have to give up on the consistency. This depends of course, on how critical the business is.
You answered something on SO but the answer doesn't show up when you visit the page? Can be tolerated. SO being down? Can't be. Critical financial systems would rather have strong consistency than availability. Every once-in-a-while, my bank's servers would go offline when I try to make a payment.
Normally, you choose availability and eventual consistency. The answer you wrote into SO would eventually show up.
Apart from the above mentioned cases where inconsistent data is tolerable, there are also scenarios where we can defer to the user to solve the inconsistency.
For example, if we found two different versions of someone's address in the database, we can prompt the user to identity the correct address.
I want to scale an e-commerce portal based on LAMP. Recently we've seen huge traffic surge.
What would be steps (please mention in order) in scaling it:
Should I consider moving onto Amazon EC2 or similar? what could be potential problems in switching servers?
Do we need to redesign database? I read, Facebook switched to Cassandra from MySql. What kind of code changes are required if switched to Cassandra? Would Cassandra be better option than MySql?
Possibility of Hadoop, not even sure?
Any other things, which need to be thought of?
Found this post helpful. This blog has nice articles as well. What I want to know is list of steps I should consider in scaling this app.
First, I would suggest making sure every resource served by your server sets appropriate cache control headers. The goal is to make sure truly dynamic content gets served fresh every time and any stable or static content gets served from somebody else's cache as much as possible. Why deliver a product image to every AOL customer when you can deliver it to the first and let AOL deliver it to all the others?
If you currently run your webserver and dbms on the same box, you can look into moving the dbms onto a dedicated database server.
Once you have done the above, you need to start measuring the specifics. What resource will hit its capacity first?
For example, if the webserver is running at or near capacity while the database server sits mostly idle, it makes no sense to switch databases or to implement replication etc.
If the webserver sits mostly idle while the dbms chugs away constantly, it makes no sense to look into switching to a cluster of load-balanced webservers.
Take care of the simple things first.
If the dbms is the likely bottle-neck, make sure your database has the right indexes so that it gets fast access times during lookup and doesn't waste unnecessary time during updates. Make sure the dbms logs to a different physical medium from the tables themselves. Make sure the application isn't issuing any wasteful queries etc. Make sure you do not run any expensive analytical queries against your transactional database.
If the webserver is the likely bottle-neck, profile it to see where it spends most of its time and reduce the work by changing your application or implementing new caching strategies etc. Make sure you are not doing anything that will prevent you from moving from a single server to multiple servers with a load balancer.
If you have taken care of the above, you will be much better prepared for making the move to multiple webservers or database servers. You will be much better informed for deciding whether to scale your database with replication or to switch to a completely different data model etc.
1) First thing - measure how many requests per second can serve you most-visited pages. For well-written PHP sites on average hardware it must be in 200-400 requests per second range. If you are not there - you have to optimize the code by reducing number of database requests, caching rarely changed data in memcached/shared memory, using PHP accelerator. If you are at some 10-20 requests per second, you need to get rid of your bulky framework.
2) Second - if you are still on Apache2, you have to switch to lighthttpd or nginx+apache2. Personally, I like the second option.
3) Then you move all your static data to separate server or CDN. Make sure it is served with "expires" headers, at least 24 hours.
4) Only after all these things you might start thinking about going to EC2/Hadoop, build multiple servers and balancing the load (nginx would also help you there)
After steps 1-3 you should be able to serve some 10'000'000 hits per day easily.
If you need just 1.5-3 times more, I would go for single more powerfull server (8-16 cores, lots of RAM for caching & database).
With step 4 and multiple servers you are on your way to 0.1-1billion hits per day (but for significantly larger hardware & support expenses).
Find out where issues are happening (or are likely to happen if you don't have them now). Knowing what is your biggest resource usage is important when evaluating any solution. Stick to solutions that will give you the biggest improvement.
Consider:
- higher than needed bandwidth use x user is something you want to address regardless of moving to ec2. It will cost you money either way, so its worth a shot at looking at things like this: http://developer.yahoo.com/yslow/
- don't invest into changing databases if that's a non issue. Find out first if that's really the problem, and even if you are having issues with the database it might be a code issue i.e. hitting the database lots of times per request.
- unless we are talking about v. big numbers, you shouldn't have high cpu usage issues, if you do find out where they are happening / optimization is worth it where specific code has a high impact in your overall resource usage.
- after making sure the above is reasonable, you might get big improvements with caching. In bandwith (making sure browsers/proxy can play their part on caching), local resources usage (avoiding re-processing/re-retrieving the same info all the time).
I'm not saying you should go all out with the above, just enough to make sure you won't get the same issues elsewhere in v. few months. Also enough to find out where are your biggest gains, and if you will get enough value from any scaling options. This will also allow you to come back and ask questions about specific problems, and how these scaling options relate to those.
You should prepare by choosing a flexible framework and be sure things are going to change along the way. In some situations it's difficult to predict your user's behavior.
If you have seen an explosion of traffic recently, analyze what are the slowest pages.
You can move to cloud, but EC2 is not the best performing one. Again, be sure there's no other optimization you can do.
Database might be redesigned, but I doubt all of it. Again, see the problem points.
Both Hadoop and Cassandra are pretty nifty, but they might be overkill.
I'm very new to performance engineering, so I have a very basic question.
I'm working in a client-server system that uses SQL server backend. The application is a huge tax-related application that requires testing performance at peak load. Meaning that there should be like 10 million tax returns in the system when we run scenarios related to creating tax returns and submitting them. Then there will also be proportional number of users that need to be created.
Now I'm hearing in meetings that we need to create 10 million records to test performance and run scenarios with 5000 users and I just don't think it is feasible.
When one talks about creating a smaller dataset and extrapolating the performance planning, a very common answer I hear is that we need to 10 million records because we cannot tell from a smaller data set how the database or network will behave.
So how does one plan capacity and test performance on large enterprise application without creating peak level of data or running peak number of scenarios?
Thanks.
Personally, I would throw as much data and traffic at it as you can. Forget what traffic you "think you need to handle". And just see how much traffic you CAN handle and go from there. Knowing the limits of your system is more valuable than simply knowing it can handle 10 million records.
Maybe it does handle 10 million, but at 11 million it dies a horrible death. Or maybe it's well written and will scale to 100 million before it dies. There's a very distinct difference between the two even though both pass the "10 million test"
Now I'm hearing in meetings that we need to create 10 million records to test performance and run scenarios with 5000 users and I just don't think it is feasible.
Why do you think so?
Of course you can (and should) test with limited amounts of data, but you also really, really need to test with a realistic load, which means testing with the amount (and type) of data that you will use in production.
This is just a special case of a general rule: For system or integration testing, you need to test in a scenario that is as close as possible to production; ideally you just copy/clone a live production system, data, config and all and use that for testing. That is actually what we do (if we technically can and the client agrees). We just run a few SQL scripts to randomize personal data in the test data set, so prevent privacy concerns.
There are always issues that crop up because production data is somehow different from what you tested on, and this is the only way to prevent (or at least limit) these problems.
I've planned and implemented reporting and imports, and they invariably break or misbehave the first time they're exposed to real data, because there are always special cases or scaling problems you didn't expect. You want that breakage to happen during development, not in production :-).
In short:
Bite the bullet, and (after having done all the tests with "toy data"), get a realistic dataset to test on. If you don't have the hardware to handle that, then you don't have the right hardware for your tests :-).
I would take a look at Redgate's SQL Data Generator. It does a good job of generating representative data.
Have a peek at "The art of application performance testing / Ian Molyneaux, O’Reilly, 2009".
Your test data is ideally a realistic variety of records. But for first approximations you could have just a few unique records, and duplicate them until you have the desired size. Then use ApacheBench to roughly approximate the traffic.
To help generate data look at ruby faker and perl data faker. I have had good luck with it in generating large data sets for testing. SQL generator from redgate is good too.
I have written a PHP application which requires storage of millions of integers between 0 and 10,000,000 inclusive. Each number is incremented by one very frequently (on average 100 values are updated every second) and read very frequently (20,000 reads per second). The numbers are reset to 0 either nightly, weekly, monthly or annually.
I've got a fairly good handle on MySQL but it feels like overkill, and not very efficient in the process.
Has anyone had to deal with this before, and/or could shed some light on a suitable data storage system?
Depending on how much memory you have, and how long you have to keep them around - you could just use the APC cache, or memcache. memcache has a nice increment operation, so it's crazy efficient doing it.
BTW, you said you need 20K reads. And you are doing it in PHP? on one server or on a cluster? Sounds like you need some architectural guidance..... :)
If it's a web-app, you are going to be using a cluster. And if it's not a web app, I'd suggest you think about re-doing it in something which isn't PHP. Java, C++ and c# come to mind.
Let's say at your job your boss says,
That system over there, which has lost all institutional knowledge but seems to run pretty good right now, could we dump double the data in it and survive?
You're completely unfamiliar with the system.
It's in SQL Server 2000 (primarily a database app).
There's no test environment.
You might be able to hijack it on the weekends if you needed to run a benchmark.
What would be the 3 things you'd do to convince yourself and then your manager that you could take on that extra load. And if you couldn't do it, on the same hardware... the extra hardware (measured in dollars) it would take to satisfy that request.
To address the response from doofledorfer, you assumptions are almost all 180 degrees off. But that's my fault for an ambiguous question.
One of the main servers runs 7x24 at 70% base and spikes from there and no one knows what it is doing.
This isn't an issue of buy-in or whining... Our company may not have much of a choice in the matter.
Because this is being externally mandated, delays in implementation could result in huge fines. So large meeting to assess risk are almost impossible. There is one risk, that dumping double the data would take the system down for the existing customers.
I was hoping someone would say something like, see if you take the system off line Sunday night at midnight and run SQLIO tests to see how close the storage subsystem is to saturation. Things like that.
Set up a test environment, even if I have to do it on my laptop.
Enable some kind of logging on the production system to get an idea of the volume of transactions in addition to the volume of data.
Read the source code as I run stress tests on my laptop with increasing amounts of data.
Having said that, I sympathize with this assignment, because it's unfair. It's like asking someone in a boat if the boat can float with twice the cargo -- but you can't get out of the boat or take it out of its regular service.
You've just described a typical Agile project. Your answer should be:
I don't know, and I won't be able to tell without testing.
In addition to data volume, there might be issues with usage patterns, application interactions, database and server tuning, etc.
So let's work through a basic list of risk factors, and how we might resolve them.
Once we've done that, let's work through them in inverse order of risk; and make a stop/continue decision as we develop the results.
etc.
Without management buy-in and participation at least at that level, any other answer you might give is high-risk wishing, and "3 most important" is a non sequitur.
I'd be optimistic unless your current system is substantially loaded already. Most servers should run at less than 50% capacity on all resources, or else be on life-support.
And I expect you wouldn't be having the conversation if the existing server were already dealing with load issues; although "seems to run pretty good right now" is imprecise enough to be worrisome.
it mostly depends on its current level. If doubling is going from 2GB to 4GB just do it. If it's going from 1TB to 2TB you've got some planning to do.
I'd collect some info using Performance Monitor and provide it to help make an educated decision.
It depends what you mean by "double the data".
If that is going to affect one table only (say product table) then you are probably safe as most queries that are referring to that one table are most likely to double the time of execution (that assumes that you do not reference the same time twice in a query).
The problem will arise if you double the amount of data in all the tables as the execution time may grow in exponential fashion then and it can lead to some serious issues.
But in general I would support the answer by doofledorfer