simple Solr deployment with two servers for redundancy - solr

I'm deploying the Apache Solr web app in two redundant Tomcat 6 servers,
to provide redundancy and improved availability. At this point, scalability is not a issue.
I have a load balancer that can dynamically route traffic to one server or the other or both.
I know that Solr supports master/slave configuration, but that requires manual recovery if the slave receives updates during the master outage (which it will in my use case).
I'm considering a simpler approach using the ability to reload a core:
- only one of the two servers is receiving traffic at any time (the "active" instance), but both are running,
- both instances share the same index data and
- before re-routing traffic due to an outage, the now active instance is told to reload the index core(s)
Limited testing of failovers with both index reads and writes has been successful. What implications/issues am I missing?
Your thoughts and opinions welcomed.

The simple approach to redundancy your considering seems reasonable but you will not be able to use it for disaster recovery unless you can share the data/index to/from a different physical location using your NAS/SAN.
Here are some suggestions:-
Make backups for disaster recovery and test those backups work as an index could conceivably have been corrupted as there are no checksums happening internally in SOLR/Lucene. An index could get wiped or some records could get deleted and merged away without you knowing it and backups can be useful for recovering those records/docs at a later time if you need to perform an investigation.
Before you re-route traffic to the second instance I would run some queries to load caches and also to test and confirm the current index works before it goes online.
Isolate the updates to one location and process and thread to ensure transactional integrity in the event of a cutover as it could be difficult to manage consistency as SOLR does not use a vector clock to synchronize updates like some databases. I personally would keep a copy of all updates in order separately from SOLR in some other store just in case a small time window needs to be repeated.
In general, my experience with SOLR has been excellent as long as you are not using cutting edge features and plugins. I have one instance that currently has 40 million docs and an uptime of well over a year with no issues. That doesn't mean you wont have issues but gives you an idea of how stable it could be.

I hardly know anything about Solr, so I don't know the answers to some of the questions that need to be considered with this sort of setup, but I can provide some things for consideration. You will have to consider what sorts of failures you want to protect against and why and make your decision based on that. There is, after all, no perfect system.
Both instances are using the same files. If the files become corrupt or unavailable for some reason (hardware fault, software bug), the second instance is going to fail the same as the first.
On a similar note, are the files stored and accessed in such a way that they are always valid when the inactive instance reads them? Will the inactive instance try to read the files when the active instance is writing them? What would happen if it does? If the active instance is interrupted while writing the index files (power failure, network outage, disk full), what will happen when the inactive instance tries to load them? The same questions apply in reverse if the 'inactive' instance is going to be writing to the files (which isn't particularly unlikely if it wasn't designed with this use in mind; it might for example update some sort of idle statistic).
Also, reloading the indices sounds like it could be a rather time-consuming operation, and service will not be available while it is happening.
If the active instance needs to complete an orderly shutdown before the inactive instance loads the indices (perhaps due to file validity problems mentioned above), this could also be time-consuming and cause unavailability. If the active instance can't complete an orderly shutdown, you're gonna have a bad time.

Related

How do I find the cause of an IIS/SQL timeout?

I have a web service sitting on IIS that has been quite happy for months but now I'm getting timeouts and I don't know how to diagnose what the problem is.
The client sends up basic information in a 'heartbeat' message to IIS which then updates this in a SQL database (on a different server). There are 250 clients in the wild, all sending up their heartbeat every 5 minutes ... so there's only 250 rows in the table, with appropriate indexing on the column being used for the update.
Ordinarily it only takes 50-100ms to do the update, but since last week you can see that the response time in the IIS log has increased and I'm also getting timeouts too.
Nothing has changed with the setup so I don't know what I'm looking for to determine the reason. The error I get back is:
System.ServiceModel.FaultException: An error occurred while updating
the entries. See the inner exception for details.An error occurred
while updating the entries. See the inner exception for
details.Execution Timeout Expired. The timeout period elapsed prior to
completion of the operation or the server is not responding. The
statement has been terminated.The wait operation timed out
Any advice on where to start looking? I did enable the failed request log trace in IIS but I don't know what it all means if I'm perfectly honest. The difference between a successful requiest and a failed one is that the request log stops after the 'AspNetStart' entry.
Thanks!
Mark
There are lots of reasons a service can gradually or suddenly become slow. Poor code structure can lead to things like memory leaks on the server, small enough they don't really show up or cause problems during testing, but when run over weeks/months start to stack up. Unauthorized requests could be targeting your server if this is a public-facing service, or has a link to public-facing services.
Things to look at:
Does this happen at certain times of the day or throughout the day?
Is this a load issue that starts occurring when multiple users are sending updates concurrently? 250 users isn't a lot. Has the # of users grown over the last few months or has it been relatively stable since the start?
What is the memory and CPU usage looking like on the Web server(s) and DB server?
This is the first clue to check to see if either server is under considerable load. From there you can investigate why it might be under load or if it possibly needs a bit more grunt to deal with the load. Look at the running processes. If these servers are managed by an IT department or such some culprits can include things like Virus Scanners hogging resources. (I.e. policy changes in the last few months have lead to additional load on the servers)
What recovery model is your database set up for?
What is the size of your Tx Log (.mdx file)
Do you have a regular scheduled database backup and index maintenance?
This is one that new projects tend to forget. An empty database is small and has no Tx Log history being recorded, but as it runs over time that Tx Log grows silently in the background, especially with Full recovery. Larger Tx Logs can lead to slower performance over time especially if the log file needs to be enlarged. A good thing to check is whether the log file is set to grow by a # of bytes or percentage. Percentage is I believe the default but this can cause exponential "grow" time/space issues so it's better to set it to a fixed size per grow. You'll want regular backups being done that allow the Tx Log to reset. Ideally don't shrink the file if the Log size between backups stays consistent.
How many records across all tables are being inserted or updated in a given day?
This is important to build a picture of how much the database will be tracking through the day between backups. You may have 250 clients, but every heartbeat is potentially updating a row and inserting others.
What are you using for PKs for inserted records? (Ints vs. UUIDs) If using UUIDs are you using NEWSEQUENTIALID() or NEWID()/Guid.New()?
GUIDs can be a time bomb for indexing if done poorly. A GUID combined with NEWID() or Guid.New() will lead to considerable index fragmentation when inserting rows. Provided the GUIDs are not visible to clients you should use NEWSEQUENTIALID(). If IDs are set via code then there are implementations you can find to generate sequential GUIDs. (It's a matter of re-arranging the parts that make up the GUID) Regular index maintenance is a requirement for using UUID columns in indexed fields.
Are you using Dependency Injection in your web service?
What is the lifetime scope of the DbContexts performing the updates?
This is a potential time bomb for web servers if the lifetime scope for a DbContext is set up incorrectly. You want a DbContext to be alive for no longer than it is needed. At a maximum the lifetime scope should be set to PerRequest. A DbContext set up for Singleton for instance would be tracking entities across requests. The more entities a DbContext is tracking, the slower read and update operations become. This would be a possible culprit if the web server memory usage is climbing.
Are you running an SQL Profiler?
In a test environment with nothing else touching the database, running scenarios through the application with an SQL Profiler can reveal potential issues such as unexpected queries being kicked off due to things like lazy loading. For one operation you might expect one or a small number of queries to be run, only to find dozens or even hundreds. Multiply this across concurrent requests and you have a recipe for the database server to say "Just sit down and wait, dammit!" :) Any queries you don't expect based on the code that is running should be investigated for either eager loading relationships or implementing projection. (Recommended for best performance)
Do the web servers get restarted periodically?
For some tricky to debug issues and memory leaks, sometimes the easiest "fix" is to schedule regular restarts of the web server. It's a hack, but compared to the considerable cost of trying to track down memory leaks or fix up inefficient code that slows down over time, it is a cheap and effective fix. (At least while you do research options to address the issues and optimize the code)
That should give you a start into things to check with the service & database.

Caching database table on a high-performance application

I have a high-performance application I'm considering making distributed (using rabbitMQ as the MQ). The application uses a database (currently SQLServer, but I can still switch to something else) and caches most of it in the RAM to increase performance.
This causes a problem because when one of the applications writes to the database, the others' cached database becomes out-of-date.
I figured it is something that happens a lot in the High-Availability community, however I couldn't find anything useful. I guess I'm not searching for the right thing.
Is there an out-of-the-box solution?
PS: I'm sorry if this belongs to serverfault - Since this a development issue I figured it belongs here
EDIT:
The application reads and writes to the database. Since I'm changing the application to be distributed - Now more than one application reads and writes to the database. The caching is done in each of the distributed applications, which are not aware to DB changes from another application.
I mean - How can one know if the DB was updated, if he wasn't the one to update it?
So you have one database and many applications on various servers. Each application has its own cache and all the applications are reading and writing to the database.
Look at a distributed cache instead of caching locally. Check out memcached or AppFabric. I've had success using AppFabric to cache things in a Microsoft stack. You can simply add new nodes to AppFabric and it will automatically distribute the objects for high availability.
If you move to a shared cache, then you can put expiration times on objects in the cache. Try to resist the temptation to proactively evict items when things change. It becomes a very difficult problem.
I would recommend isolating your critical items and only cache them once. As an example, when working on an auction site, we cached very aggressively. We only cached an auction listing's price once. That way when someone else bid on it, we only had to do one eviction. We didn't have to go through the entire cache and ask "Where does the price appear? Change it!"
For 95% of your data, the reads will expire on their own and writes won't affect them immediately. 5% of your data needs to be evicted when a new write comes in. This is what I called your "critical items". Things that always need to be up to date.
Hope that gives you ideas!

couchdb replication on a lot of servers

I am currently looking at CouchDB and I understand that I have to specify all the replications by hand. If I want to use it on 100 nodes how would I do the replication?
Doing 99 "replicate to" and 99 "replicate from" on each node
It feels like it would be overkill since a node replication includes all the other nodes replications to it
Doing 1 replicate to the next one to form a circle (like A -> B -> C -> A)
Would work until one crash, then all wait until it comes back
The latency would be big for replicating from the first to the last
Isn't there a way to say: "here are 3 IPs on the full network. Connect to them and share with everyone as you see fit like an independent P2P" ?
Thanks for your insight
BigCouch won't provide the cross data-center stuff out of the box. Cloudant DBaaS (based on BigCouch) does have this setup already across several data-centers.
BigCouch is a sharded "Dynamo-style" fork of Apache CouchDB--it is to be merged into the "mainline" Apache CouchDB in the future, fwiw. The shards live across nodes (servers) in the same data-center. "Classic" CouchDB-style Replication is used (afaik) to keep the BigCouches in the various data-centers insync.
CouchDB-style replication (n-master) is change-based, so replication only includes the latest changes.
You would need to setup to/from pairs of replication for each node/database combination. However, if all of your servers are intended to be identical, replication won't actually happen that often--it will only happen if needed.
If A gets a change, replication ships it to B and C (etc). However, if B--having just got that change--replicates it to C before A gets the chance too--due to network latency, etc--when A does finally try, it will realize the data is already there, and not bother sending the change again.
If this is a standard part of your setup (i.e., every time you make a db you want it replicated everywhere else), then I'd highly recommend automating the setup.
Also, checkout the _replicator database. It's much easier to manage what's going on:
https://gist.github.com/fdmanana/832610
Hope something in there is useful. :)

What are good algorithms to keep consistency across multiple files in a network?

What are good algorithms to keep consistency in multiple files?
This is a school project. I have to implement in C, some replication across a network.
I have 2 servers,
Server A1
Server A2
Both servers have their own file called "data.txt"
If I write something to one of them, I need the other to be updated.
I also have another scenario, with 3 Servers.
Server B1
Server B2
Server B3
I need these do do pretty much the same.
While this would be fairly simple to implement. If one, or two of the servers were to be down, When comming back up, they would have to update themselves.
I'm sure there are algorithms that solve this efficiently. I know what I want, I just don't know exactly what I'm looking for!
Can someone point me to the right direction please?
Thank you!
The fundamental issue here is known as the 'CAP theorem', which defines three properties that a distributed system can have:
Consistency: Reading data from the system always returns the most up-to-date data.
Availability: Every response either succeeds or fails (doesn't just keep waiting until things recover)
Partition tolerance: The system can operate when its servers are unable to communicate with each other (a server being down is one special case of this)
The CAP theorem states that you can only have two of these. If your system is consistent and partition tolerant, then it loses the availability condition - you might have to wait for a partition to heal before you get a response. If you have consistency and availability, you'll have downtime when there's a partition, or enough servers are down. If you have availability and partition tolerance, you might read stale data, or have to deal with conflicting writes.
Note that this applies separately between reads and writes - you can have an Available and Partition-Tolerant system for reads, but Consistent and Available system for writes. This is basically a master-slave system; in a partition, writes might fail (if they're on the wrong side of a partition), but reads will work (although they might return stale data).
So if you want to be Available and Partition Tolerant for reads, one easy option is to just designate one host as the only one that can do writes, and sync from it (eg, using rsync from a cron script or something - in your C project, you'd just copy the file over using some simple network code periodically, and do an extra copy just after modifying it).
If you need partition tolerance for writes, though, it's more complex. You can have two servers that can't talk to each other both doing writes, and later have to figure out what data wins. This basically means you'll need to compare the two versions when syncing and decide what wins. This can just be as simple as 'let the highest timestamp win', or you can use vector clocks as in Dynamo to implement a more complex policy - which is appropriate here depends on your application.
Check out rsync and how Dropbox works.
With every write on to server A, fork a process to write the same content to server B.
So that all the writes on to server A are replicated on to server B. If you have multiple servers, make the forked process to write across all the backup servers.

How to gear towards scalability for a start up e-commerce portal?

I want to scale an e-commerce portal based on LAMP. Recently we've seen huge traffic surge.
What would be steps (please mention in order) in scaling it:
Should I consider moving onto Amazon EC2 or similar? what could be potential problems in switching servers?
Do we need to redesign database? I read, Facebook switched to Cassandra from MySql. What kind of code changes are required if switched to Cassandra? Would Cassandra be better option than MySql?
Possibility of Hadoop, not even sure?
Any other things, which need to be thought of?
Found this post helpful. This blog has nice articles as well. What I want to know is list of steps I should consider in scaling this app.
First, I would suggest making sure every resource served by your server sets appropriate cache control headers. The goal is to make sure truly dynamic content gets served fresh every time and any stable or static content gets served from somebody else's cache as much as possible. Why deliver a product image to every AOL customer when you can deliver it to the first and let AOL deliver it to all the others?
If you currently run your webserver and dbms on the same box, you can look into moving the dbms onto a dedicated database server.
Once you have done the above, you need to start measuring the specifics. What resource will hit its capacity first?
For example, if the webserver is running at or near capacity while the database server sits mostly idle, it makes no sense to switch databases or to implement replication etc.
If the webserver sits mostly idle while the dbms chugs away constantly, it makes no sense to look into switching to a cluster of load-balanced webservers.
Take care of the simple things first.
If the dbms is the likely bottle-neck, make sure your database has the right indexes so that it gets fast access times during lookup and doesn't waste unnecessary time during updates. Make sure the dbms logs to a different physical medium from the tables themselves. Make sure the application isn't issuing any wasteful queries etc. Make sure you do not run any expensive analytical queries against your transactional database.
If the webserver is the likely bottle-neck, profile it to see where it spends most of its time and reduce the work by changing your application or implementing new caching strategies etc. Make sure you are not doing anything that will prevent you from moving from a single server to multiple servers with a load balancer.
If you have taken care of the above, you will be much better prepared for making the move to multiple webservers or database servers. You will be much better informed for deciding whether to scale your database with replication or to switch to a completely different data model etc.
1) First thing - measure how many requests per second can serve you most-visited pages. For well-written PHP sites on average hardware it must be in 200-400 requests per second range. If you are not there - you have to optimize the code by reducing number of database requests, caching rarely changed data in memcached/shared memory, using PHP accelerator. If you are at some 10-20 requests per second, you need to get rid of your bulky framework.
2) Second - if you are still on Apache2, you have to switch to lighthttpd or nginx+apache2. Personally, I like the second option.
3) Then you move all your static data to separate server or CDN. Make sure it is served with "expires" headers, at least 24 hours.
4) Only after all these things you might start thinking about going to EC2/Hadoop, build multiple servers and balancing the load (nginx would also help you there)
After steps 1-3 you should be able to serve some 10'000'000 hits per day easily.
If you need just 1.5-3 times more, I would go for single more powerfull server (8-16 cores, lots of RAM for caching & database).
With step 4 and multiple servers you are on your way to 0.1-1billion hits per day (but for significantly larger hardware & support expenses).
Find out where issues are happening (or are likely to happen if you don't have them now). Knowing what is your biggest resource usage is important when evaluating any solution. Stick to solutions that will give you the biggest improvement.
Consider:
- higher than needed bandwidth use x user is something you want to address regardless of moving to ec2. It will cost you money either way, so its worth a shot at looking at things like this: http://developer.yahoo.com/yslow/
- don't invest into changing databases if that's a non issue. Find out first if that's really the problem, and even if you are having issues with the database it might be a code issue i.e. hitting the database lots of times per request.
- unless we are talking about v. big numbers, you shouldn't have high cpu usage issues, if you do find out where they are happening / optimization is worth it where specific code has a high impact in your overall resource usage.
- after making sure the above is reasonable, you might get big improvements with caching. In bandwith (making sure browsers/proxy can play their part on caching), local resources usage (avoiding re-processing/re-retrieving the same info all the time).
I'm not saying you should go all out with the above, just enough to make sure you won't get the same issues elsewhere in v. few months. Also enough to find out where are your biggest gains, and if you will get enough value from any scaling options. This will also allow you to come back and ask questions about specific problems, and how these scaling options relate to those.
You should prepare by choosing a flexible framework and be sure things are going to change along the way. In some situations it's difficult to predict your user's behavior.
If you have seen an explosion of traffic recently, analyze what are the slowest pages.
You can move to cloud, but EC2 is not the best performing one. Again, be sure there's no other optimization you can do.
Database might be redesigned, but I doubt all of it. Again, see the problem points.
Both Hadoop and Cassandra are pretty nifty, but they might be overkill.

Resources