Import script that maintains data integrity w/ zero downtime

Import script that maintains data integrity w/ zero downtime - database

Background
I'm writing a import script which is fairly computationally expensive and results in many insert and update database queries. My intention is to store the database on an EBS volume and use EC2's command-line tools to launch a c1.xlarge instance, perform the import (writing to the EBS volume) and self-destruct on completion (to save $).
On instance termination, the EBS volume (that contains all the imported data) is then programatically attached and mounted to the machine that contains my webserver.
By using this scheme, the webserver machine can continue to respond to HTTP requests without being:
CPU and RAM overloaded.
Serving incomplete data while the import is running.
Wasting resources ( Being an expensive instance-type. )
Question
Is this a sound approach? Is it essentially how companies that manage large amounts of data are able to do so without downtime, whilst keeping up-to-date? Good books or blog posts on the subject?

I would posit that if you have a single webserver and are concerned about the cost of a c1.xlarge for a short period of time, you can tolerate slightly more than zero downtime. The setup you're describing sounds fine, just realize that one you turn off the DB on the c1.xlarge, your downtime starts ticking until the DB is up and your app reconfigured to point to the local instance of the DB. Figure that'll be a few minutes, so plan accordingly - perhaps with a maintenance page or if you app can work off cached data / in read-only mode, do that.
Or if you're using a supported DB, just use RDS. That'll probably get you a lot closer to zero downtime with less work. You'll pay something for it, but the multi-AZ failover is worth the price of admission.
And no, this is not how a large company with lots of data would do it. They'd most likely use replication, ensure the data is on a few different servers, and then failover the master. That's also what you get with RDS.

Related

Continuously updated database shared between multiple AWS EC2 instances

For a small personal project, I've been scraping some data every 5 minutes and saving it in a SQL database. So far I've been using a tiny EC2 AWS instance in combination with a 100GB EBS storage. This has been working great for the scraping, but is becoming unusable for analysing the resulting data, as the EC2 instance doesn't have enough memory.
The data analysis only happens irregularly, so it would feel a waste to pay 24/7 to have a bigger EC2 instance, so I'm looking for something more flexible. From reading around I've learned:
You can't connect EBS to two EC2 instances at the same time, so spinning up a second temporary big instance whenever analysis needed isn't an option.
AWS EFS seems a solution, but is quite a lot more expensive and considering my limited knowledge, I'm not a 100% sure this is the ideal solution.
The serverless options like Amazon Athena look great, but this is based on S3 which is a no-go for data that needs continuous updating (?).
I assume this is quite a common usecase for AWS, so I'm hoping to try to get some pointers in the right direction. Are there options I'm overlooking that fit my problem? Is EFS the right way to go?
Thanks!

Answers by previous users are great. Let's break them down in options. It sounds to me that your initial stack is a Custom SQL Database you installed in EC2.
Option 1 - RDS Read Replicas
Move your DB to RDS, this would give you a lot of goodies, but the main one we are looking for is Read Replicas if your reading/s grows you can create additional read replicas and put them behind a load balancer. This setup is the lowest hanging fruit without too many code changes.
Option 2 - EFS to Share Data between EC2 Instances
Using EFS is not straightforward, to no fault of EFS. Some databases save unique IDs to the filesystem, meaning you can't share the hard drive. EFS is a service and will add some lag to every read/write operation. Depending on how your installed Database distribution it might not even be possible.
Option 3 - Athena and S3
Having the workers save to S3 instead of SQL is also doable, but it means rewriting your web scraping tool. You can call S3 -> PutObject on the same key multiple times, and it will overwrite the previous object. Then you would need to rewrite your analytics tool to query S3. This option is excellent, and it's likely the cheapest in 'operation cost,' but it means that you have to be acquainted with S3, and more importantly, Athena. You would also need to figure out how you will save new data and the best file format for your application. You can start with regular JSON or CSV blobs and then later move to Apache Parquet for lower cost. (For more info on how that statement means savings see here: https://aws.amazon.com/athena/pricing/)
Option 4 - RedShift
RedShift is for BigData, I would wait until querying regular SQL is a problem (multiple seconds per query), and then I would start looking into it. Sure it would allow you query very for cheap, but you would probably have to set up a Pipeline that listens to SQL (or is triggered by it) and then updates RedShift. Reason is because RedShift scales depending on your querying needs, and you can spin up multiple machines easily to make querying faster.

As far as I can see S3 and Athena is good option for this. I am not sure about your concern NOT to use S3, but once you can save scraped data in S3 and you can analyse them with Athena (Pay Per Query model).
Alternatively, you can use RedShift to save data and analyse which has on demand service similar to ec2 on demand pricing model.
Also, you may use Kenisis Firehose which can be used to analyse data real time as and when you ingest them.

Your scraping workers should store data in Amazon S3. That way, worker instances can be scaled (and even turned off) without having to worry about data storage. Keep process data (eg what has been scraped, where to scrape next) in a database such as DynamoDB.
When you need to query the data saved to Amazon S3, Amazon Athena is ideal if it is stored in a readable format (CSV, ORC, etc).
However, if you need to read unstructured data, your application can access the files directly S3 by either downloading and using them, or reading them as streams. For this type of processing, you could launch a large EC2 instance with plenty of resources, then turn it off when not being used. Better yet, launch it as a Spot instance to save money. (It means your system will need to cope with potentially being stopped mid-way.)

Web scraping vs Cloud Storage with AWS

My team has run into a design conflict. We are working on a project that involves scraping historical data from yahoo for all stocks for the last year to run some ML analysis on it. The latency is unbearably slow, not sure if it's the network or the web scraper. I proposed we use AWS RDS to store the data so we can access it quicker. However, a team member said that storing the data in the cloud would not solve our latency issue. I rebutted with the fact that the data will be organized and stored in a way to access the data significantly faster. He came back with something else and this went on. Is it true that a cloud DB won't offer any additional speed compared to a scraper? If so does AWS have a service that allows us to access the data we store faster through another service, almost as if the database was on our own server?
I am not that all familiar with cloud services but I do understand databases pretty well. So please dumb down the AWS stuff if you wish and feel free to point me to any duplicates or links that may help me understand this more.

Lots of good reasons to use RDS as a database, but speeding up your scraping isn't one of them - it likely isn't your bottleneck.
I have written lots of scrapers over the years, and by far the biggest performance boost will be to have a fast network connection between the scraper machine(s) and the host you are scraping, and even then, using a multi-threaded scraper for each scraping machine will give you another HUGE speed improvement.
Most time spent scraping is waiting on the host to return the results to you, not parsing the page and not saving the database to a database.

A MySQL DB on AWS RDS would be the same as the one that you'd install yourself on some machine. So, it isn't going to be different or slower just because it is in the cloud.
If you scrape some data and process it only once, then there is no point in introducing a DB in between. But if your scraper is slow and you process scraped data multiple times, then storing it in a DB should improve latencies. That is because the latencies of a DB read will be much lesser than that of scraping (assuming you design your DB schema properly; your hosts are in the same availability zones, or at least regions, as your DB etc.).
For e.g., if scraping a webpage takes ~10s and you process the scraped data twice, it'd take you ~20s if you don't have a DB. If you have a DB which has latencies of ~500ms you'd only take ~11s.

simple Solr deployment with two servers for redundancy

I'm deploying the Apache Solr web app in two redundant Tomcat 6 servers,
to provide redundancy and improved availability. At this point, scalability is not a issue.
I have a load balancer that can dynamically route traffic to one server or the other or both.
I know that Solr supports master/slave configuration, but that requires manual recovery if the slave receives updates during the master outage (which it will in my use case).
I'm considering a simpler approach using the ability to reload a core:
- only one of the two servers is receiving traffic at any time (the "active" instance), but both are running,
- both instances share the same index data and
- before re-routing traffic due to an outage, the now active instance is told to reload the index core(s)
Limited testing of failovers with both index reads and writes has been successful. What implications/issues am I missing?
Your thoughts and opinions welcomed.

The simple approach to redundancy your considering seems reasonable but you will not be able to use it for disaster recovery unless you can share the data/index to/from a different physical location using your NAS/SAN.
Here are some suggestions:-
Make backups for disaster recovery and test those backups work as an index could conceivably have been corrupted as there are no checksums happening internally in SOLR/Lucene. An index could get wiped or some records could get deleted and merged away without you knowing it and backups can be useful for recovering those records/docs at a later time if you need to perform an investigation.
Before you re-route traffic to the second instance I would run some queries to load caches and also to test and confirm the current index works before it goes online.
Isolate the updates to one location and process and thread to ensure transactional integrity in the event of a cutover as it could be difficult to manage consistency as SOLR does not use a vector clock to synchronize updates like some databases. I personally would keep a copy of all updates in order separately from SOLR in some other store just in case a small time window needs to be repeated.
In general, my experience with SOLR has been excellent as long as you are not using cutting edge features and plugins. I have one instance that currently has 40 million docs and an uptime of well over a year with no issues. That doesn't mean you wont have issues but gives you an idea of how stable it could be.

I hardly know anything about Solr, so I don't know the answers to some of the questions that need to be considered with this sort of setup, but I can provide some things for consideration. You will have to consider what sorts of failures you want to protect against and why and make your decision based on that. There is, after all, no perfect system.
Both instances are using the same files. If the files become corrupt or unavailable for some reason (hardware fault, software bug), the second instance is going to fail the same as the first.
On a similar note, are the files stored and accessed in such a way that they are always valid when the inactive instance reads them? Will the inactive instance try to read the files when the active instance is writing them? What would happen if it does? If the active instance is interrupted while writing the index files (power failure, network outage, disk full), what will happen when the inactive instance tries to load them? The same questions apply in reverse if the 'inactive' instance is going to be writing to the files (which isn't particularly unlikely if it wasn't designed with this use in mind; it might for example update some sort of idle statistic).
Also, reloading the indices sounds like it could be a rather time-consuming operation, and service will not be available while it is happening.
If the active instance needs to complete an orderly shutdown before the inactive instance loads the indices (perhaps due to file validity problems mentioned above), this could also be time-consuming and cause unavailability. If the active instance can't complete an orderly shutdown, you're gonna have a bad time.

Caching in Google App Engine/Cloud Based Hosting

I am curious as to how caching works in Google App Engine or any cloud based application. Since there is no guarantee that requests are sent to same sever, does that mean that if data is cached on 1st request on Server A, then on 2nd requests which is processed by Server B, it will not be able to access the cache?
If thats the case (cache only local to server), won't it be unlikely (depending on number of users) that a request uses the cache? eg. Google probably has thousands of servers

With App Engine you cache using memcached. This means that a cache server will hold the data in memory (rather than each application server). The application servers (for a given application) all talk the same cache server (conceptually, there could be sharding or replication going on under the hoods).
In-memory caching on the application server itself will potentially not be very effective, because there is more than one of those (although for your given application there are only a few instances active, it is not spread out over all of Google's servers), and also because Google is free to shut them down all the time (which is a real problem for Java apps that take some time to boot up again, so now you can pay to keep idle instances alive).
In addition to these performance/effectiveness issues, in-memory caching on the application server could lead to consistency problems (every refresh shows different data when the caches are not in sync).

Depends on the type of caching you want to achieve.
Caching on the application server itself can be interesting if you have complex in-memory object structure that takes time to rebuild from data loaded from the database. In that specific case, you may want to cache the result of the computation. It will be faster to use a local cache than a shared memcache to load if the structure is large.
If having consistent value between in-memory and the database is paramount, you can do some checksum/timestamp check with a stored value on the datastore, every time you use the cached value. Storing checksum/timestamp on a small object or in a global cache will fasten the process.
One big issue using global memcache is ensuring proper synchronization on "refilling" it, when a value is not yet present or has been flushed. If you have multiple servers doing the check at the exact same time and refilling value in cache, you may end-up having several distinct servers doing the refill at the same time. If the operation is idem-potent, this is not a problem; if not, a potential and very hard to trace bug.

Scaling out SQL Server for the web (Single Writer Multiple Readers)

Has anyone had any experience scaling out SQL Server in a multi reader single writer fashion. If not can anyone suggest a suitable alternative for a read intensive web application, that they have experience with

It depends on probably 2 things:
How big each single write is?
Do readers need real time data?
A write will block readers when writing, but if each write is small and fast then readers won't notice.
If you offload, say, end of day reporting then you batch your load onto a separate server because readers do not require real time data. This makes sense
A write on your primary server must be synched to your offload secondary server... which will block there as part of the synch process anyway + you add an overhead load to manage the synch.
Most apps are 95%+ read anyway all the time. For example, an update or delete is a read followed by a write.
My choice would be (probably, based on the low write volume and it's a web app) to scale up and stuff as much RAM as I could in the DB server with separate disk paths for the data and log files of the database.

I don't have any experience with scaling out SQL Server for your scenario.
However for a Read-Intensive application, I would be looking at reducing the load on the database and employ a Cache Strategy using something like Memcache or MS Velocity
There are two approaches that I'm aware of:
Have the entire database loaded into the Cache and manage Adding and Updating of items in the cache.
Add items to the cache only when they are requested and remove them when a write operation is performed.

Some kind of replication would do the trick.
http://msdn.microsoft.com/en-us/library/ms151827.aspx
You of course need to change your app code.
Some people use partitioned tables, with different row ranges being stored on different servers - united with views. This would be invisible to the app. Federation for this practice, I think.
By designing your database, application and server configuration (SQL particulars - location of data/log/system/sql binaries/tempdb), you should be able to handle a pretty good load. Try not to complicate things if you don't have to.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight