MongoDB for small production project with very low traffic but with opportunity to scale fast - database

I've faced the problem with DBs (specifically with MongoDB).
At first, I thought to deploy it to MongoDB Atlas Cloud but the production Clusters (M30 and higher) are quite expensive for a little project like mine.
Now I am thinking about deploying the MongoDB replica set somewhere on Heroku Dyno or maybe AWS instance.
Couldn't you please suggest any production-ready way/solution that would not cost much?
As I can see AWS offers some FREE Tiers for DBs (Amazon RDS, Amazon DynamoDB, and others) but is it really free or it's a trap? Because I heard AWS Pricing policy is not that honey-sweet as it looks on the pricing page (in reality)
Any advice or help is welcomed!

Your least expensive way to experiment with MongoDB that still offers a path to scaling to something much larger is to get a free tier EC2 instance like a t2.micro and install MongoDB Community on it yourself. The Linux 64bit distribution is available here: https://fastdl.mongodb.org/linux/mongodb-linux-x86_64-amazon-5.0.5.tgz. Simply run tar xf on that tarfile and the mongod server and mongo CLI executables will be extracted. No other config required and pretty much you would start it like this:
$ mkdir -p /path/to/dbfiles
$ mongod --dbpath /path/to/dbfiles --logpath /path/to/dbfiles/mongo.log --replSet "rs0" --fork
You do not need a replica set to use aggregation, transactions, etc.
This solution will also expose you to a lot of important things like network security, database security, log management, driver compatibility, key vaults, and using the AWS Console and perhaps the AWS CLI to manipulate the environment. You can graduate to a bigger machine and create and add members to a replica set later -- or you can take the plunge and go with MongoDB Atlas. But at that point you'll be comfortable with functions esp. the aggregation pipeline, the document model, drivers, etc., all learned on essentially a zero-cost platform.
Performance is not really an issue here but using the handy load generator POCDriver (https://github.com/johnlpage/POCDriver) on a t2.micro instance with no special setup and a doc size of 0.28Kb, with 90/10 read/insert mix on a default of 4 threads and batch update removed (-b 1), we get about 450 inserts/sec and 4200 reads/sec on primary key _id.
java -jar POCDriver.jar --host "mongodb://localhost:27017" -k 90 -i 10 -b 1
Of course, the load generator is competing with the DB engine itself for resources but as a first cut it is good to know the performance independent of network considerations.
From launching the EC2 instance to running the test including installing java sudo yum -y install java-1.8.0-openjdk-devel.x86_64 took about 5 mins.

Related

DC/OS: Service vs Marathon App

I have the following two questions:
1) Are DC/OS Services just marathon apps? (Or: What's the difference between a DC/OS Service (like Cassandra) and a Cassandra app installed via Marathon?)
2) Scaling: Do DC/OS services like Cassandra automatically scale to all available nodes in the cluster (given sufficient work load)?
Thank you for your help :)
1) In order to answer the first part of your question, let me add one another concept: DC/OS package, so we have DC/OS package vs DC/OS service vs marathon app.
a) DC/OS service vs Marathon App
They are the same, a long run service which will be automatically kept running by marathon. You see this for example when creating a new DC/OS service, which you can do with an marathon app definition.
b) DC/OS packages (and I believe the core of your question)
dcos package install cassandra will deploy the DC/OS Apache Cassandra package.
The interesting piece of the Cassandra package is the scheduler, a piece of software that manages your Cassandra cluster (e.g., by bootstrapping the cluster or automatically restarting failed tasks) and also provides endpoints for scaling, upgrading,...
If you want it is the automated version of an administrator for your Cassandra cluster.
Now we also have to ensure that this admistrator is always available (i.e., what happens if the administrator/scheduler task/node is failing?).
This is why the scheduler is deployed by Marathon so it will be automatically restarted.
Marathon
|
Cassandra Scheduler
|
Cassandra Cluster
2) Second part of your question: Autoscaling
The package provides endpoints for scaling, so the typical pattern is to provide a script (e.g., marathon-autoscale) to scale the cassandra cluster. The reason why you need your own script is that scaling is something very individual to every user, and especialy scaling down.
Keep in mind that you are scaling a persistent service, so how to you select the node you want to remove? Do you first drain traffic from that node? Do you migrate data to other nodes?

How to quickly transfer data between servers using Sunspot / Websolr?

Since I suspect my setup is rather conventional, I'd like to start by providing a little context. Our Solr setup involves three environments:
Production - Solr server hosted on Websolr.
Staging - Also a Solr server hosted on Websolr.
Development - Supported via the sunspot_solr gem which allows us to easily set up our own local Solr server for development.
For the most part, this is working well. We have a lot of records so doing a full reindex takes a few hours (despite eager loading and using background jobs to parallelize the work). But that's not too terrible since we don't need to completely reindex very often.
But there's another scenario which is starting to become very annoying... We very frequently need to populate our local machine (or staging environment) with production data (i.e. basically grab a SQL dump from production and pipe it into our local database). We do this all the time for bugfixes and whatnot.
At this point, because our data has changed, our local Solr index is out of date. So, if we want our search to work correctly, we also need to reindex our local Solr server and that takes a really long time.
So now the question: Rather than doing a full reindex, I would like to simply copy the production index down on to my machine (i.e. conceptually similar to a SQL dump but for a Solr server rather than a database). I've Googled around enough to know that this is possible but have not seen any solutions specific to Websolr / Sunspot. These are such common tools that I figured someone else must have figured this out already.
Thanks in advance for any help!
One of the better kept secrets of Solr (and websolr): You can use the Solr Replication API to copy the data between two indices.
If you're making a copy of the production index "prod54321" into the QA index "qa12345", then you'd initiate the replication with the fetchindex command on the QA index's replication handler. Here's a quick command to approximate that, using cURL.
curl -X POST https://index.websolr.com/solr/qa12345/replication \
-d command=fetchindex \
-d masterUrl=https://index.websolr.com/solr/prod54321/replication
(Note the references to the replication request handler on both URLs.)

Fastest Open Source Content Management System for Cloud/Cluster deployment

Currently clouds are mushrooming like crazy and people start to deploy everything to the cloud including CMS systems, but so far I have not seen people that have succeeded in deploying popular CMS systems to a load balanced cluster in the cloud. Some performance hurdles seem to prevent standard open-source CMS systems to be deployed to the cloud like this.
CLOUD: A cloud, better load-balanced cluster, has at least one frontend-server, one network-connected(!) database-server and one cloud-storage server. This fits well to Amazon Beanstalk and Google Appengine. (This specifically excludes CMS on a single computer or Linux server with MySQL on the same "CPU".)
To deploy a standard CMS in such a load balanced cluster needs a cloud-ready CMS with the following characteristics:
The CMS must deal with the latency of queries to still be responsive and render pages in less than a second to be cached (or use a precaching strategy)
The filesystem probably must be connected to a remote storage (Amazon S3, Google cloudstorage, etc.)
Currently I know of python/django and Wordpress having middleware modules or plugins that can connect to cloud storages instead of a filesystem, but there might be other cloud-ready CMS implementations (Java, PHP, ?) and systems.
I myself have failed to deploy django-CMS to the cloud, finally due to query latency of the remote DB. So here is my question:
Did you deploy an open-source CMS that still performs well in rendering pages and backend admin? Please post your average page rendering access stats in microseconds for uncached pages.
IMPORTANT: Please describe your configuration, the problems you have encountered, which modules had to be optimized in the CMS to make it work, don't post simple "this works", contribute your experience and knowledge.
Such a CMS probably has to make fewer than 10 queries per page, if more, the queries must be made in parallel, and deal with filesystem access times of 100ms for a stat and query delays of 40ms.
Related:
Slow MySQL Remote Connection
Have you tried Umbraco?
It relies on database, but it keeps layers of cache so you arent doing selects on every request.
http://umbraco.com/azure
It works great on azure too!
I have found an excellent performance test of Wordpress on Appengine. It appears that Google has spent some time to optimize this system for load-balanced cluster and remote DB deployment:
http://www.syseleven.de/blog/4118/google-app-engine-php/
Scaling test from the report.
parallel
hits GAE 1&1 Sys11
1 1,5 2,6 8,5
10 9,8 8,5 69,4
100 14,9 - 146,1
Conclusion from the report the system is slower than on traditional hosting but scales much better.
http://developers.google.com/appengine/articles/wordpress
We have managed to deploy python django-CMS (www.django-cms.org) on GoogleAppEngine with CloudSQL as DB and CloudStore as Filesystem. Cloud store was attached by forking and fixing a django.storage module by Christos Kopanos http://github.com/locandy/django-google-cloud-storage
After that, the second set of problems came up as we discovered we had access times of up to 17s for a single page access. We have investigated this and found that easy-thumbnails 1.4 accessed the normal file system for mod_time requests while writing results to the store (rendering all thumb images on every request). We switched to the development version where that was already fixed.
Then we worked with SmileyChris to fix unnecessary access of mod_times (stat the file) on every request for every image by tracing and posting issues to http://github.com/SmileyChris/easy-thumbnails
This reduced access times from 12-17s to 4-6s per public page on the CMS basically eliminating all storage/"file"-system access. Once that was fixed, easy-thumbnails replaced (per design) file-system accesses with queries to the DB to check on every request if a thumbnail's source image has changed.
One thing for the web-designer: if she uses a image.width statement in the template this forces a ugly slow read on the "filesystem", because image widths are not cached.
Further investigation led to the conclusion that DB accesses are very costly, too and take about 40ms per roundtrip.
Up to now the deployment is unsuccessful mostly due to DB access times in the cloud leading to 4-5s delays on rendering a page before caching it.

Will using a Cloud PaaS automatically solve scalability issues?

I'm currently looking for a Cloud PaaS that will allow me to scale an application to handle anything between 1 user and 10 Million+ users ... I've never worked on anything this big and the big question that I can't seem to get a clear answer for is that if you develop, let's say a standard application with a relational database and soap-webservices, will this application scale automatically when deployed on a Paas solution or do you still need to build the application with fall-over, redundancy and all those things in mind?
Let's say I deploy a Spring Hibernate application to Amazon EC2 and I create single instance of Ubuntu Server with Tomcat installed, will this application just scale indefinitely or do I need more Ubuntu instances? If more than one Ubuntu instance is needed, does Amazon take care of running the application over both instances or is this the developer's responsibility? What about database storage, can I install a database on EC2 that will scale as the database grow or do I need to use one of their APIs instead if I want it to scale indefinitely?
CloudFoundry allows you to build locally and just deploy straight to their PaaS, but since it's in beta, there's a limit on the amount of resources you can use and databases are limited to 128MB if I remember correctly, so this a no-go for now. Some have suggested installing CloudFoundry on Amazon EC2, how does it scale and how is the database layer handled then?
GAE (Google App Engine), will this allow me to just deploy an app and not have to worry about how it scales and implements redundancy? There appears to be some limitations one what you can and can't run on GAE and their price increase recently upset quite a large number of developers, is it really that expensive compared to other providers?
So basically, will it scale and what needs to be done to make it scale?
That's a lot of questions for one post. Anyway:
Amazon EC2 does not scale automatically with load. EC2 is basically just a virtual machine. You can achieve scaling of EC2 instances with Auto Scaling and Elastic Load Balancing.
SQL databases scale poorly. That's why people started using NoSQL databases in the first place. It's best to see which database your cloud provider offers as a managed service: Datastore on GAE and DynamoDB on Amazon.
Installing your own database on EC2 instances is very impractical as EC2 has ephemeral storage (it looses all data on "disk" when it reboots).
GAE Datastore is actually a one big database for all applications running on it. So it's pretty scalable - your million of users should not be a problem for it.
http://highscalability.com/blog/2011/1/11/google-megastore-3-billion-writes-and-20-billion-read-transa.html
Yes App Engine scales automatically, both frontend instances and database. There is nothing special you need to do to make it scale, just use their API.
There are limitations what you can do with AppEngine:
A. No local storage (filesystem) - you need to use Datastore or Blobstore.
B. Comet is only supported via their proprietary Channels API
C. Datastore is a NoSQL database: no JOINs, limited queries, limited transactions.
Cost of GAE is not bad. We do 1M requests a day for about 5 dollars a day. The biggest saving comes from the fact that you do not need a system admin on GAE ( but you do need one for EC2). Compared to the cost of manpower GAE is incredibly cheap.
Some hints to save money (an speed up) GAE:
A. Use get instead of query in Datastore (requires carefully crafting natiral keys).
B. Use memcache to cache data you got form datastore. This can be done automatically with objectify and it's #Cached annotation.
C. Denormalize data. Meaning you write data redundantly in various places in order to get to it in as few operations as possible.
D. If you have a lot of REST requests from devices, where you do not use cookies, then switch off session support ( or roll your own as we did). Sessions use datastore under the hood and for every request it does get and put.
E. Read about adjusting app settings. Try different settings (depending how tolerant your app is to request delay and your traffic patterns/spikes). We were able to cut down frontend instances by 70%.

MapReduce in the cloud

Except for Amazon MapReduce, what other options do I have to process a large amount of data?
Microsoft also has Hadoop/MapReduce running on Windows Azure but it is under limited CTP, however you can provide your information and request for CTP access at link below:
https://www.hadooponazure.com/
The Developer Preview for the Apache Hadoop- based Services for Windows Azure is available by invitation.
Besides that you can also try Google BigQuery in which you will have to move your data to Google propitiatory Storage first and then run BigQuery on it. Remember BigQuery is based on Dremel which is similar to MapReduce however faster due to column based search processing.
There is another option is to use Mortar Data, as they have used python and pig, intelligently to write jobs easily and visualize the results. I found it very interesting, please have a look:
http://mortardata.com/#!/how_it_works
DataStax Brisk is good.
Full-on distributions
Apache Hadoop
Cloudera’s Distribution including Apache Hadoop (that’s the official name)
IBM Distribution of Apache Hadoop
DataStax Brisk
Amazon Elastic MapReduce
HDFS alternatives
Mapr
Appistry CloudIQ Storage Hadoop Edition
IBM Global Parallel File System (GPFS)
CloudStore
Hadoop MapReduce alternatives
Pervasive DataRush
Cascading
Hive (an Apache subproject, included in Cloudera’s distribution)
Pig (a Yahoo-developed language, included in Cloudera’s distribution)
Refer : http://gigaom.com/cloud/as-big-data-takes-off-the-hadoop-wars-begin/
If want to process large amount of data in real-time ( twitter feed, click stream from website) etc using cluster of machines then check out "storm" which was opensource'd from twitter recently
Standard Apache Hadoop is good for processing in batch with petabytes of data where latency is not a problem.
Brisk from DataStax as mentioned above is quite unique in that you can use MapReduce Parallel processing on live data.
There are other efforts like Hadoop Online which allows to process using pipeline.
Google BigQuery obviously another option where you have csv (delimited records) and you can slice and dice without any setting up. It's extremely simple to use ,but is a premium service where you have to pay by no. of bytes processed ( first 100GB / month is free though).
If you want to stay in the cloud, you can also spin up EC2 instances to create a permanent Hadoop cluster. Cloudera has plenty of resources about setting up such a cluster here.
However, this option is less cost effective than Amazon Elastic Mapreduce, unless you have lots of jobs to run through the day, keeping your cluster fairly busy.
The other option is to build your own cluster. One of the nice features of Hadoop is that you can cobble heterogenous hardware into a cluster with decent computing power. The kind that can live in a rack in your server room. Considering that older hardware that's laying around is already paid for, the only costs to getting such a cluster going is new drives, and perhaps enough memory sticks to maximize the capacity of those boxes. Then cost effectiveness of such an approach is much better than Amazon. The only caveat would be whether you have the bandwidth necessary for pulling down all the data into the cluster's HDFS on a regular basis.
Google App Engine does MapReduce as well (at least the map part for now). http://code.google.com/p/appengine-mapreduce/

Resources