MapReduce and similar methods to accelerate indexing of Apache Solr - solr

I am building a web application in which users' content is indexed by Solr and then presented in a dashboard-type interface. Eventually, the requirements should allow users to upload a file of say 50 megs (some day 1 gig) and have their content indexed and ready to browse as soon as possible. Arbitrary benchmark: 1 minute. Preferred benchmark: 50 ms.
I am interested in methods to scale up the capacity of the Solr indexing operation. I have been reading up approaches of doing this here, here, here, here. It seems that AWS elastic map reduce, elastic search, and SolrCloud are all potential approaches for solving this. I'm wondering whether there are accepted methods for scaling up indexing with on demand server resources, especially solutions that can be easily integrated into an existing AWS architecture.
Note, the application uses Solr 4.0.

Related

When to definitely use SOLR over Lucene in a Sitecore 7 build?

My client does not have the budget to setup and maintain a SOLR server to use in their production environment. If I understand the Sitecore 7 Content Search API correctly, it is not a big deal to configure things to use Lucene instead. For the most part the configuration will be similar and the code will be the same, and a SOLR server can be swapped in later.
The site build has
faceted search page
listing components on landing and on other pages that will leverage the Content Search API
buckets with custom facets
The site has around 5,000 pages and components not including media library items. Are there any concerns about simply using Lucene?
The main question is, when, during your architecture or design phase do you know that you should definitely choose SOLR over Lucene? What are the major signs that lead you recommend that?
I think if you are dealing with a customer on a limited budget then Lucene will work perfectly well and perform excellently for the scale of things you are doing. All the things you mention are fully supported by the implementation in Lucene.
In a Sitecore scenario I would begin to consider Solr if:
You need to index a large number of items - id say 50 thousand upwards - Lucene is happy with these sorts of number but Solr has improved query caching and is designed for these large numbers of items.
The resilience of the search tier is of maximum business importance (ie the site is purely driven by search) - Solr provides a more robust replication/sharding and failover system with SolrCloud.
Re-purposing of the search tier in other application is important (non Sitecore) - Solr is a search application so can be accessed over HTTP with XML/JSON etc which makes integration with external systems easier.
You need some specific additional feature of Solr that Lucene doesn't have.
.. but as you say if you want swap out Lucene for Solr at a later phase, we have worked hard to make sure that the process as simple as possible. Worth noting a few points here:
While your LINQ queries will stay the same your configuration will be slightly different and will need attention to port across.
The understanding of how Solr works as an application and how the schema works is important to know but there are some great books and a wealth of knowledge out there.
Solr has slightly different (newer) analyzers and scoring mechanisms so your search results may be slightly different (sometimes customers can get alarmed by this :P)
.. but I think these are things you can build up to over time and assess with the customer. Im sure there are more points here and others can chime in if they think of them. Hope this helps :)
Stephen pretty much covered the question - but I just wanted to add another scenario. You need to take into account the server setup in your production environment. If you are going to be using multiple content delivery servers behind a load balancer I would consider Solr from the start, as trying to make sure that the Lucene index on each delivery server is synchronized 100% of the time can be painful.
I would recommend planning an escape plan from Lucene as early as you start thinking about multiple CDs and here is why:
A) Each server has to maintain its own index copy:
Any unexpected restart might cause a few documents not to be added to the index on the one box, making indexes different from server to server.
That would lead to same page showing differently by CDs
Each server must perform index updates - use CPU & disk space; response rate drops after publish operation is over =/
According to security guide, CDs should have Sitecore Shell UI removed, so index cannot be easily rebuilt from Control Panel =\
B) Lucene is not designed for large volumes of content. Each search operation does roughly following:
Create an array with size equal to total number of documents in the index
If document matches search, set flag in the array
While this works like a charm for low sized indexes (~10K elements), huge performance degradation is produced once the volume of content grows.
The allocated array ends in Large Object Heap that is not compacted by default, thereby gets fragmented fast.
Scenario:
Perform search for 100K documents -> huge array created in memory
Perform one more search in another thread -> one more huge array created
Update index -> now 100K + 10 documents
The first operation was completed; LOH has space for 100K array
Seach triggered again -> 100K+10 array is to be created; freed memory 'hole' is not large enough, so more RAM is requested.
w3wp.exe process keeps on consuming more and more RAM
This is the common case for Analytics Aggregation as an index is being populated by multiple threads at once.
You'll see a lot of RAM used after a while on the processing instance.
C) Last Lucene.NET release was done 5 years ago.
Whereas SOLR is actively being developed.
The sooner you'll make the switch to SOLR, the easier it would be.

Improving database record retrieval throughput with appengine

Using AppEngine with Python and the HRD retrieving records sequentially (via an indexed field which is an incrementing integer timestamp) we get 15,000 records returned in 30-45 seconds. (Batching and limiting is used.) I did experiment with doing queries on two instances in parallel but still achieved the same overall throughput.
Is there a way to improve this overall number without changing any code? I'm hoping we can just pay some more and get better database throughput. (You can pay more for bigger frontends but that didn't affect database throughput.)
We will be changing our code to store multiple underlying data items in one database record, but hopefully there is a short term workaround.
Edit: These are log records being downloaded to another system. We will fix it in the future and know how to do so, but I'd rather work on more important things first.
Try splitting the records on different entity groups. That might force them to go to different physical servers. Read entity groups in parallel from multiple threads or instances.
Using cache mght not work well for large tables.
Maybe you can cache your records, like use Memcache:
https://developers.google.com/appengine/docs/python/memcache/
This could definitely speed up your application access. I don't think that App Engine Datastore is designed for speed but for scalability. Memcache however is.
BTW, if you are conscious about the performance that GAE gives as per what you pay, then maybe you can try setting up your own App Engine cloud with:
AppScale
JBoss CapeDwarf
Both have an active community support. I'm using CapeDwarf in my local environment it is still in BETA but it works.
Move to any of the in-memory databases. If you have Oracle Database, using TimesTen will improve the throughput multifold.

Performance expectations for an Amazon RDS instance

I am using Google App Engine for an app, and the app is currently hitting the datastore at a rate of around 2.5 million row writes, and 4.5 million row reads per day.
I am currently porting the app to Amazon Elastic Beanstalk and Amazon RDS due to the very high costs of running an application on GAE.
Based on the values above, how can I find out / estimate what type of RDS instance I will need for my requirements? Is the above a considerable amount of processing for, lets say a Small or Micro MySQL RDS instance to process in a day?
Totally depends on a number of factors:
Row size.
Field types and sizes.
Complexity of your queries (joins, etc).
Proper use of indexes.
Row contention and other possible bottlenecks.
Really hard to tell. But from experience, if you don't need fancy replication or sharding, the costs of the GAE datastore are usually higher as it offers total redundancy, distribution, scalability, etc.
My suggestion would be to write a quick program to benchmark a load on RDS that replicates what you are expecting. Should be easy to write if you forgo all the business rules and such and just do fake but randomized reads and writes.

Is GAE optimized for database-heavy applications?

I'm writing a very limited-purpose web application that stores about 10-20k user-submitted articles (typically 500-700 words). At any time, any user should be able to perform searches on tags and keywords, edit any part of any article (metadata, text, or tags), or download a copy of the entire database that is recent up-to-the-hour. (It can be from a cache as long as it is updated hourly.) Activity tends to happen in a few unpredictable spikes over a day (wherein many users download the entire database simultaneously requiring 100% availability and fast downloads) and itermittent weeks of low activity. This usage pattern is set in stone.
Is GAE a wise choice for this application? It appeals to me for its low cost (hopefully free), elasticity of scale, and professional management of most of the stack. I like the idea of an app engine as an alternative to a host. However, the excessive limitations and quotas on all manner of datastore usage concern me, as does the trade-off between strong and eventual consistency imposed by the datastore's distributed architecture.
Is there a way to fit this application into GAE? Should I use the ndb API instead of the plain datastore API? Or are the requirements so data-intensive that GAE is more expensive than hosts like Webfaction?
As long as you don't require full text search on the articles (which is currently still marked as experimental and limited to ~1000 queries per day), your usage scenario sounds like it would fit just fine in App Engine.
stores about 10-20k user-submitted articles (typically 500-700 words)
Maximum entity size in App Engine is 1 MB, so as long as the total size of the article is lower than that, it should not be a problem. Also, the cost for reading data in is not tied to the size of the entity but to the number of entities being read.
At any time, any user should be able to perform searches on tags and keywords.
Again, as long as the search on the tags and keywords are not full text searches, App Engine's datastore queries could handle these kind of searches efficiently. If you want to search on both tags and keywords at the same time, you would need to build a composite index for both fields. This could increase your write cost.
download a copy of the entire database that is recent up-to-the-hour.
You could use cron/scheduled task to schedule a hourly dump to the blobstore. The cron could be targeted to a backend instance if your dump takes more than 60 seconds to be finished. Do remember that with each dump, you would need to read all entities in the database, and this means 10-20k read ops per hour. You could add a timestamp field to your entity, and have your dump servlet query for anything newer than the last dump instead to save up read ops.
Activity tends to happen in a few unpredictable spikes over a day (wherein many users download the entire database simultaneously requiring 100% availability and fast downloads) and itermittent weeks of low activity.
This is where GAE shines, you could have very efficient instance usages with GAE in this case.
I don't think your application is particularly "database-heavy".
500-700 words is only a few KB of data.
I think GAE is a good fit.
You could store each article as a textproperty on an entity, with tags in a listproperty. For searching text you could use the search service https://developers.google.com/appengine/docs/python/search/ (which currently has quota limits).
Not 100% sure about downloading all the data, but I think you could store all the data in the blobstore (possibly as pdf?) and then allow users to download that blob.
I would choose NDB over regular datastore, mostly for the built-in async functionality and caching.
Regarding staying below quota, it depends on how many people are accessing the site and how much data they download/upload.

MapReduce in the cloud

Except for Amazon MapReduce, what other options do I have to process a large amount of data?
Microsoft also has Hadoop/MapReduce running on Windows Azure but it is under limited CTP, however you can provide your information and request for CTP access at link below:
https://www.hadooponazure.com/
The Developer Preview for the Apache Hadoop- based Services for Windows Azure is available by invitation.
Besides that you can also try Google BigQuery in which you will have to move your data to Google propitiatory Storage first and then run BigQuery on it. Remember BigQuery is based on Dremel which is similar to MapReduce however faster due to column based search processing.
There is another option is to use Mortar Data, as they have used python and pig, intelligently to write jobs easily and visualize the results. I found it very interesting, please have a look:
http://mortardata.com/#!/how_it_works
DataStax Brisk is good.
Full-on distributions
Apache Hadoop
Cloudera’s Distribution including Apache Hadoop (that’s the official name)
IBM Distribution of Apache Hadoop
DataStax Brisk
Amazon Elastic MapReduce
HDFS alternatives
Mapr
Appistry CloudIQ Storage Hadoop Edition
IBM Global Parallel File System (GPFS)
CloudStore
Hadoop MapReduce alternatives
Pervasive DataRush
Cascading
Hive (an Apache subproject, included in Cloudera’s distribution)
Pig (a Yahoo-developed language, included in Cloudera’s distribution)
Refer : http://gigaom.com/cloud/as-big-data-takes-off-the-hadoop-wars-begin/
If want to process large amount of data in real-time ( twitter feed, click stream from website) etc using cluster of machines then check out "storm" which was opensource'd from twitter recently
Standard Apache Hadoop is good for processing in batch with petabytes of data where latency is not a problem.
Brisk from DataStax as mentioned above is quite unique in that you can use MapReduce Parallel processing on live data.
There are other efforts like Hadoop Online which allows to process using pipeline.
Google BigQuery obviously another option where you have csv (delimited records) and you can slice and dice without any setting up. It's extremely simple to use ,but is a premium service where you have to pay by no. of bytes processed ( first 100GB / month is free though).
If you want to stay in the cloud, you can also spin up EC2 instances to create a permanent Hadoop cluster. Cloudera has plenty of resources about setting up such a cluster here.
However, this option is less cost effective than Amazon Elastic Mapreduce, unless you have lots of jobs to run through the day, keeping your cluster fairly busy.
The other option is to build your own cluster. One of the nice features of Hadoop is that you can cobble heterogenous hardware into a cluster with decent computing power. The kind that can live in a rack in your server room. Considering that older hardware that's laying around is already paid for, the only costs to getting such a cluster going is new drives, and perhaps enough memory sticks to maximize the capacity of those boxes. Then cost effectiveness of such an approach is much better than Amazon. The only caveat would be whether you have the bandwidth necessary for pulling down all the data into the cluster's HDFS on a regular basis.
Google App Engine does MapReduce as well (at least the map part for now). http://code.google.com/p/appengine-mapreduce/

Resources