Using Mongodb's allowDiskUse configuration in production env

Using Mongodb's allowDiskUse configuration in production env - database

We are using Mongo 4.4 and started seeing memory limit exceeded errors while sorting on mongo collections. Since, as per documentation ‘allowDiskUse’ is recommended way for queries that involve processing more than 100M data in a single stage of the pipeline, we have decided to go with Mongo's recommendation.
On this note, We need to know the following:
We could not find internal implementation details while using allowDiskUse=true with our queries regarding temp storage. What is the limit of this temp storage? How and when is the cleanup of this temp storage managed?

Related

Migrating Solr Cloud cluster over new cloud vendor

We need to move our solr cloud cluster from one cloud vendor to another, the cluster is composed of 8 shards with 2 replica factor spread among 8 servers with roughly a total of 500GB worth of data.
I wonder what are the common approaches to migrate the cluster but specially its data with the less impact in availability and performance etc..
I was thinking in some sort of initial dump copy to then synchronize them catching up the diff (which could be huge) after keeping them in sync just switch whenever everything is ready from the other side.
Is that something doable? what tools should/could I use?
Thanks!

You have multiple choices depending on your existing setup and Solr version:
As mentioned earlier, make use of backup and restore APIs from Collections API
If you have Solr 6 and above, I would recommend exploring the option of CDCR, which is Solr's native Cross Data Centre Replication.
Reindexing onto the new cluster and then leverage Solr Collection Aliasing to change your application end points to the target provider upon the completion of reindexing

Ensuring local caching with Ignite on YARN

I have a stream processing application written in Flink & I want to use its internal key-value store from the state backend to compute streaming aggregates. Because I am dealing with a lot of aggregates, I would like to avoid maintaining them on-heap inside the Flink application like the memory-backed and file-backed implementations currently offer. In stead, I would like to maintain a cache of the state in Apache Ignite, which in turn could use the write-through & read-through features to provide a more reliable back-up in HBase.
Ideally, I would have a single local Ignite cache on every physical node that handles the state for all long-running Flink operators on that node. E.g. each node has a single Ignite node in an 8 GB container available, whether it is running 1 or 10 Flink operators.
The problem is that I want both Flink and Ignite to run on YARN. Through consistent partitioning, I can ensure that the data in general is sent to the correct cache, and in case of failures etc., it can be refilled from HBase. The problem I'm facing though is that Ignite seems to request containers from YARN randomly, meaning I have no guarantee that there is in fact a local cache available, even if I set the amount of Ignite nodes exactly the same as the amount of physical nodes.
Any suggestions on how to achieve a one Ignite node per physical node set up?

There is a ticket created to enhance the resource allocation using YARN: https://issues.apache.org/jira/browse/IGNITE-3214. Someone in the community will puck it up and fix.

When to definitely use SOLR over Lucene in a Sitecore 7 build?

My client does not have the budget to setup and maintain a SOLR server to use in their production environment. If I understand the Sitecore 7 Content Search API correctly, it is not a big deal to configure things to use Lucene instead. For the most part the configuration will be similar and the code will be the same, and a SOLR server can be swapped in later.
The site build has
faceted search page
listing components on landing and on other pages that will leverage the Content Search API
buckets with custom facets
The site has around 5,000 pages and components not including media library items. Are there any concerns about simply using Lucene?
The main question is, when, during your architecture or design phase do you know that you should definitely choose SOLR over Lucene? What are the major signs that lead you recommend that?

I think if you are dealing with a customer on a limited budget then Lucene will work perfectly well and perform excellently for the scale of things you are doing. All the things you mention are fully supported by the implementation in Lucene.
In a Sitecore scenario I would begin to consider Solr if:
You need to index a large number of items - id say 50 thousand upwards - Lucene is happy with these sorts of number but Solr has improved query caching and is designed for these large numbers of items.
The resilience of the search tier is of maximum business importance (ie the site is purely driven by search) - Solr provides a more robust replication/sharding and failover system with SolrCloud.
Re-purposing of the search tier in other application is important (non Sitecore) - Solr is a search application so can be accessed over HTTP with XML/JSON etc which makes integration with external systems easier.
You need some specific additional feature of Solr that Lucene doesn't have.
.. but as you say if you want swap out Lucene for Solr at a later phase, we have worked hard to make sure that the process as simple as possible. Worth noting a few points here:
While your LINQ queries will stay the same your configuration will be slightly different and will need attention to port across.
The understanding of how Solr works as an application and how the schema works is important to know but there are some great books and a wealth of knowledge out there.
Solr has slightly different (newer) analyzers and scoring mechanisms so your search results may be slightly different (sometimes customers can get alarmed by this :P)
.. but I think these are things you can build up to over time and assess with the customer. Im sure there are more points here and others can chime in if they think of them. Hope this helps :)

Stephen pretty much covered the question - but I just wanted to add another scenario. You need to take into account the server setup in your production environment. If you are going to be using multiple content delivery servers behind a load balancer I would consider Solr from the start, as trying to make sure that the Lucene index on each delivery server is synchronized 100% of the time can be painful.

I would recommend planning an escape plan from Lucene as early as you start thinking about multiple CDs and here is why:
A) Each server has to maintain its own index copy:
Any unexpected restart might cause a few documents not to be added to the index on the one box, making indexes different from server to server.
That would lead to same page showing differently by CDs
Each server must perform index updates - use CPU & disk space; response rate drops after publish operation is over =/
According to security guide, CDs should have Sitecore Shell UI removed, so index cannot be easily rebuilt from Control Panel =\
B) Lucene is not designed for large volumes of content. Each search operation does roughly following:
Create an array with size equal to total number of documents in the index
If document matches search, set flag in the array
While this works like a charm for low sized indexes (~10K elements), huge performance degradation is produced once the volume of content grows.
The allocated array ends in Large Object Heap that is not compacted by default, thereby gets fragmented fast.
Scenario:
Perform search for 100K documents -> huge array created in memory
Perform one more search in another thread -> one more huge array created
Update index -> now 100K + 10 documents
The first operation was completed; LOH has space for 100K array
Seach triggered again -> 100K+10 array is to be created; freed memory 'hole' is not large enough, so more RAM is requested.
w3wp.exe process keeps on consuming more and more RAM
This is the common case for Analytics Aggregation as an index is being populated by multiple threads at once.
You'll see a lot of RAM used after a while on the processing instance.
C) Last Lucene.NET release was done 5 years ago.
Whereas SOLR is actively being developed.
The sooner you'll make the switch to SOLR, the easier it would be.

Improving database record retrieval throughput with appengine

Using AppEngine with Python and the HRD retrieving records sequentially (via an indexed field which is an incrementing integer timestamp) we get 15,000 records returned in 30-45 seconds. (Batching and limiting is used.) I did experiment with doing queries on two instances in parallel but still achieved the same overall throughput.
Is there a way to improve this overall number without changing any code? I'm hoping we can just pay some more and get better database throughput. (You can pay more for bigger frontends but that didn't affect database throughput.)
We will be changing our code to store multiple underlying data items in one database record, but hopefully there is a short term workaround.
Edit: These are log records being downloaded to another system. We will fix it in the future and know how to do so, but I'd rather work on more important things first.

Try splitting the records on different entity groups. That might force them to go to different physical servers. Read entity groups in parallel from multiple threads or instances.
Using cache mght not work well for large tables.

Maybe you can cache your records, like use Memcache:
https://developers.google.com/appengine/docs/python/memcache/
This could definitely speed up your application access. I don't think that App Engine Datastore is designed for speed but for scalability. Memcache however is.
BTW, if you are conscious about the performance that GAE gives as per what you pay, then maybe you can try setting up your own App Engine cloud with:
AppScale
JBoss CapeDwarf
Both have an active community support. I'm using CapeDwarf in my local environment it is still in BETA but it works.

Move to any of the in-memory databases. If you have Oracle Database, using TimesTen will improve the throughput multifold.

Running solr index on hadoop

I have a huge amount of data needs to be indexed and it took more than 10 hours to get the job done. Is there a way I can do this on hadoop? Anyone has done this before? Thanks a lot!

You haven't explained where does 10hr take? Does it take to extract the data? or does it take just to index the data.
If you are taking long time on the extraction, then you may use hadoop. Solr has a feature called bulk insert. So in your map function you could accumulate 1000s of record and commit for index in one shot to solr for large number of recods. That will optimize your performance alot.
Also what size is your data?
You could collect large number of records in reduce function of map/reduce job. You have to generate proper keys in your map so that large number of records go to single reduce function. In your custom reduce class, initialize solr object in setup/configure method, depending on your hadoop version and then close it in cleanup method.You will have to create a document collection object(in solrNet or solrj) and commit all of them in one single shot.
If you are using hadoop there is other option called katta. You can look over it as well.

You can write a map reduce job over your hadoop cluster which simply takes each record and sends it to solr over http for indexing. Afaik solr currently doesn't have indexing over cluster of machines, so it would be of worth to look into elastic search if you want to distribute your index also over multiple nodes.

There is a SOLR hadoop output format which creates a new index in each reducer- so you disteibute your keys according to the indices which you want and then copy the hdfs files into your SOLR instance after the fact.
http://www.datasalt.com/2011/10/front-end-view-generation-with-hadoop/