I currently have a single collection with 40 million documents and index size of 25 GB. The collections gets updated every n minutes and as a result the number of deleted documents is constantly growing.
The data in the collection is an amalgamation of more than 1000+ customer records. The number of documents per each customer is around 100,000 records on average.
Now that being said, I 'm trying to get an handle on the growing deleted document size. Because of the growing index size both the disk space and memory is being used up. And would like to reduce it to a manageable size.
I have been thinking of splitting the data into multiple core, 1 for each customer. This would allow me manage the smaller collection easily and can create/update the collection also fast. My concern is that number of collections might become an issue. Any suggestions on how to address this problem.
Solr: 4.9
Index size:25 GB
Max doc: 40 million
Doc count:29 million
Thanks
I had the similar sort of issue having multiple customer and big indexed data.
I have the implemented it with version 3.4 by creating a separate core for a customer.
i.e One core per customer. Creating core is some sort of creating indexes or splitting the data as like we do in case of sharding...
Here you are splitting the large indexed data in different smaller segments.
Whatever the seach will happen it will carry in the smaller indexed segment.. so the response time would be faster..
I have almost 700 core created as of now and its running fine for me.
As of now I did not face any issue with managing the core...
I would suggest to go with combination of core and sharding...
It will help you in achieve
Allows to have a different configuration for each core with different behavior and that will not have impact on other cores.
you can perform action like update, load etc. on each core differently.
Related
I'm using Solr to handle search on a very large set of documents, I start having performance issues with complex queries with facets and filters.
This is a solr query used to get some data :
solr full request : http://host/solr/discovery/select?q=&fq=domain%3Acom+OR+host%3Acom+OR+public_suffix%3Acom&fq=crawl_date%3A%5B2000-01-01T00%3A00%3A00Z+TO+2000-12-31T23%3A59%3A59Z%5D&fq=%7B%21tag%3Dcrawl_year%7Dcrawl_year%3A%282000%29&fq=%7B%21tag%3Dpublic_suffix%7Dpublic_suffix%3A%28com%29&start=0&rows=10&sort=score+desc&fl=%2Cscore&hl=true&hl.fragsize=200&hl.simple.pre=%3Cstrong%3E&hl.simple.post=%3C%2Fstrong%3E&hl.snippets=10&hl.fl=content&hl.mergeContiguous=false&hl.maxAnalyzedChars=100000&hl.usePhraseHighlighter=true&facet=true&facet.mincount=1&facet.limit=11&facet.field=%7B%21ex%3Dcrawl_year%7Dcrawl_year&facet.field=%7B%21ex%3Ddomain%7Ddomain&facet.field=%7B%21ex%3Dpublic_suffix%7Dpublic_suffix&facet.field=%7B%21ex%3Dcontent_language%7Dcontent_language&facet.field=%7B%21ex%3Dcontent_type_norm%7Dcontent_type_norm&shards=shard1"
When this query is used localy with about 50000 documents, it takes about 10 seconds, but when I try it on host with 200 million documents it takes about 4 minutes. I know naturaly it's going to take a much longer time in the host, but I wonder if anyone had the same issue and was able to get faster results. Knowing that I'm using two Shards.
Waiting for your responses.
You're doing a number of complicated things at once: Date ranges, highlighting, faceting, and distributed search. (Non-solrcloud, looks like)
Still, 10 seconds for a 50k-doc index seems really slow to me. Try selectively removing aspects of your search to see if you can isolate which part is slowing things down and then focus on that. I'd expect that you can find simpler queries that are fast, even if they match a lot of documents.
Either way, check out https://wiki.apache.org/solr/SolrPerformanceProblems#RAM
There are a lot of useful tips there, but the #1 performance issue is usually not having enough memory, especially for large indexes.
Check for how many segments you have on solr
as more the number of segments worse the query response
If you have not set merge factor in your solrConfig.xml then probably you will have close 40 segments which is to bad for query response time
Set your merge factor accordingly
If no new documents are to be added set it 2
mergeFactor
The mergeFactor roughly determines the number of segments.
The mergeFactor value tells Lucene how many segments of equal size to build before merging them into a single segment. It can be thought of as the base of a number system.
For example, if you set mergeFactor to 10, a new segment will be created on the disk for every 1000 (or maxBufferedDocs) documents added to the index. When the 10th segment of size 1000 is added, all 10 will be merged into a single segment of size 10,000. When 10 such segments of size 10,000 have been added, they will be merged into a single segment containing 100,000 documents, and so on. Therefore, at any time, there will be no more than 9 segments in each index size.
These values are set in the mainIndex section of solrconfig.xml (disregard the indexDefaults section):
mergeFactor Tradeoffs
High value merge factor (e.g., 25):
Pro: Generally improves indexing speed
Con: Less frequent merges, resulting in a collection with more index files which may slow searching
Low value merge factor (e.g., 2):
Pro: Smaller number of index files, which speeds up searching.
Con: More segment merges slow down indexing.
Brief overview of the setup:
5 x SolrCloud (Solr 4.6.1) node instances (separate machines).
The setup is intended to store last 48 hours webapp logs (which are pretty intense... ~ 3MB/sec)
"logs" collection has 5 shards (one per node instance).
One logline represents one document of "logs" collection
If I keep storing log documents to this "logs" collection, cores on shards start getting really big and CPU graphs show that instances spend more and more time waiting for disk I/O.
So, my idea is to create new collection with each 15 minutes and name it "logs-201402051400" with shards spread across 5 instances. Document writers will start writing to the new collection as soon as it is created. At some time I will get the list of collection like that:
...
logs-201402051400
logs-201402051415
logs-201402051430
logs-201402051445
logs-201402051500
...
Since there will be max 192 collections (~1000 cores) in the SolrCloud at some certain period of time. It seems that search performance should degrade drastically.
So, I would like to merge collections that are not being currently written to into one large collection (but still sharded across 5 instances). I have found information how to merge cores, but how can I merge collections?
This might NOT be a complete answer to your query - but something tells me that you need to redo the design of your collection.
This is a classic debate between using a Single Collection with Multiple Shards versus Multiple Collections.
I think you ought to setup a Single Collection - and then use Solr Cloud's dynamic sharding capability (implicit router) to add new shards (for newer 15 minute intervals) / delete old shards (for older 15 minute intervals).
Managing a single collection means that you will have a single end point and will save you from complexity of querying multiple collections.
Take a look at one of the answers on this link that talks about using the implicit router for dynamic sharding in SolrCloud.
How to add shards dynamically to collection in solr?
I'm planning to split up 1 schema to multiple schema's. This will allow me to run multiple cores with different document types. Then I will use join to get the related documents if needed.
At the moment I had multiple document types by using a type field.
How will this change affect the performance?
As far as I know, when you join between cores, you will be able to get information from only one core (not the other).
In my opinion, Solr works the best when it has to pull data from one location only. Joining might produce an overhead, thus essentially slowing the whole operation.
However, consider the following situation :- a user has 20 million records in one core and Solr has to search each and every record in that. If the user is able to separate them into two cores , one having 1 million records and the other core has 20 records, then joining might be efficient in such a case.
Summary :- it depends on how much data you have now, how much data you will have when you have multiple cores. If your situation is not like the above, then I suggest that you look for some other alternative.
I have an application which needs to store a huge volume of data (around 200,000 txns per day), each record around 100 kb to 200 kb size. The format of the data is going to be JSON/XML.
The application should be highly available , so we plan to store the data on S3 or AWS DynamoDB.
We have use-cases where we may need to search the data based on a few attributes (date ranges, status, etc.). Most searches will be on few common attributes but there may be some arbitrary queries for certain operational use cases.
I researched the ways to search non-relational data and so far found two ways being used by most technologies
1) Build an index (Solr/CloudSearch,etc.)
2) Run a Map Reduce job (Hive/Hbase, etc.)
Our requirement is for the search results to be reliable (consistent with data in S3/DB - something like a oracle query, it is okay to be slow but when we get the data, we should have everything that matched the query returned or atleast let us know that some results were skipped)
At the outset it looks like the index based approach would be faster than the MR. But I am not sure if it is reliable - index may be stale? (is there a way to know the index was stale when we do the search so that we can correct it? is there a way to have the index always consistent with the values in the DB/S3? Something similar to the indexes on Oracle DBs).
The MR job seems to be reliable always (as it fetches data from S3 for each query), is that assumption right? Is there anyway to speed this query - may be partition data in S3 and run multiple MR jobs based on each partition?
You can <commit /> and <optimize /> the Solr index after you add documents, so I'm not sure a stale index is a concern. I set up a Solr instance that handled maybe 100,000 additional documents per day. At the time I left the job we had 1.4 million documents in the index. It was used for internal reporting and it was performant (the most complex query too under a minute). I just asked a former coworker and it's still doing fine a year later.
I can't speak to the map reduce software, though.
You should think about having one Solr core per week/month for instance, this way older cores will be read only, and easier to manager and very easy to spread over several Solr instances. If 200k docs are to be added per day for ever you need either that or Solr sharding, a single core will not be enough for ever.
What is the maximum number of records within a single custom object in salesforce.com?
There does not seem to be a limit indicated in https://login.salesforce.com/help/doc/en/limits.htm
But of course, there has to be a limit of some kind. EG: Could 250 million records be stored in a single salesforce.com custom object?
As far as I'm aware the only limit is your data storage, you can see what you've used by going to Setup -> Administration Setup -> Data Management -> Storage Usage.
In one of the Orgs I work with I can see one object has almost 2GB of data for just under a million records, and this accounts for a little over a third of the storage available. Your storage space depends on your Salesforce Edition and number of users. See here for details.
I've seen the performance issue as well, though after about 1-2M records the performance hit appears magically to plateau, or at least it didn't appear to significantly slow down between 1M and 10M. I wonder if orgs are tier-tuned based on volume... :/
But regardless of this, there are other challenges which make it less than ideal for big data. Even though they've increased the SOQL governor limit to permit up to 50 million records to be retrieved in one call, you're still strapped with a 200,000 line execution limit in Apex and a 10K DML limit (per execution thread). These can be bypassed through Batch Apex, yet this has limitations as well. You can only execute 250K batches in 24 hours and only have 5 batches running at any given time.
So... the moral of the story seems to be that even if you managed to get a billion records into a custom object, you really can't do much with the data at that scale anyway. Therefore, it's effectively not the right tool for that job in its current state.
2-cents
LaceySnr is correct. However, there is an inverse relationship between the number of records for an object and performance. Any part of the system that filters on that object will be impacted, such as views, reports, SOQL queries, etc.
It's hard to talk specific numbers since salesforce has upwards of a dozen server clusters, each with their own performance characteristics. And there's probably a lot of dynamic performance management that occurs regularly. But, in the past I've seen performance issues start to creep in around 2M records. One possible remedy is you can ask salesforce to index fields that you plan to filter on.