Solr indexing increasing speed? - solr

I'm trying to figure out whats holding the index speed back.
I'm extracting text from pdf's to index each page seperatly to solr to get page hit results.
I was using commit after every "document". Then I noticed its spend loads of time rebuilding the index euch time I used commit.
Now I use this:
<autoCommit> <maxDocs>10000</maxDocs> <maxTime>60000</maxTime> </autoCommit>
To get a commit every minute.
But then I was calculating and found out it indexed around 30 'documents'(pages as solrDoc)/sec or 10 real documents/sec. This seems pretty slow compared to other setups.
How could I increase my speed?
Extra info:(request if needed)
My documents contain 7 fields.(1 content field with the text on the page)
I use Solrj to add documents to solr.
I'm using the example config since I have no advanced knowledge of Solr
pc intel core i7 2600+16Gb ram+ssd (this is a dev computer not the
final server but it should be pretty fast) Not much of the cpu and ram is used.
I get the files from an external storage. (but its fast I could easily get 12MB/s)
I extract the text using pdfbox
It took 390 Minutes to make a 650Mb index(455600 solrdocuments )

one aspect is whether your process is multithreaded or not, if not, test by having several threads extracting text from pdf and then hand over to solr for indexing.

Related

Frequency of Full reindex on SolrCloud

How often do I need to run full reindex on SolrCloud?
It takes more than 12 hours for full reindex to run and we run it every night but is it really necessary to do it as delta runs correctly.
New data comes in at the rate of 2000 documents on every delta per 30 seconds.
Total index size : 20GB
Solr: 6.5.2
If delta runs correctly, there should be no need to run a reindex at all. The exception might be if you do not have disabled any merging while the index is operative; in that case you might end up with a very fragmented index file wise, and the reindex ends up building a complete set as a single index file instead, but isn't usually how Solr is configured, and if it is - it's done for a reason.
So - if your delta is working correctly and you run Solr with fairly standard settings, you can safely skip reindexing unless you're starting over with an empty index (or have a situation where the schema has changed). But be sure that this also includes deletions - a reindex would probably not include deleted elements, so the question then becomes whether your delta import handles deletions as well.
None of our Solr based services reindex at all - everything is done with live updates and a decent merge factor.

When to optimize a Solr Index [duplicate]

I have a classifieds website. Users may put ads, edit ads, view ads etc.
Whenever a user puts an ad, I am adding a document to Solr.
I don't know, however, when to commit it. Commit slows things down from what I have read.
How should I do it? Autocommit every 12 hours or so?
Also, how should I do it with optimize?
A little more detail on Commit/Optimize:
Commit: When you are indexing documents to solr none of the changes you are making will appear until you run the commit command. So timing when to run the commit command really depends on the speed at which you want the changes to appear on your site through the search engine. However it is a heavy operation and so should be done in batches not after every update.
Optimize: This is similar to a defrag command on a hard drive. It will reorganize the index into segments (increasing search speed) and remove any deleted (replaced) documents. Solr is a read only data store so every time you index a document it will mark the old document as deleted and then create a brand new document to replace the deleted one. Optimize will remove these deleted documents. You can see the search document vs. deleted document count by going to the Solr Statistics page and looking at the numDocs vs. maxDocs numbers. The difference between the two numbers is the amount of deleted (non-search able) documents in the index.
Also Optimize builds a whole NEW index from the old one and then switches to the new index when complete. Therefore the command requires double the space to perform the action. So you will need to make sure that the size of your index does not exceed %50 of your available hard drive space. (This is a rule of thumb, it usually needs less then %50 because of deleted documents)
Index Server / Search Server:
Paul Brown was right in that the best design for solr is to have a server dedicated and tuned to indexing, and then replicate the changes to the searching servers. You can tune the index server to have multiple index end points.
eg: http://solrindex01/index1; http://solrindex01/index2
And since the index server is not searching for content you can have it set up with different memory footprints and index warming commands etc.
Hope this is useful info for everyone.
Actually, committing often and optimizing makes things really slow. It's too heavy.
After a day of searching and reading stuff, I found out this:
1- Optimize causes the index to double in size while beeing optimized, and makes things really slow.
2- Committing after each add is NOT a good idea, it's better to commit a couple of times a day, and then make an optimize only once a day at most.
3- Commit should be set to "autoCommit" in the solrconfig.xml file, and there it should be tuned according to your needs.
The way that this sort of thing is usually done is to perform commit/optimize operations on a Solr node located out of the request path for your users. This requires additional hardware, but it ensures that the performance penalty of the indexing operations doesn't impact your users. Replication is used to periodically shuttle optimized index files from the master node to the nodes that perform search queries for users.
Try it first. It would be really bad if you avoided a simple and elegant solution just because you read that it might cause a performance problem. In other words, avoid premature optimization.

How to configure Solr for improved indexing speed

I have a client program which generates a 1-50 millions Solr documents and add them to Solr.
I'm using ConcurrentUpdateSolrServer for pushing the documents from the client, 1000 documents per request.
The documents are relatively small (few small text fields).
I want to improve the indexing speed.
I've tried to increase the "ramBufferSizeMB" to 1G and the "mergeFactor" to 25 but didn't see any change.
I was wondering if there is some other recommended settings for improving Solr indexing speed.
Any links to relevant materials will be appreciated.
It looks like you are doing a bulk import of data into Solr, so you don't need to search any data right away.
First, you can increase the number of documents per request. Since your documents are small, I would even increase it to 100K docs per request or more and try.
Second, you want to reduce the number of times commits happen when you are bulk indexing. In your solrconfig.xml look for:
<!-- AutoCommit
Perform a hard commit automatically under certain conditions.
Instead of enabling autoCommit, consider using "commitWithin"
when adding documents.
http://wiki.apache.org/solr/UpdateXmlMessages
maxDocs - Maximum number of documents to add since the last
commit before automatically triggering a new commit.
maxTime - Maximum amount of time in ms that is allowed to pass
since a document was added before automatically
triggering a new commit.
openSearcher - if false, the commit causes recent index changes
to be flushed to stable storage, but does not cause a new
searcher to be opened to make those changes visible.
-->
<autoCommit>
<maxTime>15000</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>
You can disable autoCommit altogether and then call a commit after all your documents are posted. Otherwise you can tweak the numbers as follows:
The default maxTime is 15 secs so an auto commit happens every 15 secs if there are uncommitted docs, so you can set this to something large, say 3 hours (i.e. 3*60*60*1000). You can also add <maxDocs>50000000</maxDocs> which means an auto commit happens only after 50 million documents are added. After you post all your documents, call commit once manually or from SolrJ - it will take a while to commit, but this will be much faster overall.
Also after you are done with your bulk import, reduce maxTime and maxDocs, so that any incremental posts you will do to Solr will get committed much sooner. Or use commitWithin as mentioned in solrconfig.
In addition to what was written above, when using SolrCloud, you may want to consider using the CloudSolrClient when using SolrJ. The CloudSolrClient client class is Zookeeper aware and is able to directly connect to the leader shard speeding up the indexing in some cases.

Search using Solr vs Map Reduce on Files - which is reliable?

I have an application which needs to store a huge volume of data (around 200,000 txns per day), each record around 100 kb to 200 kb size. The format of the data is going to be JSON/XML.
The application should be highly available , so we plan to store the data on S3 or AWS DynamoDB.
We have use-cases where we may need to search the data based on a few attributes (date ranges, status, etc.). Most searches will be on few common attributes but there may be some arbitrary queries for certain operational use cases.
I researched the ways to search non-relational data and so far found two ways being used by most technologies
1) Build an index (Solr/CloudSearch,etc.)
2) Run a Map Reduce job (Hive/Hbase, etc.)
Our requirement is for the search results to be reliable (consistent with data in S3/DB - something like a oracle query, it is okay to be slow but when we get the data, we should have everything that matched the query returned or atleast let us know that some results were skipped)
At the outset it looks like the index based approach would be faster than the MR. But I am not sure if it is reliable - index may be stale? (is there a way to know the index was stale when we do the search so that we can correct it? is there a way to have the index always consistent with the values in the DB/S3? Something similar to the indexes on Oracle DBs).
The MR job seems to be reliable always (as it fetches data from S3 for each query), is that assumption right? Is there anyway to speed this query - may be partition data in S3 and run multiple MR jobs based on each partition?
You can <commit /> and <optimize /> the Solr index after you add documents, so I'm not sure a stale index is a concern. I set up a Solr instance that handled maybe 100,000 additional documents per day. At the time I left the job we had 1.4 million documents in the index. It was used for internal reporting and it was performant (the most complex query too under a minute). I just asked a former coworker and it's still doing fine a year later.
I can't speak to the map reduce software, though.
You should think about having one Solr core per week/month for instance, this way older cores will be read only, and easier to manager and very easy to spread over several Solr instances. If 200k docs are to be added per day for ever you need either that or Solr sharding, a single core will not be enough for ever.

Can Apache Solr Handle TeraByte Large Data

I am an apache solr user about a year. I used solr for simple search tools but now I want to use solr with 5TB of data. I assume that 5TB data will be 7TB when solr index it according to filter that I use. And then I will add nearly 50MB of data per hour to the same index.
1- Are there any problem using single solr server with 5TB data. (without shards)
a- Can solr server answers the queries in an acceptable time
b- what is the expected time for commiting of 50MB data on 7TB index.
c- Is there an upper limit for index size.
2- what are the suggestions that you offer
a- How many shards should I use
b- Should I use solr cores
c- What is the committing frequency you offered. (is 1 hour OK)
3- are there any test results for this kind of large data
There is no available 5TB data, I just want to estimate what will be the result.
Note: You can assume that hardware resourses are not a problem.
if your sizes are for text, rather than binary files (whose text would be usually much less), then I don't think you can pretend to do this in a single machine.
This sounds a lot like Logly and they use SolrCloud to handle such amount of data.
ok if all are rich documents then total text size to index will be much smaller (for me its about 7% of my starting size). Anyway, even with that decreased amount, you still have too much data for a single instance I think.

Resources