Can someone explain me what are semgents in solr.
I have not found good description online.
I have also seen various segments file in solr? what are there for.
What happens if I delete one segment file.? will that corrupt the index?
I am using solr 5.3(if that makes any difference)
Also whar are tlogs and what are there role?
The segment files in Solr are parts of the underlying Lucene index. You can read about the index format in the Lucene index docs.
In principle, each segment contains a part of the index. New files get created when you add documents and you can completely ignore them. Only if you have problems with too many open file handles you my merge some of them together with the index OPTIMIZE command.
And yes, deleting one of the files will corrupt the index.
The tlog files are transaction logs where every index changing transaction (ADD, UPDATE, DELETE) is written down. If anything happens to your Solr server while there's an open segment currently undergoing a some transactions, the segment file will be corrupt. Solr then uses the tlog to rewind the already transmitted transactions and restore the failed segment to its best guess. You can read more on this in this nice post on the Lucidworks blog.
Related
I have a text file containing over 10 million records of web pages.
I want to build solr index with this file every day(because this file is updated daily).
Is there any effective solutions to full build solr index at once? Such as using map reduce model to accelerate building process.
I think using solr api to add document is a little bit slow.
It is not clear how much content is in those 10 million records, but it may actually be simple enough to index those in bulk. Just check your solrconfig.xml for your commit settings, you may, for example, have autoCommit configured with low maxDocs settings. In your case, you may want to disable autoCommit completely and just do it manually at the end.
However, if it is still a bit slow, before going to map-reduce, you could think about building a separate index and then swapping it with the current index.
This way, you actually have the previous collection to roll-back to and/or to compare if needed. The new collection can even be built on a different machine and/or more close to the data.
I noticed that during each Nutch crawl, the indexes sent to Solr were not consistent. Sometimes the latest changes to the webpages were shown, sometimes older changes were shown instead.
Cause
Noticed that Nutch was giving indexes from an older segment to Solr.
Current Solution
Deleting all old segments before fetching and seemed to solve the problem.
Question
Would like to know if there are any implications of such an approach or my understanding to this is incorrect. Would also like to know why does Nutch not automatically remove older segments during a crawl.
Thanks.
If multiple segments are indexed (again) and the same is contained in two or more segments, there is no guarantee that the most recent version is indexed. It's a known problem (NUTCH-1416). The easiest solution is to send only the recently fetched segments to the indexer. The script bin/crawl does this, the index step is done at the end of each cycle for the segment fetched in this cycle.
I have a Solr core with hundreds of millions of documents.
I want to create 100 duplicates of this core where I only change 2-3 fields (time and ID) on the original docs and save them to the new cores (so each core contains a different time data for testing).
I need it to work as fast as possible.
I was thinking opening the core files with Lucene and read the entire content while writing the altered documents to a new index but I've realized I'll need to configure all the analyzers of the destination core which may be complex and in additional not all my fields are stored.
If there is a low level API in Lucene to alter documents / indexes, I could copy the index files and change the documents on the lowest level.
Anyone familiar with such?
We're using a training server to create solr indexes and uploading them to another (solr) server via rsync.
Until now, everything has been fine. Now, our index size on one core has increased drastically and our solr instances are refusing to read those indexes on that core. Also, they are ignoring those indexes without any exceptions. (we sure are reloading the cores or restarting tomcat after rsyncs)
ie: in solr stats, numDocs is 0 or /select?q=*:* is not returning any results..
Just to answer the question, are those indexes corrupted, we have regenerated them a couple of times. But nothing has changed. When we try to use smaller indexes, they are being read fine.
our solrconfig.xml in this core is like this; https://gist.github.com/983ebb13c895c9cccbfb
Copying your index using rsync is a bad idea. Your Solr server may not have completed writing files to disc when you initiate the copy operation, and you could end up with corruption. The only safe way to do this is to shut down the master (source index), shut down the slave (destination index), remove the entire content of the slave's index directory, copy the master's index across, and then restart everything.
A better approach is what was suggested by Peer Allan above - use Solr's built-in replication support. See http://wiki.apache.org/solr/SolrReplication.
I implement search engine with solr that import minimal 2 million doc per day.
User must can search on imported doc ASAP (near real-time).
I using 2 dedicated Windows x64 with tomcat 6 (Solr shard mode). every server, index about 120 million doc and about 220 GB (total 500 GB).
I want to get backup incremental from solr index file during update or search.
after search it, find rsync tools for UNIX and DeltaCopy for windows (GUI rsync for windows). but get error (vanished) during update.
how to solve this problem.
Note1:File copy really slow, when file size very large. therefore i can't use this way.
Note2: Can i prevent corrupt index files during update, if windows crash or hardware reset or any other problem ?
You can take a hot backup (i.e. while writing to the index) using the ReplicationHandler to copy Solr's data directory elsewhere on the local system. Then do whatever you like with that directory. You can launch the backup whenever you want by going to a URL like this:
http://host:8080/solr/replication?command=backup&location=/home/jboss/backup
Obviously you could script that with wget+cron.
More details can be found here:
http://wiki.apache.org/solr/SolrReplication
The Lucene in Action book has a section on hot backups with Lucene, and it appears to me that the code in Solr's ReplicationHandler uses the same strategy as outlined there. One of that book's authors even elaborated on how it works in another StackOverflow answer.
Don't run a backup while updating the index. You will probably get a corrupt (therefore useless) backup.
Some ideas to work around it:
Batch up your updates, i.e. instead of adding/updating documents all the time, add/update every n minutes. This will let you run the backup in between those n minutes. Cons: document freshness is affected.
Use a second, passive Solr core: Set up two cores per shard, one active and one passive. All queries are issued against the active core. Use replication to keep the passive core up to date. Run the backup against the passive core. You'd have to disable replication while running the backup. Cons: complex, more moving parts, requires double the disk space to maintain the passive core.