Multiple updates simultaneously on same document in solr - solr

I have a doubt regarding solr document update. For example, when two requests to update a document in solr comes at the same time, How does solr work?
Does it take one request randomly and locks write before next request comes in?
Thanks in Advance

There are different Locking mechanisms as mentioned in Lucene locking factory docs. By default NativeFSLockFactory is used in which file lock is acquired for the document that is being indexed. The settings for using a different locking mechanism can be changed in solrconfig.xml
Here is a snippet from solconfig.xml
<!-- LockFactory
This option specifies which Lucene LockFactory implementation
to use.
single = SingleInstanceLockFactory - suggested for a
read-only index or when there is no possibility of
another process trying to modify the index.
native = NativeFSLockFactory - uses OS native file locking.
Do not use when multiple solr webapps in the same
JVM are attempting to share a single index.
simple = SimpleFSLockFactory - uses a plain file for locking
Defaults: 'native' is default for Solr3.6 and later, otherwise
'simple' is the default
More details on the nuances of each LockFactory...
http://wiki.apache.org/lucene-java/AvailableLockFactories
-->
<lockType>${solr.lock.type:native}</lockType>

Are you talking about physical locks or logical version control? For logical version control, Solr 4+ supports optimistic concurrency using version field.
You can read about it:
Official documentation
Detailed writeup

Related

Solr Language Detection using DataImportHandler

In my Solr configuration files I have defined a DataImportHandler that fetches data from a Mysql database and also processes contents of PDF files that are related with registers of the SQL database. The data import works fine.
I'm trying to detect the language of text contained in the files during the data import phase. I have specified in my solrconfig.xml a TikaLanguageIdentifierUpdateProcessorFactory as explained in https://wiki.apache.org/solr/LanguageDetection and have defined in my document schema the language fields, nevertheless, after I run the indexation from the Solr admin, I cannot see any language field on my documents.
In all the examples I have seen, language detection is done by posting a document to solr with the post command, is it possible to do language detection with a DataImportHandler?
Once you have defined the UpdateRequestProcessor chain, you need to actually specify it in the request handler (DataImportHandler's in this case). You do that with update.chain parameter.
Also, ensure that you include LogUpdate and RunUpdate processors, otherwise you are not even indexing at all.

Does SOLR support percolation

ElasticSearch has percolator for prospective search. Does SOLR have a similar feature where you define your query upfront? If not, is there an effective way of implementing this myself on top of the existing SOLR features?
besides what BunkerMentality said, it is not hard to build your own percolator, what you need:
Are the queries you want to run easy to model on Lucene only syntax? if so you are good, if not, you need to convert them to Lucene only. Built them, and keep them in memory as Lucene queries
When a doc arrives:
build a MemoryIndex containing only that single doc
run all your queries on the index
I have done this for a system ingesting millions docs a day and it worked fine.
It's listed as an open new feature, SOLR-4587, on Solr JIRA but it doesn't seem like any work has started on it yet.
There is a link in the comments there to a separate project called Luwak that seems to implement some features similar to percolator.
If it is still relevant, you can use this
It's SOLR Update Processor that based on Luwak

updating Solr from Lucene Index

I'm currently working on a web archiving project. Basically, what we try to do is archive a collection of websites (using heritrix crawler) and provide access to the archived contents through a web interface.
We also offer full-text search throughout the archives. Currently, the index is generated using nutchwax (a customised version of apache Nutch, tailored to index .warc files, as generated by heritrix). Nutchwax dumps out a Lucene index and for using it in Solr, all that has to be done is to generate a correct schema.
This is all done and its running like it should, however the archive is not static and there are new .warc files generated periodically.
What I can do now, is to generate a new index, merge it with the existing one and import it back into Solr. However, to do that Solr has to be restarted.
It would be great if the index could be updated "on the fly" as this is usually the case (when updating the index via http requests)
Does anyone have an idea, how this can be done? My first shot at this was generating .xml files out of the Lucene index file and posting them to Solr. Is this worth a try or are there more elegant solutions?
You could probably leverage the use of multiple cores to accomplish what you need. See the Solr Wiki - CoreAdmin for more details. I think you could leverage the MergeIndexes capability or the ability to Swap cores for a better experience in your scenario.

Upgrade solr 1.4 index to solr 3.3?

I have an existing index build using apache solr 1.4.
I want to use this existing index in version 3.3. As you know the index format is changed after 3.x, so how is it possible to do this?
I have exported the existing index (that is in 1.4 version) using Luke to XML.
There's two ways to do this:
if your index is unoptimized, then simply optimize it - this will upgrade the file format along the way.
if your index is already optimized, you can't do this. Instead, use the command line tool supplied with solr (your path may differ from mine
java -cp work/Jetty_0_0_0_0_8983_solr.war__solr__k1kf17/webapp/WEB-INF/lib/lucene-core-3.3.0.jar org.apache.lucene.index.IndexUpgrader -verbose /path/to/index/directory
However, note that this only changes the file format - it won't stop deprecation warnings because unless you tell it otherwise, solrconfig.xml defaults to still assuming you're using an old index format. see http://www.mail-archive.com/dev#lucene.apache.org/msg23233.html
You may still get lots of lines like this in your logfile:
WARNING: LowerCaseFilterFactory is using deprecated LUCENE_24 emulation. You should at some point declare and reindex to at least 3.0, because 2.x emulation is deprecated and will be removed in 4.0
until you tell solrconfig.xml that you're ready to use all the features of the new index format. You do this by adding the following to solrconfig.xml (at the top level, just after the abortOnConfigurationError setting).
<!-- Controls what version of Lucene various components of Solr
adhere to. Generally, you want to use the latest version to
get all bug fixes and improvements. It is highly recommended
that you fully re-index after changing this setting as it can
affect both how text is indexed and queried.
-->
<luceneMatchVersion>LUCENE_33</luceneMatchVersion>
If you have the data: the best way is indexing all the data new in solr 3.3
You can use the data import handler to index your exported XML files.
If building up a new index is not an solution for you, you have got different possibilities:
As far as i know, Solr 3.3 can read old indexes.
So one idea could be using shards. One shard for the old data (read only) an the other shard for the new data. Unfortunately, in this solution you will be unable to modify old data.

Running Solr in read-only mode

I think I'm missing something obvious here. I have to imagine a lot of people open up their Solr servers to other developers and don't want them to be able to modify the index.
Is there something in solrconfig.xml that can be set to effectively make the index read-only?
Update for clarification:
My goal is to use Solr with an existing Lucene index managed by another application. This works just fine, but I want to be sure Solr never tries to write to this index.
Exposing a Solr instance to the public internet is a bad idea. Even though you can strip some components to make it read-only, it just wasn't designed with security in mind, it's meant to be used as an internal service, just like you wouldn't expose a RDBMS.
From the Solr Security wiki page:
First and foremost, Solr does not
concern itself with security either at
the document level or the
communication level. It is strongly
recommended that the application
server containing Solr be firewalled
such the only clients with access to
Solr are your own. A default/example
installation of Solr allows any client
with access to it to add, update, and
delete documents (and of course
search/read too), including access to
the Solr configuration and schema
files and the administrative user
interface.
Even ajax-solr, a Solr client for javascript meant to run in a browser, recommends talking to Solr through a proxy.
Take for example guardian.co.uk: it's well-known that they use Solr for searching, but they built an API to let others access their content. This way they can define and control exactly what and how they want people to search for things.
Otherwise, any script kiddie can write a trivial loop to DoS your Solr instance and therefore bring down your site.
You can probably just remove the line that defines your solr.XmlUpdateRequestHandler in solrconfig.xml.
Replication is a nice way to setup read-only while being able to do indexation. Just setup a master with restricted access and a slave that is read-only (by removing your XmlUpdateRequestHandler from the config). The slave will be replicated from the master but won't accept any indexation directly.
UPDATE
I just read that in Solr 1.4, you can disable component. I just tried it on the /update requestHandler and I was not able to index anymore.

Resources