Is data importing automatic in solr? - solr

I'm using Solr 4.10.4 with MySQL on Windows.
Solr recommends setting the HTTP cache setting max-age to half of how often the index gets updated update the index.
So, the question is: does Solr automatically perform full/delta imports? If so, how do I control that?
If not, I assume it's up to me to update the index regularly?

#Howie
SOLR can be configured to pull data using a DataImportHandler
You should look at this documentation for details https://wiki.apache.org/solr/DataImportHandler
There is some documentation on scheduling the data pull but it appears that its not a configurable feature and requires some additional changes. The section http://wiki.apache.org/solr/DataImportHandler#Scheduling discuss the same. Also there is a Stackoverflow question on the same How can I Schedule data imports in Solr
Alternately you can also post data to SOLR as needed from your System.
Both strategies will work. It depends completely on what is better for your system. I would recommend going through the appropriate docs on Indexing https://wiki.apache.org/solr/FrontPage#Search_and_Indexing and then decide which strategy works better for your application

Related

Versioning document changes in Vespa

I would like to allow for versioning of text in Vespa. If a user changes certain fields over time the changes would be tracked and versions could be restored.
I imagine a solution running in parallel to Vespa would be the way to go, with version numbers being stored in the vespa doc as unindexed data.
Any recommendations on a solution to use to do this? Something like http://jsonpatch.com?
I would just store each version as a separate document by including the version in the document id.

Solr or Lucene like single application

Hello i have already working application for searching in database. In database I have like 50M indexed documents. There is any idea to run all together i mean i don't want solr on http? what should i do? it's better to use Lucene or EmbeddedSolrServer? Or maybe you have other solution?
I have already something like on 1st diagram and i want make this in single process
And if i will go in lucene can i use my indexes from solr?
solr-5.2.1
Tomcat v8.0
It is not recommended to have one tomcat and deploy the application and solr.
If solr crashes then they are chances of getting downtime for the application. So its always better to run solr independently. Embedding solr is also not recommended.
The simplest, safest, way to use Solr is via Solr's standard HTTP interfaces. Embedding Solr is less flexible, harder to support, not as well tested, and should be reserved for special circumstances.
for reference http://wiki.apache.org/solr/EmbeddedSolr
It depends. If you want to use parts of the Solr feature set (Solr adds quite a few features on top of Lucene), you'll reimplement features that you otherwise would get for free.
You can use EmbeddedSolr to have Solr internal to your application, and then use the EmbeddedSolrServer client in SolrJ to talk to it - the rest of your application would still use Solr as it were a remote instance.
The problem with EmbeddedSolr is that you'll run into scalability issues as the index size grows, since you'll have a harder time scaling onto multiple servers and to separate concerns.

Manipulate Solr index with lucene

I have a solr core with 100K-1000k documents.
I have a scenario where I need to add or set a field value on most document.
Doing it through Solr takes too much time.
I was wondering if there is a way to do such task with Lucene library and access the Solr index directly (with less overhead).
If needed, I can shutdown the core, run my code and reload the core afterwards (hoping it will take less time than doing it with Solr).
It will be great to hear if someone already done such a thing and what are the major pitfalls in the way.
Similar problem has been discussed multiple times in Lucene Java mailing list. The underlying problem is that you can not update document in Lucene (and hence Solr).
Instead, you need to delete the document and insert a new one. This obviously adds overhead of analyzing, merging index segments, etc. Yet, the specified amount of documents isn't something major and should not take days (have you tried updating Solr with multiple threads?).
You can of course try doing this via Lucene and see if this makes any difference, but you need to be absolutely sure you will be using the same analyzers as Solr does.
I have a scenario where I need to add or set a field value on most document.
If you have to do it often, maybe you need to look at things like ExternalFileField. There are limitations, but it may be better than hacking around Solr's infrastructure by going directly to Lucene.

solr - can I use it for this?

Is solr just for searching ie it's not for 'updating' or 'inserting' data?
My site is currently MySQL based, and on looking at SOLR as an alt option, I see you make your queries through http requests.
My first thought was - how do you stop someone from making a query that updates or inserts data?
Obviously, I'm not understanding SOLR, hence my question here.
Cheers
Solr mainly is for Full Text search, and rather should not be used as a Persistent store.
Solr stores its data in the File store and does not provide the features of Relational database (ACID or Nested Entities etc )
Usually, the model followed is use Relationship database for you data management.
Replicate the data into Solr for Full Text search.
You can always control the Insert/Update access for Solr by securing the urls.

building in support for future Solr sharding

Building an application. Right now we have one Solr server. But we would like to design the app so that it can support multiple Solr shard in future if we outgrow the indexing needs.
What are keys things to keep in mind when developing an application that can support multiple shards in future?
we stored the solr URL /solr/ in a DB. Which is used to execute queries against solr. There is one URL for Updates and one URL for Searches in the DB
If we add shards to the solr environment at a future date, will the process for using the shards be as simple as updating the URLs in the DB? Or are there other things that need to be updated. We are using SolrJ
e.g. change the SolrSearchBaseURL in DB to:
https://solr2/solr/select?shards=solr1/solr,solr2/solr&indent=true&q={search_query}
And updating the SolrUpdateBaseURL in DB to
https://solr2/solr/
?
Basically, what you are describing has already been implemented in SolrCloud. There the ZooKeeper maintains the state of your search cluster (which shards in what collections, shard replicas, leader and slave nodes and more). It can handle the load on indexing and querying sides by using hashing.
You could, in principle, get by (at least in the beginning of your cluster growth) with the system you have developed. But think about replicating, adding load balancers, external cache servers (like e.g. varnish): in the long run you would end up implementing smth like SolrCloud yourself.
Having said that, there are some caveats to using hash based indexing and hence searching. If you want to implement logical partitioning of you data (say, by date) at this point there is no way to this but making a custom code. There is some work projected around this though.

Resources