I'm using Alfresco 4.1.6 and SOLR 1.4.
For search, I use fts_alfresco_language and the searchService.query method.
And in my query I search by PATH, TYPE and some custom properties like direction, telephone, mail, or similar.
I have now over 2 millions of documents, and we can see how the performance of the searchs are worst than at the beginning.
I read that in version 1.4 of solr, using PATH on the query is a bad idea. And is better avoid it and only use TYPE and the property key and value.
But I have 2 questions...
Why the PATH increase the response time? It's not a help? I have over 1000 main folders at the root of the repository. If I specify the folder that solr may search, why this not filter the results and give me a worst time response than if I don't specify this? Or there are another way to say to solr the main folder to reduce results and then do the rest of the query?
When I find by custom properties, I use 3 or 4 properties, all indexed, to search. These merged lookups has a higher overhead than one? Maybe is better to search only by one property, and not by the 3? Or maybe use ORs and not ANDs to quickly results? How works SOLR?
Thanks!
First let me start with this, I'm not sure what you want of this question cause it's vague. You're not asking how to make your query better, your asking why a bad-practice(bad-performance) is working bad for you.
Do some research on how to structure your ECM system, first thing what makes your ECM any good is a proper Content Model. There are books out there which will help you.
If you're structuring your content with folders (Path) and these are important for you, than you need to add these as metadata to your content. If you haven't done that, then you should start with that.
A good Content Model will be able to find content wherever it's placed within your ECM system.
Sure it's easy to migrate a filesystem to an ECM system and just leave it there, but you've done only half the work.
The path queries are slow in general cause it uses a loop pattern and it's expensive. It has been greatly improved in the new SOLR, but it still isn't as fast as normal metadata querying.
Related
For statistic purposes I have to save and analyse all search queries made to a server running Solr (version 8.3.1). Maybe it's just because I haven't worked with Solr until today, but I couldn't find a simpler way to access these queries except crawling the logs.
I've only found one article to help me, in which the following is stated:
I think that solr by him self doesn't store the queries (correct me if I'm wrong, about this) but you can accomplish what you want by processing the solr log (its the only way I think).
(source: https://lucene.472066.n3.nabble.com/is-it-possible-to-save-the-search-query-td4018925.html)
Is there any more convenient way to do this?
I actually found a good way to achieve this in another SO-Question. Well, at least kind of.
Note: It is only useful if you have enough resources on the same server or another server to properly handle a second Solr-Core.
Link to original answer
That SO-Question is about Elastic-Search but the methodology of it can also be applied to this case with a second Solr-Core that indexes the queries made. (One can also add additional fields like when it was last searched, total search count, ...)
The functionalities of a search auto-complete are also achievable with this solution.
In short:
The basic idea is to use a second Solr instance to provide the means necessary for quickly saving the queries (instead of a DB for instance).
Remark: I'm not going to accept this as the best answer because it is a rather special solution to the question I originally made. But I nonetheless felt it could be useful for any programmer trying to achieve this while also thinking about search auto-completion.
I am new to Solr and using in my project where i have large number of products with a number of properties. So the indexing takes a whole lot of time. But if i don't index all the properties then the results will have to be populated via a separate db hit. But that kind of loses the significance of the Solr, doesn't it? Since we are hitting db anyways, doesn't that make the query slower? Kindly guide whats the right approach. Indexing all properties or getting the remaining properties from db?
An hybrid choice is not necessarily the evil. Basically that choice depends on what kind of search features and services you want to offer to your users. For instance
if you want to facet over a "category" field you need to put that field on Solr
if you want to have some data in real time (e.g. price) I would go with the database
In general you should experiment and try because all your thoughts make sense but, my suggestion, don't optimize things in advance. Write down your (search and view) requirements and on top of that try to get a good compromise between the two extremes (only solr / only database)
I have a solr core with 100K-1000k documents.
I have a scenario where I need to add or set a field value on most document.
Doing it through Solr takes too much time.
I was wondering if there is a way to do such task with Lucene library and access the Solr index directly (with less overhead).
If needed, I can shutdown the core, run my code and reload the core afterwards (hoping it will take less time than doing it with Solr).
It will be great to hear if someone already done such a thing and what are the major pitfalls in the way.
Similar problem has been discussed multiple times in Lucene Java mailing list. The underlying problem is that you can not update document in Lucene (and hence Solr).
Instead, you need to delete the document and insert a new one. This obviously adds overhead of analyzing, merging index segments, etc. Yet, the specified amount of documents isn't something major and should not take days (have you tried updating Solr with multiple threads?).
You can of course try doing this via Lucene and see if this makes any difference, but you need to be absolutely sure you will be using the same analyzers as Solr does.
I have a scenario where I need to add or set a field value on most document.
If you have to do it often, maybe you need to look at things like ExternalFileField. There are limitations, but it may be better than hacking around Solr's infrastructure by going directly to Lucene.
I have an MVC application which I need to be able to search. The application is modular so it needs to be easy for modules to register data to index with the search module.
At present, there's just a quick interim solution in place which is fine for flexibility, but speed was always going to be a problem. Modules register models (and relationships and columns) which they'd like to be searchable. Upon search, the search functionality queries data using those relationships and applies Levenshtein, removes stop words, does character replacements etc. Clearly this will slow down as the volume of data increases so it's not viable to keep as it is effectively select * from x,y,z and then mine through the data.
The benefit of the above is such that there is a direct relation to the model which found the data. For example, if Model_Product finds something, I know that in my code i can use Model_Product::url() to associate the result off to the relevant location or Model_Product::find(other data) to show say the image or description if the keyword had been found in the title for example.
Another benefit of the above is it's already database specific, and therefore can just be thrown up onto a virtualhost and it works.
I have read about the various options, and they all seem very similar so it's unlikely that people are going to be able to suggest the 'right' one without inciting discussion or debate, but for the record; from the following options, Solr seems to be the one I'm leaning toward. I'm not set in stone so if anyone has any advice they'd like to share or other options I could look at, that'd be great.
Sphinx
Lucene
Solr - appears to just run Lucene as a service?
Xapian
ElasticSearch
Looking through various tutorials and guides they all seem relatively easy to set up and configure. In the case above I can have modules register the path of config files/search index models and have the searcher run them all through search program x. This will build my indexes, and provide the means by which to query data. Fine.
What I don't understand is how any of these indexes related to my other code. If I index data, search and in turn find a result with say Solr, how do I know how to get all of the other information related to the bit it found?
Also is someone able to confirm whether or not I will need to have an instance of any of the above per virtualhost? This is something which I can't seem to find much information on. I would assume that I can just connect to a single instance and tell it what data is relevant? Much like connecting to a single DBMS server, with credentials x to database y.
Granted I haven't done as extensive reading on this as I would have typically because I'm a bit stuck in terms of direction at the moment and I'd rather not read everything about everything in favour of seeking some advice from those who know before I take a particular route.
Edit: This question seems to have swayed me more towards Solr. There's also a similar thread here with a fair amount of insight into Sphinx.
DISCLAIMER: I can only speak about Lucene/Solr and, I believe, ElasticSearch as I know it is based on Lucene. Others might or might not work in the same way.
If I index data, search and in turn find a result with say Solr, how
do I know how to get all of the other information related to the bit
it found?
You can store any extra data you want, e.g. a database key pointing to a particular row in the database. Lucene/Solr can also help you to find relative information, e.g. if you run a DVD rent shop and user has misspelled a movie name, Lucene will figure this out for you and (unlike with DB) still list the closest alternatives. You can also provide hints by boosting certain fields during indexing or querying. There are special extensions for geospatial search, etc. And obviously you can provide your own if you need to.
Also is someone able to confirm whether or not I will need to have an
instance of any of the above per virtualhost?
Lucene is a low level library and will have to be present in every JVM you run. Solr (built on top of Lucene) is an HTTP server. You can call it from as many clients as you want. More scaling options explained here.
Currently I am using thinking sphinx for search. Now I'm considering using sunspot or tire because they automatically index new content.
Are there any performance differences between the two? Is there anything else I should be concerned with?
Obviously the first difference is that you want to decide which search engine you think is best for your purposes: SOLR or Elasticsearch. We're using SOLR via Sunspot right now, but we're thinking seriously about moving to Elasticsearch because it feels like a better match for the sorts of web app functionality we want. It was incredibly easy to set up Tire, install the attachments plugin, and get search operating against data both in the database and in PDF attachments, with highlighting (now working thanks to another answer here on SO). Also, from a development/debugging point of view being able to use curl to test queries and see results is just great.
From the point of view of coding in a Rails app, you're right that both Sunspot and Tire are very similar. They both use the idea of a searchable/mapping block that defines what fields to index and how, and then performing a search is quite similar. As far as performance goes, I might give a bit of advantage to Tire, partly because the way it paginates and indexes in bulk is pretty slick (via the rake tire:import task). The ability in tire to control the indexing contents via to_json is very flexible as well.
Ultimately I think probably Sunspot and Tire are close enough that the choice between SOLR vs Elasticsearch is where you'll really end up making your decision.