Solr indexing time vs performance - solr

I am new to Solr and using in my project where i have large number of products with a number of properties. So the indexing takes a whole lot of time. But if i don't index all the properties then the results will have to be populated via a separate db hit. But that kind of loses the significance of the Solr, doesn't it? Since we are hitting db anyways, doesn't that make the query slower? Kindly guide whats the right approach. Indexing all properties or getting the remaining properties from db?

An hybrid choice is not necessarily the evil. Basically that choice depends on what kind of search features and services you want to offer to your users. For instance
if you want to facet over a "category" field you need to put that field on Solr
if you want to have some data in real time (e.g. price) I would go with the database
In general you should experiment and try because all your thoughts make sense but, my suggestion, don't optimize things in advance. Write down your (search and view) requirements and on top of that try to get a good compromise between the two extremes (only solr / only database)

Related

About PATH in FTS alfresco queries

I'm using Alfresco 4.1.6 and SOLR 1.4.
For search, I use fts_alfresco_language and the searchService.query method.
And in my query I search by PATH, TYPE and some custom properties like direction, telephone, mail, or similar.
I have now over 2 millions of documents, and we can see how the performance of the searchs are worst than at the beginning.
I read that in version 1.4 of solr, using PATH on the query is a bad idea. And is better avoid it and only use TYPE and the property key and value.
But I have 2 questions...
Why the PATH increase the response time? It's not a help? I have over 1000 main folders at the root of the repository. If I specify the folder that solr may search, why this not filter the results and give me a worst time response than if I don't specify this? Or there are another way to say to solr the main folder to reduce results and then do the rest of the query?
When I find by custom properties, I use 3 or 4 properties, all indexed, to search. These merged lookups has a higher overhead than one? Maybe is better to search only by one property, and not by the 3? Or maybe use ORs and not ANDs to quickly results? How works SOLR?
Thanks!
First let me start with this, I'm not sure what you want of this question cause it's vague. You're not asking how to make your query better, your asking why a bad-practice(bad-performance) is working bad for you.
Do some research on how to structure your ECM system, first thing what makes your ECM any good is a proper Content Model. There are books out there which will help you.
If you're structuring your content with folders (Path) and these are important for you, than you need to add these as metadata to your content. If you haven't done that, then you should start with that.
A good Content Model will be able to find content wherever it's placed within your ECM system.
Sure it's easy to migrate a filesystem to an ECM system and just leave it there, but you've done only half the work.
The path queries are slow in general cause it uses a loop pattern and it's expensive. It has been greatly improved in the new SOLR, but it still isn't as fast as normal metadata querying.

Spatial Search Objectify, appengine

I want to use, objectify for spatial search. I have entities that have longitude and latitude associated with them. Latitude and longitude information is dynamic e.g. service providers (like electrician, carpenter) in a city. I want to implement a query that gives me service providers providing some specific service in 1 Km radius. Searching on google reveals following options
Use Objectify with geohashes - Not sure, how accurate and scalable this solution is
Use Google Search - It will need entities(or part of it) duplicated in the form of documents and Will it be able to support dynamically updated locations.
Use other database like mongodb
Assuming few millions entities and latitude/longitude dynamically updated, please suggest me an appropriate option.
thanks
Ittium
I've used geohashes. It works, although you end up selecting more data than the exact bounds you are looking for and then filtering out the extra. This might or might not be a good solution depending on your specific application. It requires writing more code but has fewer moving parts (all in the datastore).
Google search and "other database" are basically the same architectural pattern - use the task queue to replicate updates to an external index. If you want a quick solution, the search service is probably is the easiest to wrap your head around.
Just pick one solution and run with it for a while. You can always reindex the data into a different solution.
It really depends on your query rate but I usually prefer to use google search. Building and maintaining docs is pretty simple and you get a different quota to handle this queries.

How can I relate search indexes to models in MVC?

I have an MVC application which I need to be able to search. The application is modular so it needs to be easy for modules to register data to index with the search module.
At present, there's just a quick interim solution in place which is fine for flexibility, but speed was always going to be a problem. Modules register models (and relationships and columns) which they'd like to be searchable. Upon search, the search functionality queries data using those relationships and applies Levenshtein, removes stop words, does character replacements etc. Clearly this will slow down as the volume of data increases so it's not viable to keep as it is effectively select * from x,y,z and then mine through the data.
The benefit of the above is such that there is a direct relation to the model which found the data. For example, if Model_Product finds something, I know that in my code i can use Model_Product::url() to associate the result off to the relevant location or Model_Product::find(other data) to show say the image or description if the keyword had been found in the title for example.
Another benefit of the above is it's already database specific, and therefore can just be thrown up onto a virtualhost and it works.
I have read about the various options, and they all seem very similar so it's unlikely that people are going to be able to suggest the 'right' one without inciting discussion or debate, but for the record; from the following options, Solr seems to be the one I'm leaning toward. I'm not set in stone so if anyone has any advice they'd like to share or other options I could look at, that'd be great.
Sphinx
Lucene
Solr - appears to just run Lucene as a service?
Xapian
ElasticSearch
Looking through various tutorials and guides they all seem relatively easy to set up and configure. In the case above I can have modules register the path of config files/search index models and have the searcher run them all through search program x. This will build my indexes, and provide the means by which to query data. Fine.
What I don't understand is how any of these indexes related to my other code. If I index data, search and in turn find a result with say Solr, how do I know how to get all of the other information related to the bit it found?
Also is someone able to confirm whether or not I will need to have an instance of any of the above per virtualhost? This is something which I can't seem to find much information on. I would assume that I can just connect to a single instance and tell it what data is relevant? Much like connecting to a single DBMS server, with credentials x to database y.
Granted I haven't done as extensive reading on this as I would have typically because I'm a bit stuck in terms of direction at the moment and I'd rather not read everything about everything in favour of seeking some advice from those who know before I take a particular route.
Edit: This question seems to have swayed me more towards Solr. There's also a similar thread here with a fair amount of insight into Sphinx.
DISCLAIMER: I can only speak about Lucene/Solr and, I believe, ElasticSearch as I know it is based on Lucene. Others might or might not work in the same way.
If I index data, search and in turn find a result with say Solr, how
do I know how to get all of the other information related to the bit
it found?
You can store any extra data you want, e.g. a database key pointing to a particular row in the database. Lucene/Solr can also help you to find relative information, e.g. if you run a DVD rent shop and user has misspelled a movie name, Lucene will figure this out for you and (unlike with DB) still list the closest alternatives. You can also provide hints by boosting certain fields during indexing or querying. There are special extensions for geospatial search, etc. And obviously you can provide your own if you need to.
Also is someone able to confirm whether or not I will need to have an
instance of any of the above per virtualhost?
Lucene is a low level library and will have to be present in every JVM you run. Solr (built on top of Lucene) is an HTTP server. You can call it from as many clients as you want. More scaling options explained here.

Multiple index locations Solr

I am new to Solr, and am trying to figure out the best way to index and search our catalogs.
We have to index multiple manufactures and each manufacturer has a different catalog per country. Each catalog for each manufacture per country is about 8GB of data.
I was thinking it might be easier to have an index per manufacture per country and have some way to tell Solr in the URL which index to search from.
Is that the best way of doing this? If so, how would I do it? Where should I start looking? If not, what would be the best way?
I am using Solr 3.5
In general there are two ways of solving this:
Split each catalog into its own core, running a large multi core setup. This will keep each index physically separated from each other, and will allow you to use different properties (language, etc) and configuration for each core. This might be practical, but will require quite a bit of overhead if you plan on searching through all the core at the same time. It'll be easy to split the different cores into running on different servers later - simply spin the cores up on a different server.
Run everything in a single core - if all the attributes and properties of the different catalogs are the same, add two fields - one containing the manufacturer and one containing the country. Filter on these values when you need to limit the hits to a particular country or manufacturer. It'll allow you to easily search the complete index, and scalability can be implemented by replication or something like SolrCloud (coming in 4.0). If you need multilanguage support you'll have to have a field for each language with the settings you need for that language (such as stemming).
There are a few tidbits of information about this on the Solr wiki, but my suggestion is to simply try one of the methods and see if that solves your issue. Moving to the other solution shouldn't be too much work. The simplest implementation is to keep everything in the same index.

Is it feasible to use SOLR as non-relational database when plugged in with Zoie?

Title says all. We have roughly 10 million items that constantly changes. It has roughly 30 - 50 fields that we may search on. Some of the items might have a few additional fields which is not generic enough (so that we can apply them on all items), but in general this is something we could live with.
I understand that Zoie is a solution for near-real-time search and indexing based on Lucene. It also has a plugin for SOLR. Upon first try, we had some minor issues, but overall I don't see any problem with using it as a non-relational database solution. Of course, we have to sacrifice many features, e.g. constraints, transactions, unique key generation, etc. But compared with what we get, these can also be compensated or patched, one way or another.
So, I guess my question is, does anyone really have any real problem with using such combo as a database in a serious app?
Constantly means real time scenario or what are the requirements?
Take a look into: http://wiki.apache.org/solr/NearRealtimeSearchTuning
I would also take a look into Solandra + ElasticSearch too!

Resources