what is the best practice to implement SOLR in Ecom applications? - solr

I am new user to SOLR. I am working on an E-commerce web application which have SQL database. I want to implement SOLR for my "category page" in application where we will show products of that category with specific information like available stock , price and few more details. Also we want to restrict product display on basis of stock availability, if there is no stock then we wont display those products.
I am trying to implement SOLR with Delta import queries to make my category pages faster. And my concern is about performance of page while getting data from SOLR and accuracy of real time data like Stock and Price.
actually my database queries to get product data are little bit complicated and have several joins, so i have to make several related entities in SOLR database. So due to this data upload to SOLR is slow. that makes difficult to upload data (even with delta import query) frequently, so my application lacking real time data like stock of products.
basically i want to know best practice approach to implement SOLR. I am confused with
1. Should i export my all data to SOLR and then get all details from SOLR?
(I am worried about performance and real time data)
2. should i get only index data (say id of products) from SOLR and then get othere details from my SQL database? (not sure about this approach perfomance).
So please help me and suggest me that how can i implement SOLR for my app in best way.
all help is appretiated!!

One good case practice I encountered while developing e-commerce solution and SOLR as the search provider is to retrieve from SOLR only the IDs and get the data from SQL server.
I created a single schema that was updating every time some new products were added in the database.(In my case products were added by a user in an admin console, in your case you can use the #Mauricio Scheffer comment to have latest updates)
One field in the schema was the ID - representing the ID of the product in the database.
After querying SOLR I was receiving N documents suiting the query and with the ID field I was getting all the information from my database and display it to the user.
So the good part is that the user will always get the data from your database ( real time data), and will have his search results displayed very fast because of SOLR.
You can add to the schema different fields that you can use to filter your results such as
category
date
price
size
etc...
And also differnet fields that you need to query upon:
headline
description
name
etc...
And also add the product ID.
So after making the query to SOLR you have a list of product IDs, now the only thing you need to do is to implement a function that will get the products from database based on the ID and display it on the search results page.
This approach performance is pretty good because selecting from database elements based on primary key is the fastest way to retrieve data from sql.
I worked on a website with 1.000.000 products and searching time was always under 0.2 seconds
and page loading time in the browser was under 0.6 seconds on single user queries. ( the server where SOLR and SQL was running was 8Gb ram and quad core 1.8 gb as I remember)
Hope this type of implementation is useful for you.

You can set up a database trigger or data access layer event to send data to Solr whenever there's a change, and configure autoCommit to control freshness.
See also:
http://wiki.apache.org/solr/NearRealtimeSearch
http://java.dzone.com/articles/tuning-solr-near-real-time

Related

Full Text Search in ERP application - is Apache Lucene\Solr the right choice?

Im currently investigating the tools necessary to add a fast, full text search to our ERP SAAS application with the aim of providing a single search entry point in the application that could search over the many different kind of objects that compose the domain of the software.
​
The application (a Spring Java web application) is backed by a Sql Server RDBMS (usign Hibernate as ORM), there are hundreds of different tables, dozens of which (but maybe more) should be searchable (usually there are one or more varchar columns in evenry table that should be indexed/searched).
Think for example of a single search bar where i can search customers, contracts, employees, articles..), this data is also very often updated (new inserts, deletes, updates..)
​
I found this article (www.chrisumbel.com/article/lucene_solr_sql_server) that shows how to connect a Sql Server db with Solr, posting a query example on the database that extracts the data used by Solr during the data import.
Since we have dozens (and more) tables containing the searchable data that means that we should pass for a first step that integrate all the sql queries that extracts this data with Solr, in order to build the index?
Second question: not all the data is searchable by everyone (permissions and ad hoc filters), so how could we complement the full text search provided by Solr with the need of putting in place more complex queries (join on other tables for example) on this data?
​
Thanks
You are nearly asking for a full blown consulting project :-) But a few suggestions are possible.
Define Search Result Types: Search engines use denormalized data, i.e. you won't do any joins while querying (if you think you do, stick to your DB:-) That means you need to do the necessary joins while filling the index. This defines what you can search for. Most people "just" index documents or log-lines, so there is just one type of result. Sometimes people's profiles are included, sometimes a difference is made between results from different source systems where the documents come from, but in the end, there is a limited number of types of search results. And even more, they are nevertheless indexed into one and the same schema (where schemas are very malleable for search engines).
Index: You know your SQL statements to extract your data. Converting to JSON and shoveling it into a search engine is not difficult. One thing to watch out for: while your DB changes, you keep indexing, incremental or full "crawl" depends on how much logic you want to add. The most tricky part is to get deletes on the DB side into the index. If its gone, its gone: how do you know there was something that needs to be purged from the index :-)
Secure Search Since you don't really join, applying access rights at query time amounts requires two steps. During indexing, write principle (group, user) names of those who may read your search result. At query time, get the user ID and expand it, recursively, to get all groups of the user. Add this as a query filter. Make sure to cache the filter or even pre-compute for all users quite regularly and store it in a fast store (the search index is one place, DB would do too:-) Obviously you need to re-index if access rights change. The good thing is: as long as things only change in LDAP/AD, you don't need to index the data, only the expanded groups of the affected users.
ad hoc filters If you want to filter for X, put X as a field into the index. At query time, apply the filter.

What is a good web application SQL Server data mart implementation in ElasticSearch?

Coming from a RDBMS background and trying to wrap my head around ElasticSearch data storage patterns...
Currently in SQL Server, we have a star schema data mart, RecordData. Rows are organized by user ID, geographic location that pertains to the rest of the searchable record, title and description (which are free text search fields).
I would like to move this over to ElasticSearch, and have read about creating a separate index per user. If I understand this correctly, with this suggestion, I would be creating a RecordData type in each user index, correct? What is a recommended naming convention for user indices that will be simple for Kibana analysis?
One issue I have with this recommendation is, how would you organize multiple web applications on the ES server? You wouldn't want to have all those user indices all over the place?
Is it so bad to have one index per application, and type per SQL Server table?
Since in SQL Server, we have other tables for user configuration, based on user ID's, I take it that I could then create new ES types in user indices for configuration. Is this a recommended pattern? I would rather not have two data base systems for this web application.
Suggestions welcome, thank you.
I went through the same thing, and there are a few things to take into account.
Data Modeling
You say you use a star schema today. Elasticsearch is typically appropriate for denormalized data where the totality of the information resides in each document unlike with a star schema. If you can live with denormalized, that is fine but I assume that since you already have star schema, denormalized data is not an option because you don't want to go and update millions of documents each time the location name change for example(if i understand the use case). At least in my use case that wasn't an option.
What are Elasticsearch options for normalized data?
This leads us to think of how to put star schema like data in a system like Elasticsearch. There are a few options in the documentation, the main ones i focused were
Nested Objects - more details at https://www.elastic.co/guide/en/elasticsearch/guide/current/nested-objects.html . In nested objects the entire information is kept in a single document, meaning one location and its related users would be in a single document. That may make it not optimal becasue the document will be huge and again, a change in the location name will require to update the entire document. So this is better but still not optimal.
Parent - Child Relationship - more details at https://www.elastic.co/guide/en/elasticsearch/guide/current/parent-child.html . In this case the location and the User records would be kepts in separate indices similarly to a relational database. This seems to be the right modeling for what we need. The only major issue with this option is the fact that Kibana 4 does not provide ways to manipulate/aggregate documents based on parent/child relationship as of this writing. So if you main driver for using Elasticsearch is Kibana(this was mine), that kind of eliminates the option. If you want to benefit from the elasticsearch speed as an engine this seems to be the desired option for your use case.
In my opinion once you got right the data modeling all of your questions will be easier to answer.
Regarding the organization of the servers themselves, the way we organize that is by having a separate cluster of 3 elasticsearch nodes behind a Load Balancer(all of that is hosted on a cloud) and then have all your Web Applications connect to that cluster using the Elasticsearch API.
Hope that helps.

Is it possible to retrieve Hbase data along with Solr data?

I have the pipeline of Hbase, Lily, Solr and Hue setup for search and visualization. I am able to search on the data indexed in Solr using Hue, except I cannot view all the required data since I do not have all the fields from Hbase stored in Solr. I'm not planning on storing all of the data as well.
So is there a way of retrieving those fields from Hbase along with the Solr response for visualizing the data with Hue?
From what I know, I believe it is possible to setup the Solr searchhandler to perform this, but I haven't been able to find a concrete example to help me understand better(I am very new to both Solr and Hbase, so examples help)
My question is similar to this question. But I am unable to comment there for further information.
Current Solution thanks to suggestion by Romain:
Used HTML widget to provide a link for each record in Hue Search page back to the Hbase record on the Hbase Browser.
One of the approach is, fetch the required id from the solr, and then get the actual data from Hbase. Well solr gives you the count based on your query and also some faceting features. Once those are fetched, and you always have the data in Hbase. Solr is best for index search. So given the speed and space compromise, this design can help. Another main reason is Hbase gives you good fetch times for entire row, when searched based on row key. So, the overall performance depends on your Hbase row key design also.
i think you are using lily Hbase indexer if I am not wrong. so by default the doc id is the hbase row key, which might make things easy

Fast, reliable nosql db for indexing single docs but retrieve all at once

I need a document based database where I can update single documents but when I get them I'll get all documents at once.
In my app I have no need to search and there will be up to 100k documents.
The db should be hosted on a dedicated machine or VM and have a web api.
What is the best choice?
This is probably not so smart to do. If you only update one doc, why would you want to retrieve all the other 99,999 documents when you know they didn't change?
Unless you are making a computation based on that, in which case you probably want the DB to do that for you and only fetch the result?

Solr schema and workflow

I am thinking of implementing Solr for our ecommerce project. What I am trying to achieve is to allow users to search within product title and display the result page showing the product title, product price, availability and short description. Now what I am wondering is, do I store the information I am going to show on the search result page in Solr (ie price, description) or do I just get the product ID from the solr and retrieve the products from the db?
You can do both.
It may be easier for you to implement solution if you store values in Solr and simply retrieve them with search results (although the other option is also very simple to implement).
Whichever way you choose you'll have to think about synchronization between your db and Solr, so check out these:
http://wiki.apache.org/solr/DataImportHandler#Scheduling
http://wiki.apache.org/solr/NearRealtimeSearch

Resources