Solr schema and workflow

Solr schema and workflow - solr

I am thinking of implementing Solr for our ecommerce project. What I am trying to achieve is to allow users to search within product title and display the result page showing the product title, product price, availability and short description. Now what I am wondering is, do I store the information I am going to show on the search result page in Solr (ie price, description) or do I just get the product ID from the solr and retrieve the products from the db?

You can do both.
It may be easier for you to implement solution if you store values in Solr and simply retrieve them with search results (although the other option is also very simple to implement).
Whichever way you choose you'll have to think about synchronization between your db and Solr, so check out these:
http://wiki.apache.org/solr/DataImportHandler#Scheduling
http://wiki.apache.org/solr/NearRealtimeSearch

Related

Is it possible to retrieve Hbase data along with Solr data?

I have the pipeline of Hbase, Lily, Solr and Hue setup for search and visualization. I am able to search on the data indexed in Solr using Hue, except I cannot view all the required data since I do not have all the fields from Hbase stored in Solr. I'm not planning on storing all of the data as well.
So is there a way of retrieving those fields from Hbase along with the Solr response for visualizing the data with Hue?
From what I know, I believe it is possible to setup the Solr searchhandler to perform this, but I haven't been able to find a concrete example to help me understand better(I am very new to both Solr and Hbase, so examples help)
My question is similar to this question. But I am unable to comment there for further information.
Current Solution thanks to suggestion by Romain:
Used HTML widget to provide a link for each record in Hue Search page back to the Hbase record on the Hbase Browser.

One of the approach is, fetch the required id from the solr, and then get the actual data from Hbase. Well solr gives you the count based on your query and also some faceting features. Once those are fetched, and you always have the data in Hbase. Solr is best for index search. So given the speed and space compromise, this design can help. Another main reason is Hbase gives you good fetch times for entire row, when searched based on row key. So, the overall performance depends on your Hbase row key design also.
i think you are using lily Hbase indexer if I am not wrong. so by default the doc id is the hbase row key, which might make things easy

Trying to understand the idea of Search Documents on the Google App Engine

I'm trying to understand the concept of Documents on Google App Engine's Search API. The concept I'm having trouble with is the idea behind storing documents. So for example, say in my database I have this:
class Business(ndb.Model):
name = ndb...
description = ndb...
For each business, I am storing a document so I can do full-text searches on the name and description.
My questions are:
Is this right? Does these mean we are essentially storing each entity TWICE, in two different places, just to make it searchable?
If the answer to above is yes, is there a better way to do it?
And again if the answer to number 1 is yes, where do the documents get stored? To the high-rep DS?
I just want to make sure I am thinking about this concept correctly. Storing entities in docs means I have to maintain each entity in two separate places... doesn't seem very optimal just to keep it searchable.

You have it worked out already.
Full Text Search Overview
The Search API allows your application to perform Google-like searches
over structured data. You can search across several different types of
data (plain text, HTML, atom, numbers, dates, and geographic
locations). Searches return a sorted list of matching text. You can
customize the sorting and presentation of results.
As you don't get to search "inside" the contents of the models in the datastore the search API provides the ability to do that for text and html.
So to link a searchable text document (e.g a product description) to a model in the datastore (e.g. that product's price) you have to "manually" make that link between the documents and the data-store objects they relate to. You can use the search api and the datastore totally independently of each other also so you have to build that in. AFAIK there is no automatic linkage between them.

Post processing of pages crawled using nutch

I have a set of pages crawled using nutch. And I understand that this crawled pages are saved as segments. I want to extract certain key values from this pages and feed it to solr as xml.
A sample situation is that I have crawled a shopping website with many product listings. I want to extract key infos like Name, Price, Specs of the product and ignore rest of the data. So that I may provide to solr some xml like
qwerty123qwerty
This is so that using solr I should be able to do sorting of different product listings based on the price.
Now how this extraction part can be done? Does map reduce come anywhere in picture?

Turning raw web pages into information is not a trivial task. One tool used for this job is Boilerpipe. However, it won't give you a solution on a plate.
If you are working on a fixed target, you might just write your own procedural code to find the data you need. If you need to find this sort of thing in arbitrary HTML, you are facing a very hard problem with no off-the-shelf solutions.

Tracking user read/unread of a link/document in solr

I am using solr to index reports from DB. I am successful in doing that. However, I also need to track user activity to report whether a document has been read by the user or not. I am aware that Solr is not built to index/keep track user activity, but is there a good approach going about this ?
Any suggestions?

No, as you say there is no support for this in Solr. From a Solr perspective it’s more related to how you build you web-application. I would recommend you to ask yourself this:
When tracking the reading statistics of my users do I need to index that information into Solr too?
The answer to this question depends on if you need to the information to facet, search or use it in the relevance model. Say for example you want to have a facet that allows your users to filter on read or unread documents then of course you need to index this into Solr.
If you only want to present whether or not a document has been read or not (in the web interface) you might as well store this information inside a SQL database fetching it when presenting the results.

what is the best practice to implement SOLR in Ecom applications?

I am new user to SOLR. I am working on an E-commerce web application which have SQL database. I want to implement SOLR for my "category page" in application where we will show products of that category with specific information like available stock , price and few more details. Also we want to restrict product display on basis of stock availability, if there is no stock then we wont display those products.
I am trying to implement SOLR with Delta import queries to make my category pages faster. And my concern is about performance of page while getting data from SOLR and accuracy of real time data like Stock and Price.
actually my database queries to get product data are little bit complicated and have several joins, so i have to make several related entities in SOLR database. So due to this data upload to SOLR is slow. that makes difficult to upload data (even with delta import query) frequently, so my application lacking real time data like stock of products.
basically i want to know best practice approach to implement SOLR. I am confused with
1. Should i export my all data to SOLR and then get all details from SOLR?
(I am worried about performance and real time data)
2. should i get only index data (say id of products) from SOLR and then get othere details from my SQL database? (not sure about this approach perfomance).
So please help me and suggest me that how can i implement SOLR for my app in best way.
all help is appretiated!!

One good case practice I encountered while developing e-commerce solution and SOLR as the search provider is to retrieve from SOLR only the IDs and get the data from SQL server.
I created a single schema that was updating every time some new products were added in the database.(In my case products were added by a user in an admin console, in your case you can use the #Mauricio Scheffer comment to have latest updates)
One field in the schema was the ID - representing the ID of the product in the database.
After querying SOLR I was receiving N documents suiting the query and with the ID field I was getting all the information from my database and display it to the user.
So the good part is that the user will always get the data from your database ( real time data), and will have his search results displayed very fast because of SOLR.
You can add to the schema different fields that you can use to filter your results such as
category
date
price
size
etc...
And also differnet fields that you need to query upon:
headline
description
name
etc...
And also add the product ID.
So after making the query to SOLR you have a list of product IDs, now the only thing you need to do is to implement a function that will get the products from database based on the ID and display it on the search results page.
This approach performance is pretty good because selecting from database elements based on primary key is the fastest way to retrieve data from sql.
I worked on a website with 1.000.000 products and searching time was always under 0.2 seconds
and page loading time in the browser was under 0.6 seconds on single user queries. ( the server where SOLR and SQL was running was 8Gb ram and quad core 1.8 gb as I remember)
Hope this type of implementation is useful for you.

You can set up a database trigger or data access layer event to send data to Solr whenever there's a change, and configure autoCommit to control freshness.
See also:
http://wiki.apache.org/solr/NearRealtimeSearch
http://java.dzone.com/articles/tuning-solr-near-real-time

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight