Solr - Indexing products with attributes as Key / Value pair - solr

I’m currently looking at developing a solr application to index products on our e-commerce website.
Some example fields in the schema are:
ProductID
ProductName
Description
Price
Categories (multi-value)
Attributes
Attributes are a list of key-value pairs.
For example:
Type = Rose
Position = Full Sun
Position = Shade
Colour = Red
I am going to store the fields, so that my pages can be generated from a search result.
How is it best to represent these?
I was thinking of maybe having some dynamic fields for indexing:
attribute_* for example (attribute_position)
And then “attribute” for stored value (For returning, for displaying) - storing multiple fields
The value of an “attribute” field could be (for example) Position|Full Sun - then let the client handle the displaying?
Are there any better ways of doing this?
As a footnote- I will be using Solrnet as a client for querying (probably not relevant)

First, I would not recommend storing your entire document in your search engine. The only thing you should store in Solr is those things that you wish to search on. Yes, it supports storing more, however, taking advantage of this can cause issues down the road with index size, master/slave replication time, etc. Ideally, the only thing in Solr is things you wish to search/sort on and a document ID that is unique enough to fetch document data with from another source that is optimized for storing .... documents.
However, if you decide to ignore this advice, then you can easily store your name value pairs in a single field. If your name value pairs have a limited character set, you can easily concatenate the name value pairs into a single string. Then, parse them on the way out when you are forming your web page for display. There's no need to come up with a more complex schema to support this. Multiple fields for storing these will only increase your index overhead, which buys you nothing.

Related

how to boost the score in azure search for unstructured blob data?

I am using Azure search which is using default indexing on the data which is importing unstructured data (pdf, doc, text, image files etc.)
I didn't make any scoring profile on the default available fields.
Almost every setting in the portal is the default. If I search any text through the search explorer then I get the JSON result which has very low search score.
I read about score boosting using the scoring profile. however, the terms which I want to find out can be in any document at any place. so how can I decide on which field I can weight more?
how can I generate more custom fields on these input files? Do I need to write document parser?
I am using SDK 4.0 and c# in my bot.
please suggest.
To use scoring profile, the fields you are trying to boost need to be part of the index definition, otherwise the scoring mechanism won't know about them.
You mentioned using unstructured data as your source, I assume this means your data does not have any stable or predictable structure. If that's the case, then you probably won't be able to update your index definition to match exactly the structure of every document, since different documents will likely have a different and unpredictable structure. If you know what fields you want to boost, and you know how to retrieve those fields from your document, then you could update your index definition with only the fields you care about, and then use the "merge" document API to populate that field for each document.
https://learn.microsoft.com/en-us/rest/api/searchservice/addupdate-or-delete-documents
This would require you to retrieve all documents from the index, parse the data to extract the field you want to boost, and then use the merge API to update the index data with the data you extracted. Once you have this, you will be able to use that field as part of a scoring profile.

Is there a better way to represent provenenace on a field level in SOLR

I have documents in SOLR which consist of fields where the values come from different source systems. The reason why I am doing this is because this document is what I want returned from the SOLR search, including functionality like hit highlighting. As far as I know, if I use join with multiple SOLR documents, there is no way to get what matched in the related documents. My document has fields like:
id => unique entity id
type => entity type
name => entity name
field_1_s => dynamic field from system A
field_2_s => dynamic field from system B
...
Now, my problem comes when data is updated in one of the source systems. I need to update or remove only the fields that correspond to that source system and keep the other fields untouched. My thought is to encode the dynamic field name with the first part of the field name being a 8 character hash representing the source system.. this way they can have common field names outside of the unique source hash. And in this way, I can easily clear out all fields that start with the source prefix, if needed.
Does this sound like something I should be doing, or is there some other way that others have attempted?
In our experience the easiest and least error prone way of implementing something like this is to have a straight forward way to build the resulting document, and then reindex the complete document with data from both subsystems retrieved at time of reindexing. Tracking field names and field removal tend to get into a lot of business rules that live outside of where you'd normally work with them.
By focusing on making the task of indexing a specific document easy and performant, you'll make the system more flexible regarding other issues in the future as well (retrieving all documents with a certain value from Solr, then triggering a reindex for those documents from a utility script, etc.).
That way you'll also have the same indexing flow for your application and primary indexing code, so that you don't have to maintain several sets of indexing code to do different stuff.
If the systems you're querying isn't able to perform when retrieving the number of documents you need, you can add a local cache (in SQL, memcached or something similar) to speed up the process, but that code can be specific to the indexing process. Usually the subsystems will be performant enough (at least if doing batch retrieval depending on the documents that are being updated).

Storing Inverted Index

I know that inverted indexing is a good way to index words, but what I'm confused about is how the search engines actually store them? For example, if a word "google" appears in document - 2, 4, 6, 8 with different frequencies, where should store them? Can a database table with one-to-many relation would do any good for storing them?
It is highly unlikely that fullfledged SQL-like databases are used for this purpose. First, it is called an inverted index because it is just an index. Each entry is just a reference. As non-relational databases and key-value stores came up as a favourite topic in relation to web technology.
You only ever have one way of accessing the data (by query word). That is why it's called an index.
Each entry is a list/array/vector of references to documents, so each element of that list is very small. The only other information besides of storing a documentID would be to store a tf-idf score for each element.
How to use it:
If you have a single query word ("google") then you look up in the inverted index in which documents this word turns up (2,4,6,8 in your example). If you have tf-idf scores, you can sort the results to report the best matching document first. You then go and look up which documents the document IDs 2,4,6,8 refer to, and report their URL as well as a snippet etc. URL, snippets etc are probably best stored in another table or key-value store.
If you have multiple query words ("google" and "altavista"), you look into the II for both query words and you get two lists of document IDs (2,4,6,8 and 3,7,8,11,19). You take the intersection of both lists, which in this case is (8), which is the list of documents in which both query words occur.
It's a fair bet that each of the major search engines has its own technology for handling inverted indexes. It's also a moderately good bet that they're not based on standard relational database technology.
In the specific case of Google, it is a reasonable guess that the current technology used is derived from the BigTable technology described in 2006 by Fay Chang et al in Bigtable: A Distributed Storage System for Structured Data. There's little doubt that the system has evolved since then, though.
Traditionally, an inverted index is written directly to file and stored on disk somewhere. If you want to do boolean retrieval querying (Either a file contains all the words in the query or not) postings might look like so stored contiguously on file.
Term_ID_1:Frequency_N:Doc_ID_1,Doc_ID_2,Doc_ID_N.Term_ID_2:Frequency_N:Doc_ID_1,Doc_ID_2,Doc_ID_N.Term_ID_N:Frequency_N:Doc_ID_1,Doc_ID_2,Doc_ID_N
The term id is the id of a term, the frequency is the number of docs the term appears in (in other words how long is the postings list) and the doc id is the document that contained the term.
Along with the index, you need to know where everything is on file so mappings also have to be stored somewhere on another file. For instance, given a term_id, the map needs to return the file position that contains that index and then it is possible to seek to that position. Since the frequency_id is recorded in the postings, you know how many doc_ids to read from the file. In addition, there will need to be mappings from the IDs to the actual term/doc name.
If you have a small use case, you may be able to pull this off with SQL by using blobs for the postings list and handling the intersection yourself when querying.
Another strategy for a very small use case is to use a term document matrix.
Possible Solution
One possible solution would be to use a positional index. It's basically an inverted index, but we augment it by adding more information. You can read more about it at Stanford NLP.
Example
Say a word "hello" appeared in docs 1 and 3, in positions (3,5,6,200) and (9,10) respectively.
Basic Inverted Index (note there's no way to find word freqs nor there positions)
"hello" => [1,3]
Positional Index (note we don't only have freqs for each docs, but we also know exactly where the term appeared in the doc)
"hello" => [1:<3,5,6,200> , 3:<9,10>]
Heads Up
Will your index take a lot more size now? You bet!
That's why it's a good idea to compress the index. There are multiple options to compress the postings list using gap encoding, and even more options to compress the dictionary, using general string compression algorithms.
Related Readings
Index compression
Postings file compression
Dictionary compression

Index file content and custom metadata separately with Solr3.3

I am doing a POC on content/text search using Solr3.3.
I have requirement where documents along with content and their custom metadata would be indexed initially. After the documents are indexed and made available for searching, user can change the custom metadata of the documents. However once the document is added to index the content of the document cannot be updated. When the user updates the custom metadata, the document index has to be updated to reflect the metadata changes in the search.
But during index update, even though the content of the file is not changed, it is also indexed and which causes delays in the metadata update.
So I wanted to check if there is a way to avoid content indexing and update just the metadata?
Or do I have to store the content and metadata in separate index files. i.e. documentId, content in index1 and documentId, custom metadata in another index. In that case how I can query onto these two different indexes and return the result?
"if there is a way to avoid content indexing and update just the metadata" This has been covered in solr indexing and reindexing and the answer is no.
Do remember that Solr uses a very loose schema. Its like a database where everything is put into a single table. Think sparse matrices, think Amazon SimpleDB. Two solr indexes are considered as two databases, not two tables, if you had DB-like joins in mind. I just answered on it on How to start and Stop SOLR from A user created windows service .
I would enter each file as two documents (a solr document = a DB row). Hence for a file on "watson":
id: docs_contents_watson
type:contents
text: text of the file
and the metadata as
id:docs_metadata_watson
type:metadata
author:A J Crown
year:1984
To search the contents of a document:
http://localhost:8080/app/select?q=type:contents&text:"on a dark lonely night"
To do metadata searches:
http://localhost:8080/app/select?q=type:metadata&year:1984
Note the type:xx.
This may be a kludge (an implementation that can cause headaches in the long run). Fellow SO'ers, please critic this.
We did try this and it should work. Take a snapshot of what you have basically the SOLrInputDocument object before you send it to lucene. Compress it and serialize the object and then assign it to one more field in your schema. Make that field as a binary field.
So when you want to update this information to one of the fields just fetch the binary field unserialize it and append/update the values to fields you are interested and re-feed it to lucene.
Never forget to store the XML as one of the fields inside SolrInputDocument that contains the text extracted by TIKA which is used for search/indexing.
The only negative: Your index size will grow a little bit but you will get what you want without re-feeding the data.

Limiting solr spellchecking based on attributes

I wonder if there is a way to limit the spellchecking to just a part of the index.
Example i have an index containing different products used in different countries.
when a search is performed i limit the solr query to just return the results for COUNTRY X, however the suggestions that are returned are not limited to COUNTRY X, instead i receive results based on the whole index(since i only have one mispell index).
i beleive you can create a separate dictionary one for each country to solve this but here is the twist, i sometimes do a query where i want results back from COUNTRY_X and COUNTRY_Y thus also suggestions limited by those 2 countries, this would in turn result in a dictionary index of its own, seems a little to complicated and the number of dictionary indexes would be large.
I'd try splitting the index per country, i.e. one index for country X and another for country Y. You can easily do this with a multi-core setup. This way each index gets its own dictionary.
When you want to search on multiple countries at once you run a distributed query over the indexes. Distributed support for the spell checking component is only available in trunk as of this writing though.

Resources