Solr schema design and performance - solr

I have books database that has three entities: Books, pages and titles (titles found in a page). I have got confused and concerned about performance between two approaches in the schema design:
1- Dealing with books as documents i.e book field, pages field with multiValue and titles field with multiValue too. In this approach all of the book data will be represented in one Solr document with very large fields.
2- dealing with pages as documents which will lead in much smaller fields but larger number of documents.
I tried to look at this official resource but I could not able to find a clear answer for my question.

Assuming you are going to take Solr results and present them through another application, I would make the smallest item - Titles - the model for documents, which will make it much easier to present where a result appears. Doing it this way minimizes the amount of application code you need to write. If your users are querying Solr directly I might use Page as a my document instead - presumably you are using Solr's highlighting feature then to assist your users with identifying how their search term(s) matched.
For Title documents I would model the schema as follows:
Book ID + Page Number + Title [string - unique key]
Book ID [integer]
Book Name [tokenized text field]
Page Number [TrieIntField]
Title [tokenized text field]
Content for that book/title/page combination [tokenized text field]
There may be other attributes you want to capture, such as author, publication date, publisher, but you do not explain above what other information you have so I leave that out of this example.
Textual queries then can involve Book Name, Title and Content where you may want to define a single field that's indexed, but not stored, that serves as a target for <copyField/> declarations in your schema.xml to allow for easy searching over all three at the same time.
For indexing, without knowing more about the data being indexed, I would use the ICU Tokenizer and Snowball Porter Stemming Filter with a language specification on the text fields to handle non-English data - assuming all the books are in the same language. And if English, the Standard Tokenizer instead of ICU.

Related

Apache Solr Querying by search term from multiple tables and in all columns

I am new to Apache Solr and have worked with single table and importing it in Solr to get data using query.
Now I want to do following.
query from multiple tables ..... Like if I find by a word, it should return all occurances in multiple tables.
Search in all fields of table ....like I query by word in all fields in single table too.
Do I need to create single document by importing data from multiple tables using joins in data-config.xml? And then querying over it?
Any leads and guidance is welcome.
TIA.
Do I need to create single document by importing data from multiple tables using joins in data-config.xml? And then querying over it?
Yes. Solr uses a document model (rather than a relational model) and the general approach is to index a single document with the fields that you need for searching.
From the Apache Solr guide:
Solr’s basic unit of information is a document, which is a set of data
that describes something. A recipe document would contain the
ingredients, the instructions, the preparation time, the cooking time,
the tools needed, and so on. A document about a person, for example,
might contain the person’s name, biography, favorite color, and shoe
size. A document about a book could contain the title, author, year of
publication, number of pages, and so on.

Querying Solr multiple indexes with different schema in single query

We have a situation where we are keeping two indexes with different schemas.
For example: suppose we have an index for seller where the key value is seller id and other attributes are seller information. Now another index is book where book id is unique key and it keeps book related information.
Is it possible to query both these indexes in a single query and get collective results?
I have checked Solr but as per my findings we can do this through distributed search in Solr but it works on same kind of schema being distributed in at max 3 indexes.
I am a newbie to Solr so please ignore if this is a stupid question.
You need to think about what makes sense for a search query but there are some rules.
The first requirement is that the unique keys need to have the same name and be unique across collections or Solr cannot collate results.
If you are then hoping to get some kind of sensible ranking of your results you need some common fields. For example I have two collections: one of product data and one containing product related documents. I have a unique key: id and I have common title and contents fields for when I want to query across the two collections. I also have an advanced search interface where I can query on specific fields like product id.
A "unification core" is a typical way of handling search across two or more cores, see this Stack Overflow answer on how to set that up
Query multiple collections with different fields in solr
Other techniques are to use federated search with something like Carrot or to issue two queries and show the results in different tabs in the search results.

Suggest please a pattern for searching through database

imagine this kind of db
Authors(id, author)
Publication(id, authorID, Title, Year....)
What is the best way to proceed string search queries f.e. "2001 Smith Theory of Evolution", I mean not in particular case, but in general: searching records not by 1 column?
For a simple/quick solution:
Consider creating a new (fulltext indexed) terms column on your Publication table which will receive every text string of interest to search (e.g. author name, pubdate, title).
Then add a MATCH/AGAINST clause to your query (or to_tsquery() for Postgres).
Postgres doc: http://www.postgresql.org/docs/9.4/static/textsearch-tables.html
MySQL doc: https://dev.mysql.com/doc/refman/5.7/en/fulltext-search.html
If you find that you need finer control over search relevance, or stock search features like facets and autocomplete, then consider deploying Solr or Elasticsearch as an external index to your database.

Denormalization of database tables for Lucene indexing

I am just starting up with Lucene, and I'm trying to index a database so I can perform searches on the content. There are 3 tables that I am interested in indexing:
1. Image table - this is a table where each entry represents an image. Each image has an unique ID and some other info (title, description, etc).
2. People table - this is a table where each entry represent a person. Each person has a unique ID and other info like (name, address, company, etc)
3. Credited table - this table has 3 fields (image, person, and credit type). It's purpose is to associate some people to a image as the credits for that image. Each image can have multiple credited people (there's the director, photographer, props artist, etc). Also, a person is credited in multiple images.
I'm trying to index these tables so I can perform some searching using Lucene but as I've read, I need to flatten the structure.
The first solution the came to me would be to create Lucene documents for each combination of Image/Credited Person. I'm afraid this will create a lot of duplicate content in the index (all the details of an image/person would have to be duplicated in each Document for each person that worked on the image).
Is there anybody experienced with Lucene that can help me with this? I know there is no generic solution to denormalization, that is why I provided a more specific example.
Thank you, and I will gladly provide more info on the database is anybody needs
PS: Unfortunately, there is no way for me to change the structure of the database (it belongs to the client). I have to work with what I have.
You could create a Document for each person with all the associated images' descriptions concatenated (either appended to the person info or in a separate Field).
Or, you could create a minimal Document for each person, create a Document for each image, puts the creators' names and credit info in a separate field of the image Document and link them by putting the person ID (or person Document id) a third, non-indexed field. (Lucene is geared toward flat document indexing, not relational data, but relations can be defined manually.)
This is really a matter of what you want to search for, images or persons, and whether each contains enough keywords for search to function. Try several options, see if they work well enough and don't exceed the available space.
The credit table will probably not be a good candidate for Document construction, though.

Index document "linked" to multiple users

Hi I want to index a Solr Document and tag the document with multiple associated users. I want to enable searches like "give me the documents assocaited with userid 1000,1003...9300 containing the word X. More people will be added to the document during the lifetime of the document. I want to potentially associate thousands of users to one document. There is no need to show the associated users in the results, just for search, will indexing of userid or username be more performant and scalable. What field type would be more performant and scalable, appending to a text field, a multivalued field or any other approach?
I believe that using the userid (as an integer) would be the most performant. (At least from my experience so far). Also, using a multivalued field will allow you to use a filter query on the userid field to help improve the query response time.

Resources