Forgive me I just a newbie of Solr. I am trying to understand some basic concept of Solr.
I quoted some read about inverted index as following .
This is like retrieving pages in a book related to a keyword by
scanning the index at the back of a book, as opposed to searching
every word of every page of the book.
This type of index is called an inverted index, because it inverts a
page-centric data structure (page->words) to a keyword-centric data
structure (word->pages).
In my understanding. I think the index would indicate the specific token term pointing to some document. But I can't understand what does the field of document use for when in the indexing and query?
In my understanding. in the query . Solr just search in the index and find the document. It is nothing to do with the field . Right ?Thanks.
Documents (which can have one or more fields) are the I/O entities exchanged between client and server during the index and the query phases. The inverted index is a low-level concept (hidden to the client) and it is the immutable and underlying data structure that solr uses to organize its data.
Solr uses fields for searching and indexing. Document instead is a logical grouping of them. (Improperly) speaking in RDBMS terminology
Document = record
Field = columns values belongin to that record
Related
I need to write some metadata to index in Lucene. These metadata describes the relationship between indexes, which helps me to do cross-index query.
The data structure of metadata is key-value pair. The key may be Integer or String. And the value is a list of Integer or String.
In the begining, I tried to extend the Codec. Obvously this key-value pair is not belongs to existing Formats. Then I turned into write this by adding a field. But it is not belongs to the index either and field is hard to change.
How to extends this metadata? Thank you.
I am new to Apache Solr and have worked with single table and importing it in Solr to get data using query.
Now I want to do following.
query from multiple tables ..... Like if I find by a word, it should return all occurances in multiple tables.
Search in all fields of table ....like I query by word in all fields in single table too.
Do I need to create single document by importing data from multiple tables using joins in data-config.xml? And then querying over it?
Any leads and guidance is welcome.
TIA.
Do I need to create single document by importing data from multiple tables using joins in data-config.xml? And then querying over it?
Yes. Solr uses a document model (rather than a relational model) and the general approach is to index a single document with the fields that you need for searching.
From the Apache Solr guide:
Solr’s basic unit of information is a document, which is a set of data
that describes something. A recipe document would contain the
ingredients, the instructions, the preparation time, the cooking time,
the tools needed, and so on. A document about a person, for example,
might contain the person’s name, biography, favorite color, and shoe
size. A document about a book could contain the title, author, year of
publication, number of pages, and so on.
We have a situation where we are keeping two indexes with different schemas.
For example: suppose we have an index for seller where the key value is seller id and other attributes are seller information. Now another index is book where book id is unique key and it keeps book related information.
Is it possible to query both these indexes in a single query and get collective results?
I have checked Solr but as per my findings we can do this through distributed search in Solr but it works on same kind of schema being distributed in at max 3 indexes.
I am a newbie to Solr so please ignore if this is a stupid question.
You need to think about what makes sense for a search query but there are some rules.
The first requirement is that the unique keys need to have the same name and be unique across collections or Solr cannot collate results.
If you are then hoping to get some kind of sensible ranking of your results you need some common fields. For example I have two collections: one of product data and one containing product related documents. I have a unique key: id and I have common title and contents fields for when I want to query across the two collections. I also have an advanced search interface where I can query on specific fields like product id.
A "unification core" is a typical way of handling search across two or more cores, see this Stack Overflow answer on how to set that up
Query multiple collections with different fields in solr
Other techniques are to use federated search with something like Carrot or to issue two queries and show the results in different tabs in the search results.
I have indexed two json documents into Solr, and when I get the response am I recieving both documents - how to differentiate the two documents and store the documents separately?
You need to define a (unique) key when indexing the json-documents - this key being either mandatory or not. This could be done in schema.xml or managed-schema, if not already done. Further on would you have to search for this key in the query for fetching the wanted document.
This can be compared with querying for a unique primary key in SQL and traditional databases. A tuple/record, uniquely defined by the primary key, would in this scenario be equivalent with the json documents.
Assuming two documents with respective unique id 1 and 2 - can you fetch document 1 by searching forq=id:1 in the Solr Admin-UI - if you want the document with id 1. I'm afraid I don't know how to do this is Solrj or by QueryResponse.
Management of where documents are stored in Solr is not supported - it is more or less black-boxed. This should however not be a problem considering your situation as long as you specify the query correctly.
Look here for a link that tells how to use Solr 6 as a JDBC dataSource . Better if you use Solr 6 if you want to utilize Solr more as a data source rather than an index source as it has enhanced SQL level features and hence, serves the purpose best . Here is the link https://sematext.com/blog/2016/04/26/solr-6-as-jdbc-data-source/ . Let me know if that helps you :) .
I have books database that has three entities: Books, pages and titles (titles found in a page). I have got confused and concerned about performance between two approaches in the schema design:
1- Dealing with books as documents i.e book field, pages field with multiValue and titles field with multiValue too. In this approach all of the book data will be represented in one Solr document with very large fields.
2- dealing with pages as documents which will lead in much smaller fields but larger number of documents.
I tried to look at this official resource but I could not able to find a clear answer for my question.
Assuming you are going to take Solr results and present them through another application, I would make the smallest item - Titles - the model for documents, which will make it much easier to present where a result appears. Doing it this way minimizes the amount of application code you need to write. If your users are querying Solr directly I might use Page as a my document instead - presumably you are using Solr's highlighting feature then to assist your users with identifying how their search term(s) matched.
For Title documents I would model the schema as follows:
Book ID + Page Number + Title [string - unique key]
Book ID [integer]
Book Name [tokenized text field]
Page Number [TrieIntField]
Title [tokenized text field]
Content for that book/title/page combination [tokenized text field]
There may be other attributes you want to capture, such as author, publication date, publisher, but you do not explain above what other information you have so I leave that out of this example.
Textual queries then can involve Book Name, Title and Content where you may want to define a single field that's indexed, but not stored, that serves as a target for <copyField/> declarations in your schema.xml to allow for easy searching over all three at the same time.
For indexing, without knowing more about the data being indexed, I would use the ICU Tokenizer and Snowball Porter Stemming Filter with a language specification on the text fields to handle non-English data - assuming all the books are in the same language. And if English, the Standard Tokenizer instead of ICU.