Solr field alias for indexing and querying - solr

I have a set of documents in a Solr index that have the fields, exact_title and alternative_title. I want to be able to search them by using the field title.
So in other words the query title:Hello World should return documents that have an exact_title or an alternative_title "Hello World"
Is it possible to define as alias for a field during indexing time?

I solved defining copy fields in the schema.xml file.
Example:
<field name="title_txt" type="text_general" indexed="true" stored="false" multiValued="true"/>
<field name="exact_title_txt" type="text_general" indexed="true" stored="true" multiValued="false"/>
<field name="alternative_title_txt" type="text_general" indexed="true" stored="true" multiValued="false"/>
<copyField source="exact_title_txt" dest="title_txt"/>
<copyField source="alternative_title_txt" dest="title_txt"/>

Related

Solr indexing some of fields in a Tuple

I am new to Solr. I have a question regarding Solr indexing. Currently we have below configuration to index all the fields in a Tuple.
<!--contact fields -->
<field indexed="true" multiValued="false" name="contact" stored="false" type="TupleField"/>
<field docValues="true" indexed="true" multiValued="false" name="contact.first_name" stored="false" type="TextField"/>
<field docValues="true" indexed="true" multiValued="false" name="contact.last_name" stored="false" type="TextField"/>
<field docValues="true" indexed="true" multiValued="false" name="contact.email" stored="false" type="TextField"/>
I am trying to avoid indexing unwanted fields. In the above config i wanted to remove the indexing for first_name and last_name. Basically i want to have index on email field only.
Do i need to remove the fields (first_name and last_name) in the above config and mention
<field indexed="true" multiValued="false" name="contact" stored="false" type="TupleField"/>
<field docValues="true" indexed="true" multiValued="false" name="contact.email" stored="false" type="TextField"/>
or I need to mention all the fields and make docValues and indexed as false? I guess both are same. But can some one confirm above change is good?
In production usage you should always mention all fields, so that you don't suddenly get weird behavior from fields being added by the schemaless mode.
Keep the configuration and set indexed and docValues explicitly to false if you don't need them.

Solr - Not Getting Results When searching

I am having trouble with Solr 8.5.2 when providing a word in a query. It's fine when the query is :. But when I put in a word, it does not hit any document.
Here is my schema.xml config.
<field name="quoteid" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="quotenumber" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="formdata" type="text_general" indexed="true" stored="true" multiValued="false"/>
<field name="creationtimeintickssinceepoch" type="plong" indexed="true" stored="true"/>
<field name="_version_" type="plong" indexed="false" stored="false"/>
<field name="_text_" type="text_general" indexed="true" stored="false" multiValued="true"/>
Here is a sample document. (FormData field is actually a Json string, as you notice)
{
"quoteid":"466f4dea-XXXX-443c-b1e4-XXXXXXX",
"quotenumber":"NAAAAA",
"creationtimeintickssinceepoch":15927195449809739,
"formdata":"{\"formModel\": {\"SomeProperty0\":\"somevalue\",\"SomeProperty1\":\"somevalue\",\"SomeProperty2\":\"somevalue\"}"...blahblahblah here,
"_version_":1670089165635584000}
I tried entering NAAAAA, no results. I tried 'SomeProperty1', no results too.
If you're not giving any field names in your query or using dismax or edismax with the qf argument, the default search field is used (usually named _text_ - this could be configured in your schema, but is usually given as df with the default query handler).
You'll need to include your field name when you're querying other fields - quotenumber:NAAAAA to get hits in the quotenumber field.

Caused by: org.apache.solr.common.SolrException: ERROR: [doc=#docid] unknown field '#fieldname'

I am new to Solr and just trying to index a couple of PDF files. Started with empty field list in schema.xml, I keep getting the error message:
Caused by: org.apache.solr.common.SolrException: ERROR: [doc=#docid] unknown field '#fieldname'
(#docid and #fieldname are placeholders for real values here)
Is there a way how to find out all the fields in my PDF files? Adding one by another is just not too much fun :)
And what is the best way to filter these before being loaded to Solr? schema.xml seems to be the last option. Are there any config files, where I could get rid of the garbage fields
sooner, possibly improving performance?
My environment: Cloudera Quickstart VM with CDH 5
Thansk for your help in advance.
You'll want to look at the ExtractingRequestHandler (aka SolrCell) and it's configuration. There's an example there of how you can use uprefix to ignore all fields that are not known by the schema:
Example: uprefix=ignored_ would effectively ignore all unknown fields
generated by Tika given the example schema contains <dynamicField name="ignored_*" type="ignored"/>
There is also a list of fields defined in the example schema that lists all expected values from SolrCell and their types:
<!-- Common metadata fields, named specifically to match up with
SolrCell metadata when parsing rich documents such as Word, PDF.
Some fields are multiValued only because Tika currently may return
multiple values for them. Some metadata is parsed from the documents,
but there are some which come from the client context:
"content_type": From the HTTP headers of incoming stream
"resourcename": From SolrCell request param resource.name
-->
<field name="title" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="subject" type="text_general" indexed="true" stored="true"/>
<field name="description" type="text_general" indexed="true" stored="true"/>
<field name="comments" type="text_general" indexed="true" stored="true"/>
<field name="author" type="text_general" indexed="true" stored="true"/>
<field name="keywords" type="text_general" indexed="true" stored="true"/>
<field name="category" type="text_general" indexed="true" stored="true"/>
<field name="resourcename" type="text_general" indexed="true" stored="true"/>
<field name="url" type="text_general" indexed="true" stored="true"/>
<field name="content_type" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="last_modified" type="date" indexed="true" stored="true"/>
<field name="links" type="string" indexed="true" stored="true" multiValued="true"/>
<!-- Main body of document extracted by SolrCell.
NOTE: This field is not indexed by default, since it is also copied to "text"
using copyField below. This is to save space. Use this field for returning and
highlighting document content. Use the "text" field to search the content. -->
<field name="content" type="text_general" indexed="false" stored="true" multiValued="true"/>

Index and query Unique Key URL Solr

i indexed a collection of archived websites for querying using solr. As unique key i use the URL's of the sites. What i would like to do is to use the url field in filter queries to limit the search to a certain domain when needed. For example i want to query for "Barack Obama", but limit the results to the "whitehouse.gov" domain. Sounds like a pretty basic use case to me, however searches on the URL field do not return any results at all. Here is my config (schema.xml):
.
.
.
<field name="collection" type="string" indexed="true" stored="true"/>
<field name="content" type="text_de" indexed="true" stored="true" multiValued="true"/>
<field name="date" type="string" indexed="true" stored="true"/>
<field name="digest" type="string" indexed="true" stored="true"/>
<field name="length" type="string" indexed="true" stored="true"/>
<field name="segment" type="string" indexed="true" stored="true"/>
<field name="site" type="string" indexed="true" stored="true"/>
<field name="title" type="text_de" indexed="true" stored="true" multiValued="true"/>
<field name="type" type="string" indexed="true" stored="true"/>
<field name="url" type="text_en_splitting" indexed="true" stored="true"/>
.
.
.
<!-- Field to use to determine and enforce document uniqueness.
Unless this field is marked with required="false", it will be a required field
-->
<uniqueKey>url</uniqueKey>
And here is my query (simplified):
http://mysolrserver.com:8983/solr/select/?q=content:Barack+Obama&fq=url:whitehouse.gov
The query analyzer tells me, that my query should match:
Does anyone have an idea why this is not working? I highly appreciate any hints i can get! Thanks alot guys!!
The fq=url:whitehouse.gov filtering should work.
However I see the problem with the query q=content:Barack+Obama.
Whats your default search field ??
Does removing the query component and using q=*:* return results for you. ??
q=content:Barack+Obama query would actually result into a query like content:barack defaultsearchfield:obama
As the default search field would not have obama this would not result in any results.

Solr Schema Design

I have some questions regarding the solr schema design. Basically I'm setting up a search engine for product catalogue website and my table relationships are as follows.
Product Belongs to Merchant
Product Belongs to Brand
Product has and belongs to many Categories
Category has many Sub Categories
Sub Category has many Types
Type has many Sub Types
So far my Schema.xml is looks like this.
<field name="product_id" type="string" indexed="true" stored="true" required="true" />
<field name="name" type="string" indexed="true" stored="true"/>
<field name="merchant" type="string" indexed="true" stored="true"/>
<field name="merchant_id" type="string" indexed="true" stored="true"/>
<field name="brand" type="string" indexed="true" stored="true"/>
<field name="brand_id" type="string" indexed="true" stored="true"/>
<field name="categories" type="string" multiValued="true" indexed="true" stored="true"/>
<field name="sub_categories" type="string" multiValued="true" indexed="true" stored="true"/>
<field name="types" type="string" multiValued="true" indexed="true" stored="true"/>
<field name="sub_types" type="string" multiValued="true" indexed="true" stored="true"/>
<field name="price" type="float" indexed="true" stored="true"/>
<field name="description" type="text" indexed="true" stored="true"/>
<field name="image" type="text" indexed="true" stored="true"/>
<field name="text" type="text" indexed="true" stored="false" multiValued="true"/>
<uniqueKey>product_id</uniqueKey>
<defaultSearchField>text</defaultSearchField>
<solrQueryParser defaultOperator="OR"/>
<copyField source="name" dest="text"/>
<copyField source="merchant" dest="text"/>
<copyField source="brand" dest="text"/>
<copyField source="categories" dest="text"/>
<copyField source="sub_categories" dest="text"/>
<copyField source="types" dest="text"/>
<copyField source="sub_types" dest="text"/>
So my Questions now:
1) Is the Schema correct?
2) Let's assume I need to find products for Category XYZ. My Senior programer doesn't like querying the solr by Category Name, instead he wan't to use CategoryID.
He is suggesting to store CategoryID_CategoryName (1001_Category XYZ) and from web front he is sending ID. (Assuming that Names with white spaces doesn't work properly).
So to find the products I should then do a partial match of categories and identify the category id from the string i.e (fetch 1001 from 1001_Category XYZ)
or
What if I keep the Names on categories field and setup another field for category_ids? that's seems a better option for me.
or
is there any Solr multi valued field type to store CategoryID and CategoryName together?
Let me know your thoughts, thanks.
Answers to your questions.
Maybe - it depends on how you plan on structuring your queries, what you intend to search and what you intend to retrieve in search results. In your schema, you're storing & indexing everything which can be quite inefficient. Index what you intend to query, store what you intend to retrieve/display. If you were looking for optimizations, I would review the datatypes used in the schema - try to stay as native to the source type as you can.
Querying by CategoryId - your programmer is correct, you want to query by category Id. Your approach of storing Ids and Names in separate fields is accurate as well. Presuming your Id-based fields are integers/longs, you don't want to structure them as strings but rather as integers/longs.

Resources