solr indexes documents but does not search in them - solr

I am a novice with Solr and i was trying the example that comes in the example folder of Solr(3.6) package(apache-solr-3.6.0.tgz). I started the server and posted the sample xml files in example/exampledocs and then i could search for stuff and Solr would return matches and it was all good. But then i tried posting another xml file with more than 10,000 documents. I modified the example/solr/conf/schema.xml file to add the fields of my xml file and then restarted the server and posted my xml file. I checked the statistics in Solr admin panel(http://localhost:8983/solr/admin/stats.jsp) and it shows numDocs : 10020. Now this means that the documents were successfully posted. But when i search for anything that was present in my posted documents(from the 10,000 document xml file),it returns 0 results. But Solr is still able to return results from searches that match content in the documents that come by default in the example/exampledocs folder. I am clueless about what has happened here. The value of numDoc clearly suggests that the documents i posted in the xml file were indexed.
Anything else i can inspect to see what's wrong with this?
The schema which comes in the example with the Solr package is like this
<field name="id" type="string" indexed="true" stored="true" required="true"/>
<field name="sku" type="text_en_splitting_tight" indexed="true" stored="true" omitNorms="true"/>
<field name="name" type="text_general" indexed="true" stored="true"/><field name="alphaNameSort" type="alphaOnlySort" indexed="true" stored="false"/>
<field name="manu" type="text_general" indexed="true" stored="true" omitNorms="true"/>
<field name="cat" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="features" type="text_en_splitting" indexed="true" stored="true" multiValued="true"/>
<field name="includes" type="text_general" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true"/>
<field name="weight" type="float" indexed="true" stored="true"/>
<field name="price" type="float" indexed="true" stored="true"/>
<field name="popularity" type="int" indexed="true" stored="true"/>
<field name="title" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="subject" type="text_general" indexed="true" stored="true"/>
<field name="description" type="text_general" indexed="true" stored="true"/>
<field name="inStock" type="boolean" indexed="true" stored="true"/>
and more....
The schema of the xml file which i posted had some fields in common with the above schema like title,description,price,etc so i entered the rest of the fields in schema.xml like this
<field name="cid" type="int" indexed="false" stored="false"/>
<field name="discount" type="float" indexed="true" stored="true"/>
<field name="link" type="string" indexed="true" stored="true"/>
<field name="status" type="string" indexed="true" stored="true"/>
<field name="pubDate" type="string" indexed="true" stored="true"/>
<field name="image" type="string" indexed="false" stored="false"/>

If you are using the default settings from the Solr example site, then by virtue of the df setting in the solrconfig.xml file for the /select request handler, it is setting the default search field to the text field.
<requestHandler name="/select" class="solr.SearchHandler">
<!-- default values for query parameters can be specified, these
will be overridden by parameters in the request
-->
<lst name="defaults">
<str name="echoParams">explicit</str>
<int name="rows">10</int>
<str name="df">text</str>
</lst>
....
</requestHandler>
If you look in the schema.xml file just below the field definitions you will see the multiple copyField settings that are moving the values from certain fields into the text field and therefore making them searchable via the default field setting. In your example of searching for Sony in the title field, if you look at the copyField statements, you will see that the title field is not being copied to the text default search field. Therefore, the documents with the Sony title value are not being returned in your query.
I would suggest the following:
Try a query by specifying the following: title:Sony that should return what you are expecting.
If you want the title field to be included in the default query field, then add the following copyField statement to the schema.xml file and reload your 10000 document file.
<copyField source="title" dest="text">
I hope this helps.

Related

Solr - Not Getting Results When searching

I am having trouble with Solr 8.5.2 when providing a word in a query. It's fine when the query is :. But when I put in a word, it does not hit any document.
Here is my schema.xml config.
<field name="quoteid" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="quotenumber" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="formdata" type="text_general" indexed="true" stored="true" multiValued="false"/>
<field name="creationtimeintickssinceepoch" type="plong" indexed="true" stored="true"/>
<field name="_version_" type="plong" indexed="false" stored="false"/>
<field name="_text_" type="text_general" indexed="true" stored="false" multiValued="true"/>
Here is a sample document. (FormData field is actually a Json string, as you notice)
{
"quoteid":"466f4dea-XXXX-443c-b1e4-XXXXXXX",
"quotenumber":"NAAAAA",
"creationtimeintickssinceepoch":15927195449809739,
"formdata":"{\"formModel\": {\"SomeProperty0\":\"somevalue\",\"SomeProperty1\":\"somevalue\",\"SomeProperty2\":\"somevalue\"}"...blahblahblah here,
"_version_":1670089165635584000}
I tried entering NAAAAA, no results. I tried 'SomeProperty1', no results too.
If you're not giving any field names in your query or using dismax or edismax with the qf argument, the default search field is used (usually named _text_ - this could be configured in your schema, but is usually given as df with the default query handler).
You'll need to include your field name when you're querying other fields - quotenumber:NAAAAA to get hits in the quotenumber field.

How to return results only if four mandatory fields exist in Solr

In our scenario we receive an unknown postal address in a string format with an unknown address format. Our need is to run the search with the given postal address over all the fields and find the best match for the query.
However, if we don't have an exact match for the 4 mandatory fields - meaning SOLR returns similar results (for at least 1 mandatory field), then NO results should be displayed.
The 4 mandatory fields are BuildingNumber, LocPressName, County and PostalDistrict defined with the other search fields in the schema.xml file as follows -
<field name="uid" stored="true" indexed="true" type="uuid" default="NEW"/>
<field name="UnitNumber" stored="true" indexed="true" type="text_general"/>
<field name="UnitName" stored="true" indexed="true" type="text_general"/>
<field name="BuildingNumber" stored="true" indexed="true" type="exactish"/>
<field name="BuildingName" stored="true" indexed="true" type="text_general"/>
<field name="LocPressName" stored="true" indexed="true" type="exactish"/>
<field name="PostalDistrict" stored="true" indexed="true" type="exactish"/>
<field name="County" stored="true" indexed="true" type="exactish"/>
<field name="AddressId" stored="true" indexed="true" type="text_general"/>
<field name="ExchangeCode" stored="true" indexed="true" type="text_general"/>
<field name="PreviousCustomerName" stored="true" indexed="true" type="text_general"/>
<field name="Eircode" stored="true" indexed="true" type="text_general"/>
I am fairly new to Solr and I am not sure how to generate this query that produces the best results only if it finds a match for ALL FOUR mandatory fields.
Without the exact type of your exactish field, its hard to say, but assuming that it's a StrField. The basic, explicit version:
q=(BuildingNumber:18 AND LocPressName:Foo AND
County:Forthershire AND PostalDistrict:Bar) AND searchField:Query
.. where searchField is a field where everything you want to search as a text_general field has been copied. You can replace this with all the other fields if needed.
Another option:
q=Query&defType=edismax&qf=UnitNumber UnitName .. etc&fq=BuildingNumber:18 AND
LocPressName:Foo AND County:Forthershire AND PostalDistrict:Bar
This works the same, but allows a free form querying by using the edismax query parser. The fq applies a filter to your resultset, where documents has to match the filter to be considered in the result set. It does however not affect how a document is scored.

Caused by: org.apache.solr.common.SolrException: ERROR: [doc=#docid] unknown field '#fieldname'

I am new to Solr and just trying to index a couple of PDF files. Started with empty field list in schema.xml, I keep getting the error message:
Caused by: org.apache.solr.common.SolrException: ERROR: [doc=#docid] unknown field '#fieldname'
(#docid and #fieldname are placeholders for real values here)
Is there a way how to find out all the fields in my PDF files? Adding one by another is just not too much fun :)
And what is the best way to filter these before being loaded to Solr? schema.xml seems to be the last option. Are there any config files, where I could get rid of the garbage fields
sooner, possibly improving performance?
My environment: Cloudera Quickstart VM with CDH 5
Thansk for your help in advance.
You'll want to look at the ExtractingRequestHandler (aka SolrCell) and it's configuration. There's an example there of how you can use uprefix to ignore all fields that are not known by the schema:
Example: uprefix=ignored_ would effectively ignore all unknown fields
generated by Tika given the example schema contains <dynamicField name="ignored_*" type="ignored"/>
There is also a list of fields defined in the example schema that lists all expected values from SolrCell and their types:
<!-- Common metadata fields, named specifically to match up with
SolrCell metadata when parsing rich documents such as Word, PDF.
Some fields are multiValued only because Tika currently may return
multiple values for them. Some metadata is parsed from the documents,
but there are some which come from the client context:
"content_type": From the HTTP headers of incoming stream
"resourcename": From SolrCell request param resource.name
-->
<field name="title" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="subject" type="text_general" indexed="true" stored="true"/>
<field name="description" type="text_general" indexed="true" stored="true"/>
<field name="comments" type="text_general" indexed="true" stored="true"/>
<field name="author" type="text_general" indexed="true" stored="true"/>
<field name="keywords" type="text_general" indexed="true" stored="true"/>
<field name="category" type="text_general" indexed="true" stored="true"/>
<field name="resourcename" type="text_general" indexed="true" stored="true"/>
<field name="url" type="text_general" indexed="true" stored="true"/>
<field name="content_type" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="last_modified" type="date" indexed="true" stored="true"/>
<field name="links" type="string" indexed="true" stored="true" multiValued="true"/>
<!-- Main body of document extracted by SolrCell.
NOTE: This field is not indexed by default, since it is also copied to "text"
using copyField below. This is to save space. Use this field for returning and
highlighting document content. Use the "text" field to search the content. -->
<field name="content" type="text_general" indexed="false" stored="true" multiValued="true"/>

SOLR Related facet search

I am using SOLR and i have a schema something Like this :
<fields>
<field name="Id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="Username" type="text_general" indexed="true" stored="true" omitNorms="true" multiValued="false"/>
<field name="ServerName" type="text_general" indexed="true" stored="true" multiValued="false" />
<fields/>
I want to use facet to get the result that give me the number of user per each server
how can i do that?
desired result :
server 1 : 200 (userNumber)
server 2: 300
and so on...
thank you
This is not a complete solution, as i do not have your data and schema. But what i think you need is pivot Faceting http://wiki.apache.org/solr/SimpleFacetParameters#Pivot_.28ie_Decision_Tree.29_Faceting .
So you need to do something like this (again , you need to adjust this to make it work for you)
http://ip:port/solr/collection1/select?q=*:*&rows=0&facet=true&facet.pivot=Username,ServerName

Index and query Unique Key URL Solr

i indexed a collection of archived websites for querying using solr. As unique key i use the URL's of the sites. What i would like to do is to use the url field in filter queries to limit the search to a certain domain when needed. For example i want to query for "Barack Obama", but limit the results to the "whitehouse.gov" domain. Sounds like a pretty basic use case to me, however searches on the URL field do not return any results at all. Here is my config (schema.xml):
.
.
.
<field name="collection" type="string" indexed="true" stored="true"/>
<field name="content" type="text_de" indexed="true" stored="true" multiValued="true"/>
<field name="date" type="string" indexed="true" stored="true"/>
<field name="digest" type="string" indexed="true" stored="true"/>
<field name="length" type="string" indexed="true" stored="true"/>
<field name="segment" type="string" indexed="true" stored="true"/>
<field name="site" type="string" indexed="true" stored="true"/>
<field name="title" type="text_de" indexed="true" stored="true" multiValued="true"/>
<field name="type" type="string" indexed="true" stored="true"/>
<field name="url" type="text_en_splitting" indexed="true" stored="true"/>
.
.
.
<!-- Field to use to determine and enforce document uniqueness.
Unless this field is marked with required="false", it will be a required field
-->
<uniqueKey>url</uniqueKey>
And here is my query (simplified):
http://mysolrserver.com:8983/solr/select/?q=content:Barack+Obama&fq=url:whitehouse.gov
The query analyzer tells me, that my query should match:
Does anyone have an idea why this is not working? I highly appreciate any hints i can get! Thanks alot guys!!
The fq=url:whitehouse.gov filtering should work.
However I see the problem with the query q=content:Barack+Obama.
Whats your default search field ??
Does removing the query component and using q=*:* return results for you. ??
q=content:Barack+Obama query would actually result into a query like content:barack defaultsearchfield:obama
As the default search field would not have obama this would not result in any results.

Resources