i indexed a collection of archived websites for querying using solr. As unique key i use the URL's of the sites. What i would like to do is to use the url field in filter queries to limit the search to a certain domain when needed. For example i want to query for "Barack Obama", but limit the results to the "whitehouse.gov" domain. Sounds like a pretty basic use case to me, however searches on the URL field do not return any results at all. Here is my config (schema.xml):
.
.
.
<field name="collection" type="string" indexed="true" stored="true"/>
<field name="content" type="text_de" indexed="true" stored="true" multiValued="true"/>
<field name="date" type="string" indexed="true" stored="true"/>
<field name="digest" type="string" indexed="true" stored="true"/>
<field name="length" type="string" indexed="true" stored="true"/>
<field name="segment" type="string" indexed="true" stored="true"/>
<field name="site" type="string" indexed="true" stored="true"/>
<field name="title" type="text_de" indexed="true" stored="true" multiValued="true"/>
<field name="type" type="string" indexed="true" stored="true"/>
<field name="url" type="text_en_splitting" indexed="true" stored="true"/>
.
.
.
<!-- Field to use to determine and enforce document uniqueness.
Unless this field is marked with required="false", it will be a required field
-->
<uniqueKey>url</uniqueKey>
And here is my query (simplified):
http://mysolrserver.com:8983/solr/select/?q=content:Barack+Obama&fq=url:whitehouse.gov
The query analyzer tells me, that my query should match:
Does anyone have an idea why this is not working? I highly appreciate any hints i can get! Thanks alot guys!!
The fq=url:whitehouse.gov filtering should work.
However I see the problem with the query q=content:Barack+Obama.
Whats your default search field ??
Does removing the query component and using q=*:* return results for you. ??
q=content:Barack+Obama query would actually result into a query like content:barack defaultsearchfield:obama
As the default search field would not have obama this would not result in any results.
Related
I am new to Solr. I have a question regarding Solr indexing. Currently we have below configuration to index all the fields in a Tuple.
<!--contact fields -->
<field indexed="true" multiValued="false" name="contact" stored="false" type="TupleField"/>
<field docValues="true" indexed="true" multiValued="false" name="contact.first_name" stored="false" type="TextField"/>
<field docValues="true" indexed="true" multiValued="false" name="contact.last_name" stored="false" type="TextField"/>
<field docValues="true" indexed="true" multiValued="false" name="contact.email" stored="false" type="TextField"/>
I am trying to avoid indexing unwanted fields. In the above config i wanted to remove the indexing for first_name and last_name. Basically i want to have index on email field only.
Do i need to remove the fields (first_name and last_name) in the above config and mention
<field indexed="true" multiValued="false" name="contact" stored="false" type="TupleField"/>
<field docValues="true" indexed="true" multiValued="false" name="contact.email" stored="false" type="TextField"/>
or I need to mention all the fields and make docValues and indexed as false? I guess both are same. But can some one confirm above change is good?
In production usage you should always mention all fields, so that you don't suddenly get weird behavior from fields being added by the schemaless mode.
Keep the configuration and set indexed and docValues explicitly to false if you don't need them.
I am having trouble with Solr 8.5.2 when providing a word in a query. It's fine when the query is :. But when I put in a word, it does not hit any document.
Here is my schema.xml config.
<field name="quoteid" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="quotenumber" type="string" indexed="true" stored="true" multiValued="false"/>
<field name="formdata" type="text_general" indexed="true" stored="true" multiValued="false"/>
<field name="creationtimeintickssinceepoch" type="plong" indexed="true" stored="true"/>
<field name="_version_" type="plong" indexed="false" stored="false"/>
<field name="_text_" type="text_general" indexed="true" stored="false" multiValued="true"/>
Here is a sample document. (FormData field is actually a Json string, as you notice)
{
"quoteid":"466f4dea-XXXX-443c-b1e4-XXXXXXX",
"quotenumber":"NAAAAA",
"creationtimeintickssinceepoch":15927195449809739,
"formdata":"{\"formModel\": {\"SomeProperty0\":\"somevalue\",\"SomeProperty1\":\"somevalue\",\"SomeProperty2\":\"somevalue\"}"...blahblahblah here,
"_version_":1670089165635584000}
I tried entering NAAAAA, no results. I tried 'SomeProperty1', no results too.
If you're not giving any field names in your query or using dismax or edismax with the qf argument, the default search field is used (usually named _text_ - this could be configured in your schema, but is usually given as df with the default query handler).
You'll need to include your field name when you're querying other fields - quotenumber:NAAAAA to get hits in the quotenumber field.
In our scenario we receive an unknown postal address in a string format with an unknown address format. Our need is to run the search with the given postal address over all the fields and find the best match for the query.
However, if we don't have an exact match for the 4 mandatory fields - meaning SOLR returns similar results (for at least 1 mandatory field), then NO results should be displayed.
The 4 mandatory fields are BuildingNumber, LocPressName, County and PostalDistrict defined with the other search fields in the schema.xml file as follows -
<field name="uid" stored="true" indexed="true" type="uuid" default="NEW"/>
<field name="UnitNumber" stored="true" indexed="true" type="text_general"/>
<field name="UnitName" stored="true" indexed="true" type="text_general"/>
<field name="BuildingNumber" stored="true" indexed="true" type="exactish"/>
<field name="BuildingName" stored="true" indexed="true" type="text_general"/>
<field name="LocPressName" stored="true" indexed="true" type="exactish"/>
<field name="PostalDistrict" stored="true" indexed="true" type="exactish"/>
<field name="County" stored="true" indexed="true" type="exactish"/>
<field name="AddressId" stored="true" indexed="true" type="text_general"/>
<field name="ExchangeCode" stored="true" indexed="true" type="text_general"/>
<field name="PreviousCustomerName" stored="true" indexed="true" type="text_general"/>
<field name="Eircode" stored="true" indexed="true" type="text_general"/>
I am fairly new to Solr and I am not sure how to generate this query that produces the best results only if it finds a match for ALL FOUR mandatory fields.
Without the exact type of your exactish field, its hard to say, but assuming that it's a StrField. The basic, explicit version:
q=(BuildingNumber:18 AND LocPressName:Foo AND
County:Forthershire AND PostalDistrict:Bar) AND searchField:Query
.. where searchField is a field where everything you want to search as a text_general field has been copied. You can replace this with all the other fields if needed.
Another option:
q=Query&defType=edismax&qf=UnitNumber UnitName .. etc&fq=BuildingNumber:18 AND
LocPressName:Foo AND County:Forthershire AND PostalDistrict:Bar
This works the same, but allows a free form querying by using the edismax query parser. The fq applies a filter to your resultset, where documents has to match the filter to be considered in the result set. It does however not affect how a document is scored.
I am new to Solr and just trying to index a couple of PDF files. Started with empty field list in schema.xml, I keep getting the error message:
Caused by: org.apache.solr.common.SolrException: ERROR: [doc=#docid] unknown field '#fieldname'
(#docid and #fieldname are placeholders for real values here)
Is there a way how to find out all the fields in my PDF files? Adding one by another is just not too much fun :)
And what is the best way to filter these before being loaded to Solr? schema.xml seems to be the last option. Are there any config files, where I could get rid of the garbage fields
sooner, possibly improving performance?
My environment: Cloudera Quickstart VM with CDH 5
Thansk for your help in advance.
You'll want to look at the ExtractingRequestHandler (aka SolrCell) and it's configuration. There's an example there of how you can use uprefix to ignore all fields that are not known by the schema:
Example: uprefix=ignored_ would effectively ignore all unknown fields
generated by Tika given the example schema contains <dynamicField name="ignored_*" type="ignored"/>
There is also a list of fields defined in the example schema that lists all expected values from SolrCell and their types:
<!-- Common metadata fields, named specifically to match up with
SolrCell metadata when parsing rich documents such as Word, PDF.
Some fields are multiValued only because Tika currently may return
multiple values for them. Some metadata is parsed from the documents,
but there are some which come from the client context:
"content_type": From the HTTP headers of incoming stream
"resourcename": From SolrCell request param resource.name
-->
<field name="title" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="subject" type="text_general" indexed="true" stored="true"/>
<field name="description" type="text_general" indexed="true" stored="true"/>
<field name="comments" type="text_general" indexed="true" stored="true"/>
<field name="author" type="text_general" indexed="true" stored="true"/>
<field name="keywords" type="text_general" indexed="true" stored="true"/>
<field name="category" type="text_general" indexed="true" stored="true"/>
<field name="resourcename" type="text_general" indexed="true" stored="true"/>
<field name="url" type="text_general" indexed="true" stored="true"/>
<field name="content_type" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="last_modified" type="date" indexed="true" stored="true"/>
<field name="links" type="string" indexed="true" stored="true" multiValued="true"/>
<!-- Main body of document extracted by SolrCell.
NOTE: This field is not indexed by default, since it is also copied to "text"
using copyField below. This is to save space. Use this field for returning and
highlighting document content. Use the "text" field to search the content. -->
<field name="content" type="text_general" indexed="false" stored="true" multiValued="true"/>
I am using SOLR and i have a schema something Like this :
<fields>
<field name="Id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="Username" type="text_general" indexed="true" stored="true" omitNorms="true" multiValued="false"/>
<field name="ServerName" type="text_general" indexed="true" stored="true" multiValued="false" />
<fields/>
I want to use facet to get the result that give me the number of user per each server
how can i do that?
desired result :
server 1 : 200 (userNumber)
server 2: 300
and so on...
thank you
This is not a complete solution, as i do not have your data and schema. But what i think you need is pivot Faceting http://wiki.apache.org/solr/SimpleFacetParameters#Pivot_.28ie_Decision_Tree.29_Faceting .
So you need to do something like this (again , you need to adjust this to make it work for you)
http://ip:port/solr/collection1/select?q=*:*&rows=0&facet=true&facet.pivot=Username,ServerName