Solr 1.4 - Spatial search, missing last query object

Solr 1.4 - Spatial search, missing last query object - solr

Recently i've upgraded my Solr from 1.3 to 1.4 and I'm happy for this. Now I faced a strange problem and I would like to see if you have the same problem or I'm missing something.
I've run a query and put in this a PLACE with latitude and longitude, thus i could retrieve this with a spatial search (which it already works). If I run a query via ID, I retrieve this PLACE with the schema info, latitude and longitude are correct. When I run a spatial query (with latitude and longitude of the PLACE), in the xml result I don't see my place.
XML's PLACE:
<add>
<doc>
<field name="id">PLC||77173</field>
<field name="document_type">PLACE</field>
<field name="document_type_content"><![CDATA[POI]]></field>
<field name="latitude">45.07475</field>
<field name="longitude">7.680215</field>
</doc>
</add>
Ok, if I'm going to query solr with "id:PLC||77173" (primary key), here the XML:
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">139</int>
<lst name="params">
<str name="indent">on</str>
<str name="start">0</str>
<str name="q">id:PLC||77173
</str>
<str name="rows">10</str>
<str name="version">2.2</str>
</lst>
</lst>
<result name="response" numFound="1" start="0">
<doc>
<str name="document_type">PLACE</str>
<str name="document_type_content">POI</str>
<str name="id">PLC||77173</str>
<double name="latitude">45.07475</double>
<double name="longitude">7.680215</double>
</doc>
</result>
</response>
Now, I'm going to type the following query
qt=geo&lat=45.07475&long=7.680215&q=(document_type:PLACE)&radius=10&unit=km&wt=json
And in my json/xml (just erase json from the query) there's no trace of my PLACE (PLC||77173). I prefere don't paste the xml response, is too big.

Related

How to get the file name of index Word documents in Apache Solr?

I used to upload and index Word documents using the following url..
java -Durl=http://localhost:8983/solr/update/extract?literal.id=1 -Dtype=application/word -jar post.jar microfost_det.doc
When I query the Solr Index it returns XML as ..
http://localhost:8983/solr/collection1/select?q=microfost&wt=xml&indent=true
The Response was :
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>
<lst name="params">
<str name="indent">true</str>
<str name="q">microfost</str>
<str name="_">1389196238897</str>
<str name="wt">xml</str>
</lst>
</lst>
<result name="response" numFound="1" start="0">
<doc>
<str name="id">1</str>
<date name="last_modified">1601-01-01T00:00:00Z</date>
<str name="author">fazlan </str>
<str name="author_s">fazlan </str>
<arr name="content_type">
<str>application/msword</str>
</arr>
<arr name="content">
<str>
This is a MSWord document. Microfost.
</str>
</arr>
<long name="_version_">1456677821213573120</long></doc>
</result>
</response>
Now my problem is , I need the name of the document that contains the queried text "microfost" that is , microfost_det.doc ..
Is it possible to get the name of the Word file (that is filename.doc) that contains the queried text ..
.

In Solr, the default searchable field is "content". That's why you are getting the result as it's matching with content. First create a custom string field (e.g docname) modifying your schema.xml
Then restart your Solr instance. Execute the following command to update your Solr doc.
curl 'http://localhost:8983/solr/update?commit=true' -H 'Content-type:application/json' -d '[{"id":"1","docname":{"set":"microfost_det.doc"}}]'
After that execute the following query and you'll get the result.
http://localhost:8983/solr/collection1/select?q=docname:microfost*&wt=xml&indent=true
Otherwise, while extracting the document execute the following command
java -Durl="http://localhost:8983/solr/update/extract?literal.id=1&literal.docname=microfost_det.doc" -Dtype=application/word -jar post.jar microfost_det.doc
Any way, you have to store the document name in a separate field.

How to boost repeated values in a multiValue field on Solr

I have some repeated (same strings) data in a multiValue field on my solr index and i want to boost documents by matches count in that field. For example:
doc1 : { locales : ['en_US', 'de_DE', 'fr_FR', 'en_US'] }
doc2 : { locales : ['en_US'] }
When i run the query q=locales:en_US i would like to see the doc1 at the top because it has two "en_US" values. What is the proper way to boost this kind of data?
Should i use a special tokenizer?
Solr version is: 4.5

Disclaimer
In order to use either of the following solutions you will need to make either one of the following changes:
Create a copyField for locales:
<field name="locales" type="string" indexed="true" stored="true" multiValued="true"/>
<!-- No need to store(stored="false") locales_text as it will only be used for searching/sorting/boosting -->
<field name="locales_text" type="text_general" indexed="true" stored="false" multiValued="true"/>
<copyField source="locales" dest="locales_text"/>
Change the type of locales to "text_general" (the type is provided in the standard solr collection1)
First solution (Ordering):
Results can be ordered by some function. So we can order by number of occurrences (termfreq function) in field:
If copyField is used, then sort query will be: termfreq(locales_text,'en_US') DESC
If locales is of text_general type, then sort query will be: termfreq(locales,'en_US') DESC
Example response for copyField option (the result is the same for text_general type):
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">1</int>
<lst name="params">
<str name="fl">*,score</str>
<str name="sort">termfreq(locales_text,'en_US') DESC</str>
<str name="indent">true</str>
<str name="q">locales:en_US</str>
<str name="_">1383598933337</str>
<str name="wt">xml</str>
</lst>
</lst>
<result name="response" numFound="2" start="0" maxScore="0.5945348">
<doc>
<arr name="locales">
<str>en_US</str>
<str>de_DE</str>
<str>fr_FR</str>
<str>en_US</str>
</arr>
<str name="id">4f9f71f6-7811-4c22-b5d6-c62887983d08</str>
<long name="_version_">1450808563062538240</long>
<float name="score">0.4203996</float></doc>
<doc>
<arr name="locales">
<str>en_US</str>
</arr>
<str name="id">7f93e620-cf7b-4b90-b741-f6edc9db77c9</str>
<long name="_version_">1450808391856291840</long>
<float name="score">0.5945348</float></doc>
</result>
</response>
You can also use fl=*,termfreq(locales_text,'en_US') to see the number of matches.
One thing to keep in mind - it is an order function, not a boost function. If you will rather boost score based on multiple matches, you will be probably more insterested in the second solution.
I included the score in the results to demonstrate what #arun was talking about. You can see that the score is different(probably to length)... Quite unexpected(for me) that for multivalued string it is the same.
Second solution (Boosting):
If copyField is used, then the query will be : {!boost b=termfreq(locales_text,'en_US')}locales:en_US
If locales is of text_general type, then the query will be: {!boost b=termfreq(locales,'en_US')}locales:en_US
Example response for copyField option (the result is the same for text_general type):
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>
<lst name="params">
<str name="lowercaseOperators">true</str>
<str name="fl">*,score</str>
<str name="indent">true</str>
<str name="q">{!boost b=termfreq(locales_text,'en_US')}locales:en_US</str>
<str name="_">1383599910386</str>
<str name="stopwords">true</str>
<str name="wt">xml</str>
<str name="defType">edismax</str>
</lst>
</lst>
<result name="response" numFound="2" start="0" maxScore="1.1890696">
<doc>
<arr name="locales">
<str>en_US</str>
<str>de_DE</str>
<str>fr_FR</str>
<str>en_US</str>
</arr>
<str name="id">4f9f71f6-7811-4c22-b5d6-c62887983d08</str>
<long name="_version_">1450808563062538240</long>
<float name="score">1.1890696</float></doc>
<doc>
<arr name="locales">
<str>en_US</str>
</arr>
<str name="id">7f93e620-cf7b-4b90-b741-f6edc9db77c9</str>
<long name="_version_">1450808391856291840</long>
<float name="score">0.5945348</float></doc>
</result>
</response>
You can see that the score changed significantly. The first document score two time more than the second (because there was two matches each scored as 0.5945348).
Third solution (omitNorms=false)
Based on the answer from #arun I figured that there is also a third option.
If you convert you field to (for example) text_general AND set omitNorms=true for that field - it should have the same result.

The default standard request handler in Solr does not use only the term frequency to compute the scores. Along with term frequency, it also uses the length of the field. See the lucene scoring algorithm, where it says:
lengthNorm - computed when the document is added to the index in accordance with the number of tokens of this field in the document, so that shorter fields contribute more to the score.
Since doc2 has a shorter field it might have scored higher. Check the score for the results with fl=*,score in your query. To know how Solr arrived at the score, use fl=*,score&wt=xml&debugQuery=on (then right click on your browser and view page-source to see a properly indented score calculation). I believe you will see the lengthNorm contributing to a lower score for doc1.
To have length of the field not contribute to the score, you need to disable it. Set omitNorms=true for that field. (Ref: http://wiki.apache.org/solr/SchemaXml) Then see what the scores are.

How to expose the Solr DataImportHandler dataSource name in the result doc

I am importing data into Solr 4.3.0 from two different dataSources. This all works fine except that the search results do not indicate the original dataSource for each result document.
Is there a "proper" way to get the dataSource (or entity name) into the result document?
My data-config.xml looks like this (based on example given in http://wiki.apache.org/solr/DataImportHandler#Multiple_DataSources ):
<dataConfig>
<dataSource name="ds1" driver="oracle.jdbc.driver.OracleDriver" url="jdbc:oracle:thin:#//oracle-1:1521/DB1" user="SCHEMA1" password="Passw0rd1"/>
<dataSource name="ds2" driver="oracle.jdbc.driver.OracleDriver" url="jdbc:oracle:thin:#//oracle-1:1521/DB2" user="SCHEMA2" password="Passw0rd2"/>
<document>
<entity name="apples" dataSource="ds1" pk="id" query="select id,name,color from apples" />
</entity>
<entity name="bannnas" dataSource="ds2" pk="id" query="select id,name,desc from bananas" />
</entity>
</document>
</dataConfig>
Sample XML result set from a search looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">3</int>
<lst name="params">
<str name="indent">true</str>
<str name="q">yellow</str>
<str name="_">1370321809357</str>
<str name="wt">xml</str>
</lst>
</lst>
<result name="response" numFound="2" start="0">
<doc>
<str name="id">12</str>
<str name="name">Golden Delicious</str>
<str name="color">yellow</str></doc>
<doc>
<str name="id">5</str>
<str name="name">Cavendish group</str>
<str name="desc">Cavendish group is the common name for the triploid AAA group of Musa acuminata, by far the most popular cultivar by export volume. Cavendish bananas have a yellow skin and pale yellow inside when ripe.</str></doc>
</result>
</response>
Note the reason I want to know the dataSource for a given result is that the result entities have different schemas and thus need to be parsed/handled/rendered differently by the client application. Happy to see other answers that address this root problem in a different way.

Instead of the storing the datasource, why not just add the entity identifier column with each document.
This identifier field would a fixed value column, probably embedded within the Query itself.
e.g. Use alias in sql e.g. SELECT 'APPLE' AS ENTITY_TYPE
You can use this field to determine what type of parsing is needed for the respective entity.

solr group count giving wrong count

I am using solr4 and i have some issue in grouping that. here is the query i used for grouping
http://****/solr.war/collection1/select?q=name%3Awhat%26a%26girl%26wants&fl=name%2Cprice%2Cupc&wt=xml&indent=true&group=true&group.ngroups=true&group.facet=true&group.field=upc&group.sort=price+asc
this is the o/p for that
<lst>
<str name="groupValue">085391170112</str>
<result name="doclist" numFound="1" start="0">
<doc>
<str name="name">What a Girl Wants/Chasing Liberty - DVD</str>
<str name="upc">085391170112</str>
<float name="price">9.99</float></doc>
</result>
</lst>
<lst>
the 'numFound' is 1 here but when i copy that 'upc' and searched it using the following query
http://*****/solr.war/collection1/select?q=upc%3A085391170112&fl=name%2Cupc&wt=xml&indent=true
.
<result name="response" numFound="2" start="0">
<doc>
<str name="name">What a Girl Wants/Chasing Liberty - DVD</str>
<str name="upc">085391170112</str></doc>
<doc>
<str name="upc">085391170112</str>
<str name="name">Sergio Vitier - Visiones Temas Para Cine</str></doc>
</result>
the 'numFound' is 2 in the upc search.
my schema is
<field name="upc" type="string" indexed="true" stored="true" multiValued="false"/>

For the first query,
http://****/solr.war/collection1/select?q=name%3Awhat%26a%26girl%26wants&fl=name%2Cprice%2Cupc&wt=xml&indent=true&group=true&group.ngroups=true&group.facet=true&group.field=upc&group.sort=price+asc
you got numFound = 1 because, your query
q=name%3Awhat%26a%26girl%26wants
matches only the following doc based on name ( not based on "upc" )
<doc>
<str name="name">What a Girl Wants/Chasing Liberty - DVD</str>
<str name="upc">085391170112</str></doc>
</doc>
On the other hand, in your second query,
http://*****/solr.war/collection1/select?q=upc%3A085391170112&fl=name%2Cupc&wt=xml&indent=true
you have searched for "upc" which matches all documents with the given "upc" and this does not filter the results for name:what%26a%26girl%26wants.
So obviously, the counts will be different as you have two different result sets for your 2 queries.

Multiple cores join query

My solr version is 4.0
I have a multicore environment with a core for products and a core for availability records of these products.
The products core will contain detailed descriptions and has about 10,000 douments.
The availabilities core contains up to 4 million documents.
I built a small testset and I'm trying to get results using the join syntax, meant to find alle availabilities of products containing "disney".
http://localhost:8080/solr/product/select?q={!join%20from=productid%20to=id%20fromindex=availp}disney&fl=*
I get zero results.
Individual queries on each of the cores do yield results.
Questions:
1. how should I construct the query in order to get results
2. when I refine my query for filtering for a specific date, what would the syntax be.
for example ?fq=period:"november 2012" AND country:France
country is a field from the product index, period is a field from then availp index.
Results from individual queries: product core
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">1</int>
<lst name="params">
<str name="fl">id,productname</str>
<str name="indent">1</str>
<str name="q">disney</str>
<str name="rows">1</str>
</lst>
</lst>
<result name="response" numFound="31" start="0">
<doc>
<str name="productname">DPAZ00 DPAZ00-02 DPAZ0002 Disneyland Parijs Hotel Disney's Santa Fe</str>
<str name="id">44044</str></doc>
</result>
</response>
other core: availp
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>
<lst name="params">
<str name="fl">*</str>
<str name="indent">1</str>
<str name="q">productid:44044</str>
<str name="rows">1</str>
</lst>
</lst>
<result name="response" numFound="42" start="0">
<doc>
<date name="datefrom">2012-10-01T10:00:00Z</date>
<arr name="period">
<str>oktober 2012</str>
</arr>
<str name="productid">44044</str>
<double name="minpriceperperson">209.0</double>
<int name="durationcode">1</int>
<str name="id">3890</str>
<int name="budgetcode">2</int>
</result>
</response>

1) You should query inventory core (with product as inner index).
This is how the query should be
http:// localhost:8080/solr/product/select?q=*& fl={!join from=id to=id fromIndex=availp}productname:disney
2) You can use the same query syntax above.
http:// localhost:8080/solr/product/select?q=period:november&fl={!join from=id to=id fromIndex=availp}productname:disney AND country:France
You can remove productname from above if not needed.

Have you tried by changing the fromindex to fromIndex (uppercase I)?

According to Adventures with Solr Join, the query look like this:
http://localhost:8983/solr/parents/select?q=alive:yes AND _query_:"{!join fromIndex=children from=fatherid to=parentid v='childname:Tom'}"
It should be works

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Solr 1.4 - Spatial search, missing last query object - solr

Related

How to get the file name of index Word documents in Apache Solr?

How to boost repeated values in a multiValue field on Solr

How to expose the Solr DataImportHandler dataSource name in the result doc

solr group count giving wrong count

Multiple cores join query

Categories

Resources