Solr highlightning distinction - solr

When I look for the word "Unternehmen" with the following query
q="Unternehmen"&
hl=true&
hl.simple.pre=<em>&
hl.simple.post=</em>
I get this Result if the Word Unternehmen is found twice within a close distance within the same field:
Um als <em>Unternehmen die Zukunft erfolgreich zu gestalten, brauchen Unternehmen</em> Innovationen
When "Unternehmen" is found within the same field not within close distance solr gives me:
olle in <em>Unternehmen</em>, hat eine lange Tradition in der Betriebswirtschaftslehre. Dieses Special Issue widmet sich in vielfältiger Weise dem Thema der geeigneten Corporate Governance in mittelständischen oder öffentlichen deutschen <em> Unternehmen </em>. ​
How do I prevent solr from merging highlighting when the matches are too close? I always want to get the second variation of highlighting.
I already tried playing around with hl.fragsize= 0, 10,1000, hl.snippets=2 with no visible effect.

I tried your example in Solr 5 here using the standard /select Request Handler, and highlighting worked as expected (your second variation) on the first sentence.
Try the following:
What field type do you have for the field where you are querying and highlighting text in? Even if it is German, try "text_en" as the field type and re-index the document for testing.
Try querying and highlighting not on the general catch-all field, but specifically on the field (lets assume it is "mytext") you have your text, such as:
q=mytext:"Unternehmen"&hl=true&hl.fl=mytext&hl.simple.pre=<em>&hl.simple.post=</em>

Related

Solr - How to fix "Error adding field ... msg=For input string" when post data to core

I am new to Solr.
I created a Solr(8.1.0) core using SolrCloud for testing, and try to post data as a json file.
When an object has a value with float like "spalte412": "35.5" or with special characters, it throws an error in the in the console:
SimplePostTool: WARNING: Response: {
"responseHeader":{
"rf":2,
"status":400,
"QTime":223},
"error":{
"metadata":[
"error-class","org.apache.solr.common.SolrException",
"root-error-class","java.lang.NumberFormatException"],
"msg":"ERROR: [doc=52] Error adding field 'spalte421'='156.6' msg=For input string: \"156.6\"",
"code":400}}
I tried to edit core Schema by adding the field, in the Admin UI, without success.
Thanks for you help !
If you're not pre-defining your fields, the field types determined for the field will depend on the first document submitted that has that field present. Solr uses this field type to guess the type of the field, and in this case the guessed field type differs from the format you're sending in later documents.
The schemaless mode is neat for prototyping, but when moving to production you should always add the fields up front with the correct types so you don't suddenly get any surprises (as above) when the documents are submitted in a different order (or different documents) than when developing.
You can define fields in schema.xml or through the SchemaAPI.
You should post the schema.xml an an short description, what you did before.
"root-error-class","java.lang.NumberFormatException"
Sounds like solr war unable to understand that number format while your are trying to put a document with an stringt ( =For input string: \"156.6\"")
Sounds like you have a mismatch between a delivered and expected format.
Thanks guys.
indeed, I solved it by deleting the fields in the admin UI and defining with
curl -X POST -H 'Content-type:application/json' --data-binary '{"add-field": {"name":"name", "type":"text_general", "multiValued":false, "stored":true}}' http://localhost:8983/solr/films/schema

Error adding field 'field_name'-'field_value' msg=For input string: \"field_Value\"

We are struggling to import certain files into Solr occasionally. It seems like certain documents have weird meta data (values), not sure if it might be from eccentric word processor or something else. See two examples here:
Type: Solarium\Exception\HttpException
Message: Solr HTTP error: OK (400)
{"responseHeader":{"status":400,"QTime":49},"error":{"metadata":["error-class","org.apache.solr.common.SolrException","root-error-class","java.lang.NumberFormatException"],"msg":"ERROR: [doc=3932487729] Error adding field 'brightness_value'='6.18' msg=For input string: \"6.18\"","code":400}}
And
Type: Solarium\Exception\HttpException
Severity: error --> Exception: Solr HTTP error: OK (400)
{"responseHeader":{"status":400,"QTime":72},"error":{"metadata":["error-class","org.apache.solr.common.SolrException","root-error-class","java.lang.NumberFormatException"],"msg":"ERROR: [doc=16996] Error adding field 'version'='5.3.1' msg=For input string: \"5.3.1\"","code":400}}
How do we prevent these issues? We are not in control of the documents, so need to fix it on the server.
Define the field type explicitly in the schema instead of relying on Solr to create the field type for you - the first document that contains the field will make Solr guess the type of the field, and if later documents doesn't match the same, expected format, you'll get an error like this.
Always define the schema for a collection when using it in production or in an actual application - the schemaless mode is really neat for prototyping and experimenting, but in an actual application you want the types to be well defined.

Cakephp 3 Localization issue

I'm using the localization feature to translate my app in French, but I'm running into some constraints I was not expecting.
In the default.po file, when I set:
msgid "MainLoginAccountLocked"
msgstr "Votre compte a été verrouillé."
It works fine. I can see the message translated.
But when I set:
msgid "Main Login Account Locked"
msgstr "Votre compte a été verrouillé."
It's not working. I get the "Main Login Account Locked" key instead of the translation.
Is there limitations in the msgid values? Or restricted values?
I've found nothing in the doc that could help me.
There are no limitations of having spaces in the msgid, I use it all the time.
#Nds's advice is pertinent: clear your cache.
In the past I also ran with problem on the compiles version of the .po file that needed to be regenerated.
Carefully verify the spaces at the end and beginning of the string.
Maybe also you should verify the encodings involved. If you are using UTF-8 everywhere (as I hope), it shouldn't be the issue.

SOLR how to limit the search content in solr query

i want to search the words upto particular line and not beyond that using solr query. i have tried proximity match but it didnt worked. my data is like
Blockquote"Date: Thu, 24 Jul 2014 09:36:44 GMT\nCache-Control: private\nContent-Type: application/json; charset=utf-8\nContent-Encoding: gzip\nVary: Accept-Encoding\nP3P: CP=%20CURo TAIo IVAo IVDo ONL UNI COM NAV INT DEM STA OUR%20\nX-Powered-By: ASP.NET\nContent-Length: 570 \nKeep-Alive: timeout=120\nConnection: Keep-Alive\n\n[{%20rows%20:[],%20index%20:[],%20folders%20:[[%20Inbox%20,%20Inbox%20,%20%20,1,1,0,0,0,%20Inbox%20,0,0,%20none%20,0],[%20Drafts%20,%20Drafts%20,%20%20,1,1,0,0,0,%20Drafts%20,0,0,%20none%20,0],[%20Sent%20,%20Sent%20,%20%20,1,1,0,0,11,%20Sent%20,1,0,%20none%20,0],[%20Spam%20,%20Spam%20,%20%20,1,1,0,0,0,%20Spam%20,1,0,%20none%20,0],[%20Deleted%20,%20Trash%20,%20%20,1,1,0,7,9,%20Deleted%20,1,0,%20none%20,0],[%20Saved%20,%20Saved Mail%20,%20%20,1,1,0,0,0,%20Saved%20,1,0,%20none%20,0],[%20SavedIMs%20,%20Saved Chats%20,%20Saved%20,2,1,0,0,0,%20SavedIMs%20,1,0,%20none%20,0]],%20fcsupport%20:true,%20hasNewMsg%20:false,%20totalItems%20:0,%20isSuccess%20:true,%20foldersCanMoveTo%20:[%20Sent%20,%20Spam%20,%20Deleted%20,%20Saved%20,%20SavedIMs%20],%20indexStart%20:0}]POST /38664-816/aol-6/en-us/common/rpc/RPC.aspx?user=hl1lkgReIh&transport=xmlhttp&r=0.019667088333411797&a=GetMessageList&l=31211 HTTP/1.1\nHost: mail.aol.com\nUser-Agent: Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8\nAccept-Language: en-US,en;q=0.5\nAccept-Encoding: gzip, deflate\nContent-Type: application/x-www-form-urlencoded; charset=UTF-8\nX-Requested-With: XMLHttpRequest\nReferer: http://mail.aol.com/38664-816/aol-6/en-us/Suite.aspx\nContent-Length: 452\nCookie: mbox=PC#1405514778803-136292.22_06#1407395182|session#1406185366924-436868#1406187442|check#true#1406185642; s_pers=%20s_fid%3D55C638B5F089E6FB-19ACDEED1644FD86%7C1469344726539%3B%20s_getnr%3D1406186326569-Repeat%7C1469258326569%3B%20s_nrgvo%3DRepeat%7C1469258326571%3B; s_vi=[CS]v1|29E33A0D051D366F-60000105200097FF[CE]; UNAUTHID=1.5efb4a11934a40b8b5272557263dadfe.88c5; RSP_COOKIE=type=30&name=YWxzaGFraWIyMDE0&sn=MzRb%2FjjHIe8odpr%2FfxZR2g%3D%3D&stype=0&agrp=M; LTState=ver:5&lav:22&un:*UQo5AwAnAytffwJSYg%3d%3d&sn:*UQo5AwAnAytffwJSYg%3d%3d&uv:AOL&lc:en-us&ud:aol.com&ea:*UQo5AwAnAytffwJSCAsnWWoJASZL&prmc:825345&mt:6&ams:1&cmai:365&snt:0&vnop:False&mh:core-mia002b.r1000.mail.aol.com&br:100&wm:mail.aol.com&ckd:.mail.aol.com&ckp:%2f&ha:1NGRuUTRRxGFF2s5A4JwkuCT43Q%3d&; aolweatherlocation=10003; DataLayer=cons%3D6.107%26coms%3D629; grvinsights=69f3a2bb86ed3cd31aa1d14a1ce9e845; CUNAUTHID=1.5efb4a11934a40b8b5272557263dadfe.88c5; s_sess=%20s_cc%3Dtrue%3B%20s_sq%3Daolcmp%253D%252526pid%25253Dcmp%2525253A%25252520Help%25252520%2525257C%25252520View%25252520Article%2525253A%25252520Clear%25252520cookies%2525252C%25252520cache%2525252C%25252520history%25252520and%25252520footprints%252526pidt%25253D1%252526oid%25253Dhttp%2525253A%2525252F%2525252Fwebmail.aol.com%2525252F%2525253F_AOLLOCAL%2525253Dmail%252526ot%25253DA%2526aolsnssignin%253D%252526pid%25253Dsso%25252520%2525253A%25252520login%252526pidt%25253D1%252526oid%25253DSign%25252520In%252526oidt%25253D3%252526ot%25253DSUBMIT%3B; L7Id=31211; Context=ver:3&sid:923f783b-bc6e-4edf-87c9-e52f19b3ce67&rt:STANDARD&i:f&ckd:.mail.aol.com&ckp:%2f&ha:X80Ku4ffRKsOVSwgmEVPCfpfxeU%3d&; IDP_A=s-1-V0c3QiuO6BzQ5S6_u3s0brfUqMCktezAz7sWlVfHD90omIijDXRrMJkSM-9-xcnUcSTnXbcZ1aUCgvfuToVeJihcftKY5KtsC_nB7Y9qf6P0xUnNfCIAmWVtRf4ctSQ9JwRIzHa40dhFuULwYLu3NUPTxckeFUFAzcSS4hrmb4grhEtyOGp0qV5rIKtjs4u8; MC_CMP_ESK=NonSense; SNS_AA=asrc=2&sst=1406185424&type=0; _utd=gd#MzRb%2FjjHIe8odpr%2FfxZR2g%3D%3D|pr#a|st#sns.webmail.aol.com|uid#; Auth=ver:22&uas:*UQo5AwAnAytffwJSZAskRiwLBSIDWVpVXxVTVwJCLFxdSnpHUWBbeV1jcikERgl6CEYLJUweGUhdFQQLW1h%2bBAZRcllWfVl8VH4DUmRaZARoPhw%2bBFBA&idl:0&un:*UQo5AwAnAytffwJSYg%3d%3d&at:SNS&sn:*UQo5AwAnAytffwJSYg%3d%3d&wim:%252FwQCAAAAAAAEk2ihy%252BE4MMebm4R1jvxY07zNZhFOHSz2EFBnsNdOAUsl8QyZceo54kWYZ4vwVayLFF7w&sty:0&ud:aol.com&uid:hl1lkgReIh&ss:635417678271359104&svs:SNS_AA%7c1406185424&la:635417687268954835&aat:A&act:M&br:100&cbr:AOL&mt:&pay:0&mbt:G&uv:AOL&lc:en-us&bid:1&acd:1403348988&pix:3829&prmc:825345&relm:aol&mah:%2\nConnection: keep-alive\n"
and want to search Content-Type: application/json from the data and not beyond this line. I have tried
http://192.168.0.164:8983/solr/collection_with_all_details/select?q=Content%3AContent-Typejson*&wt=json&indent=true
but it searches in entire content. i need to limit the search content
I don't think it is possible in this case. You can check highlighter to return that first 200 characters in highlighting response.
May be you need think of writting a custom response writer which can help on this.
One more option cab be creating additional field with indexed="false" stored="true" will be more efficient.
Create your original field indexed="true" stored="false", your index size will be diminished. New copy field will be indexed="false" stored="true".
<copyField source="text" dest="textShort" maxChars="200"/>
Check if this works out for you.
You should really, really pre-process your data to just index the part that you're going to use. Doing it after the fact will not be a good solution, as you'll have most of the content in the index already, and you're looking for a separator that's not positioned in one specific byte location (which is what maxChars would be able to do).
Depending on how you're indexing, you can either do it in the indexing step (regextransformer, in your own code using SolrJ, etc), or do it in the analysis step of the code, by using something like a patternreplacefilter. That would allow you to remove anything after the header you're looking for.
That way you should be able to index the content into one header field and one body field for example, depending on your need.

Solr 5: only first word gets highlighted

Search for: "test string"
Highlighting I get: test string
This has been reported as a bug and is allegedly fixed:
Solr: Multi Word Synonyms : Only first word is highlighting
However, here's my version of Lucene:
<luceneMatchVersion>5.0.0</luceneMatchVersion>
How is it possible that I'm still getting this behaviour?
EDIT:
There are no special settings related to highlighting in my solrconfig.xml
Here is the query I use:
hl=true
&hl.simple.pre=<em>
&hl.simple.post=</em>
&hl.fl=Comments,Summary
My problem was that the application was incorrectly parsing tags returned by the highlighter - not Solr's fault at all!

Resources