solr highlighting in old and new versions

solr highlighting in old and new versions - solr

I am migrating a web site from an old version of solr (1.4.1) to the current release version (5.2.1) on a different machine and noticing some differences.
In the old version, I could get highlighting with a url like this:
http://localhost:8983/solr/select?indent=on&q=text:software/&start=0&rows=10&fl=id,score,title&wt=json&hl=on&hl.fragsize=200
In the new version, one thing that's different is I need to specify a collection. Another difference is that the new version gives an error if I put text: in front of the value of q.
So, taking into account those differences, I end up with a URL like this:
http://localhost:8983/solr/default/select?indent=on&q=software/&start=0&rows=10&fl=id,score,title&wt=json&hl=on&hl.snippets=1&hl.fl=%2a&hl.fragsize=200
That second URL does not give me highlighting fragments/snippets. That is to say, where the old URL would give something like this:
"highlighting":{
"document0_id":{"text":["The <em>software</em> is awesome"]}}
The new URL gives something like this:
"highlighting":{
"document0_id":{}}
What do I need to do to get highlighting fragments returned in solr 5.2.1?
[edited]
In addition, I tried selecting a single document by its id on both machines. On the old machine, a url like
http://localhost:8983/solr/select?wt=json&indent=true&q=id:thedocumentid
returns some JSON that includes a text field containing the full searchable text of the original HTML document. On the new machine a similar url (but one that includes the collection):
http://localhost:8983/solr/default/select?wt=json&indent=true&q=id:thedocumentid
...returns similar JSON that does not include the text field.
I note that searching returns the correct results; the problem is that on the new machine, the results do not include the highlighting fragments.
So it seems like maybe the issue is that I need to specify that these documents have a text field when I index them; how do I do that?

A colleague (not tempted by the bounty) noticed that my text field had stored="false" in my schema.xml and suggsted changing it to true. That did the trick.

In the first query you are specifically searching in the text field and in the second its not.
And in the second you have mentioned hl.fl which means "Specifies a list of fields to highlight. Accepts a comma- or space-delimited list of fields for which Solr should generate highlighted snippets. If left blank, highlights the defaultSearchField"
Try again by making the changes...
http://localhost:8983/solr/default/select?q=text:software&start=0&rows=10&fl=id,score,title&wt=json&hl=on&hl.fragsize=200

Related

Azure suggester returning all content

I'm trying to implement an Azure suggester feature into our pilot Azure search app and running into issues. The content I'm indexing are PDF files, so my suggester definition is based on the content field itself which can be thousands of lines of text. Following examples online, when I implement the suggester, I'm returned the entire content of the body of text from the PDF file. What I'd really like to do is return just a phrase found in the text.
For instance, suppose I'm indexing a Harry Potter book and I type into my search field "Dum", I'd like to see suggested results back like "Dumbledore", "Dementor", etc VS the whole book. Is this possible?
Tks

If we want to search for words sharing the same prefix, Autocomplete is the right API for this job. https://learn.microsoft.com/en-us/rest/api/searchservice/autocomplete
In contrast, Suggester API helps users find the documents containing words with that prefix. It returns text snippets containing those worlds.
If you still believe suggester api does not behave as expected and autocomplete is not suitable, let me know your source document, query and expected results.

How to query solr field for a substring

My use case:
I have a single-valued field called cqpath. This is a textfield and has a values that look something like the following:
"/content/domain/en/path/to/some/page"
"/content/domain/en/path/to/another/page"
"/content/domain/en-us/path/to/some/page"
"/content/domain/en-us/path/to/another/page"
I wanted to form a query that would return me 1. and 2. I'd been trying along the lines of writing:
cqpath: "/content/domain/en"
which has been discovered to be erroneous, since it retrieves items 3. and 4. as well. Could any of you think of a way to write a query that returns only 1. and 2. and not 3. and 4.?
This is a normal textfield field-type. Really do appreciate your help.

Starting from Solr 4.0 you can use a regex query. You can find some useful examples here.
In your case, you can get the results that you're looking for using something like:
cqpath:/.*content/domain/en.*/

It looks like you are trying to match partial paths here with boundaries on path elements (slashes). The usual generic solution is to tokenize during index to generate all alternative completions and not tokenize during query. So, the field type declaration is not symmetric. There are examples of that in Solr distribution. And you would look at using something like (index-time only) EdgeNGramFilterFactory instead of much more expensive regex matching.
For your specific case, you may want to look at testPathHierarchyTokenizer which does that for you automatically.
And if your content were more like full URLs than just path, you could also be interested by a custom update request processor chain that includes URLClassify URP. It is not very documented, but mentions generating url parts, which is what I think you would want.

Generate highlighted snippets in Solr for PDFs

I'm new to solr. I've set up a solr server and have indexed a few thousand PDFs. I am trying to query solr via the rest API in a PHP page. I am trying to build something similar to the solritas interface included in the tutorial (solrserver/browse), but I don't know how to generate highlighted snippets. I found in the documentation "hl" is a query parameter and is by default set to false.
When I get http://solrserver/?q=search+term&hl=true I get back a response with a hightlighting section, but it only contains the document IDs, no generated snippets.
I am using the tutorial provided schema and config for solr 4.2.1. I believe that the configuration is fine because solritas is able to display highlighted snippets using the same indexed data. I've tried seeing how solritas is built but it's separated out in .vm template files and I haven't been able to find what I'm looking for yet.
I can see the full text of the PDF in the doc->content area, so it is stored. I think I just don't understand the proper way to generate snippets! Can someone please help!
Thanks :)

I would suggest, you should try using hl.fl parameter. So your query should be something like this:
?q=search+term&hl=true&hl.fl=field1,field2,field3
Where field1, field2 and field3 are three source fields you would like to generate highlights.
In your case, if the field name you want to use for highlighting is content, your query can be:
?q=search+term&hl=true&hl.fl=content
More details: http://docs.lucidworks.com/display/solr/Highlighting
With highlighting, you can even specify fragment size, HTML tags around highlighted text etc...

Solr search – Tika extracted text from PDF not return highlighting snippet

I have successfully indexed Pdf –using Tika- and pure text –fetched from database- in one single collection. Now I am trying to implement highlighting. When I querying Solr i placing in the url the following: http://myhost:8090/solr/ktm/select/?q=BlahBlah&start=0&rows=120&indent=on&hl=true&wt=json . Everything is OK. The received output has the original (not highlighted text) content under “docs” and the highlighted snippets under “highlighting”. But I had noticed the documents that have been extracted by Tika don’t have “highlighting” snippet. That kind of response, cause me many troubles (zero length rows). Is there any workaround in order to tackle it? I have already tried to copyField (at index time) but the response come out blank ({“highlighting”:{}}). I really need help on this.

Solr configuration

I'm very new with Solr,
And I really want a step by step to have my Solr search result like the google one.
To give you an idea, when you search 'PHP' in http://wiki.apache.org/solr/FindPage , the word 'php' shows up in bold .. This is the same result I want to have.
Showing only a parser even if the pdf is a very huge one.

You can use highlighting to show a matching snippet in the results.
http://wiki.apache.org/solr/HighlightingParameters
By default, it will wrap matching words with <em> tags, but you can change this by setting the hl.simple.pre/hl.simple.post parameters.

You may be looking at the wrong part of the returned data. Try looking at the 'highlighting' component of the returned data structure (i.e. don't look at the response docs). This should give you the snippets you want.