Get text snippet from search index generated by solr and nutch

Get text snippet from search index generated by solr and nutch - solr

I have just configured nutch and solr to successfully crawl and index text on a web site, by following the geting started tutorials. Now I am trying to make a search page by modifying the example velocity templates.
Now to my question. How can I tell solr to provide a relevant text snippet of the content of the hits? I only get the following fields associated with each hit:
score, boost, digest, id, segment, title, date, tstamp and url.
The content is really indexed, because I can search for words that I know only is in the fulltext, but I still don't get the fulltext back associated with the hit.

don't forget: indexed is not the same as stored.
You can search words in an document, if all field are indexed, but no field is stored.
To get the content of a specific field, it must be also stored=true in schema.xml
If your fulltext-field is
stored, so probably the default "field-list-settings" does not include the fulltext-field.
You can add this by using the fl parameter:
http://<solr-url>:port/select/?......&fl=mytext,*
...this example, if your fulltext is stored in the field called mytext
Finally, if you like to have only a snippet of the text with the searched words (not the whole text) look at the highlight-component from solr/lucene

Related

Apache solr search text search (among multiple fields)

I am studying/getting familiar Apache Solr database.
I created a simple document via the admin UI:
{
"company_name":["Rikotech inc"],
"id":"12345",
"full_title":["ft rikotech marinov"],
"_version_":1681062832169287680}]
}
Here is the document fetched:
But when I type rikotech in the standard query field, I get no result:
Both full_title and company_name are of type text_general .
I watched YouTube video with some Indian guy, and it worked for him ;|
What am I missing here?

Solr will not search all fields (under any configuration, really) without specifying the fields. However, the tutorial you watched probably had the default copyField rule enabled where everything is copied into a field named _text_, and then that field is configured as the default search field. This effectively means that everything is being copied into a specific field, and then that (single) field is being searched by default.
In your case it's probably better to use the edismax query parser (check the box in front of edismax in the user interface), and then give full_title company_name as the query fields (qf). That will allow you to adjust the weights between the fields as well. full_title company_name^5 will give 5x as much weight to any hits in company_name compared to those in full_title.

I found the problem.
It was that the fields I want to search through by default were copied to some strange fields like full_title_str, instad of text . This is the correct schema setting:

How can I view actually stored transformed Solr text field values?

When Solr returns a document, the field values match those that where passed to the Solr indexer.
However especially for TextFields Solr typically uses a modified value where (depending on the definition in the schema.xml) various filters are applied, typicall:
conversion to lower case
replacing of synonyms
removal of stopwords
application of stemming
One can see the result of the conversion for specific texts by using Solr Admin > Some core > Analysis. There is a tool called Luke and the LukeRequestHandler but it seems I can only view the values passed to Solr but not the tranformed variant. One can also take a look at the index data on the disk but they seem to be stored in a binary format.
However, non of these seem to enable me to see the actual value as stored.
The reason for asking is that I've created a text field based on a certain filter chain which according to Solr Admin > Analysis transforms the text correctly. However when searching for a specific word in the transformed text it won't find it.

Elements getting added in Solr index but not able to search elements as desired

I'm working with solr to store web crawling search results to be used in a search engine. The structure of my documents in solr is the following:
{
word: The word received after tokenizing the body obtained from the html.
url: The url where this word was found.
frequency: The no. of times the word was found in the url.
}
When I go the Solr dashboard on my system, which is http://localhost:8983/solr/#/CrawlerSearchResults/query I'm able to find a word say "Amazon" with the query "word: Amazon" but on directly searching for Amazon I get no results. Could you please help me out with this issue ?
Image links below.
First case
Second case (No results)
Thanks,
Nilesh.

In your second example, the value is searched against the default search field (since you haven't provided a field name). This is by default a field named _text_.
To support just typing a query into the q parameter without field names, you can either set the default field name to search in with df=wordin your URL, or use the edismax query parser (defType=edismax) and the qf parameter (query fields). qf allows multiple fields and giving them a weight, but in your case it'd just be qf=word.
Second - what you're doing seems to replicate what Lucene is doing internally, so I'm not sure why you'd do it this way (each word is what's called a "token", and each count is what's called a term frequency). You can write a custom similarity to add custom scoring based on these parameters.

Solr doesn't index document's content

I've a little problem with Sorl.
I've indexed about 1400 documents by an xml file with the post.jar command. Within the xml file I placed some information like ID, TITLE and URL of the documents.
When I search a document, It finds nothing, but if I specified an attribute, ex. TITLE: IEEE, It finds the documents.
So I change, on schema.xml, the default field search from text to title. In this way it finds documents without specifying the attribute.
Why doesn't it find the content? Did I mess up the indexing by changing the xml file?

Do a q=*:*. This fetches 10 (implicit default value for rows) documents with all fields and their values. Is all your data indexed properly?
Then do a q=fieldx:val with some known field and value. Do they show up in the results? Can you do more than string matches? If not, you need to choose data types (and storage/indexing options) in schema. Example: string allows only equality and prefix matches and text allows full text search.

solr query not returning results

When I enter search url
http://localhost:8983/solr/select?qt=standard&rows=10&q=*:*
I get a response with 10 documents.
But when I want to test specific query, then nothing comes up. For example:
http://localhost:8983/solr/select?qt=standard&rows=10&q=white
Why is that happening? I clearly see in results, that there is document with word "White" in it. So Why solr dont return that document as result.?

q=*:* searches for all content on all the documents, hence you get back the results.
q=white will search for white on the default search field, which is usually text if you have not modified the schema.xml.
<defaultSearchField>text</defaultSearchField>
You can change the default field to be the field you want to search on.
OR use specific field to search on the specific field e.g. title q=title:white
If you want to search on multiple field, you can combine the fields into one field by using copyfields or use dismax request handler.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Get text snippet from search index generated by solr and nutch - solr

Related

Apache solr search text search (among multiple fields)

How can I view actually stored transformed Solr text field values?

Elements getting added in Solr index but not able to search elements as desired

Solr doesn't index document's content

solr query not returning results

Categories

Resources