Solr doesn't index document's content

Solr doesn't index document's content - solr

I've a little problem with Sorl.
I've indexed about 1400 documents by an xml file with the post.jar command. Within the xml file I placed some information like ID, TITLE and URL of the documents.
When I search a document, It finds nothing, but if I specified an attribute, ex. TITLE: IEEE, It finds the documents.
So I change, on schema.xml, the default field search from text to title. In this way it finds documents without specifying the attribute.
Why doesn't it find the content? Did I mess up the indexing by changing the xml file?

Do a q=*:*. This fetches 10 (implicit default value for rows) documents with all fields and their values. Is all your data indexed properly?
Then do a q=fieldx:val with some known field and value. Do they show up in the results? Can you do more than string matches? If not, you need to choose data types (and storage/indexing options) in schema. Example: string allows only equality and prefix matches and text allows full text search.

Related

Is it possible to use multiple words in a filter query in SOLRJ / SOLR?

I am using SOLRJ (with SOLR 7) and my index features some fields for the document contents named content_eng, content_ita, ...
It also features a field with the full path to the document (processed by a StandardTokenizer and a WordDelimiterGraphFilter).
The user is able to search in the content_xyz fields thanks to the lines :
final SolrQuery query = new SolrQuery();
query.setQuery(searchedText);
query.set("qf",searchFields); // searchFields is a generated String which looks like "content_eng content_ita" (field names separated by space)
Now the user needs to be able to specify some words contained in the path (namely some subdirectories). So I added a filterQuery :
query.addFilterQuery(
"full_path_split:" + searchedPath);
If searchedPath contains only a single word contained in the document path, the document is correctly returned however if searchedPath has several words contained in the path, the document is not returned. To sum it up the fq only works if searchedPath contains a single word.
For example doc1 is in /home/user/dir1/doc1.txt
If I search for all (* in searchedText) documents that are in user dir (fq=full_path_split%3Adir) doc1.txt is returned.
If I do the same search but for documents that are in user and dir1 (fq=full_path_split%3user+dir1) doc1.txt is not returned, and I think it is because the fq is parsed as "+full_path_split:user +text:dir1" as debug=query shows. I don't know where text comes from it may be a default field.
So is it possible to use a filter query with several words to fulfill my needs ?
Any help appreciated,

Your suspicion is correct - the _text_:dir1 part comes from you not providing a field name, and the default field name being used instead.
You can work around this by using the more general edismax (or the older dismax) parser as you're doing in your main query with qf:
fq={!type=edismax qf='full_path_split'}user dir1

Solr full text search for dynamically added data?

I'm trying index the data without defining schema.xml, is the any way to apply full text search without adding schema.xml or updating the manged shema?

The default operation mode of Solr is to use the Schemaless mode. In this mode Solr will guess what the field type is based on what pattern the data matches the first time a field is included. If it is numeric the first time, Solr will guess that it's going to be a numeric field every time.
If the field contains text it'll be indexed as a text field with processing applied as defined in the default schema.
As long as you're using the default configuration you can submit documents with just the field name and the associated text, then search against the field name as necessary.practice

How can I view actually stored transformed Solr text field values?

When Solr returns a document, the field values match those that where passed to the Solr indexer.
However especially for TextFields Solr typically uses a modified value where (depending on the definition in the schema.xml) various filters are applied, typicall:
conversion to lower case
replacing of synonyms
removal of stopwords
application of stemming
One can see the result of the conversion for specific texts by using Solr Admin > Some core > Analysis. There is a tool called Luke and the LukeRequestHandler but it seems I can only view the values passed to Solr but not the tranformed variant. One can also take a look at the index data on the disk but they seem to be stored in a binary format.
However, non of these seem to enable me to see the actual value as stored.
The reason for asking is that I've created a text field based on a certain filter chain which according to Solr Admin > Analysis transforms the text correctly. However when searching for a specific word in the transformed text it won't find it.

Solr Text field and String field - different search behaviour

I am working on Solr 4+.
I have several fields into my solr schema with different solr field types.
Does the search on text field and string field differs?
Because I am trying to search on string field (which is a copy field of few facet fields) which does not work as expected. The destination string field is indexed and stored both.
However, when I change destination field which a text field (only indexed), it works fine.
Can you suggest why this happens? What is exactly the difference between text and string fields in solr in respect to searches?

TextFields usually have a tokenizer and text analysis attached, meaning that the indexed content is broken into separate tokens where there is no need for an exact match - each word / token can be matched separately to decide if the whole document should be included in the response.
StrFields cannot have any tokenization or analysis / filters applied, and will only give results for exact matches. If you need a StrField with analysis or filters applied, you can implement this using a TextField and a KeywordTokenizer.

A general text field that has reasonable, generic cross-language defaults: it tokenizes with StandardTokenizer, removes stop words from case-insensitive "stopwords.txt" (empty by default), and down cases. At query time only, it also applies synonyms.
The StrField type is not analyzed, but indexed/stored verbatim.

Get text snippet from search index generated by solr and nutch

I have just configured nutch and solr to successfully crawl and index text on a web site, by following the geting started tutorials. Now I am trying to make a search page by modifying the example velocity templates.
Now to my question. How can I tell solr to provide a relevant text snippet of the content of the hits? I only get the following fields associated with each hit:
score, boost, digest, id, segment, title, date, tstamp and url.
The content is really indexed, because I can search for words that I know only is in the fulltext, but I still don't get the fulltext back associated with the hit.

don't forget: indexed is not the same as stored.
You can search words in an document, if all field are indexed, but no field is stored.
To get the content of a specific field, it must be also stored=true in schema.xml
If your fulltext-field is
stored, so probably the default "field-list-settings" does not include the fulltext-field.
You can add this by using the fl parameter:
http://<solr-url>:port/select/?......&fl=mytext,*
...this example, if your fulltext is stored in the field called mytext
Finally, if you like to have only a snippet of the text with the searched words (not the whole text) look at the highlight-component from solr/lucene

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Solr doesn't index document's content - solr

Related

Is it possible to use multiple words in a filter query in SOLRJ / SOLR?

Solr full text search for dynamically added data?

How can I view actually stored transformed Solr text field values?

Solr Text field and String field - different search behaviour

Get text snippet from search index generated by solr and nutch

Categories

Resources