Solr alphabetical search - solr

I have a requirement searching data from the solr in alphabetical order.. eg..
"aaa,aba,abc,bba" this is my query "q=&fields=viewername&viewername=a*".
I am not getting proper result. I am getting whichever the document contains "a".
Ex Results:
1.abcd-terstttttttttttttttt
2.aaab
3.Iraq: India wastes Army's Special Forces resource
but I need only the document which starts with "a".
schema.xml-dynamicfield is
dynamicField name="*_string" type="lowerstring" indexed="true" stored="false" multiValued="false"
If I change the type from "lowerstring" to "string" and re-index, I am getting correct results. but I can not re-index all the records, because there's hundreds of thousands of them.

You are getting the result for string fieldType because when you have field of type String .. it does not get tokenized. Rather there wont be any token created for the input field.
But when you use any other fieldType which consists of tokenizers and filter it creates some tokens based on whats been used..
You can analyse the same in the solr admin page.
Depending on the same you can define the fieldType for a field.

Related

Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

I am new to Solr and I need to implement a full-text search of some PDF files. The indexing part works out of the box by using bin/post. I can see search results in the admin UI given some queries, though without the matched texts and the context.
Now I am reading this post for the highlighting part. It is for an older version of Solr when managed schema was not available. Before fully understand what it is doing I have some questions:
He defined two fields:
<field name="content" type="text_general" indexed="false" stored="true" multiValued="false"/>
<field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>
But why are there two fields needed? Can I define a field
<field name="content" type="text_general" indexed="true" stored="true" multiValued="true"/>
to capture the full text?
How are the fields filled? I don't see relevant information in TikaEntityProcessor's documentation. The current text extractor should already be Tika (I can see
"x_parsed_by":
["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.pdf.PDFParser"]
in the returned JSON of some query). But even I define the fields as he said I cannot see them in the search results as keys in JSON.
The _text_ field seems a concatenation of other fields, does it contain the full text? Though it does not seem to be accessible by default.
To be brief, using The Elements of
Statistical Learning as an example, how to highlight the relevant texts for the query "SVM"? And if changing the file name into "The Elements of Statistical Learning - Trevor Hastie.pdf" and post it, how to highlight "Trevor Hastie" for the query "id:Trevor Hastie"?
Before I get started on the questions let me just give a brief how solr works. Solr in its core uses lucene when simply put is a matching engine. It creates inverted indexes of document with the phrases. What this means is for each phrase it has a list of documents which makes it so fast. Getting to your questions:
Solr does not convert your pdf to text,well its the update processor configured in the handler which does it ,again this can be configured in solrconfig.xml or write your own handler here.
Coming back why are there two fields. To simply put the first one(content) is a stored field which stores the data as it is. And the second one is a copyfield which copies the data for each document as per the configuration in schema.xml.
We do this because we can then choose the indexing strategy such as we add a lowercase filter factory to text field so that everything is indexed in lower case. Then "Sam" and "sam" when searched returns the same results.Or remove certain common occurring words such as "a","the" which will unnecessarily increase your index size. Which uses a lot of memory when you are dealing with millions of records, then you want to be careful which fields to index to better utilise the resources.
The field "text" is a copyfield which copies data from certain fields as mentioned in the schema to text field. Then when searching in general one does not need to fire multiple queries for each field. As everything thing is copied into "text" field and you get the result. This is the reason it's "multivaled". As it can stores an array of data. Content is a stored field and text is not,and opposite for indexed because when you return your result to the end user you show him what ever you saved not the stripped down data that you just did with the text field applying multiple filters(such as removing stop words and applying case filters,stemming etc).
This is the reason you do not see "text" field in the search result as this is used solr.
For highlighting see this.
For more these are some great blog yonik and joel.
Hope this helps. :)

solr query : with the Wildcard Searches Type *

the filed define in the schema.xml :
<field name="typeDesc" type="text_general" indexed="true" stored="true"/>
The typeDesc store the values like 公立, 公立,三甲, 公立,二甲。
The question is when I query typeDesc:*三甲*, there is nothing, but when I query typeDesc:*公立* or typeDesc:*三* or typeDesc:*甲* or typeDesc:三甲, they all could find the result like 公立,三甲。 I want to know the reason.
While I'm not too familiar with word breaking rules for kanji, I'm going to guess that the reason is that when you're doing wildcard searches, analysis for the field isn't performed. If 三 and 甲 are split into separate tokens, the wild card match will not find any token matching your search.
You can confirm this by using the analysis tab of the admin page to see which tokens an indexed term is being broken into.
Possible solutions would be to index the terms in a single string field as well and do wildcard matches against that, or use a KeywordTokenizer for your text field if you need further processing before storing the token (the keyword tokenizer will keep the text as one single token). You could also use an ngramfilter and drop the wildcards.

Solr: should I index large fields?

After a webpage has been crawled with Apache Nutch 2.2.1, contents of that page are pushed to Solr. Solr stores the contents of entire webpages in the "content" field, so data in that field is usually very sizable. So here's my concerns:
Should I index the "content" field in Solr? Indexing such a large field will increase index size. In Solr's schema.xml file I found the following recommendation:
NOTE: This field is not indexed by default, since it is also copied to "text"
using copyField below. This is to save space. Use this field for returning and
highlighting document content. Use the "text" field to search the content.
<field name="content" type="text_general" indexed="false" stored="true" multiValued="true"/>
However, if I left this field unindexed, would it increase search response time significantly?
I'd greatly appreciate any information that will help me to understand benefits of not indexing this large field or benefits of indexing it.
If you're going to search against the field, it needs to be indexed. The example in the schema assumes that since you're going to search against text instead of content, there is no need to create the index twice. They do however want to keep a reference to the content by itself, so that it can be displayed in the application or used for highlighting (which require the whole field content to be available).
If you don't seen any situation where you'll need the field for querying, there is no need to create an index for the field.

Solr index vs stored

I am a little confused as to what the behaviour of the index and stored attibutes of the Solr fields is.
For example if I have the following in the Schema.xml
<field name="test1" type="text" indexed="false"
stored="false" required="false" />
Will the field test1 be not stored in the Solr document even if I create a document with that field in it and set a value to that field and commit the document to Solr. As I have the stored=false attribute, does it mean that the value of the field is lost in Solr and not persisted?
That is correct. Typically you will want your field to be either indexed or stored or both. If you set both to false, that field will not be available in your Solr docs (either for searching or for displaying). See Alexandre's answer for the special cases when you will want to set both to false.
As stated here : indexed=true makes a field searchable (and sortable and facetable). For eg, if you have a field named test1 with indexed=true, then you can search it like q=test1:foo, where foo is the value you are searching for. If indexed=false for field test1 then that query will return no results, even if you have a document in Solr with test1's value being foo.
stored=true means you can retrieve the field when you search. If you want to explicitly retrieve the value of a field in your query, you will use the fl param in your query like fl=test1 (Default is fl=* meaning retrieve all stored fields). Only if stored=true for test1, the value will be returned. Else it will not be returned.
The main point of having both set to false is to explicitly skip that particular field.
For example, if you have a storing/indexing dynamicField mapping and you want to ignore one particular name that would otherwise fall under dynamicField's pattern.
Alternatively you could use dynamicField to ignore a whole set of fields with same prefix/suffix that comes from a 3rd party. For example, Tika will send you a whole bunch of metadata fields which you may just want to ignore. See this defined in Solr's example schema.xml and used in solrconfig.xml
In the later versions of Solr, you could also use IgnoreFieldUpdateProcessorFactory (see full list for others) instead, which will get rid of those fields even earlier in the indexing process.
Quoting from this response in the Solr's mail thread:
"indexed" and "stored" are independent, orthogonal attributes - you can use
any of the four combinations of true and false. "indexed" is used for search
or query, the "lookup" portion of processing a query request. Once the
search/query/lookup is complete and a set of documents is selected, "stored"
is the set of fields whose values are available for display or return with
the Solr response.
Part of the reason for the separation is that Solr/Lucene "analyzes" or
transforms the input data into a more efficient form for faster and more
relevant search/lookup. Unfortunately, that analyzed/transformed data is
frequently no longer suitable for display and human consumption. In other
words the analysis/transformation is not bidirectional/reversible. Setting
"stored=true" guarantees that the original data can be retrieved in its
original form.
If both are false you loose your data in that field. If indexed true, the data are searchable but it can not be displayed. If you set stored true you will not be able to search on that field but it can be displayed (in this case you can write copyfield rule to copy the info from that field to the default searchable field). Both set as true -> you can search and display.
indexed = true means that this field can be used in the search.
For example, if I set the item field as follows and I try to perform the field in a search
<field name="item" type="text_general" uninvertible="true" indexed="false" stored="true"/>
fq = item: "Tennis" will mark an error.
stored = true means that this field can be retrieved in the list of fields displayed after a query.
For example, if the item field is defined as follows
<field name="item" type="text_general" uninvertible="true" indexed="true" stored="false"/>
You will be able to search fq = item: "Tennis" correctly, but it will not return the item field in the results.
Regards

Solr Search not working after dataimport successful

I am new in Solr. I have tried DataImport using a Oracle Database. The data gets successfully imported. When I try to search with query:
qt=standard
q=*
I get good results. But when I do a specific search, the results are empty showing no documents. The logger is empty and there are NO errors displayed.
Ok! I got it.
I observed that when I am using some pre-defined fields of schema.xml, the search on those fields are working fine. But when I defined some fields of my own, the result was still NOTHING.
Then I looked into "solr-config.xml's" "/select" request handler. There is a line
<str name="df">text</str>
which says that "txt" is the only field which is searchable. But then how does it searches the other fields?
Answer lies in "schema.xml's"
"<copyField>"
tag. The fields present by default are copied into "text" which makes them searchable. Hence if you want your defined field as searchable, just define your field and add it in copyField tag. ;)
TLDR Version: Define your fields as type="text" to start off. If you have a field called "product", add <field name="product" type="text" indexed="true" stored="true" /> to the default schema.xml inside the <fields> tag and you should be done. To search using the select request-handler, use q=<field_name>:<text_to_look_for> or q=*:* to show all documents.
There are a few mistakes you're making here. I'll be explaining using the 'select' request handler.
The format for a query is ?q=<field_name>:<text_to_look_for>. So if you want to return all the values matching all the fields, you'd say q=*:*
And if you were to look for the word "iPod" in the field "product" your query would be q=product:iPod
Another thing to keep in mind is that if in schema.xml, say if you specify the field product as type="string" which maps to class="solr.StrField", the query (<text_to_look_for>) should precisely match the value in the index, since Solr doesn't tokenize the StrField by default, i.e., ipod will not return results if your index holds it as iPod. If you need it to return it still, you could use the type="text" in schema.xml (the fieldType definition is present already in the default schema.xml.) The "text" fieldType has several analyzers(one analyzer ignores case) and tokenizers(tokenizer splits up the words in the field and indexes them so that if you search for a particular word, say "ipod", it would match the value "iPod 16GB White").
Regarding your own answer, the <str name="df">text</str> specifies the default field to search in, i.e, if you just said q=iPod, it would look in this field. The objective of this field called text is to hold all the other fields in the document, so that you could just search in this field and know that some or the other field in this document would match your query, thereby you wouldn't need to search in a specific field if you don't know what field you're expecting the value to be in.

Resources