I have a Solr 4.4.0 core configured that contains about 630k documents with an original size of about 10 GB. Each of the fields gets copied to the text field for purposes of queries and highlighting. When I execute a search without highlight, the results come back in about 100 milliseconds, but when highlighting is turned on, the same query takes 10-11 seconds. I also noticed that subsequent queries for the same terms continued to take about the same 10-11 seconds.
My initial configuration of the field was as follows
<field name="text" type="text_general" indexed="true" stored="true"
multiValued="true"
omitNorms="true"
termPositions="true"
termVectors="true"
termOffsets="true" />
The query that is sent is similar to the following
http://solrtest:8983/solr/Incidents/select?q=error+code&fl=id&wt=json&indent=true&hl=true&hl.useFastVectorHighlighter=true
All my research seems to provide no clue as to why the highlight performance is so bad. On a whim, I decided to see if the omitNorms=true attribute could have an effect, I modified the text field, wiped out the data, and reloaded from scratch.
<field name="text" type="text_general" indexed="true" stored="true"
multiValued="true"
termPositions="true"
termVectors="true"
termOffsets="true" />
Oddly enough, this seemed to fix things. The initial query with highlighting took 2-3 seconds with subsequent queries taking less than 100 milliseconds.
However, because we want the omitNorms=true in place, my permanent solution was to have two copies of the "text" field, one with the attribute and one without. The idea was to perform queries against one field and highlighting against the other. So now the schema looks like
<field name="text" type="text_general" indexed="true" stored="true"
multiValued="true"
omitNorms="true"
termPositions="true"
termVectors="true"
termOffsets="true" />
<field name="text2" type="text_general" indexed="true" stored="true"
multiValued="true"
termPositions="true"
termVectors="true"
termOffsets="true" />
And the query is as follows
http://solrtest:8983/solr/Incidents/select?q=error+code&fl=id&wt=json&indent=true&hl=true&hl.fl=text2&hl.useFastVectorHighlighter=true
Again, the data was cleared and reloaded with the same 630k documents but this time the index size is about 17 GB. (As expected since the contents on the "text" field is duplicated.)
The problem is that the performance numbers are back to the original 10-11 seconds each run. Either the first removal of omitNorms was a fluke or there is something else is going on. I have no idea what...
Using jVisualVM to capture a CPU sample shows the following two methods using most of the CPU
org.apache.lucene.search.vectorhighlight.FieldPhraseList.<init>() 8202 ms (72.6%)
org.eclipse.jetty.util.BlockingArrayQueue.poll() 1902 ms (16.8%)
I have seen the init method as low as 54% and the poll number as high as 30%.
Any ideas? Any other places I can look to track down the bottleneck?
Thanks
Update
I have done a bunch of testing with the same dataset but different configurations and here is what I have found...although I do not understand my findings.
Speedy highlighting performance requires that omitNorms not be set to true. (Have no idea what omitNorms and highlighting has to do with one another.)
However, this is only seems to be true if both the query and highlighting are executed against the same field (i.e. df = hl.fl). (Again, no idea why...)
And another however, only if done against the default text field that exists in the schema.
Here is how I tested -->
Test was against about 525,000 documents
Almost all of the fields were copied to the multi-valued text field
In some tests, almost all of the fields were also copied to a send multi-valued text2 field (this field was identical to text except it had the opposite omitNorms setting
Each time the configuration was changed, the Solr instance was stopped, the data folder was deleted, and the instance was started back up
What I found -->
When just the text field was used and omitNorms = true was present, performance was bad (10 second response time)
When just the text field was used and omitNorms = true was not present, performance was great (sub-second response times)
When text did not have omitNorms = true and text2 did, queries wit highlighting against text returned in sub-second times, all other combinations resulted in 10-30 second response times.
When text did have omitNorms = true and text2 did not, all combinations of queries with highlighting returned in 7-10 seconds.
I am soooo confused....
I know that this is a bit dated, but I've ran into the same issue and wanted to chime in with our approach.
We are indexing text from a bunch of binary docs and need Solr to maintain some metadata about the document as well as text. Users need to search for docs based on metadata and full text search within the content as well as see highlights and snippets of relevant content. The performance problem gets worse if the content for highlighting/snippet is located further within each document (e.x. page 50 instead of page 2)
Due to poor performance of highlighting, we had to break up each document into multiple solr records. Depending on the length of the content field, we will chop it up into smaller chunks, copy the metadata attributes to each record and assign a per-document unique id to each record. Then at query time, we will search the content field of all these records and group by that unique field we assigned. Since the content field is smaller, Solr will not have to go deep into each content field, plus from an end user standpoint, this is completely transparent; although it does add a bit of indexing overhead for us.
Additionally, if you choose this approach, you may want to consider overlapping the seconds a little bit between each "sub document" to ensure that if there is phrase match at the boundary of two seconds it will get properly returned.
Hope it helps.
Related
I am new to Solr and I need to implement a full-text search of some PDF files. The indexing part works out of the box by using bin/post. I can see search results in the admin UI given some queries, though without the matched texts and the context.
Now I am reading this post for the highlighting part. It is for an older version of Solr when managed schema was not available. Before fully understand what it is doing I have some questions:
He defined two fields:
<field name="content" type="text_general" indexed="false" stored="true" multiValued="false"/>
<field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>
But why are there two fields needed? Can I define a field
<field name="content" type="text_general" indexed="true" stored="true" multiValued="true"/>
to capture the full text?
How are the fields filled? I don't see relevant information in TikaEntityProcessor's documentation. The current text extractor should already be Tika (I can see
"x_parsed_by":
["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.pdf.PDFParser"]
in the returned JSON of some query). But even I define the fields as he said I cannot see them in the search results as keys in JSON.
The _text_ field seems a concatenation of other fields, does it contain the full text? Though it does not seem to be accessible by default.
To be brief, using The Elements of
Statistical Learning as an example, how to highlight the relevant texts for the query "SVM"? And if changing the file name into "The Elements of Statistical Learning - Trevor Hastie.pdf" and post it, how to highlight "Trevor Hastie" for the query "id:Trevor Hastie"?
Before I get started on the questions let me just give a brief how solr works. Solr in its core uses lucene when simply put is a matching engine. It creates inverted indexes of document with the phrases. What this means is for each phrase it has a list of documents which makes it so fast. Getting to your questions:
Solr does not convert your pdf to text,well its the update processor configured in the handler which does it ,again this can be configured in solrconfig.xml or write your own handler here.
Coming back why are there two fields. To simply put the first one(content) is a stored field which stores the data as it is. And the second one is a copyfield which copies the data for each document as per the configuration in schema.xml.
We do this because we can then choose the indexing strategy such as we add a lowercase filter factory to text field so that everything is indexed in lower case. Then "Sam" and "sam" when searched returns the same results.Or remove certain common occurring words such as "a","the" which will unnecessarily increase your index size. Which uses a lot of memory when you are dealing with millions of records, then you want to be careful which fields to index to better utilise the resources.
The field "text" is a copyfield which copies data from certain fields as mentioned in the schema to text field. Then when searching in general one does not need to fire multiple queries for each field. As everything thing is copied into "text" field and you get the result. This is the reason it's "multivaled". As it can stores an array of data. Content is a stored field and text is not,and opposite for indexed because when you return your result to the end user you show him what ever you saved not the stripped down data that you just did with the text field applying multiple filters(such as removing stop words and applying case filters,stemming etc).
This is the reason you do not see "text" field in the search result as this is used solr.
For highlighting see this.
For more these are some great blog yonik and joel.
Hope this helps. :)
Given: a list of consultants with a list of intervals when they are NOT available:
<consultant>
<id>1</id>
<not-available>
<interval><from>2013-01-01</from><to>2013-01-10</to>
<interval><from>2013-20-01</from><to>2013-01-30</to>
...
</not-available>
</consultant>
...
I'd like to search for consultants that are available (!) for at least X days in a specific interval from STARTDATE to ENDDATE.
Example: Show me all consultants that are available for at least 5 days in the range 2013-01-01 - 2013-02-01 (this would match consultant 1 because he is free from 2013-01-11 to 2013-01-19).
Question 1: How should my solr document look like?
Question 2: How has the query to look like?
As a general advice: precalculate as much as you can, store the data that you are querying for rather than the data you are getting as input.
Also, use several indexes based on different entities - if you have the liberty to do so, and if the queries would become simpler and more straight forward.
Ok, generalities aside and on to your question.
From your example I take it that you currently store in the index if a consultant is not available - probably, because that is what you get as input. But what you want to query is when they are available. So, you should think about storing the availability rather then the non-availability.
EDIT:
The most forward way to query this is to use the intervals as entities such that you do not have to resort to special SOLR features to query the start and the end of an interval on two multi valued fields.
Once you have stored the availability intervals you can also precalculate and store their lengths:
<!-- id of the interval -->
<field name="id" type="int" indexed="true" stored="true" multiValued="false" />
<field name="consultant_id" type="int" indexed="true" stored="true" multiValued="false" />
<!-- make sure that the time is set to 00:00:00 (*/DAY) -->
<field name="interval_start" type="date" indexed="true" stored="true" multiValued="false" />
<!-- make sure that the time is set to 00:00:00 (*/DAY) -->
<field name="interval_end" type="date" indexed="true" stored="true" multiValued="false" />
<field name="interval_length" type="int" indexed="true" stored="true" multiValued="false" />
Your query:
(1.) Optionally, retrieve all intervals that have at least the requested length:
fq=interval_length:[5 to *]
This is an optional step. You might want to benchmark whether it improves the query performance.
Additionally, you could also filter on certain consultant_ids.
(2.) The essential query is for the interval (use q.alt in case of dismax handler):
q=interval_start:[2013-01-01T00:00:00.000Z TO 2013-02-01T00:00:00.000Z-5DAYS]
interval_end:[2013-01-01T00:00:00.000Z+5DAYS TO 2013-02-01T00:00:00.000Z]
(added linebreak for readability, the two components of the query should be separated by regular space)
Make sure that you always set the time to the same value. Best is 00:00:00 because that is what /DAY does: http://lucene.apache.org/solr/4_4_0/solr-core/org/apache/solr/util/DateMathParser.html .
The less different values the better the caching.
More info:
http://wiki.apache.org/solr/SolrQuerySyntax - Solr Range Query
http://wiki.apache.org/solr/SolrCaching#filterCache - caching of fq filter results
EDIT:
More info on q and fq parameters:
http://wiki.apache.org/solr/CommonQueryParameters
They are handled differently when it comes to caching. That's why I added the other link (see above), in the first place. Use fq for filters that you expect to see often in your queries. You can combine multiple fq parameters while you can only specify q once per request.
How can I "use several indexes based on different entities"?
Have a look at the multicore feature: http://wiki.apache.org/solr/CoreAdmin
Would it be overkill to save for each available day: date;num_of_days_to_end_of_interval - should make querying much simpler?
Depends a bit on how much more data you are expecting in that case. I'm also not exactly sure that it would really help you for the query you posted. The date range queries are very flexible and fast. You don't need to avoid them. Just make sure you specify the time as broad as you can to allow for caching.
We are trying to execute a solr based search on the content of text files and the requirement is trying to return all the hits of the search term in each document along with the highlighted text around the hit.
We are able to return the number of documents found along with the highlighted snippet around the first hit of the search term in the document. But is does not return the list of highlights across the document where the search term is found. We can get the TermFrequency reported as the correct number but not the snippets around all these occurrences.
Relevant portion of the solr schema:
<field name="Content" type="text_general" indexed="false" stored="true" required="true"/>
<field name="ContentSearch" type="text_general" indexed="true" stored="false" multiValued="true"/>
<copyField source="Content" dest="ContentSearch"/>
For example, if we have a.txt and b.pdf which are indexed, and the search term
"case" exists in both the documents multiple times(a.txt - 7 hits, b.pdf - 10
hits), when executing a search for "case" against both the documents, we are
getting two documents returned with the correct term frequencies(7 and 9) but the highlight list contains only one record which corresponds to the first hit in the files.
Is this something to do with using TermVectorComponent for the content field. I have read but could not quite make out the way the TVC works and in which situation it is helpful.
This is due to the default settings for Highlighting. In order to achieve what you want, I would recommend changing the snippets and maxAnalyzedChars options. By default the snippets is set to only return one snippet and maxAnalyzedChars will only look at the first 51200 characters. I would set these values to snippets=20 (or some value larger than the expected max number of snippets) and maxAnalyzedChars=100000 (or some other value larger than the longest field value) this will ensure that the entire value is analyzed and that all highlights are returned.
Note: You may also need to work with the fragsize setting to get the appropriate size for the snippets (to include the line before and after the highlighted word). As the default size for the fragments is 100 characters.
Within SolrNet you would need to set the Snippets and MaxAnalyzedChars properties on the HighlightingParameters you are passing to your query. Like something similar to the following:
var results = solr.Query(new SolrQueryByField("ContentSearch", "case"),
new QueryOptions {
Highlight = new HighlightingParameters {
Fields = new[] {"ContentSearch"},
Snippets = 20,
MaxAnalyzedChars = 100000,
}
});
I'm having a hard time pinning down why my Solr date range search is not working. I am building on an existing working search, adding two new fields to assist with accommodation search.
I add the following two fields to the schema - The first is effectively an array of dates, and the second is a single value:
<field name="available_checkin_dates" type="date" indexed="true" stored="false" multiValued="true" />
<field name="available_unit_count" type="int" indexed="true" stored="false" />
I verified that the index document was created and sent to Solr with the two fields populated, but the following search terms yield no results:
* AND available_checkin_dates:[* TO NOW]
* AND available_checkin_dates:[NOW TO *]
* AND available_checkin_dates:"2012-08-31T00:00:00.0000000Z"
* AND available_checkin_dates:"2012-08-31T00:00:00Z"
* AND available_unit_count:1
* AND available_unit_count:*
Either I'm using the wrong syntax, or the documents didn't get indexed. I'm having a hard time reading the catalina logs, and I can't find a tool that inspects the actual indexed documents.
Any ideas on how to help me nail this one down? I'm a relative Solr newbie.
Never mind, there was a problem with the auto-commit settings, so the buffer wasn't getting flushed. Documents were getting committed with commit as false, but the auto-commit settings weren't in place to flush when the level of uncommitted documents reached a certain number.
i m new to solr so i really need someone to help me understand the fields below. What's the meaning of the field if it's stored=false, indexed=false? see the two examples below, what's the differences? If the field is not stored, what's the use of it...
<field name="test1" type="text" indexed="false"
stored="false" required="false" />
How about this one?
<field name="test2" type="text" indexed="false"
stored="false" required="false" multiValued="true" />
Thanks a lot!
You can find best explanation from Solr wiki.
If you want a field to be searchable then you should set indexed attribute to true.
indexed=true : True if this field should be "indexed". If (and only if) a field is indexed, then it is searchable, sortable, and facetable.
If you want to retrieve the field at the search result then you should set stored attribute to true.
stored=true : True if the value of the field should be retrievable during a search
If you want to store multiple value in a single field then you should set multivalued field to true.
multivalued=true : True if this field may contain multiple values per document, i.e. if it can appear multiple times in a document
It's easier than it seems:
indexed: you can search on it
stored: you can show it within your search results
In fact, there might be fields that you don't use for search, but you just want to show them within the results. On the other hand, there might be fields that you want to show within the results but you don't want to use for search. The stored=false is important when you don't need to show a certain field, since it improves performance. If you make all your fields stored and you have a lot of fields, Solr can become slow returning the results.
Of course, having both false doesn't make a lot of sense, since the field would become totally useless.
The unique difference between your two fields is the multiValued=true, which means that the second field can contain multiple values. That means that the content of the field is not just a text entry but a list of text entries.