How to get paragraph result from solr keyword search after using tika to index some documents? - solr

I use TIKA to index documents. then I want to get the whole paragraph from paragraph start to the paragraph end which contains the key words. I tried to use HighlightFragsize but it does not work. For example: there is a document like below:
When I was very small, my parents took me to many places, because they wanted me to learn more about the world. Thanks to them, I
witnessed the variety of the world and a lot of beautiful scenery.
But no matter where I go, in my heart, the place with the most
beautiful scenery is my hometown.
there are two paragraphs above. If I search 'my parents', I hope I can get the whole paragraph "When I was very small, my parents....... a lot of beautiful scenery". not only part of this paragraph. I used HighlightFragsize to limit the sentence, but the result is not what I want. Please help. thanks in advance

You haven't provided a lot of information to go off of but I'm assuming that you're using a highlighter so here are a couple of things you should check for:
The field that holds your parsed data - is it stored? Can you see the entire contents?
If (1), is the text longer than 51200 chars? The default highlighter configuration has a setting maxAnalyzedChars that is set to 51200. This means that the highlighter will not process more than 51200 characters from a highlighted field in a matched document to look for highlights. If this is the case, increase this value until you get the desired results.
Highlighting on extremely large fields may incur a significant performance penalty which you should be mindful of before choosing a configuration.
See this for more details.
I don't think there's any parameter called HighlightFragsize but there's one called hl.fragsize which can do what you want when set to zero.
Try the following query and see if it works for you:
Additionally, you should, in any case, be mindful of the first 2 points I posted above.
I don’t think there’s a direct way to do what you’re looking for. You could possibly split up your field into a multi valued field with each paragraph being stored as a separate value.
You can then possibly use hl.preserveMulti, hl.maxMultiValuedToExamine and hl.maxMultiValuedToMatch to achieve what you need.


Parsing text, then searching it: one entry per position, vs. 1 JSON column per text

I have a Rails application using Postgresql.
Texts are added to the application (ranging in size from a few words to, say, 5,000 words).
The texts get parsed, first automatically, and then with some manual revision, to associate each word/position in the text with specific information (verb/noun/etc, base word (running ==> run), definition_id, grammar tags)
Given a lemma (base word, ex. "run"), or a part of speech (verb/noun), or grammar tags, or a definition_id (or a combination), I need to be able to find all the other text positions in the database that contain the same information.
I can't do a full-text search because, for example, if I click "left" on "I left Nashville", I don't want "turn left at the light" to appear. the traffic light. I just want "Leave" as a verb, as well as other forms of "Leave" as a verb.
Also, I might want just "left" with a specific definition_id (eg "Left" used as "The political party", not used as "the opposite of the right").
In short, I am looking for some advice on which of the following 3 routes I should take (or if there's a 4th or 5th route that I haven't considered).
There are three options I can think of:
Option 1: TextPosition
A TextPosition table to store each word position, with columns for each of the above attributes.
This would make searching very easy, but there would be MANY records (1 for each position), but maybe that's not a problem? Is storing this amount of tickets a bad idea for some specific reason?
Option 2: JSON on the Text object
A JSON column on the Text object, to store all word positions in a large array of hashes, or a hash of hashes.
This would add zero records, but, a) Building a query to search all texts with certain information would probably be difficult, b) That query would probably be slow, and c) It could take up more storage space than a separate table (TextPosition).
Option 3: TWO JSON columns: one on the Text object, and one on each dictionary object
A JSON in each text object, as in option 2, but only to render the text (not to search), containing all the information about each position in that same text.
Another JSON in each "dictionary object" (definition, base word, grammar concept, grammar tag), just for searching (not to render the text). This column would track the matches of this particular object across ALL texts. It would be an array of hashes, where each hash would be {text_id: x, text_index: y}.
With this option, the search would be "easier", but it would still not be ideal: to find all the text positions that contain a certain attribute, I would have to do the following:
Find the record for that attribute
Extract the text_ids / indexes from the record
Find the texts with those IDs
Extract the matching line from each text, using the index that comes with each text_id within the JSON.
If it was a combination of attributes that I were looking for, I would have to do those 4 steps for each attribute, and then find the intersection between the sets of matches for each attribute (to end up only having the positions that contain both).
Furthermore, when updating a position (for example, if a person indicates that an attribute is wrongly associated and that it should actually be another), I would have to update both JSONs.
Also, will storing 2 JSON columns actually bring any tangible benefit over a TextPosition table? It would probably take up MORE storage space than using a TextPosition table, and for what benefit?
In sum, I am looking for some advice on which of those 3 routes I should follow. I hope the answer is "option 1", but if so, I would love to know what drawbacks/obstacles could come up later when there are a ton of entries.
Thanks, Michael King
Text parsing and searching make my brain hurt. But anytime I have something with the complexity of what you are talking about, ElasticSearch is my tool of choice. You can do some amazingly complex indexing and searching with it.
So my answer is 4) ElasticSearch.

solr fuzzy vs wildcard vs stemmer

I have couple of questions here.
I want to search a term jumps
With Fuzzy search, I can do jump~
With wild card search, I can do jump*
With stemmer I can do, jump
My understanding is that, fuzzy search gives pump. Wildcard search gives jumping as well. Stemmer gives "jumper" also.
I totally agree with the results.
What is the performance of thes three?
Wild card is not recommended if it is at the beginning of the term - my understanding as it has to match with all the tokens in the index - But in this case, it would be all the tokens which starts jump
Fuzzy search gives me unpredicted results - It has to do something kind of spellcheck I assume.
Stemmer suits only particular scenarios like it can;t match pumps.
How should I use these things which can give more relevant results?
I probably more confused about all these because of this section. Any suggestions please?
Question 1
Wildcard queries are (generally) not analysed (i.e. they're not tokenized or run through filters), meaning that anything that depend on filters doing their processing of the input/output tokens will give weird results (for example if the input string is broken into multiple strings).
The matching happens on the tokens, so what you've input is almost (lowercasing still works) matched directly against the prefix / postfix of the tokens in the index. Generally you'd want to avoid wildcard queries for general search queries, since they're rather limited for natural search and can give weird results (as shown).
Fuzzy search is based on "edit distance" - i.e. a number that tells Solr how many characters can be removed/inserted/changed to get to the resulting token. This will give your users OK-ish results, but might be hard to decipher in the sense of "why did this give me a hit" when the allowed distance is larger (Lucene/Solr supports up to 2 in edit distance which is also the default if no edit distance is given).
Stemming is usually the way to go, as it's the actual "formal" process of taking a term and reducing it down to its stem - the actual "meaning" (it doesn't really know anything about the meaning as in the natural language processing term, but it does it according to a set of static rules and exceptions for the language configured) of the word . It can be adjusted per language to rules suitable for that language, which neither of the two other options can.
For your downside regarding stemming ("Since it can't match pumps") - that might actually be a good thing. It'll be clearer to your users what the search results are based on, and instead of including pumps in your search result, include it as a spelling correction ("Did you mean pump / pumps instead?"). It'll give a far better experience for any user, where the search results will more closely match what they're searching for.
The requirements might differ based on what your actual use case is; i.e. if it's just for programmatic attempts to find terms that look similar.
Question 2
Present those results you deem more relevant as the first hits - if you're doing wildcard or fuzzy searches you can't do this through scoring alone, so you'll have to make several queries and then present them after each other. I usually suggest making that an explicit action by the user of the search when discussing this in projects.
Instead, as the main search, you can use an NGramFilter in a separate field and use a copyfield instruction to get the same content into both fields - and then score the ngramfilter far lower than hits in the more "exact" field. Usually you want three fields in that case - one for exact hits (non-stemmed), one for stemmed hits and one for ngram hits - and then score them appropriately with the qf parameter to edismax. It usually gives you the quickest and easiest results to a decent search results for your users, but make sure to give them decent ways of either filtering the result set (facets) or change their queries into something more meaningful (did you mean, also see xyz, etc.).
Guessing the user's intent is usually very hard unless you have invested a lot of time and resources into personalisation (think Google), so leave that for later - most users are happy as long as they have a clear and distinct way of solving their own problems, even if you don't get it perfect for the first result.
For question 2 you can go strict to permissive.
Option one: Only give strict search result. If no result found give stemmer results. Continue with fuzzy or wildcard search if no result found previously.
Option two: Give all results but rank them by level (ie. first exact match, then stemmer result, ...)

If possible, what is the Solr query syntax to filter by doc size?

Solr 4.3.0
I want to find the larger size documents.
I'm trying to build some test data for testing memory usage, but I keep getting the smaller sized documents. So, if I could add a doc size clause to my query it would help me find more suitable documents.
I'm not aware of this possibility, most likely there is no support for it.
I could see one possible approach - you could add size of the document during indexing in some separate field, which will later use to filter on.
Another possible case - is to use TermVectorComponent, which could return term vectors for matched documents, which could lead to some understanding of "how big" this document is. Not easy and simple, though.
Example of the possibly useful output:
Third possible option (kudos to MatsLindh for the idea): to use sorting function norm() for a specific field. There are some limitations:
You need to use some classic similarity
The field you're sorting on should contains norms
Example of the sorting function: sort:norm(field_name) desc

Using Topic Model, how should we set up a "stop words" list?

There are some standard stop lists, giving words like "a the of not" to be removed from corpus. However, I'm wondering, should the stop list change case by case?
For example, I have 10K of articles from a journal, then because of the structure of an article, basically you will see words like "introduction, review, conclusion, page" in every article. My concern is: should we remove these words from our corpus? (the words that every document has?) Thanks to every comment and suggestion.
I am working on a similar problem, but of text categorization. From my experience, it is good to have a domain specific set of stop word list along with the standard .
list. Otherwise, these words like "introduction","review" etc. will come up in the term frequency matrix, if you have tried out analysing it. It can mislead your models by giving more weights to these domain specific keywords.
Worth to consider is that the stop words might not affect your model as much as you fear. Have you tried not removing them and compared the results?
See also this 2017 paper: "Pulling Out the Stops: Rethinking Stopword Removal for Topic Models."
In conclusion they say (paraphrasing) that removing stopwords had no real negative effect on the quality of the LDA model, and if needed they could still be removed afterwards without impacting the model.
Alternatively you can always remove words with a high document frequency automatically, i.e. set a treshold of the amount of documents the word can appear in (e.g. 50%) and just remove all words that are more frequent than those as stopwords.
I don't think this will meaningfully impact the model itself, but I'm sure it'll speed up the computations of the model, by virtue of there being less words to compute.

Getting facet count 0 in solr

I am using solr search with faceting in my application. My use case is in such a way that the index files in the datadir keeps on changing.
The problem is, when I facet based on a particular field. I get the value from the indices that where previously in the data dir (and are not present currently). However they are returned with a value of 0. I don't understand where the values from the previous indices are persisted and are returned during a totally newer search?
Though I can simply skip the facets with count 0, I understand that this can seriously eat over my scalability. Any pointers to not include the facets from previous searchers?
[Edit 1] : The current workaround I am using is add a facet.mincount=1 in my URL. But still, I guess this can eat over my performance.
I couldnt find a comment option & I dont have enough reputation to vote-up!
I have the same exact problem.
We are using atomic updates with solr 4.2.
I found some explanation here:
To efficiently handle facets for multi-valued fields (like tags), Solr
builds an "uninverted index" (which you think would just be called an
"index", but I suppose that's even more confusing), which maps
internal document IDs to the list of terms they contain. Calculating
facets from this data structure just requires walking over every
document in the result set, looking up the terms it contains in the
uninverted index, and adding them to the tally for all documents.
However, there's a sneaky optimisation here that causes the zero
counts we're seeing. For terms that appear in more than 5% of
documents, Solr doesn't include them in the uninverted index (leaving
them out helps to keep the size in memory down, I guess), and instead
gets the count for these terms using a regular query against the
Lucene index. Since the set of "common" terms isn't specific to your
result set, and since any given result set won't necessarily contain
all of these terms, you can get back counts of zero.
It may not be from old index values but just terms that exist in more than 5% of documents?
I think facet.mincount=n is not a workaround, you should use it to get only the non-negative facet count.
