Long queries on very short documents

Long queries on very short documents - solr

I'm fresh out of the nursery as far as Lucene/Solr are concerned, so I may be trying to utilize it completely wrong, but I hope someone can point me in the right direction.
My documents (less than 3,000) are short statements from a taxonomy. All are single sentences, with some having no more than 4-6 words long. There is only one field for each document, so searching across multiple fields is not a route I would be looking into. What I would like to do is query the contents of a work related document and have the taxonomy statements that are relevant returned.
Currently I am using the default example setup that came with Solr with added verb synonyms from Wordnet since performed actions are what I am trying to identify (i.e. taxonomy statement of 'Alter garments to specifications').
Basic word matching works as expected, but I would like to make things somewhat more sophisticated. Since the queries are so long I never end up with a high relevancy scores when searching against the tiny documents. I'm sure this can be resolved by normalizing scores in some fashion so I am not real concerned about the scores coming out, but the actual statements (documents) that are being identified.
Would I be better off indexing the documents (currently the long queries) on the fly and querying each taxonomy statement and compiling/sorting the results or can I perform these long queries on the tiny documents effectively in some other fashion? I presume this may present it's own difficulties.

I see no end to what are you trying to do here, i mean your short documents index will definitely suffer from lake of information, and a long query will make every result almost flat in front of it, even expanding the document by adding every term with Wordnet synonyms will be confusing and misleading i think, my advice is to chack other possible forms of the query.

Related

Solr near real time search: impact of reindexing frequently the same documents

We want to use SolR in a Near Real Time scenario. Say for example we want to filter / rank our results by number of views.
SolR SoftCommit was made for this use case but:
In practice, the same few documents are updated very frequently (just for the nb_view field) while most of the documents are untouched.
As far as I know each update, even partial are implemented as a full delete and full addition of the document in lucene.
It seems to me having many times the same docs in the Tlog is inefficient and might also be problematic during the merge process (is the doc marked n times as deleted and added?)
Any advice / good practice?

Two things you could use for supporting this scenario:
In place updates: only that field is udpated, not the whole doc. Check out the conditions you need to be able to use them.
ExternalFileFieldType you keep the values in an external file
if the scenario is critical, I would test both in reald world conditions if possible, and asses.

Solr 4.5: When is Solr facet query better than simple query?

I'm working with Apache Solr and would like to get more detailed information about some query options. I discovered facet queries and was wondering, when exactly do they bring essential advantages; especially in case of the following example:
There is a stock of books that is saved on a Solr server. Despite the common attributes a book ought to have, they have an ISBN. Data about books is provided by third parties and so it's important to check that there are no doubled ISBNs within the system. In order to check if a book's ISBN is a duplicate, it has to go through a routed path, were - unfortunately - every book is processed indiviually without any information about preceeding or following processes.
The question is:
a) Should you simply query Solr with the current book ISBN and check the total results, or
b) should you send a facet query with a f.isbn.facet.mincount=2 and check if the result contains the current book ISBN?
In both cases, caching results is not possible. So the number of queries would always equal the number of books processed. I simply don't know how Solr works within and therefore can't make this decision without further information, especially because the number of queries won't be reduced by either of above possibilities.

If you're going to do a query - do a query. Lucene is highly optimized for doing queries, so that's what you should do. A facet query is for creating facets (counts) from arbitrary queries - so internally it does the same thing. If you generate a facet and then iterate through that one, Lucene has to look at far more documents than if you're just querying for one single value.
The best strategy to get a performance boost would be to perform these operations in batch - check 500 books in the same batch (i.e. isbn:(123 OR 321 OR 567 OR 765)), and then handle that in your code. If these updates can arrive from many systems in parallel without going through one single source, you'll have to decide how much time you can spend before any duplicates might appear in the streams (this race condition can happen with just one book as well, as two streams can query for a single isbn and get a negative result before adding it separately from both streams).

Find similar results with Lucene / SOLR index

We have an application for tagging user selections over a large corpus of MS Word documents. We tag these selections with one or more keyword tags, and usually a title tag. We want to add a feature where the selected text is instantly analyzed, and the tagger is presented with a list of most-likely keyword and title tags (based on the existing tagged text selections)
We are using a SOLR index. I have been told that we can simply issue the selected text as the query itself to return similar selections. However, the selected text could be anywhere between 200 and 6000 words long. A 6000 word query may be a problem in terms of memory usage!
I thought we could do some very aggressive stopword removal to significantly reduce the number of words in the queries, leaving only the very meaningful words. We have been working with this corpus for the last 10 years and we are very familiar with the subject matter and the vocabulary used, so this would be easy for us to do. But the problem is that we also use the same index for allowing the normal users to search the index, and if we remove too many common words, then their normal queries may not work properly (especially phrase queries).
We would also like to boost the results that contain the text of the query within a smaller range, rather than just spread arbitrarily throughout the document.
Another issue is that we allow nested selections. The outer selection may be more general in nature and be around 5000 words long, and the inner selections will be shorter and topically more specific. However, since both selections contain the same text, SOLR ranks them both highly, when the outer selection may not be so relevant
I have spent the last few days going through the SOLR query parser documentation, and it looks like this should be doable, but I'm still not sure exactly what I need to do to make this work. Any suggestions would be much appreciated.

Solr have multi-core facility. So if you can have one core for your internal work and you can reveal the other core for public domain, it may solve your issue.
You can refer this section
http://wiki.apache.org/solr/Solr.xml%20(supported%20through%204.x)
or you can refer Solr cores and solr.xml section in solr reference manual.

Determining the Similarity Between Items in a Database

We have a database with hundreds of millions of records of log data. We're attempting to 'group' this log data as being likely to be of the same nature as other entries in the log database. For instance:
Record X may contain a log entry like:
Change Transaction ABC123 Assigned To Server US91
And Record Y may contain a log entry like:
Change Transaction XYZ789 Assigned To Server GB47
To us humans those two log entries are easily recognizable as being likely related in some way. Now, there may be 10 million rows between Record X and Record Y. And there may be thousands of other entries that are similar to X and Y, and some that are totally different but that have other records they are similar to.
What I'm trying to determine is the best way to group the similar items together and say that with XX% certainty Record X and Record Y are probably of the same nature. Or perhaps a better way of saying it would be that the system would look at Record Y and say based on your content you're most like Record X as apposed to all other records.
I've seen some mentions of Natural Language Processing and other ways to find similarity between strings (like just brute-forcing some Levenshtein calculations) - however for us we have these two additional challenges:
The content is machine generated - not human generated
As opposed to a search engine approach where we determine results for a given query - we're trying to classify a giant repository and group them by how alike they are to one another.
Thanks for your input!

Interesting problem. Obviously, there's a scale issue here because you don't really want to start comparing each record to every other record in the DB. I believe I'd look at growing a list of "known types" and scoring records against the types in that list to see if each record has a match in that list.
The "scoring" part will hopefully draw some good answers here -- your ability to score against known types is key to getting this to work well, and I have a feeling you're in a better position than we are to get that right. Some sort of soundex match, maybe? Or if you can figure out how to "discover" which parts of new records change, you could define your known types as regex expressions.
At that point, for each record, you can hopefully determine that you've got a match (with high confidence) or a match (with lower confidence) or very likely no match at all. In this last case, it's likely that you've found a new "type" that should be added to your "known types" list. If you keep track of the score for each record you matched, you could also go back for low-scoring matches and see if a better match showed up later in your processing.

I would suggest indexing your data using a text search engine like Lucene to split your log entries into terms. As your data is machine generated use also word bigrams and tigrams, even higher order n-grams. A bigram is just a sequence of consecutive words, in your example you would have the following bigrams:
Change_Transaction, Transaction_XYZ789, XYZ789_Assigned, Assigned_To, To_Server, Server_GB47
For each log prepare queries in a similar way, the search engine may give you the most similar results. You may need to tweek the similarity function a bit to obtain best results but I believe this is a good start.

Two main strategies come to my mind here:
the ad-hoc one. Use an information retrieval approach. Build an index for the log entries, eventually using a specialized tokenizer/parser, by feeding them into a regular text search engine. I've heard people do this with Xapian and Lucene. Then you can "search" for a new log record and the text search engine will (hopefully) return some related log entries to compare it with. Usually the "information retrieval" approach is however only interested in finding the 10 most similar results.
the clustering approach. You will usually need to turn the data into numerical vectors (that may however be sparse) e.g. as TF-IDF. Then you can apply a clustering algorithm to find groups of closely related lines (such as the example you gave above), and investigate their nature. You might need to tweak this a little, so it doesn't e.g. cluster on the server ID.
Both strategies have their ups and downs. The first one is quite fast, however it will always just return you some similar existing log lines, without much quantities on how common this line is. It's mostly useful for human inspection.
The second strategy is more computationally intensive, and depending on your parameters could fail completely (so maybe test it on a subset first), but could also give more useful results by actually building large groups of log entries that are very closely related.

It sounds like you could take the lucene approach mentioned above, then use that as a source for input vectors into the machine learning library Mahout (http://mahout.apache.org/). Once there you can train a classifier, or just use one of their clustering algorithms.

If your DBMS has it, take a look at SOUNDEX().

Efficiently sorting and paging with Solr when index is changing

I'm working on a structured document viewer, where each Solr document is a "section" or "paragraph" in a large set of legal documents, along with assorted metadata. I have a corpus which will probably represent 10^12 or more of these sections. I want to provide paging for the user so that they can view N of these sections at a time in sort_path order.
Now the problem: Even if sort_path is indexed, there are docs being added and removed all the time. A simple sort and paging solution will end up with users possibly skipping sections or jumping around in the ordering unexpectedly, even when they are nowhere near the documents being added/removed in the ordering; this behavior would be unacceptable.
Example: I make the "next" page link point at something like ...sort_order=sort_path+desc&rows=N&start:12345. Then, while the user is viewing the page, a document early in the sort_path order is deleted. Now when they fetch the next N rows, they will have skipped 1 document without knowing.
So, given I have a sort_path field which orders the sections, the front end needs to be able to ask for N sections "before" or "after" sort_path:/X/Y/Z, instead of asking for rows:N with start:12345. I have no idea how to represent this in a Solr query.
I may be pushing the edges of Solr a little far, and it may end up making more sense to store representations of these "section" documents both in Solr (for content searches, which Solr is awesome at) and an RDBMS (for ordering and indexing). I was hoping to avoid that, and this sort of query is still going to be ugly in a database, so maybe you've got some ideas. (Thanks!)
Update:
It turns out that solr ranges combined with sorting may give me exactly what I need. On the indexed field, I can do something like
sort_path:["/A/B/C" TO *]
to get the "next" N sections, and do
sort_path:[* TO "/A/B/C"]
ordering by sort_path:desc and then reversing the returned chunk to get the previous N sections. I am going to test the performance of this solution, but it seems viable.

This is not really a Solr-specific problem, but a general problem with pagination of any external data source, because the data source has an independent state from the (web) application. For example, it also happens on relational databases. Here's a good coverage of pagination in relational databases, along with the possible solutions. Most web applications / websites take the first solution: "Repeat the query for each new request" since the other solutions are much more complex and not scalable, but this suffers from the problem you describe. Browse the questions on stackoverflow.com for a while and you'll notice it, since questions are being created constantly.
In your case I'd consider modeling the Solr documents as your whole legal documents instead of their individual sections. You'll get a lot less documents (therefore a slower rate of inserts/deletes) and you can use the highlighting parameters to get snippets of the sections that matched the user query.
Another option would be decreasing your commit rate, but this could end up in less-than-ideal document freshness.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight