I have a set of keywords defined by client requirements stored in a SOLR field. I also have a never ending stream of sentences entering the system.
By using the sentence as the query against the keywords I am able to find those sentences that match the keywords. This is working well and I am pleased. What I have essentially done is reverse the way in which SOLR is normally used by storing the query in Solr and passing the text in as the query.
Now I would like to be able to extend the idea of having just a keyword in a field to having a more fully formed SOLR query in a field. Doing so would allow proximity searching etc. But, of course, this is where life becomes awkward. Placing SOLR query operators into a field will not work as they need to be escaped.
Does anyone know if it might be possible to use the SOLR "query" function or perhaps write a java class that would enable such functionality? Or is the idea blowing just a bit too much against the SOLR winds?
Thanks in advance.
ES has percolate for this - for Solr you'll usually index the document as a single document in a memory based core / index and then run the queries against that (which is what ES at least used to do internally, IIRC).
I would check out the percolate api with ElasticSearch. It would sure be easier using this api than having to write your own in Solr.
Related
Is there any parameter like (edismax or dismax or any other) that i can set for stemming to work in Solr or i need to make changes in schema.xml of Solr to implement the stemming ?
Problem is if i change schema.xml by default stemming/phoentic work which i dont want ? I am using Solr from third party application and in UI we have checkbox for stemming to check/uncheck , i pass these paramaters to Solr and get the data from Solr, i cant pass this UI parameter to SOlr, so if there is any parameter at Solr side i can pass that for stemming to work ?
Please let me know ?
Stemming is performed as part of the analysis chain, and therefor is part of how the schema for that particular field is defined.
The reason for this becomes apparent when you consider how stemming works - for stemming to make sense, the term has to be stemmed when it's being indexed, as well as when being queried.
Lucene takes your input string, runs it through your analysis chain and saves the generated tokens to its index. Giving it what are you asking will probably end up as what, are, you, ask after tokenizing by whitespace and applying stemming.
The same operation happens when querying, so if someone searches for asks, the token gets stemmed to ask - and then compared against what's in the index. If stemming hadn't taken place when indexing, you'd end up with asking in the index, and ask when querying - and that isn't a match, since the tokens aren't the same.
In your third party application the stemming option probably performs stemming inside the application before sending the content to Solr.
You can also use the Schema API to dynamically update and change field type definitions.
I am using Solrj to build queries for Solr server.
So I have some pretty short free-form texts that can contain various special characters - like Mr. John's New-Wall, "Hotels & Food".
A phrase query for text like this would not produce enough matches. So from this text I would like to extract terms for building a simple query, something like content:Mr OR content:John's OR content:Hotels OR content:Food. (It probably would be good to somehow consider the term proximity, but I have to start with something).
The field that I am searching is the default text_general field. I started with replacing some special characters with spaces and splitting them up to extract the terms. But it feels kind of redundant.
Isn't there an easier way to extract terms from text using Solrj and Solr? Basically I would like to extract terms from text similarly to how it is done by Solr when it creates its index.
I am not sure exactly what your question is, however here is a bit of info that you may find helpful:
Basically I would like to extract terms from text similarly to how it is done by Solr when it creates its index.
You can configure indexing and query field processing in your schema. I would suggest you take a look in here. This gives you a bit of flexibility to normalize your data.
So from this text I would like to extract terms for building a simple query, something like content:Mr OR content:John's OR content:Hotels OR content:Food.
This is the default way that solr queries under the hood. I would suggest you look up edismax query parser and qf and tie parameters.
Hope it helps
I am pretty new to Lucene, so would like to get some help from you guys :)
BACKGROUND: Currently I have documents stored in SQL Server and want to use Lucene for full-text/tag searches on those documents in SQL Server.
Q1) In this case, in order to do the keyword search on the documents, should I insert all of those documents to the Lucene index? Does this mean there will be data duplication (one in SQL Server and the other one in the Lucene index?) It could be a matter since we have a massive amount of documents (about 100GB). Is it inevitable?
Q2) Also, each documents has a set of tags (up to 3). Lucene is also good choice for the tag search? If so, how to do it?
Thanks,
Yes, providing full-text search through Lucene and data storage through a traditional database is a well-supported architecture. Take a look here, for a brief introduction. A typical implementation would be to index anything you wish to be able to support searching on, and store only a unique identifier in the Lucene index, and pull any records founds by a search from the database, based on the ID. If you want to reduce DB load, you can store some information in Lucene to display a list of search results, and only query the database in order to fetch the full document.
As for saving on space, there will be some measure of duplication. This is true even if you only Lucene, though. Lucene stores the inverted index used for searching entirely separately from stored data. For saving on space, I'd recommend being very deliberate about what data you choose to index, and what you need to store and be able to retrieve later. What you store is particularly important for saving space in Lucene, since indexed-only values tend to be very space-efficient, in most cases.
Lucene can certainly implement a tag search. The simple way to implement it would be to add each tag to a field of your choosing (I'll call is "tags", which seems to make sense), while building the document, such as:
document.add(new Field("tags", "widget", Field.Store.NO, Field.Index.ANALYZED));
document.add(new Field("tags", "forkids", Field.Store.NO, Field.Index.ANALYZED));
and I could simply add a required term to any query to search only within a particular tag. For instance, if I was to search for "some stuff", but only with the tag "forkids", I could write a query like:
some stuff +tags:forkids
Documents can also be stored in Lucene, you can retrieve and reference them using the document ID.
I would suggest using Solr http://lucene.apache.org/solr/ on top of Lucene, is more user friendly and has multiValued fields (for the tags) available by default.
http://wiki.apache.org/solr/SchemaXml
With the Levenshtein implementation of Lucene 4 claiming to be 100 times faster than before (http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html) I would like to do fuzzy matching of all terms in a query. The idea is that a search for 'gren hose' should be able to find the document 'green house' (I don't really care about phrases at this point, the quotes are just here to make this more readable).
I am using Lucene 4 + Solr 4. As I'm doing some pre- and post-processing there is a small wrapper servlet around Solr, the servlet is using SolrJ to eventually access Solr
I'm currently a little lost on what would be the right way to achieve this. My basic approach is to break the search query down into terms and append the tilde / fuzzy operator to each term. Thus 'gren hose' would become 'gren~ hose~' . Now the question is how to properly do this. I can see several ways:
Brute force: Assume that the terms are delimited by whitespace, so just parse the query and append a tilde before each whitespace (ie. after each term)
Two steps: Send the query to Solr with query debugging turned on. This will give me a list of query terms as parsed by Solr. I can then extract the terms from the debug output, append the tilde operator and re-run the query with the added tilde operators
Internally: Hook into the search request handler and append the tilde operator after the query has been parsed into terms
Method 1 stinks a lot, as it circumvents Solr's query parsing entirely, so I would rather not do that. Method 2 sounds quite doable if the cost of parsing the query twice is not too high. Method 3 sound just right, but I have yet to figure out where I have to hook into the processing chain.
Maybe there is a completely different way to achieve what I want to do, or maybe it's just a stupid idea on my part. Anyway, I would really appreciate a few pointers, maybe someone else has already done something like this. Thanks!
I would propose the following methods:
Implement a query handler module in your application where you can build solr query from the input user query. This way nothing changes in the SOLR side and your application has all the control on what goes into SOLR.
Implement your own query parser , you can start from Standard SOLR query parser (org.apache.solr.search.QParser) and make your changes. Your application just needs to select your custom query parser and rest your implementation should take care.
I would prefer method 1 as this makes the system completely agnostic to SOLR upgrades, any new release of Solr will not require me to update the custom qparser and you will not have to update/build and setup your custom qparser in the new version.
If you dont have any control on the app and dont want to go through the qparser route , then you can implement a Servlet filter that transforms the solr query before it is dispatched to solr request filter.
I'm not entirely sure on the vocabulary, but what I'd like to do is send a document (or just a string really) and a bunch of keywords to a Solr server (using Solrnet), and have a return that tells me if the document is a match for the keywords or not, without having the document being stored or indexed to the server.
Is this possible, and if so, how do I do it?
If not, any suggestions of a better way? The idea is to check if a document is a match before storing it. Could it work to store it first with just a soft commit, and if it is not a match delete it again? How would this affect the index?
Index a document - send it to Solr to be tokenized and analyzed and the resulting strings stored
Store a document - send it to Solr to be stored as-is, without any modifications
So if you want a document to be searchable you need to index it first.
If you want a document (fields) to be retrievable in its original form, you need to store a document.
What exactly are you trying to accomplish? Avoid duplicate documents? Can you expand a little bit on your case...