Let's say we have a list of articles that are indexed by sunspot/solr/lucene (or any other search engine).
How can be used to find similar articles with a given article?
Should this be done with a resuming tool, like:
http://www.wordsfinder.com/api_Keyword_Extractor.php, or termextract from http://developer.yahoo.com/yql/console, or http://www.alchemyapi.com/api/demo.html ?
It seems you're looking for the MoreLikeThis feature.
What you are trying to do is very similar to the task I outlined in this answer.
In brief, you need to generate a summary for each document that you can use as the query to compare it with every other. A document summary could be as simple as the top N terms in that document (excluding stop words). You can generate top N terms from a Lucene document pretty easily without using any 3rd party tools, there are plenty examples on SO and the web to do this.
Related
We are desperate to switch over to Lucene (via Solr), but one big issue we have is the syntax support.
dtSearch supports xfirstword, w/N, pre/N, and probably some others.
I think w/N can be ported to Lucene, but the other ones I have no idea how to port.
I did a search and found an article that claims they have made the switch--still using dtSearch syntax, but I have yet to get the source. I left a comment about getting the source, but no response yet.
What do you guys recommend?
We basically want Solr with dtSearch syntax.
Do you have any good articles on how to specifically add features to indexing, etc. needed to accomplish these features?
Since I wasn't able to find a good solution to this, I wrote a dtSearch parser in Antlr4.
Many of you have asked for it, so I've posted it to GitHub.
Here's the link:
https://github.com/blmille1/dtsearchparser
ElasticSearch has percolator for prospective search. Does SOLR have a similar feature where you define your query upfront? If not, is there an effective way of implementing this myself on top of the existing SOLR features?
besides what BunkerMentality said, it is not hard to build your own percolator, what you need:
Are the queries you want to run easy to model on Lucene only syntax? if so you are good, if not, you need to convert them to Lucene only. Built them, and keep them in memory as Lucene queries
When a doc arrives:
build a MemoryIndex containing only that single doc
run all your queries on the index
I have done this for a system ingesting millions docs a day and it worked fine.
It's listed as an open new feature, SOLR-4587, on Solr JIRA but it doesn't seem like any work has started on it yet.
There is a link in the comments there to a separate project called Luwak that seems to implement some features similar to percolator.
If it is still relevant, you can use this
It's SOLR Update Processor that based on Luwak
Can someone provide a link/blog/anything where I can find a step by step tutorial of Solr and autosuggest.
I want to understand how to complete configuring schema , configurations including field types analyzers and tokens
There is an article specifically on autosuggesters. And another one on doing multi-field autosuggester.
There are also several ways to implement autocomplete. You can use ngrams approach, which is what I use for Solr/Lucene search-based documentation. You can find the source code for that on Solr-Javadoc repository.
There is another one doing ngrams and edge-ngrams from a couple of years ago.
You could also use facets for some scenarios.
I am currently using Apache Solr to build a search engine. The queries in Solr are of the field:value format. Now I want to use a part-of-speech tagger to separate the subject, verb and predicate and search the values in each fields. For example, if I input "Who likes Starbucks" then I need some code to give me "q=subject:*&verb=likes&object=starbucks". Is there any library that can handle this job? Thank you!
I think several people have used UIMA for this, see solr wiki
There are a number of POS taggers. Here is another StackOverflow posting about this: What is a good Java library for Parts-Of-Speech tagging?
I have a static collection of over 300,000 text and html files. I want to be able to search them for words, exact phrases, and ideally regex patterns. I want the searches to be fast.
I think searching for words and phrases can be done by looking up a dictionary of unique words referencing to the files that contain each word, but is there a way to have reasonably fast regex matching?
I don't mind using existing software if such exists.
Consider Lucene http://lucene.apache.org/java/docs/index.html
There are quite a bunch available in the market which will help you achieve what you want, some are open-source and some comes with pricing:
Opensource:
elasticsearch - based on lucene
constellio - based on lucene
Sphinx - based on C++
Solr - built on top of lucene
You can have a look at Microsoft Search Server Express 2010: http://www.microsoft.com/enterprisesearch/searchserverexpress/en/us/technical-resources.aspx
http://blog.webdistortion.com/2011/05/29/open-source-search-engines/