Is it possible to get the keywords in the documents returned by Solr

Is it possible to get the keywords in the documents returned by Solr - solr

Solr provides an easy way to search documents based on keywords, but I was wondering if it had the ability to return the keywords themselves?
For example, I may want to search for all documents created by Joe Blogs last week and then get a feel for the contents of those documents by the keywords inside them. Or do I have to work out the key words myself and save them in a field?

Assuming by keywords you mean the tokens that Solr generates when parsing a particular field, you may want to review the documentation and examples for the Term Vector Component.
Before implementing it though, just checking the Analysis screen of the Solr (4+) Admin WebUI, as it has a section that shows the terms/tokens particular field actually generates.
If these are not quite the keywords that you are trying to produce, you may need to have a separate field that generates those keywords, possibly by using UpdateRequestProcessor in the indexing pipeline.
Finally, if you are trying to get a feel to do some sort of clustering, you may want to look at the Carrot2, which already does this and integrates with Solr.

What you are asking for is know as "Topic Model". Solr does not have out of the box support for this. However there are other tools that you can integrate to achieve this.
Apache Mahout supports LDA algorithm, that can be used to model topics. There are several examples of integrating Solr with Mahout. Here is one such.
Apache UIMA (Unstructured Information Management Applications.) I won't bother typing about it. Instead, here is a brilliant presentation.

Related

what is the difference between suggester component and carrot2 cluster in solr?

I need to know in common language what is the difference between them and what are they used for exactly and how? I'm having a Solr project that suggests results based on queries as a personalization approach.
which one can be used?

They're very different features. Carrot2 is a clusterer - i.e. it finds clusters of similar documents that belong together. That means that it attempts to determine which documents describe the same thing, and group them together based on these characteristics.
The suggester is component is mainly used for autocomplete-like features, where you're giving the user suggestions on what to search for (i.e. trying to guess what the user wants to accomplish before he or she has typed all of their query).
Neither is intended for personalization. You might want to look at Learning to rank to apply certain models based on what you know about the input from the user. You'll have to find out which features you have that describe your users and apply those as external feature information
There's also a feature to examine semantic knowledge graphs (i.e. "this concept is positively related to this other concept"), but that's probably on the side of what you're looking for.

Are there any nosql database could do search(like lucene) on map/reduce

I'm using cloudant which I could use mapreduce to project view of data and also it could search document with lucene
But these 2 feature is separate and cannot be used together
Suppose I make a game with userdata like this
{
name: ""
items:[]
}
Each user has item. Then I want to let user find all swords with quality +10. With cloudant I might project type and quality as key and use query key=["sword",10]
But it cannot make query more complex than that like lucene could do. To do lucene I need to normalize all items to be document and reference it with owner
I really wish I could do a lucene search on a key of data projection. I mean, instead of normalization, I could store nested document as I want and use map/reduce to project data inside document so I could search for items directly
PS. If that database has partial update by scripting and inherently has transaction update feature that would be the best

I'd suggest trying out elasticsearch.
Seems like your use case should be covered by the search api
If you need to do more complex analytics elasticsearch supports aggregations.

I am not at all sure that I got the question correctly, but you may want to take a look at riak. It offers a solr-based search, which is quite well documented. I have used it in the past for distributed search over a distributed key-value index and it was quite fast.
If you use this, you will also need to take a look at the syntax of solr queries, so I add it here to save you some time. However, keep in mind that not all of those solr query functionalities were available in riak (at least that was when I used it).

There are several solutions that would do the job. I can give my 2 cents proposing the well established MongoDB. With MongoDB you can create a text-Index on a given field and then do a full text Search as explained here. The feature is in MongoDb since version 2.4 and the syntax is well documented on MongoDB docs.

Apache SOLR design, searching entire text field or just keywords?

For my SOLR setup, I can configure it so when a user searches for some articles in a database, the search engine does a full text search of the entire body text.
However, I also have the code which does a keyword extraction of the body text. Is it recommended to just allow SOLR to perform a full text search on the keywords of an article, or is it still better to let SOLR just perform a full text extraction on the article body itself?
I'd rather not do both, one or the other would be nice. I'm on limited RAM, and can only keep one of the two fields, keywords or the article body.
Reasoning and an answer would be nice, thank you.

Is it recommended to just allow SOLR to perform a full text search on the keywords of an article, or is it still better to let SOLR just perform a full text extraction on the article body itself?
Yes, SOLR is great at full-text indexing. Rather than reinvent the wheel (search algorithms, stop word filtering, boosting, etc) let SOLR index content in your database. If you need to ignore certain words in the article text you can configure stop words in stopwords.txt.

Searching over documents stored in Hadoop - which tool to use?

I'm lost in: Hadoop, Hbase, Lucene, Carrot2, Cloudera, Tika, ZooKeeper, Solr, Katta, Cascading, POI...
When you read about the one you can be often sure that each of the others tools is going to be mentioned.
I don't expect you to explain every tool to me - sure not. If you could help me to narrow this set for my particular scenario it would be great. So far I'm not sure which of the above will fit and it looks like (as always) there are more then one way of doing what's to be done.
The scenario is: 500GB - ~20 TB of documents stored in Hadoop. Text documents in multiple formats: email, doc, pdf, odt. Metadata about those documents stored in SQL db (sender, recipients, date, department etc.) Main source of documents will be ExchangeServer (emails and attachments), but not only. Now to the search: User needs to be able to do complex full-text searches over those documents. Basicaly he'll be presented with some search-config panel (java desktop application, not webapp) - he'll set date range, document types, senders/recipients, keywords etc. - fire the search and get the resulting list of the documents (and for each document info why its included in search results i.e. which keywords are found in document).
Which tools I should take into consideration and which not? The point is to develop such solution with only minimal required "glue"-code. I'm proficient in SQLdbs but quite uncomfortable with Apache-and-related technologies.
Basic workflow looks like this: ExchangeServer/other source -> conversion from doc/pdf/... -> deduplication -> Hadopp + SQL (metadata) -> build/update an index <- search through the docs (and do it fast) -> present search results
Thank you!

Going with solr is a good option. I have used it for similar scenario you described above. You can use solr for real huge data as its a distributed index server.
But to get the meta data about all of these documents formats you should be using some other tool. Basically your workflow will be this.
1) Use hadoop cluster to store data.
2) Extract data in hadoop cluster using map/redcue
3) Do document identification( identify document type)
4) Extract meta data from these document.
5) Index metadata in solr server, store other ingestion information in database
6) Solr server is distributed index server, so for each ingestion you could create a new shard or index.
7) When search is required search on all the indexs.
8) Solr supports all the complex searches , so you don't have to make your own search engine.
9) It also does paging for you as well.

We've done exactly this for some of our clients by using Solr as a "secondary indexer" to HBase. Updates to HBase are sent to Solr, and you can query against it. Typically folks start with HBase, and then graft search on. Sounds like you know from the get go that search is what you want, so you can probably embed the secondary indexing in from your pipeline that feeds HBase.
You may find though that just using Solr does everything you need.

Another project to look at is Lily, http://www.lilyproject.org/lily/index.html, which has already done the work of integrating Solr with a distributed database.
Also, I do not see why you would not want to use a browser for this application. You are describing exactly what faceted search is. While you certainly could set up a desktop app that communicates with the server (parses JSON) and displays the results in a thick client GUI, all of this work is already done for you in the browser. And, Solr comes with a free faceted search system out of the box: just follow along the tutorial.

Going with Solr (http://lucene.apache.org/solr) is a good solution, but be ready to have to deal with some non-obvious things. First is planning your indexes properly. Multiple terabytes of data will almost definitely need multiple shards on Solr for any level of reasonable performance and you'll be in charge of managing those yourself. It does provide distributed search (doing the queries off multiple shards), but that is only half the battle.
ElasticSearch (http://www.elasticsearch.org/) is another popular alternative, but i don't have much experience with it regarding scale. It uses the same Lucene engine so i'd expect the search feature-set to be similar.
Another type of solution is something like SenseiDB - open sourced from LinkedIn - which gives the full-text search functionality (also Lucene-based) as well as proven scale for large amounts of data:
http://senseidb.com
They've definitely done a lot of work on search over there and my casual use of it is pretty promising.
Assuming all your data is already in Hadoop, you could write some custom MR jobs that pull the data in a consistent schema-friendly format into SenseiDB. SenseiDB already provides a Hadoop MR indexer which you can look at.
The only caveat is it is a little more complex to setup, but will save you with the scaling issues many times over - especially around indexing performance and faceting functionality. It also provides clustering support if HA is important to you - which is still in Alpha for Solr (Solr 4.x is alpha atm).
Hope that helps and good luck!
Update:
I asked a friend who is more versed in ElasticSearch than me and it does have the advantage of clustering and rebalancing based on the # of machines and shards you have. This is a definite win over Solr - especially if you're dealing with TBs of data. The only downside is the current state of documentation on ElasticSearch leaves a lot to be desired.

As a side note, you can't say the documents are stored in Hadoop, they are stored in a distributed file system (most probably HDFS since you mentioned Hadoop).
Regarding searching/indexing: Lucene is the tool to use for your scenario. You can use it for both indexing and searching. It's a java library. There is also an associated project (called Solr) which allows you to access the indexing/searching system through WebServices. So you should also take a look at Solr as it allows the handling of different types of documents (Lucene puts the responsability of interpreting the document (PDF, Word, etc) on your shoulders but you, probably, can already do that)

Solr suggest - method for populating a dictionary file based on prior successful searches

I am working on providing auto-suggest functionality in our search application (which uses Solr) based on terms used in previous successful searches. In the Solr suggest documentation (http://wiki.apache.org/solr/Suggester), I see how to do this using a dictionary file. My question is: Does Solr have any utilities for populating a dictionary file, or do I need to write my own?
I haven't found utilities for doing this in the Solr documentation. But before I started to write my own job to build the dictionary file, I figured it's worth asking this question.

We had a Similar requirement for providing auto suggest on popular searches.
Solr did not provide and out of the box solution for the searches, it does provide it based on the Index or dictionary based.
You can use Solr Suggester and build an new index with the successful Searches and have suggestion based on it.
Else you would need to build up a custom solution for searches with popularity.
What we tried was -
Building an index with the searches and their count and providing autosuggest based on it.
We didn't wanted to maintain the count, and the overhead to maintain and increment it, so we indexed all the Successful search terms multiple times as they were being searched upon.
We used the Terms Component as it provides a terms.prefix option to get the auto suggest feature and as the terms are returned based on the frequency that took care of the popularity.
As the field being used for terms were only marked indexed and not stored, the index size did not grew considerably with same terms being pushed multiple times.