I would like to be able to search a CouchDB database using Solr. Are there any projects that provide such an integration?
I am also aware of CouchDB-Lucene. Is there a way to hook Solr into that?
Thanks!
It would make more sense to roll your own, given how wasy it easy. First you need to decide what kind of SOLR schema to use and how to map your CouchDB documents onto that schema. Then simple iterate through all the documents in a db Pagination in CouchDB? and generate SOLR <add> documents.
People do this all the time with all kinds of data sources. Since SOLR is essentially searching a single table, the hard work is often figuring out how to map your database format onto a single table. Read up on what you can do with the SOLR schema, and you may be surprised at how easy this is.
There is a CouchDB integration for ElasticSearch available, apart from feeding ElasticSearch with JSON on your own. Both work with schema-less JSON, so it's very easy to integrate them.
In terms of features, ElasticSearch would offer a comparable set to Solr (in addition to some unique features, of course.)
According to this
http://wiki.apache.org/couchdb/Related_Projects
there was a CouchDB-Solr2 project (scroll down to the end), which is no longer maintained.
Related
I am developing a web application where I want to use Solr for search only and keep my data on another Database.
I will be having 2 databases: one Relational (Sql Server) and the other will be a copy of it on the NoSQL Solr database.
I'll be searching for specific fields in the solr documents e.g(by id,name,type and join queries) i.e NOT full text search.
I know Solr strength is in full text search by creating inverted index on the documents data, now i want to know does it also helps in my case by creating another type of index on my documents which make normal searching faster than sql server index?
Yes, it will help you.
You need to consider what is your requirement. What is your preference?
If you have the solr as another additional option which will be used for the searching the application data, you need to consider that you have to constantly update the solr. You will need additional infrastructure and all.
If the performance is your main criteria and you don't want to put any search load on your RDBMS then you can add the solr to your system. Also consider how big your data is in the RDBMS. Because RDBMS system are also enough strong to support searching data.
Considering all the above aspects you can take the decision.
I am trying to compare data ingested into both Accumulo and Solr from the same source XML. The data ingested into Accumulo is legacy code while Solr is new code. I can easily extract out data from Solr using SolrCloud and choosing CSV or JSON, which is easily readable. But I'm at a loss for how to easily view the data in Accumulo. I used scan to view the data, but it is not easily readable. Is there a way to export the data in Accumulo to a CSV or something similar so it will be easy to read/compare with other datasets?
As I understand it, Apache Solr is a document store which uses Lucene indexes to make search fast via a web-based REST interface. On the other hand, Apache Accumulo is a massively scalable sorted key-value store, which stores arbitrary key-value pairs with cell-level security labels, in accordance with the user's application, queryable with a Java API. It makes no sense to compare the two. They are entirely different applications. Accumulo is a low-level infrastructure application, upon which you can build complex systems, such as a search engine comparable to Solr, but it is not directly comparable to Solr because Accumulo is not a search engine.
To answer your question about how to view data in Accumulo, the answer is to use its Java API. I recommend starting with the Tour on its web page, for some examples of how to query it. As for how the data is presented, and in what form, that depends on the application which ingested it in the first place. It can be arbitrary binary data in byte arrays and may not be directly viewable; that depends on the application. Accumulo is agnostic to the nature of the data stored in its key-value pairs.
What you were probably referring to in your question, when you said "I used scan to view the data", you were probably referring to the scan command in Accumulo's shell. You should probably be aware that the shell is not the primary interface for query. It is intended for system administration and triage of data ingest. The Java API is the primary means of querying.
The Accumulo open source community is pretty responsive to questions. If you're having trouble figuring out how best to use it for your needs, I would advise to ask on their community mailing lists, which can be found at their website. StackOverflow is more suitable for very specific questions than generalized "getting started" kinds of tutorials.
I'm using cloudant which I could use mapreduce to project view of data and also it could search document with lucene
But these 2 feature is separate and cannot be used together
Suppose I make a game with userdata like this
{
name: ""
items:[]
}
Each user has item. Then I want to let user find all swords with quality +10. With cloudant I might project type and quality as key and use query key=["sword",10]
But it cannot make query more complex than that like lucene could do. To do lucene I need to normalize all items to be document and reference it with owner
I really wish I could do a lucene search on a key of data projection. I mean, instead of normalization, I could store nested document as I want and use map/reduce to project data inside document so I could search for items directly
PS. If that database has partial update by scripting and inherently has transaction update feature that would be the best
I'd suggest trying out elasticsearch.
Seems like your use case should be covered by the search api
If you need to do more complex analytics elasticsearch supports aggregations.
I am not at all sure that I got the question correctly, but you may want to take a look at riak. It offers a solr-based search, which is quite well documented. I have used it in the past for distributed search over a distributed key-value index and it was quite fast.
If you use this, you will also need to take a look at the syntax of solr queries, so I add it here to save you some time. However, keep in mind that not all of those solr query functionalities were available in riak (at least that was when I used it).
There are several solutions that would do the job. I can give my 2 cents proposing the well established MongoDB. With MongoDB you can create a text-Index on a given field and then do a full text Search as explained here. The feature is in MongoDb since version 2.4 and the syntax is well documented on MongoDB docs.
Is solr just for searching ie it's not for 'updating' or 'inserting' data?
My site is currently MySQL based, and on looking at SOLR as an alt option, I see you make your queries through http requests.
My first thought was - how do you stop someone from making a query that updates or inserts data?
Obviously, I'm not understanding SOLR, hence my question here.
Cheers
Solr mainly is for Full Text search, and rather should not be used as a Persistent store.
Solr stores its data in the File store and does not provide the features of Relational database (ACID or Nested Entities etc )
Usually, the model followed is use Relationship database for you data management.
Replicate the data into Solr for Full Text search.
You can always control the Insert/Update access for Solr by securing the urls.
I'm lost in: Hadoop, Hbase, Lucene, Carrot2, Cloudera, Tika, ZooKeeper, Solr, Katta, Cascading, POI...
When you read about the one you can be often sure that each of the others tools is going to be mentioned.
I don't expect you to explain every tool to me - sure not. If you could help me to narrow this set for my particular scenario it would be great. So far I'm not sure which of the above will fit and it looks like (as always) there are more then one way of doing what's to be done.
The scenario is: 500GB - ~20 TB of documents stored in Hadoop. Text documents in multiple formats: email, doc, pdf, odt. Metadata about those documents stored in SQL db (sender, recipients, date, department etc.) Main source of documents will be ExchangeServer (emails and attachments), but not only. Now to the search: User needs to be able to do complex full-text searches over those documents. Basicaly he'll be presented with some search-config panel (java desktop application, not webapp) - he'll set date range, document types, senders/recipients, keywords etc. - fire the search and get the resulting list of the documents (and for each document info why its included in search results i.e. which keywords are found in document).
Which tools I should take into consideration and which not? The point is to develop such solution with only minimal required "glue"-code. I'm proficient in SQLdbs but quite uncomfortable with Apache-and-related technologies.
Basic workflow looks like this: ExchangeServer/other source -> conversion from doc/pdf/... -> deduplication -> Hadopp + SQL (metadata) -> build/update an index <- search through the docs (and do it fast) -> present search results
Thank you!
Going with solr is a good option. I have used it for similar scenario you described above. You can use solr for real huge data as its a distributed index server.
But to get the meta data about all of these documents formats you should be using some other tool. Basically your workflow will be this.
1) Use hadoop cluster to store data.
2) Extract data in hadoop cluster using map/redcue
3) Do document identification( identify document type)
4) Extract meta data from these document.
5) Index metadata in solr server, store other ingestion information in database
6) Solr server is distributed index server, so for each ingestion you could create a new shard or index.
7) When search is required search on all the indexs.
8) Solr supports all the complex searches , so you don't have to make your own search engine.
9) It also does paging for you as well.
We've done exactly this for some of our clients by using Solr as a "secondary indexer" to HBase. Updates to HBase are sent to Solr, and you can query against it. Typically folks start with HBase, and then graft search on. Sounds like you know from the get go that search is what you want, so you can probably embed the secondary indexing in from your pipeline that feeds HBase.
You may find though that just using Solr does everything you need.
Another project to look at is Lily, http://www.lilyproject.org/lily/index.html, which has already done the work of integrating Solr with a distributed database.
Also, I do not see why you would not want to use a browser for this application. You are describing exactly what faceted search is. While you certainly could set up a desktop app that communicates with the server (parses JSON) and displays the results in a thick client GUI, all of this work is already done for you in the browser. And, Solr comes with a free faceted search system out of the box: just follow along the tutorial.
Going with Solr (http://lucene.apache.org/solr) is a good solution, but be ready to have to deal with some non-obvious things. First is planning your indexes properly. Multiple terabytes of data will almost definitely need multiple shards on Solr for any level of reasonable performance and you'll be in charge of managing those yourself. It does provide distributed search (doing the queries off multiple shards), but that is only half the battle.
ElasticSearch (http://www.elasticsearch.org/) is another popular alternative, but i don't have much experience with it regarding scale. It uses the same Lucene engine so i'd expect the search feature-set to be similar.
Another type of solution is something like SenseiDB - open sourced from LinkedIn - which gives the full-text search functionality (also Lucene-based) as well as proven scale for large amounts of data:
http://senseidb.com
They've definitely done a lot of work on search over there and my casual use of it is pretty promising.
Assuming all your data is already in Hadoop, you could write some custom MR jobs that pull the data in a consistent schema-friendly format into SenseiDB. SenseiDB already provides a Hadoop MR indexer which you can look at.
The only caveat is it is a little more complex to setup, but will save you with the scaling issues many times over - especially around indexing performance and faceting functionality. It also provides clustering support if HA is important to you - which is still in Alpha for Solr (Solr 4.x is alpha atm).
Hope that helps and good luck!
Update:
I asked a friend who is more versed in ElasticSearch than me and it does have the advantage of clustering and rebalancing based on the # of machines and shards you have. This is a definite win over Solr - especially if you're dealing with TBs of data. The only downside is the current state of documentation on ElasticSearch leaves a lot to be desired.
As a side note, you can't say the documents are stored in Hadoop, they are stored in a distributed file system (most probably HDFS since you mentioned Hadoop).
Regarding searching/indexing: Lucene is the tool to use for your scenario. You can use it for both indexing and searching. It's a java library. There is also an associated project (called Solr) which allows you to access the indexing/searching system through WebServices. So you should also take a look at Solr as it allows the handling of different types of documents (Lucene puts the responsability of interpreting the document (PDF, Word, etc) on your shoulders but you, probably, can already do that)