Amazon Cloud Search API compatibility with Solr API - solr

With Amazon's Cloud Search being powered by Solr, I have certain questions before we proceed. May be someone who has experience with both can guide us.
How compatible Amazon's Cloud Search's API with Solr API ? Are they same, or radically different?
Is it compatible with queries being performed via Solrnet?
How different it is from Solr?
The reason we're asking is, we need to migrate one application from Solr to Amazon Cloud Search and before we proceed we need some idea on how does this work?
I checked with Amazon Cloud Search documentation, but unable to find any details on this particular thing!

Amazon cloudsearch is based on solr, so conceptually they work same way, but amazon has written its own wrapper on top of solr api. Answers to the questions in same order below:
Amazon cloud search has two endpoints, one for search and other for indexing documents. Provides with multiple ways of indexing like S3 documents, JSON, XML etc.
Amazon cloudsearch has four different query parsers, if you place queries using Lucence as query parser, query syntax and functionality is same.
The only difference I observed functionally is cloudsearch doesn't support hierarchy in fields data, it only provides with text and literal datatypes and their arrays for multiple values.
There are migration tools available like http://www.8kmiles.com/blog/apache-solr-to-amazon-cloudsearch-migration-tool/ which support migration from solr to cloudsearch.

Related

Pros and cons using Lucidworks Fusion instead of regular Solr

i wanna know what are the pros and cons using Fusion instead of regular Solr ? can you guys give some example (like some problem that can be solved easily using Fusion)?
First of all, I should disclose that I am the Product Manager for Lucidworks Fusion.
You seem to already be aware that Fusion works with Solr (or one or more Solr clusters or instances), using Solr for data storage and querying. The purpose of Fusion is to make it easier to use Solr, integrate Solr, and to build complex solutions that make use of Solr. Some of the things that Fusion provides that many people find helpful for this include:
Connectors and a connector framework. Bare Solr gives you a good API and the ability to push certain types of files at the command line. Fusion comes with several pre-built data source connectors that fetch data from various types of systems, process them as appropriate (including parsing, transformation, and field mapping), and sends the results to Solr. These connectors include common document stores (cloud and on-premise), relational databases, NoSQL data stores, HDFS, enterprise applications, and a very powerful and configurable web crawler.
Security integration. Solr does not have any authentication or authorizations (though as of version 5.2 this week, it does have a pluggable API and an basic implementation of Kerberos for authentication). Fusion wraps the Solr APIs with a secured version. Fusion has clean integrations into LDAP, Active Directory, and Kerberos for authentication. It also has a fine-grained authorizations model for mananging and configuring Fusion and Solr. And, the Fusion authorizations model can automatically link group memberships from LDAP/AD with access control lists from the Fusion Connectors data sources so that you get document-level access control mirrored from your source systems when you run search queries.
Pipelines processing model. Fusion provides a pipeline model with modular stages (in both API and GUI form) to make it easier to define and edit transformations of data and documents. It is analogous to unix shell pipes. For example, while indexing you can include stages to define mappings of fields, compute new fields, aggregate documents, pull in data from other sources, etc. before writing to Solr. When querying, you could do the same, along with transforming the query, running and returning the results of other analytics, and applying security filtering.
Admin GUI. Fusion has a web UI for viewing and configuring the above (as well as the base Solr config). We think this is convenient for people who want to use Solr, but don't use it regularly enough to remember how to use the APIs, config files, and command line tools.
Sophisticated search-based features: Using the pipelines model described above, Fusion includes (and make easy to use) some richer search-based components, including: Natural language processing and entity extraction modules; Real-time signals-driven relevancy adjustment. We intend to provide more of these in the future.
Analytics processing: Fusion includes and integrates Apache Spark for running deep analytics against data stored in Solr (or on its way in to Solr). While Solr implicitly includes certain data analytics capabilities, that is not its main purpose. We use Apache Spark to drive Fusion's signals extraction and relevancy tuning, and expect to expose APIs so users can easily run other processing there.
Other: many useful miscellaneous features like: dashboarding UI; basic search UI with manual relevancy tuning; easier monitoring; job management and scheduling; real-time alerting with email integration, and more.
A lot of the above can of course be built or written against Solr, without Fusion, but we think that providing these kinds of enterprise integrations will be valuable to many people.
Pros:
Connectors : Lucidworks provides you a wide range of connectors, with those you can connect to datasources and pull the data from there.
Reusability : In Lucidworks you can create pipelines for data ingestion and data retrieval. You can create pipelines with common logic so that these can be used in other pipelines.
Security : You can apply restrictions over data i.e Security Trimming data. Lucidworks provides in built query-pipeline stages for Security Trimming or you can write custom pipeline for your use case.
Troubleshooting : Lucidworks comes with discrete services i.e api, connectors, solr. You can troubleshoot any issue according the services, each service has its logs. Also you can configure JVM properties for each service
Support : Lucidworks support is available 24/7 for help. You can create support case according the severity and they schedule call for you.
Cons:
Not much, but it keeps you away from your normal development, you don't get much chance to open your IDE and start coding.

can GSA use data indexed by Apache Solr for search as a combined solution

It is observed that google does not provide good indexing through its enterprise
search solution Google Search Appliance . But Apache solr has a good indexing capability. Can we use apache solr to index documents and then those documents be
searched through GSA server . So that we can get best of the both world. Kindly give your thoughts ??
Can you please provide more details on why you think the GSA "does not provide good indexing"?
The GSA is generally recognised as being the best or at least one of the best when it comes to result relevancy. When it comes to non-web content, Google supply multiple connectors to allow you to index this content in the GSA and if you have a content source that is neither web based or covered by one of the Google connectors it is not difficult to write your own.
So I'm not sure why you think the indexing is not good, it would be really helpful if you could elaborate.
Mohan is incorrect when he says that you cannot serve Solr content via a GSA, you certainly can do this. What you will need to do is create a onebox module so that you can federate Solr results in realtime and they will be presented to the right of the main GSA results.
What is your data source?
If it is a website crawl,to my little knowledge GSA provides sophisticated crawling/indexing capability for websites than Solr.
Because Solr needs external toolkit such as Tika or Nutch for crawling web resources. On the other hand GSA has its own crawler which makes crawling simple and effective.
Regarding your question on indexing through Solr and serving through GSA,
it is possible through onebox module.(Refer BigMikeW's answer)
If you can provide some information about your data sources, it might help people to suggest the best solution to increase indexing capability in GSA.

Using Fuseki to federate different SPARQL endpoints

I work for a large organization with many different relational datastores containing overlapping information. We are looking for a solution for integrated querying on all our data at once. We are considering using Semantic Web technology for this purpose. Specifically, we plan to:
Create a unified ontology
Map each database to this ontology
Create a SPARQL endpoint for each database
Use a federation engine to unify them to one endpoint.
I am now in search of appropriate tools for the last stage of this plan. I have heard that Fuseki is appropriate for this case, but have been unable to find any relevant documentation.
Can you please give your opinion on the appropriateness of Fuseki for this task, or even better, point me at some proper documentation?
Thanks in advance!
Oren
http://jena.apache.org/
You want to read about Fuseki but also able SPARQL basic federated query. Fuseki itself is a query server, the query engine is ARQ.
This is the simplest example of a federated query, that is supported by the ARQ engine (the backend to fuseki):
https://jena.apache.org/documentation/query/service.html
You submit this query to your fuseki endpoint, and it will go off and query the endpoint in the "service <>" brackets. This certainly works for me, using ARQ 2.9.4 with fuseki 0.2.6

Is there a project that integrates CouchDb and Solr?

I would like to be able to search a CouchDB database using Solr. Are there any projects that provide such an integration?
I am also aware of CouchDB-Lucene. Is there a way to hook Solr into that?
Thanks!
It would make more sense to roll your own, given how wasy it easy. First you need to decide what kind of SOLR schema to use and how to map your CouchDB documents onto that schema. Then simple iterate through all the documents in a db Pagination in CouchDB? and generate SOLR <add> documents.
People do this all the time with all kinds of data sources. Since SOLR is essentially searching a single table, the hard work is often figuring out how to map your database format onto a single table. Read up on what you can do with the SOLR schema, and you may be surprised at how easy this is.
There is a CouchDB integration for ElasticSearch available, apart from feeding ElasticSearch with JSON on your own. Both work with schema-less JSON, so it's very easy to integrate them.
In terms of features, ElasticSearch would offer a comparable set to Solr (in addition to some unique features, of course.)
According to this
http://wiki.apache.org/couchdb/Related_Projects
there was a CouchDB-Solr2 project (scroll down to the end), which is no longer maintained.

Hadoop to create an Index and Add() it to distributed SOLR... is this possible? Should I use Nutch? ..Cloudera?

Can I use a MapReduce framework to create an index and somehow add it to a distributed Solr?
I have a burst of information (logfiles and documents) that will be transported over the internet and stored in my datacenter (or Amazon). It needs to be parsed, indexed, and finally searchable by our replicated Solr installation.
Here is my proposed architecture:
Use a MapReduce framework (Cloudera, Hadoop, Nutch, even DryadLinq) to prepare those documents for indexing
Index those documents into a Lucene.NET / Lucene (java) compatible file format
Deploy that file to all my Solr instances
Activate that replicated index
If that above is possible, I need to choose a MapReduce framework. Since Cloudera is vendor supported and has a ton of patches not included in the Hadoop install, I think it may be worth looking at.
Once I choose the MatpReduce framework, I need to tokenize the documents (PDF, DOCx, DOC, OLE, etc...), index them, copy the index to my Solr instances, and somehow "activate" them so they are searchable in the running instance. I believe this methodolgy is better that submitting documents via the REST interface to Solr.
The reason I bring .NET into the picture is because we are mostly a .NET shop. The only Unix / Java we will have is Solr and have a front end that leverages the REST interface via Solrnet.
Based on your experience, how does
this architecture look? Do you see
any issues/problems? What advice can
you give?
What should I not do to lose faceting search? After reading the Nutch documentation, I believe it said that it does not do faceting, but I may not have enough background in this software to understand what it's saying.
Generally, you what you've described is almost exactly how Nutch works. Nutch is an crawling, indexing, index merging and query answering toolkit that's based on Hadoop core.
You shouldn't mix Cloudera, Hadoop, Nutch and Lucene. You'll most likely end up using all of them:
Nutch is the name of indexing / answering (like Solr) machinery.
Nutch itself runs using a Hadoop cluster (which heavily uses it's own distributed file system, HDFS)
Nutch uses Lucene format of indexes
Nutch includes a query answering frontend, which you can use, or you can attach a Solr frontend and use Lucene indexes from there.
Finally, Cloudera Hadoop Distribution (or CDH) is just a Hadoop distribution with several dozens of patches applied to it, to make it more stable and backport some useful features from development branches. Yeah, you'd most likely want to use it, unless you have a reason not to (for example, if you want a bleeding edge Hadoop 0.22 trunk).
Generally, if you're just looking into a ready-made crawling / search engine solution, then Nutch is a way to go. Nutch already includes a lot of plugins to parse and index various crazy types of documents, include MS Word documents, PDFs, etc, etc.
I personally don't see much point in using .NET technologies here, but if you feel comfortable with it, you can do front-ends in .NET. However, working with Unix technologies might feel fairly awkward for Windows-centric team, so if I'd managed such a project, I'd considered alternatives, especially if your task of crawling & indexing is limited (i.e. you don't want to crawl the whole internet for some purpose).
Have you looked at Lucandra https://github.com/tjake/Lucandra a Cassandra based back end for Lucense/Solr which you can use Hadoop to populate the Cassandra store with the index of your data.

Resources