Apache Nutch REST API to retrieve data from server running Nutch? - solr

I am using the nutch REST API to run nutch searches on a seperate server. I would like to retrieve the crawled data back to my local machine. Is there a way I can use the nutch dump functionality to dump the data and retrieve it via the API, or am I better off indexing the data into Solr and retrieving it from Solr.
Thanks for your help.

Currently, the REST API doesn't provide such functionality. The main purpose of the REST API is to configure and lunch your crawl jobs. At its core, it will allow you to set the configuration of a new crawl job and manage it (to some extent).
The transfer of the crawled data is up to you. That being said I do have a couple of recommendations:
If you're sending the data into Solr/ES (or any other indexer) I would recommend getting the data directly from there. Both Solr&ES already provide a REST API, with the additional benefit that you might filter which data to "copy over".
If you're running Nutch in a distributed mode (i.e in a Hadoop cluster) try to use the Hadoop libraries to copy the data to the destination.
If none of this applies then perhaps relying on something else like rsync or similar might be worth considering.

Related

How to stream data from database via REST API?

I have large data stored in Postres database and I need to send the data to the client via a REST API using Django. The requirement is to send the data in chunks and to not load the entire content into memory at once. I understand that there is a StreamingHttpResponse class in Django which I will explore. But are there any other better options? I've heard about Kafka and Spark for streaming applications but the tutorials I've checked about these two tend to involve streaming live data like interacting with Twitter data, etc. But is it possible to stream data from database using any of these two? If yes, how do I then integrate it with REST so that clients can interact with it? Any leads would be appreciated. Thanks.
You could use debezium or apache-kafka-connect to bulk load your database into Kafka.
Once the data is there, you can either put a Kafka consumer within your Django application or outside of it and make REST requests as messages are consumed. Spark isn't completely necessary, and shouldn't be used within Django

Validating Solr queries against a schema

I would like to validate queries against a schema before actually executing them.
Is there an official API which will give me access to a schema, or will I have to parse the Solr configuration XML myself?
The usual trick for finding these resources it to open the Admin interface with the developer network tool running in your browser, then navigating to the resource you're looking for while watching which requests your browser perform. Since the frontend is purely Javascript based and runs in your browser, it accesses everything through the API exposed by Solr.
You'll have to parse something, either in JSON or XML (probably) format. For my older, 4.10.2-installation, it is available as:
/solr/corename/admin/file?file=schema.xml&contentType=text/xml;charset=utf-8

How to automate solr indexing?

Normally we do indexing in solr from a browser. Can we do it automatically by writing a batch job or java code?
Please provide me some idea, if it is possible.
You can use the DataImportHandler, which can import from lot of different sources such as databases or xml files: https://wiki.apache.org/solr/DataImportHandler
If you have specific requirements which are not satisfied by the DataImportHandler you may implement your own indexer by using a solr client api:
https://cwiki.apache.org/confluence/display/solr/Client+APIs
If you want to do stuff with Solr programmaticaly take a look at: Solrj which is an API that'll do what your asking for.
You can use a web debugging proxy such as Fiddler to view the HTTP request that is generated when you trigger the data import via a web browser. Then send the same request from your Java code.

Solr Cloud Managed Resources

I am implementing Solr Cloud for the first time. I've worked with normal Solr and have that down pretty well, but I'm not finding a lot on what you can and can't do with Solr Cloud. So my question is about Managed Resources. I know you can CRUD stop words and synonyms using the new RESTful api in solr. However with the cloud do I need to CRUD my changes to each individual solr server in the cloud, or do I send them to a different url that sends them through to each server? I'm new to cloud and zookeeper. I have not found anything in the solr wiki about working with the managed resources in the cloud setup. Any advice would be helpful.
In SolrCloud configuration and other files like stopwords, are stored and maintained by Zookeeper. Which means you do not need to individually send updates to each server.
Once you have SolrCloud, before putting in any data, you will create a collection. Each collection has its own set of resources/config folder.
So for example if u have a collection called techproducts with 2 servers localhost1 and localhost2 the below command from any of the servers will work on the same resource.
curl "http://localhost1:8983/solr/techproducts/schema/analysis/synonyms/english"
curl "http://localhost2:8983/solr/techproducts/schema/analysis/synonyms/english"

Replicating data from GAE data store

We have an application that we're deploying on GAE. I've been tasked with coming up with options for replicating the data that we're storing the the GAE data store to a system running in Amazon's cloud.
Ideally we could do this without having to transfer the entire data store on every sync. The replication does not need to be in anything close to real time, so something like a once or twice a day sync would work just fine.
Can anyone with some experience with GAE help me out here with what the options might be? So far I've come up with:
Use the Google provided bulkloader.py to export the data to CSV and somehow transfer the CSV to Amazon and process there
Create a Java app that runs on GAE, reads the data from the data store and sends the data to another Java app running on Amazon.
Do those options work? What would be the gotchas with those? What other options are there?
You could use a logic similar to what App Engine HRD migration or backup tool are doing:
Mark modified entities with a child entity marker
Run a MapperPipeline using App Engine mapreduce library iterating on those entity using a Datastore Input Reader
In your map function fetch the parent entity and serialize it to Google Storage using a File Output Writer and remove the marker
Ping the remote host to import those entity from the Google Storage url
As an alternative to 3 and 4, you could make multiple urlfetch(POST) to send each serialized entity to the remote host directly, but it is more fragile as an single failure could compromise the integrity of your data import.
You could look at the datastore admin source code for inspiration.

Resources