Solr backup and restore on Google cloud - solr

I have been struggling in getting solution for below scenario:
I have 2 instances of solr cloud on composite engine.
I have 2 instances of application rest api which calls above solr cluster for data.
From my application I wanna take backup of solr and copy zipped backup file to google storage automatically and restore it automatically with the an url endpoint.
For that I am trying to make a api endpoint in my application that will call below solr api to take back up
admin/collections?action=BACKUP
And making another endpoint that call below url to restore
/admin/collections?action=RESTORE
However after taking backup my application doesnt have access to the back up files as they are getting saved on solr instances. So I am not able to save them to google bucket.
Please guide me a simpler way to achieve this i.e automatically backup and restore solr from other GCP instance.

Have you considered something like gcs-fuse? It'll allow you to mount a GCS bucket directly on the file system.
You can then point the BACKUP command directly to the mount point for gcs-fuse on your Solr compute engine VMs, and the whole thing is abstracted away through how the VM is configured (instead of having to be manually uploaded afterwards with a separate tool when a local copy has been made).

I found GCSFuse a bit unreliable, and decided to write a wrapper script which first detects the master for the given collection and then just executed the backup directly on that node.

Related

How to prevent files from being deleted on idling in GAE?

I have a flask server running on a GAE (flexible env). The application generates certain files during runtime based on the API requests received. But after the instance starts, after idling, the files are lost. How do i prevent this ?
Your application seems to write those files on the filesystem. However, whatever is written in the filesystem will not be persisted and will be lost on instance death. Also, what you write in one instance won't be available to other instances serving your app.
The solution to this is to write your files to Google Cloud Storage. Any instance can write and retrieve its files there and they'll be available for any instance of your service.

How can I export data from Google App Engine High Replication datastore?

I am looking into using Google App Engine for a project and would like make sure I have a way to export all my data if I ever decide to leave GAE (or GAE shuts down).
Everything I search about exporting data from GAE points to https://developers.google.com/appengine/docs/python/tools/uploadingdata. However, that page contains this note:
Note: This document applies to apps that use the master/slave
datastore. If your app uses the High Replication datastore, it is
possible to copy data from the app, but Google does not currently
support this use case. If you attempt to copy from a High Replication
datastore, you'll see a high_replication_warning error in the Admin
Console, and the downloaded data might not include recently saved
entities.
The problem is that recently the master/slave datastore was recently deprecated in favor of the High Replication datastore. I understand that the master/slave datastore is still supported for a little while, but I don't feel comfortable using something that has officially been deprecated and is on its way out. So that leaves me with the High Replication datastore and the only way it seems to export the data is the method above that is not officially supported (and thus does not provide me with a guarantee that I can get my data out).
Is there any other (officially supported) way of exporting data from the High Replication datastore? I don't feel comfortable using Google App Engine if it means my data could be locked in there forever.
It took me quite a long time to setup the download of data from GAE as the documentation is not as clear as it should be.
If you extracting data from a Unix server, you maybe could reuse the script below.
Also, if you do not provide the "config_file" parameter, it will extract all your data for this kind but in a proprietary format which can only be used for restoring data afterwards.
#!/bin/sh
#------------------------------------------------------------------
#-- Param 1 : Namespace
#-- Param 2 : Kind (table id)
#-- Param 3 : Directory in which the csv file should be stored
#-- Param 4 : output file name
#------------------------------------------------------------------
appcfg.py download_data --secure --email=$BACKUP_USERID -- config_file=configClientExtract.yml --filename=$3/$4.csv --kind=$2 --url=$BACKUP_WEBSITE/remote_api --namespace=$1 --passin <<-EOF $BACKUP_PASSWORD EOF
Currently app engine datastore supports another option also. Data backup provision can be used to copy selected data into blob store or google cloud storage. This function is available under datastore admin area in app engine console. If required, the backed up data can then be downloaded from the blob viewer or cloud storage. For doing the backup for high replication datastore, it is recommended that datastore writes are disabled before taking the backup.
You need to configure a builtin called remote_api. This article has all the information and guide you need to be able to download all your data today and in the future.

Move database from local datastore to another local datastore

I and my friend are working on a GWT-Google App Engine project, using Tortoise SVN and Google Code to synchronize the code.
We also synchronize the local_db.bin file in appengine-generated folder. But we cant get it work. After synchronize the db file, our local datastore is not updated as we expected.
That is a pain. Im worrying about our future, when our database get bigger and more complicated #A#.
Anyone please give me an advice. What should i do to synchronize our local datastore?
I have to suggestions:
1) Use remote api : https://developers.google.com/appengine/articles/remote_api to share a GAE hosted db locally.
2) Maybe you can use Gdrive to sync folders.
This is a really bad idea. Even if you weren't having trouble making both ends read from the same datastore file, the local datastore is in a binary format, and thus you won't both be able to work on the app at the same time, or you'll get merge conflicts you will be unable to resolve.
Instead, both for collaboration purposes and for testing and deployment, you should provide a set of test data you can easily load into the datastore. Store the test data in version control, and load it in using bulkloader or your own code.

Replicating data from GAE data store

We have an application that we're deploying on GAE. I've been tasked with coming up with options for replicating the data that we're storing the the GAE data store to a system running in Amazon's cloud.
Ideally we could do this without having to transfer the entire data store on every sync. The replication does not need to be in anything close to real time, so something like a once or twice a day sync would work just fine.
Can anyone with some experience with GAE help me out here with what the options might be? So far I've come up with:
Use the Google provided bulkloader.py to export the data to CSV and somehow transfer the CSV to Amazon and process there
Create a Java app that runs on GAE, reads the data from the data store and sends the data to another Java app running on Amazon.
Do those options work? What would be the gotchas with those? What other options are there?
You could use a logic similar to what App Engine HRD migration or backup tool are doing:
Mark modified entities with a child entity marker
Run a MapperPipeline using App Engine mapreduce library iterating on those entity using a Datastore Input Reader
In your map function fetch the parent entity and serialize it to Google Storage using a File Output Writer and remove the marker
Ping the remote host to import those entity from the Google Storage url
As an alternative to 3 and 4, you could make multiple urlfetch(POST) to send each serialized entity to the remote host directly, but it is more fragile as an single failure could compromise the integrity of your data import.
You could look at the datastore admin source code for inspiration.

how can I change my app-id in GAE and still access the same permanent datastore?

I am developing an app locally in Google App Engine. I have built a small datastore for development purposes. Rebuilding it after every power cycle on my Mac got tedious so I made it permanent. Now I run my app locally with the following command:
/usr/local/bin/dev_appserver.py "--datastore_path=./permanent.datastore" appengine_prototype
Life is good. I have decided to deploy my app so I can test http post commands from a different machine. When I tried to register my current application id (example), I found that it was unavailable (shocker!). So I registered a different application id and planned to change my local application id to match. However, when I changed the
application: *app-id*
line in my app.yaml file, my app stopped recognizing my permanent datastore.
So, how can I change my application id to the one I registered, maintain the connection to the permanent datastore and then push the whole shebang online? I tried running the app twice locally, first with the permanent datastore referenced in the command and then without, hoping that the default temporary datastore would inherit from the previous permanent datastore. That didn't work. Do I need to start by copying the permanent datastore to the default temporary datastore? How would I do that? Any help would be much appreciated.
Thanks,
Dessie
If your intention is to eventually push your local data to your live environment anyway, then your best bet is to:
use bulkloader.py to backup your local data (while using oldid in your config)
then change your config to your newid
then use bulkloader.py to push your data to your new development server (ran with --datastore_path=./permanent.datastore2 or something)
then use bulkloader.py to push your data to the GAE production server
Details of bulkloader.py can be found in the docs and an example here

Resources