Migrating data to GCP persistent disk from local linux machine - database

I am trying to find commands to copy data from a local linux VM to GCP persistent disk.
I think gsutil cp may be what I am looking for however I cannot find the documentation nor do I know the path to where I would need to copy the data to.
I have seen how to copy to google cloud bucket storage but not to a persistent disk.
Thanks,
K

The following seems to have worked
gcloud compute scp --recurse db/ machineName:/dir/path/

Related

Google App Engine (GAE) - do instances use a shared disk?

I'm using Google App Engine (GAE) and my app.yaml looks like this:
runtime: custom # uses Dockerfile
env: flex
manual_scaling:
instances: 2
resources:
cpu: 2
memory_gb: 12
disk_size_gb: 50
The 50GB disk, is it shared between instances? The docs are silent on this issue. I'm downloading files to the disk and each instance will need to be able to access the files I am downloading.
If the disk is not shared, how can I share files between instances?
I know I could download them from Google Cloud storage on demand, but these are video files and instant access is needed for every instance. Downloading the video files on demand would be too slow.
Optional Reading
The reason instant access is needed is because I am using ffmpeg to produce a photo from the video at frame X (or time X). When a photo of the video is taken, these photos need to be made available to the user as quickly as possible.
You are using 50 Gb disk in your GAE, be it standard or flex, there is no way you can share instances between GAE, as the storage is dedicated.
You tried GCS, since video file processing is involved and GCS is object based storage.
So the alternative to this could be Filestore but it is not yet supported for GAE Flex despite the possibility of SSH into its underlying fully-managed machine.
There is a way if you use the /tmp folder. However, it will store files in the RAM of the instance, so note that it will take up memory and that it is temporary (as the folder's name suggests).
For more details, see the documentation here or here.

What is the backend file system used for OpenEBS Jiva and cStor volumes?

I just got to know about OpenEBS project and I have few doubt regarding the storage engine backend filesystem. What is the backend file system used for OpenEBS Jiva and cStor volumes?
Jiva is all sparse file based.CStor is on COW-based files, can be sparse/real disks.

OpenShift 3 Webapp With Access to File System

I have a Tomcat 8 web app running on OpenShift 3.
I want to be able to read and write files on 'the file system'.
I have been wading through documentation and looking for examples of how to achieve this.
I see that there are many types of persistent storage, for example NFS, EBS, GlusterFS etc.
So, my first question is.
What is the best file system to use for simple read/write access to text based xml files?
Preferably like a *nix file system.
Any example would be much appreciated...
The free OpenShift 3 Starter service ONLY allows 'filesystem storage' to EBS (Amazon Elastic Block Storage). Which can only be written to ONCE.
To get access to GlusterFS of NFS you have to go to the paid service which starts at $50 per month. They are the only filesystems that allow multiple writes to a file.

What is the minimal setup needed to write to HDFS/GS on Google Cloud Storage with flume?

I would like to write data from flume-ng to Google Cloud Storage.
It is a little bit complicated, because I observed a very strange behavior. Let me explain:
Introduction
I've launched a hadoop cluster on google cloud (one click) set up to use a bucket.
When I ssh on the master and add a file with hdfs command, I can see it immediately in my bucket
$ hadoop fs -ls /
14/11/27 15:01:41 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.2.9-hadoop2
Found 1 items
-rwx------ 3 hadoop hadoop 40 2014-11-27 13:45 /test.txt
But when I try to add then read from my computer, it seems to use some other HDFS. Here I added a file called jp.txt, and it doesn't show my previous file test.txt
$ hadoop fs -ls hdfs://ip.to.my.cluster/
Found 1 items
-rw-r--r-- 3 jp supergroup 0 2014-11-27 14:57 hdfs://ip.to.my.cluster/jp.txt
That's also the only file I see when I explore HDFS on http://ip.to.my.cluster:50070/explorer.html#/
When I list files in my bucket with the web console (https://console.developers.google.com/project/my-project-id/storage/my-bucket/), I can only see test.txt and not jp.txt.
I read Hadoop cannot connect to Google Cloud Storage and I configured my hadoop client accordingly (pretty hard stuff) and now I can see items in my bucket. But for that, I need to use a gs:// URI
$ hadoop fs -ls gs://my-bucket/
14/11/27 15:57:46 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.3.0-hadoop2
Found 1 items
-rwx------ 3 jp jp 40 2014-11-27 14:45 gs://my-bucket/test.txt
Observation / Intermediate conclusion
So it seems here there are 2 different storages engine in the same cluster: "traditional HDFS" (starting with hdfs://) and a Google storage bucket (starting with gs://).
Users and rights are different, depending on where you are listing files from.
Question(s)
The main question is: What is the minimal setup needed to write to HDFS/GS on Google Cloud Storage with flume ?
Related questions
Do I need to launch a Hadoop cluster on Google Cloud or not to achieve my goal?
Is it possible to write directly to a Google Cloud Storage Bucket ? If yes, how can I configure flume? (adding jars, redefining classpath...)
How come there are 2 storage engine in the same cluster (classical HDFS / GS bucket)
My flume configuration
a1.sources = http
a1.sinks = hdfs_sink
a1.channels = mem
# Describe/configure the source
a1.sources.http.type = org.apache.flume.source.http.HTTPSource
a1.sources.http.port = 9000
# Describe the sink
a1.sinks.hdfs_sink.type = hdfs
a1.sinks.hdfs_sink.hdfs.path = hdfs://ip.to.my.cluster:8020/%{env}/%{tenant}/%{type}/%y-%m-%d
a1.sinks.hdfs_sink.hdfs.filePrefix = %H-%M-%S_
a1.sinks.hdfs_sink.hdfs.fileSuffix = .json
a1.sinks.hdfs_sink.hdfs.round = true
a1.sinks.hdfs_sink.hdfs.roundValue = 10
a1.sinks.hdfs_sink.hdfs.roundUnit = minute
# Use a channel which buffers events in memory
a1.channels.mem.type = memory
a1.channels.mem.capacity = 1000
a1.channels.mem.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.http.channels = mem
a1.sinks.hdfs_sink.channel = mem
Does the line a1.sinks.hdfs_sink.hdfs.path accept a gs:// path ?
What setup would it need in that case (additional jars, classpath) ?
Thanks
As you observed, it's actually fairly common to be able to access different storage systems from the same Hadoop cluster, based on the scheme:// of the URI you use with hadoop fs. The cluster you deployed on Google Compute Engine also has both filesystems available, it just happens to have the "default" set to gs://your-configbucket.
The reason you had to include the gs://configbucket/file instead of just plain /file on your local cluster is that in your one-click deployment, we additionally included a key in your Hadoop's core-site.xml, setting fs.default.name to be gs://configbucket/. You can achieve the same effect on your local cluster to make it use GCS for all the schemeless paths; in your one-click cluster, check out /home/hadoop/hadoop-install/core-site.xml for a reference of what you might carry over to your local setup.
To explain the internals of Hadoop a bit, the reason hdfs:// paths work normally is actually because there is a configuration key which in theory can be overridden in Hadoop's core-site.xml file, which by default sets:
<property>
<name>fs.hdfs.impl</name>
<value>org.apache.hadoop.hdfs.DistributedFileSystem</value>
<description>The FileSystem for hdfs: uris.</description>
</property>
Similarly, you may have noticed that to get gs:// to work on your local cluster, you provided fs.gs.impl. This is because DistribtedFileSystem and GoogleHadoopFileSystem both implement the same Hadoop Java interface FileSystem, and Hadoop is built to be agnostic to how an implementation chooses to actually implement the FileSystem methods. This also means that at the most basic level, anywhere you could normally use hdfs:// you should be able to use gs://.
So, to answer your questions:
The same minimal setup you'd use to get Flume working with a typical HDFS-based setup should work for using GCS as a sink.
You don't need to launch the cluster on Google Compute Engine, though it'd be easier, as you experienced with the more difficult manual instructions for using the GCS connector on your local setup. But since you already have a local setup running, it's up to you whether Google Compute Engine will be an easier place to run your Hadoop/Flume cluster.
Yes, as mentioned above, you should experiment with replacing hdfs:// paths with gs:// paths instead, and/or setting fs.default.name to be your root gs://configbucket path.
Having the two storage engines allows you to more easily switch between the two in case of incompatibilities. There are some minor differences in supported features, for example GCS won't have the same kinds of posix-style permissions you have in HDFS. Also, it doesn't support appends to existing files or symlinks.

local GAE datastore does not keep data after computer shuts down

On my local machine (i.e. http://localhost:8080/), I have entered data into my GAE datastore for some entity called Article. After turning off my computer and then restarting next day, I find the datastore empty: no entity. Is there a way to prevent this in the future?
How do I make a copy of the data in my local datastore? Also, will I be able to upload said data later into both localhost and production?
My model is ndb.
I am using Max OS X and Python 2.7, if theses matter.
I have experienced the same problem. Declaring the datastore path when running dev_appserver.py should fix it. These are the arguments I use when starting the dev_appserver
python dev_appserver.py --high_replication --use_sqlite --datastore_path=myapp.datastore --blobstore_path=myapp_blobs
This will use sqlite and save the data in the file myapp.datastore. If you want to save it in a different directory, use --datastore_path=/path/to/myapp/myapp.datastore
I also use --blobstore_path to save my blobs in a specific directory. I have found that it is more reliable to declare which directory to save my blobs. Again, that is --blobstore_path=/path/to/myapp/blobs or whatever you would like.
Since declaring blob and datastore paths, I haven't lost any data locally. More info can be found in the App Engine documentation here:
https://developers.google.com/appengine/docs/python/tools/devserver#Using_the_Datastore
Data in the local datastore is preserved unless you start it with the -c flag to clear it, at least on the PC. You therefore probably have a different issue with temp files or permissions or something.
The local data is stored using a different method to the actual production servers, so not sure if you can make a direct backup as such. If you want to upload data to both the local and deployed servers you can use the Upload tool suite: uploading data
The bulk loader tool can upload and download data to and from your application's datastore. With just a little bit of setup, you can upload new datastore entities from CSV and XML files, and download entity data into CSV, XML, and text files. Most spreadsheet applications can export CSV files, making it easy for non-developers and other applications to produce data that can be imported into your app. You can customize the upload and download logic to use different kinds of files, or do other data processing.
So you can 'backup' by downloading the data to a file.
To load/pull data into the local development server just give it the local URL.
The datastore typically saves to disk when you shut down. If you turned off your computer without shutting down the server, I could see this happening.

Resources