Google App Engine (GAE) - do instances use a shared disk? - google-app-engine

I'm using Google App Engine (GAE) and my app.yaml looks like this:
runtime: custom # uses Dockerfile
env: flex
manual_scaling:
instances: 2
resources:
cpu: 2
memory_gb: 12
disk_size_gb: 50
The 50GB disk, is it shared between instances? The docs are silent on this issue. I'm downloading files to the disk and each instance will need to be able to access the files I am downloading.
If the disk is not shared, how can I share files between instances?
I know I could download them from Google Cloud storage on demand, but these are video files and instant access is needed for every instance. Downloading the video files on demand would be too slow.
Optional Reading
The reason instant access is needed is because I am using ffmpeg to produce a photo from the video at frame X (or time X). When a photo of the video is taken, these photos need to be made available to the user as quickly as possible.

You are using 50 Gb disk in your GAE, be it standard or flex, there is no way you can share instances between GAE, as the storage is dedicated.
You tried GCS, since video file processing is involved and GCS is object based storage.
So the alternative to this could be Filestore but it is not yet supported for GAE Flex despite the possibility of SSH into its underlying fully-managed machine.
There is a way if you use the /tmp folder. However, it will store files in the RAM of the instance, so note that it will take up memory and that it is temporary (as the folder's name suggests).
For more details, see the documentation here or here.

Related

How to deploy a web app that needs regular access to large data files

I am trying to deploy a web app I have written, but I am stuck with one element. The bulk of it is just an Angular application that interacts with a MongoDB database, thats all fine. Where I am stuck is that I need local read access to around 10Gb of files (geoTiff digital elevation models) - these dont change and are broken down into 500 or so files. Each time my app needs geographic elevations, it needs to find the right file, read the right bit of the files, return the data - the quicker the better. To reiterate, I am not serving these files, just reading data from them.
In development these files are on my machine and I have no problems, but the files seem to be too large to bundle in the Angular app (runs out of memory), and too large to include in any backend assets folder. I've looked at two serverless cloud hosting platforms (GCP and Heroku) both of which limit the size of the deployed files to around 1Gb (if I remember right). I have considered using cloud storage for the files, but I'm worried about negative performance as each time I need a file it would need to be downloaded from the cloud to the application. The only solution I can think of is to use a VM based service like Google Compute and use an API service to recieve requests from the app and deliver back the required data, but I had hoped it could be more co-located (not least cos that solution costs more $$)...
I'm new to deployment so any advice welcome.
Load your data to a GIS DB, like PostGIS. Then have your app query this DB, instead of the local raster files.

Kubernetes: How to manage data with multiple replicas?

I am currently learning Kubernetes and I'm stuck on how to handle the following situation:
I have a Spring Boot application which handles files(photos, pdf, etc...) uploaded by users, users can also download these files. This application also produces logs which are spread into 6 different files. To make my life easier I decided to have a root directory containing 2 subdirectories(1 directory for users data and 1 for logs) so the application works only with 1 directory(appData)
.appData
|__ usersData
|__ logsFile
I would like to use GKE (Google Kubernetes Engine) to deploy this application but I have these problems:
How to handle multiple replicas which will read/write concurrently data + logs in the appData directory?
Regarding logs, is it possible to have multiple Pods writing to the same file?
Say we have 3 replicas (Pod-A, Pod-B and Pod-C), if user A uploads a file handled by Pod-B, how Pod-A and Pod-C will discover this file if the same user requests later its file?
Should each replica have its own volume? (I would like to avoid this situation, which seems the case when using StatefulSet)
Should I have only one replica? (using Kubernetes will be useless in that case)
Same questions about database's replicas.
I use PostgreSQL and I have the same questions. If we have multiple replicas, as requests are randomly send to replicas, how to be sure that requesting data will return a result?
I know there a lot of questions. Thanks a lot for your clarifications.
I'd do two separate solutions for logs and for shared files.
For logs, look at a log aggregator like fluentd.
For shared file system, you want an NFS. Take a look at this example: https://github.com/kubernetes/examples/tree/master/staging/volumes/nfs. The NFS will use a persistent volume from GKE, Azure, or AWS. It's not cloud agnostic per se, but the only thing you change is your provisioner if you want to work in a different cloud.
You can use persistent volume using NFS in GKE (Google Kubernetes Engine) to share files across pods.
https://cloud.google.com/filestore/docs/accessing-fileshares

App Engine Flexible running out of disk space

Stopping and deleting old versions and instances in my project doesn't seem to free up disk space. After stopping and deleting a working instance and then spinning up a new instance I get error messages related to disk space (health_check returns unhealthy, i get logs of vm_check_disk_space.sh). I know this is related to disk space as I can resolve the issue by raising resources: disk_size_gb in my app.yaml and redeploying.
My project is 15gb so it's essential that deleted versions and instances don't bloat my project. How can I go about freeing up unused space?
For reference this is my app.yaml (and with a project size of 15gb this should be more than enough?)
runtime: custom
env: flex
manual_scaling:
instances: 1
resources:
cpu: 1
memory_gb: 1.5
disk_size_gb: 40
The docker image used for a specific version is built at deployment time and doesn't normally include other versions of your app (unless they are also present in your deployment directory). So stopping instances for or deleting other versions in the developer console has no impact on the already built docker image.
Increase the deployment verbosity (see --verbosity in gcloud) to see what exactly is included in the image being built then re-deploy while looking for unwanted files/directories. Then use the skip_files configuration option in app.yaml (see General settings) to skip them, if any. A typical such example would be the app's .git directory, for example. Repeat until you're happy with what's included in the docker image.
If you still encounter the problem after skipping unwanted files it could mean that your custom runtime is simply too big for the app's disk size configuration, so you'll have to increase it.
Note that the disk may be used for storing data generated at runtime as well, not only for storing your app and environment code, so you may need to investigate runtime usage as well, see Debugging an Instance.

OpenShift 3 Webapp With Access to File System

I have a Tomcat 8 web app running on OpenShift 3.
I want to be able to read and write files on 'the file system'.
I have been wading through documentation and looking for examples of how to achieve this.
I see that there are many types of persistent storage, for example NFS, EBS, GlusterFS etc.
So, my first question is.
What is the best file system to use for simple read/write access to text based xml files?
Preferably like a *nix file system.
Any example would be much appreciated...
The free OpenShift 3 Starter service ONLY allows 'filesystem storage' to EBS (Amazon Elastic Block Storage). Which can only be written to ONCE.
To get access to GlusterFS of NFS you have to go to the paid service which starts at $50 per month. They are the only filesystems that allow multiple writes to a file.

What is the minimal setup needed to write to HDFS/GS on Google Cloud Storage with flume?

I would like to write data from flume-ng to Google Cloud Storage.
It is a little bit complicated, because I observed a very strange behavior. Let me explain:
Introduction
I've launched a hadoop cluster on google cloud (one click) set up to use a bucket.
When I ssh on the master and add a file with hdfs command, I can see it immediately in my bucket
$ hadoop fs -ls /
14/11/27 15:01:41 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.2.9-hadoop2
Found 1 items
-rwx------ 3 hadoop hadoop 40 2014-11-27 13:45 /test.txt
But when I try to add then read from my computer, it seems to use some other HDFS. Here I added a file called jp.txt, and it doesn't show my previous file test.txt
$ hadoop fs -ls hdfs://ip.to.my.cluster/
Found 1 items
-rw-r--r-- 3 jp supergroup 0 2014-11-27 14:57 hdfs://ip.to.my.cluster/jp.txt
That's also the only file I see when I explore HDFS on http://ip.to.my.cluster:50070/explorer.html#/
When I list files in my bucket with the web console (https://console.developers.google.com/project/my-project-id/storage/my-bucket/), I can only see test.txt and not jp.txt.
I read Hadoop cannot connect to Google Cloud Storage and I configured my hadoop client accordingly (pretty hard stuff) and now I can see items in my bucket. But for that, I need to use a gs:// URI
$ hadoop fs -ls gs://my-bucket/
14/11/27 15:57:46 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.3.0-hadoop2
Found 1 items
-rwx------ 3 jp jp 40 2014-11-27 14:45 gs://my-bucket/test.txt
Observation / Intermediate conclusion
So it seems here there are 2 different storages engine in the same cluster: "traditional HDFS" (starting with hdfs://) and a Google storage bucket (starting with gs://).
Users and rights are different, depending on where you are listing files from.
Question(s)
The main question is: What is the minimal setup needed to write to HDFS/GS on Google Cloud Storage with flume ?
Related questions
Do I need to launch a Hadoop cluster on Google Cloud or not to achieve my goal?
Is it possible to write directly to a Google Cloud Storage Bucket ? If yes, how can I configure flume? (adding jars, redefining classpath...)
How come there are 2 storage engine in the same cluster (classical HDFS / GS bucket)
My flume configuration
a1.sources = http
a1.sinks = hdfs_sink
a1.channels = mem
# Describe/configure the source
a1.sources.http.type = org.apache.flume.source.http.HTTPSource
a1.sources.http.port = 9000
# Describe the sink
a1.sinks.hdfs_sink.type = hdfs
a1.sinks.hdfs_sink.hdfs.path = hdfs://ip.to.my.cluster:8020/%{env}/%{tenant}/%{type}/%y-%m-%d
a1.sinks.hdfs_sink.hdfs.filePrefix = %H-%M-%S_
a1.sinks.hdfs_sink.hdfs.fileSuffix = .json
a1.sinks.hdfs_sink.hdfs.round = true
a1.sinks.hdfs_sink.hdfs.roundValue = 10
a1.sinks.hdfs_sink.hdfs.roundUnit = minute
# Use a channel which buffers events in memory
a1.channels.mem.type = memory
a1.channels.mem.capacity = 1000
a1.channels.mem.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.http.channels = mem
a1.sinks.hdfs_sink.channel = mem
Does the line a1.sinks.hdfs_sink.hdfs.path accept a gs:// path ?
What setup would it need in that case (additional jars, classpath) ?
Thanks
As you observed, it's actually fairly common to be able to access different storage systems from the same Hadoop cluster, based on the scheme:// of the URI you use with hadoop fs. The cluster you deployed on Google Compute Engine also has both filesystems available, it just happens to have the "default" set to gs://your-configbucket.
The reason you had to include the gs://configbucket/file instead of just plain /file on your local cluster is that in your one-click deployment, we additionally included a key in your Hadoop's core-site.xml, setting fs.default.name to be gs://configbucket/. You can achieve the same effect on your local cluster to make it use GCS for all the schemeless paths; in your one-click cluster, check out /home/hadoop/hadoop-install/core-site.xml for a reference of what you might carry over to your local setup.
To explain the internals of Hadoop a bit, the reason hdfs:// paths work normally is actually because there is a configuration key which in theory can be overridden in Hadoop's core-site.xml file, which by default sets:
<property>
<name>fs.hdfs.impl</name>
<value>org.apache.hadoop.hdfs.DistributedFileSystem</value>
<description>The FileSystem for hdfs: uris.</description>
</property>
Similarly, you may have noticed that to get gs:// to work on your local cluster, you provided fs.gs.impl. This is because DistribtedFileSystem and GoogleHadoopFileSystem both implement the same Hadoop Java interface FileSystem, and Hadoop is built to be agnostic to how an implementation chooses to actually implement the FileSystem methods. This also means that at the most basic level, anywhere you could normally use hdfs:// you should be able to use gs://.
So, to answer your questions:
The same minimal setup you'd use to get Flume working with a typical HDFS-based setup should work for using GCS as a sink.
You don't need to launch the cluster on Google Compute Engine, though it'd be easier, as you experienced with the more difficult manual instructions for using the GCS connector on your local setup. But since you already have a local setup running, it's up to you whether Google Compute Engine will be an easier place to run your Hadoop/Flume cluster.
Yes, as mentioned above, you should experiment with replacing hdfs:// paths with gs:// paths instead, and/or setting fs.default.name to be your root gs://configbucket path.
Having the two storage engines allows you to more easily switch between the two in case of incompatibilities. There are some minor differences in supported features, for example GCS won't have the same kinds of posix-style permissions you have in HDFS. Also, it doesn't support appends to existing files or symlinks.

Resources