I have a Tomcat 8 web app running on OpenShift 3.
I want to be able to read and write files on 'the file system'.
I have been wading through documentation and looking for examples of how to achieve this.
I see that there are many types of persistent storage, for example NFS, EBS, GlusterFS etc.
So, my first question is.
What is the best file system to use for simple read/write access to text based xml files?
Preferably like a *nix file system.
Any example would be much appreciated...
The free OpenShift 3 Starter service ONLY allows 'filesystem storage' to EBS (Amazon Elastic Block Storage). Which can only be written to ONCE.
To get access to GlusterFS of NFS you have to go to the paid service which starts at $50 per month. They are the only filesystems that allow multiple writes to a file.
Related
I am currently learning Kubernetes and I'm stuck on how to handle the following situation:
I have a Spring Boot application which handles files(photos, pdf, etc...) uploaded by users, users can also download these files. This application also produces logs which are spread into 6 different files. To make my life easier I decided to have a root directory containing 2 subdirectories(1 directory for users data and 1 for logs) so the application works only with 1 directory(appData)
.appData
|__ usersData
|__ logsFile
I would like to use GKE (Google Kubernetes Engine) to deploy this application but I have these problems:
How to handle multiple replicas which will read/write concurrently data + logs in the appData directory?
Regarding logs, is it possible to have multiple Pods writing to the same file?
Say we have 3 replicas (Pod-A, Pod-B and Pod-C), if user A uploads a file handled by Pod-B, how Pod-A and Pod-C will discover this file if the same user requests later its file?
Should each replica have its own volume? (I would like to avoid this situation, which seems the case when using StatefulSet)
Should I have only one replica? (using Kubernetes will be useless in that case)
Same questions about database's replicas.
I use PostgreSQL and I have the same questions. If we have multiple replicas, as requests are randomly send to replicas, how to be sure that requesting data will return a result?
I know there a lot of questions. Thanks a lot for your clarifications.
I'd do two separate solutions for logs and for shared files.
For logs, look at a log aggregator like fluentd.
For shared file system, you want an NFS. Take a look at this example: https://github.com/kubernetes/examples/tree/master/staging/volumes/nfs. The NFS will use a persistent volume from GKE, Azure, or AWS. It's not cloud agnostic per se, but the only thing you change is your provisioner if you want to work in a different cloud.
You can use persistent volume using NFS in GKE (Google Kubernetes Engine) to share files across pods.
https://cloud.google.com/filestore/docs/accessing-fileshares
I would like to write data from flume-ng to Google Cloud Storage.
It is a little bit complicated, because I observed a very strange behavior. Let me explain:
Introduction
I've launched a hadoop cluster on google cloud (one click) set up to use a bucket.
When I ssh on the master and add a file with hdfs command, I can see it immediately in my bucket
$ hadoop fs -ls /
14/11/27 15:01:41 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.2.9-hadoop2
Found 1 items
-rwx------ 3 hadoop hadoop 40 2014-11-27 13:45 /test.txt
But when I try to add then read from my computer, it seems to use some other HDFS. Here I added a file called jp.txt, and it doesn't show my previous file test.txt
$ hadoop fs -ls hdfs://ip.to.my.cluster/
Found 1 items
-rw-r--r-- 3 jp supergroup 0 2014-11-27 14:57 hdfs://ip.to.my.cluster/jp.txt
That's also the only file I see when I explore HDFS on http://ip.to.my.cluster:50070/explorer.html#/
When I list files in my bucket with the web console (https://console.developers.google.com/project/my-project-id/storage/my-bucket/), I can only see test.txt and not jp.txt.
I read Hadoop cannot connect to Google Cloud Storage and I configured my hadoop client accordingly (pretty hard stuff) and now I can see items in my bucket. But for that, I need to use a gs:// URI
$ hadoop fs -ls gs://my-bucket/
14/11/27 15:57:46 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.3.0-hadoop2
Found 1 items
-rwx------ 3 jp jp 40 2014-11-27 14:45 gs://my-bucket/test.txt
Observation / Intermediate conclusion
So it seems here there are 2 different storages engine in the same cluster: "traditional HDFS" (starting with hdfs://) and a Google storage bucket (starting with gs://).
Users and rights are different, depending on where you are listing files from.
Question(s)
The main question is: What is the minimal setup needed to write to HDFS/GS on Google Cloud Storage with flume ?
Related questions
Do I need to launch a Hadoop cluster on Google Cloud or not to achieve my goal?
Is it possible to write directly to a Google Cloud Storage Bucket ? If yes, how can I configure flume? (adding jars, redefining classpath...)
How come there are 2 storage engine in the same cluster (classical HDFS / GS bucket)
My flume configuration
a1.sources = http
a1.sinks = hdfs_sink
a1.channels = mem
# Describe/configure the source
a1.sources.http.type = org.apache.flume.source.http.HTTPSource
a1.sources.http.port = 9000
# Describe the sink
a1.sinks.hdfs_sink.type = hdfs
a1.sinks.hdfs_sink.hdfs.path = hdfs://ip.to.my.cluster:8020/%{env}/%{tenant}/%{type}/%y-%m-%d
a1.sinks.hdfs_sink.hdfs.filePrefix = %H-%M-%S_
a1.sinks.hdfs_sink.hdfs.fileSuffix = .json
a1.sinks.hdfs_sink.hdfs.round = true
a1.sinks.hdfs_sink.hdfs.roundValue = 10
a1.sinks.hdfs_sink.hdfs.roundUnit = minute
# Use a channel which buffers events in memory
a1.channels.mem.type = memory
a1.channels.mem.capacity = 1000
a1.channels.mem.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.http.channels = mem
a1.sinks.hdfs_sink.channel = mem
Does the line a1.sinks.hdfs_sink.hdfs.path accept a gs:// path ?
What setup would it need in that case (additional jars, classpath) ?
Thanks
As you observed, it's actually fairly common to be able to access different storage systems from the same Hadoop cluster, based on the scheme:// of the URI you use with hadoop fs. The cluster you deployed on Google Compute Engine also has both filesystems available, it just happens to have the "default" set to gs://your-configbucket.
The reason you had to include the gs://configbucket/file instead of just plain /file on your local cluster is that in your one-click deployment, we additionally included a key in your Hadoop's core-site.xml, setting fs.default.name to be gs://configbucket/. You can achieve the same effect on your local cluster to make it use GCS for all the schemeless paths; in your one-click cluster, check out /home/hadoop/hadoop-install/core-site.xml for a reference of what you might carry over to your local setup.
To explain the internals of Hadoop a bit, the reason hdfs:// paths work normally is actually because there is a configuration key which in theory can be overridden in Hadoop's core-site.xml file, which by default sets:
<property>
<name>fs.hdfs.impl</name>
<value>org.apache.hadoop.hdfs.DistributedFileSystem</value>
<description>The FileSystem for hdfs: uris.</description>
</property>
Similarly, you may have noticed that to get gs:// to work on your local cluster, you provided fs.gs.impl. This is because DistribtedFileSystem and GoogleHadoopFileSystem both implement the same Hadoop Java interface FileSystem, and Hadoop is built to be agnostic to how an implementation chooses to actually implement the FileSystem methods. This also means that at the most basic level, anywhere you could normally use hdfs:// you should be able to use gs://.
So, to answer your questions:
The same minimal setup you'd use to get Flume working with a typical HDFS-based setup should work for using GCS as a sink.
You don't need to launch the cluster on Google Compute Engine, though it'd be easier, as you experienced with the more difficult manual instructions for using the GCS connector on your local setup. But since you already have a local setup running, it's up to you whether Google Compute Engine will be an easier place to run your Hadoop/Flume cluster.
Yes, as mentioned above, you should experiment with replacing hdfs:// paths with gs:// paths instead, and/or setting fs.default.name to be your root gs://configbucket path.
Having the two storage engines allows you to more easily switch between the two in case of incompatibilities. There are some minor differences in supported features, for example GCS won't have the same kinds of posix-style permissions you have in HDFS. Also, it doesn't support appends to existing files or symlinks.
I have a GAE/Python application that is an admin program that allows people all over the world to translate template files for a large GAE application into their local language (we cannot use auto translation because many idioms and the like are involved). The template files have been tokenized and text snippets in the various languages are stored in a GAE datastore (there are thousands of template files involved).
I therefore need to be able to write files to a folder.
I have the following code:
with open("my_new_file.html", "wb") as fh:
fh.write(output)
I have read that GAE blocks the writing of files. If that is true, is there some way to get around it?
If I cannot write the file direct, does anyone have a suggestion for how I accomplish the same thing (e.g. do something with a web-service that does some kind of round trip to download and then upload the file)?
I am a newbie to GAE/Python, so please be specific.
Thanks for any suggestions.
you could use google app engine blobstore or BlobProperty in datastore to store blobs/files
for using blobstore (up to 2GB)
https://developers.google.com/appengine/docs/python/blobstore/
for using datastore blobs (only up to 1MB)
https://developers.google.com/appengine/docs/python/datastore/typesandpropertyclasses#Blob
Filesystem is read only in many cloud system and GAE is too. In a virtual world, where the OS and machine are virtual, the filesystem is least reliable place to store anything
I would suggest using any of BLOB, Google Cloud Storage, Google Drive or even go a setp further and store in any external provider like Amazon S3 etc.
Use the files API:
https://developers.google.com/appengine/docs/python/googlestorage/overview
Adding some extra code you can use it like the normal Python file API:
with files.open(writable_file_name, 'a') as f:
f.write('Hello World!')
While this particular link describes it in relation with Google Cloud Storage (GCS) you can easily replace the GCS-specific pieces and use blobstore as a storage backend.
The code can be found here:
http://code.google.com/p/googleappengine/source/browse/trunk/python/google/appengine/api/files/
My Java JSP application requires to store permanent files on the Tomcat web server. At the moment I save the files in the "/temp" folder of the System. But this folder gets cleared from time to time. Further, the current solution is hard-coded which makes it less flexible (e.g. when moving to another server).
I would like to now if there is a best practice for defining and accessing a permanent directory in this configuration. In detail, where is the best place to define the app file directory, and how would I access this from within my java application? The goal of this setup would be to cause the least effort when (a) updating the application (i.e. placing a new war file), and (b) moving from one server to another and OS (e.g. Unix, Windows, MacOS).
The research I have done on this topic so far revealed that the following would be solutions (possibly amongst others):
1.) Use of a custom subdirectory in the Tomcat installation directory.
What happens to the files if I deploy a new version on the tomcat via
war file?
Where do I define this directory to be accessed from
within my Java application?
2.) In a separate directory in the file system.
Which are good locations or ways to get the locations without knowing
the system?
Where do I define this directory to be accessed from
within my Java application?
Thank you for your advice!
Essentially, you are creating 'a database' in the form of some files. In the world of Java EE and servlet containers, the only really general approach to this is to configure such a resource via JNDI. Tomcat and other containers have no concept of 'a place for persistent storage for webapps'. If a webapp needs persistent storage, it needs to be configured via JNDI, or -D, or something you tell it by posting something to it. There's no convention or standard practice you can borrow.
You can pick file system pathname by convention and document that convention (e.g. /var/something on Linux, something similar on Windows). But you won't necessarily be aligned with what anyone else is doing.
Our app is a sort-of self-service website builder for a particular industry. We need to be able to store the HTML and image files for each customer's site so that users can easily access and edit them. I'd really like to be able to store the files on S3, but potentially other places like Box.net, Google Docs, Dropbox, and Rackspace Cloud Files.
It would be easiest if there there some common file system API that I could use over these repositories, but unfortunately everything is proprietary. So I've got to implement something. FTP or SFTP is the obvious choice, but it's a lot of work. WebDAV will also be a pain.
Our server-side code is Java.
Please someone give me a magic solution which is fast, easy, standards-based, and will solve all my problems perfectly without any effort on my part. Please?
Not sure if this is exactly what you're looking for but we built http://mover.io to address this kind of thing. We currently support 13 different end points and we have a GUI interface and an API for interfacing with all these cloud storage providers.