How to download an ephemeral file from Heroku Cedar - file

I have a rails project hosted on Heroku Cedar that does the following:
crawls daily newsfeed and store them into the database
manually judge the feeds and classify them into categories
use the judgments to build a classifier that automatically classifies new incoming feed
iteratively improve the classification with additional judgments
The problem is that the classifier requires writing to a file. However, when I run the scripts on Heroku Cedar, it creates an ephemeral file that isn't permanent.
My questions are:
Is there a way to download the ephemeral file I created by running a script on Heroku?
What's a better way to handle situation like this?

In short No. You want to be storing any generated data in some sort of persistent file/data store. You should look at pushing these files to S3 or similar.

Related

Heroku where can i save files?

I have a telegram bot, and it saves the user's audio messages and photos in the repository and DB(path only), I deployed it in on pythonanywhere and everything works.
But before that, I tried to deploy it on heroku and ran into the problem that you can't store files there and everything can only be done through databases.
Do I understand correctly that you need to create a field in the database that stores the file itself, or are there other ways?
You may use, for example, cloudinary. They provide 25GB of bandwidth for free. The service is intended to be used for pictures but works well with other files as well. AND it has a good API to go with it for many programming languages (not sponsored)).

Kubernetes: How to manage data with multiple replicas?

I am currently learning Kubernetes and I'm stuck on how to handle the following situation:
I have a Spring Boot application which handles files(photos, pdf, etc...) uploaded by users, users can also download these files. This application also produces logs which are spread into 6 different files. To make my life easier I decided to have a root directory containing 2 subdirectories(1 directory for users data and 1 for logs) so the application works only with 1 directory(appData)
.appData
|__ usersData
|__ logsFile
I would like to use GKE (Google Kubernetes Engine) to deploy this application but I have these problems:
How to handle multiple replicas which will read/write concurrently data + logs in the appData directory?
Regarding logs, is it possible to have multiple Pods writing to the same file?
Say we have 3 replicas (Pod-A, Pod-B and Pod-C), if user A uploads a file handled by Pod-B, how Pod-A and Pod-C will discover this file if the same user requests later its file?
Should each replica have its own volume? (I would like to avoid this situation, which seems the case when using StatefulSet)
Should I have only one replica? (using Kubernetes will be useless in that case)
Same questions about database's replicas.
I use PostgreSQL and I have the same questions. If we have multiple replicas, as requests are randomly send to replicas, how to be sure that requesting data will return a result?
I know there a lot of questions. Thanks a lot for your clarifications.
I'd do two separate solutions for logs and for shared files.
For logs, look at a log aggregator like fluentd.
For shared file system, you want an NFS. Take a look at this example: https://github.com/kubernetes/examples/tree/master/staging/volumes/nfs. The NFS will use a persistent volume from GKE, Azure, or AWS. It's not cloud agnostic per se, but the only thing you change is your provisioner if you want to work in a different cloud.
You can use persistent volume using NFS in GKE (Google Kubernetes Engine) to share files across pods.
https://cloud.google.com/filestore/docs/accessing-fileshares

How the GitHub store your repository files?

I'm feeling stupid, but I want to know how GitHub and Dropbox store user files, because I have a similar problem and I need to store user's project files .
Is it just like storing project files somewhere in the server and refer to the location as a field in the database, or there are other better methods ?
Thanks.
GitHub uses Git to store repositories, and accesses those repos from their Ruby application. They used to do this with Grit, a Ruby library. Grit was written to implement Git in Ruby but has been replaced with rugged. There are Git reimplementations in other languages like JGit for Java and Dulwich for Python. This presentation gives some details about how GitHub has changed over the years and is worth watching/browsing the slides.
If you wanted to store Git repositories, what you'd want to do is store them on a filesystem (or a cluster thereof) and then have a pointer in your database to point to where the filesystem is located, then use a library like Rugged or JGit or Dulwich to read stuff from the Git repository.
Dropbox stores files on Amazon's S3 service and then implements some wrappers around that for security and so on. This paper describes the protocol that Dropbox uses.
The actual question you've asked is how do you store user files. The simple answer is... on the filesystem. There are plugins for a lot of popular web frameworks for doing user file uploads and file management. Django has Django-Filer for instance. The difficulty you'll encounter in rolling your own file upload management system is building a sensible way to do permissions (so users can only download the files they are entitled to download), so it is worth looking into how the various framework plugins do it.

Are git submodules a good solution for storing a large DB dump?

I.e., we have a 20MB bzip2 sql file of development data that we'd like to have versioned along with our development code.
However, we don't want this file pulled down from the repo by default with every fresh clone/fetch.
One solution seems to be storing this large file in a separate repo and then link to it with a submodule. Then, a developer would fetch the db file only when they need to retrieve and reset their development database. And then, when there's a schema change, the database file would be updated, committed to the external repo, and the submodule updated.
Is this a good development workflow? Or is there a better way of doing this?
EDIT: The uncompressed SQL dump is 360MB.
EDIT: Github says "no", don't do this:
Database dumps
Large SQL files do not play well with version control systems such as
Git. If you are looking to provide your developers with the most
recent production dataset, we recommend using Dropbox for sharing
files like these among your developers.
I ended up making a simple web server serve the schema dump directory from the repo where dumps are stored. The repo grew really quickly because the dumps are large, and it was slowing people down just to clone it when they had to bring up new nodes.

local GAE datastore does not keep data after computer shuts down

On my local machine (i.e. http://localhost:8080/), I have entered data into my GAE datastore for some entity called Article. After turning off my computer and then restarting next day, I find the datastore empty: no entity. Is there a way to prevent this in the future?
How do I make a copy of the data in my local datastore? Also, will I be able to upload said data later into both localhost and production?
My model is ndb.
I am using Max OS X and Python 2.7, if theses matter.
I have experienced the same problem. Declaring the datastore path when running dev_appserver.py should fix it. These are the arguments I use when starting the dev_appserver
python dev_appserver.py --high_replication --use_sqlite --datastore_path=myapp.datastore --blobstore_path=myapp_blobs
This will use sqlite and save the data in the file myapp.datastore. If you want to save it in a different directory, use --datastore_path=/path/to/myapp/myapp.datastore
I also use --blobstore_path to save my blobs in a specific directory. I have found that it is more reliable to declare which directory to save my blobs. Again, that is --blobstore_path=/path/to/myapp/blobs or whatever you would like.
Since declaring blob and datastore paths, I haven't lost any data locally. More info can be found in the App Engine documentation here:
https://developers.google.com/appengine/docs/python/tools/devserver#Using_the_Datastore
Data in the local datastore is preserved unless you start it with the -c flag to clear it, at least on the PC. You therefore probably have a different issue with temp files or permissions or something.
The local data is stored using a different method to the actual production servers, so not sure if you can make a direct backup as such. If you want to upload data to both the local and deployed servers you can use the Upload tool suite: uploading data
The bulk loader tool can upload and download data to and from your application's datastore. With just a little bit of setup, you can upload new datastore entities from CSV and XML files, and download entity data into CSV, XML, and text files. Most spreadsheet applications can export CSV files, making it easy for non-developers and other applications to produce data that can be imported into your app. You can customize the upload and download logic to use different kinds of files, or do other data processing.
So you can 'backup' by downloading the data to a file.
To load/pull data into the local development server just give it the local URL.
The datastore typically saves to disk when you shut down. If you turned off your computer without shutting down the server, I could see this happening.

Resources