I'm trying to figure out the best way to store user uploaded files in a file system. The files range from personal files to wiki files. Of course, the DB will point to those files by someway which I have yet to figure out.
Basic Requirements:
Fairy Decent Security so People Can't Guess Filenames
(Picture001.jpg, Picture002.jpg,
Music001.mp3 is a big no no)
Easily Backed Up & Mirrorable (I prefer a way so I don't have to copy the entire HDD every single time I want to backup. I like the idea of backing up just the newest items but I'm flexible with the options here.)
Scalable to millions of files on multiple servers if needed.
One technique is to store the data in files named after the hash (SHA1) of their contents. This is not easily guessable, any backup program should be able to handle it, and it easily sharded (by storing hashes starting with 0 on one machine, hashes starting with 1 on the next, etc).
The database would contain a mapping between the user's assigned name and the SHA1 hash of the contents.
Guids for filenames, automatically expanding folder hierarchy with no more than a couple of thousand files/folders in each folder. Backing up new files is done by backing up new folders.
You haven't indicated what environment and/or programming language you are using, but here's a C# / .net / Windows example:
using System;
using System.IO;
using System.Xml.Serialization;
/// <summary>
/// Class for generating storage structure and file names for document storage.
/// Copyright (c) 2008, Huagati Systems Co.,Ltd.
/// </summary>
public class DocumentStorage
{
private static StorageDirectory _StorageDirectory = null;
public static string GetNewUNCPath()
{
string storageDirectory = GetStorageDirectory();
if (!storageDirectory.EndsWith("\\"))
{
storageDirectory += "\\";
}
return storageDirectory + GuidEx.NewSeqGuid().ToString() + ".data";
}
public static void SaveDocumentInfo(string documentPath, Document documentInfo)
{
//the filestream object don't like NTFS streams so this is disabled for now...
return;
//stores a document object in a separate "docinfo" stream attached to the file it belongs to
//XmlSerializer ser = new XmlSerializer(typeof(Document));
//string infoStream = documentPath + ":docinfo";
//FileStream fs = new FileStream(infoStream, FileMode.Create);
//ser.Serialize(fs, documentInfo);
//fs.Flush();
//fs.Close();
}
private static string GetStorageDirectory()
{
string storageRoot = ConfigSettings.DocumentStorageRoot;
if (!storageRoot.EndsWith("\\"))
{
storageRoot += "\\";
}
//get storage directory if not set
if (_StorageDirectory == null)
{
_StorageDirectory = new StorageDirectory();
lock (_StorageDirectory)
{
string path = ConfigSettings.ReadSettingString("CurrentDocumentStoragePath");
if (path == null)
{
//no storage tree created yet, create first set of subfolders
path = CreateStorageDirectory(storageRoot, 1);
_StorageDirectory.FullPath = path.Substring(storageRoot.Length);
ConfigSettings.WriteSettingString("CurrentDocumentStoragePath", _StorageDirectory.FullPath);
}
else
{
_StorageDirectory.FullPath = path;
}
}
}
int fileCount = (new DirectoryInfo(storageRoot + _StorageDirectory.FullPath)).GetFiles().Length;
if (fileCount > ConfigSettings.FolderContentLimitFiles)
{
//if the directory has exceeded number of files per directory, create a new one...
lock (_StorageDirectory)
{
string path = GetNewStorageFolder(storageRoot + _StorageDirectory.FullPath, ConfigSettings.DocumentStorageDepth);
_StorageDirectory.FullPath = path.Substring(storageRoot.Length);
ConfigSettings.WriteSettingString("CurrentDocumentStoragePath", _StorageDirectory.FullPath);
}
}
return storageRoot + _StorageDirectory.FullPath;
}
private static string GetNewStorageFolder(string currentPath, int currentDepth)
{
string parentFolder = currentPath.Substring(0, currentPath.LastIndexOf("\\"));
int parentFolderFolderCount = (new DirectoryInfo(parentFolder)).GetDirectories().Length;
if (parentFolderFolderCount < ConfigSettings.FolderContentLimitFolders)
{
return CreateStorageDirectory(parentFolder, currentDepth);
}
else
{
return GetNewStorageFolder(parentFolder, currentDepth - 1);
}
}
private static string CreateStorageDirectory(string currentDir, int currentDepth)
{
string storageDirectory = null;
string directoryName = GuidEx.NewSeqGuid().ToString();
if (!currentDir.EndsWith("\\"))
{
currentDir += "\\";
}
Directory.CreateDirectory(currentDir + directoryName);
if (currentDepth < ConfigSettings.DocumentStorageDepth)
{
storageDirectory = CreateStorageDirectory(currentDir + directoryName, currentDepth + 1);
}
else
{
storageDirectory = currentDir + directoryName;
}
return storageDirectory;
}
private class StorageDirectory
{
public string DirectoryName { get; set; }
public StorageDirectory ParentDirectory { get; set; }
public string FullPath
{
get
{
if (ParentDirectory != null)
{
return ParentDirectory.FullPath + "\\" + DirectoryName;
}
else
{
return DirectoryName;
}
}
set
{
if (value.Contains("\\"))
{
DirectoryName = value.Substring(value.LastIndexOf("\\") + 1);
ParentDirectory = new StorageDirectory { FullPath = value.Substring(0, value.LastIndexOf("\\")) };
}
else
{
DirectoryName = value;
}
}
}
}
}
SHA1 hash of the filename + a salt (or, if you want, of the file contents. That makes detecting duplicate files easier, but also puts a LOT more stress on the server). This may need some tweaking to be unique (i.e. add Uploaded UserID or a Timestamp), and the salt is to make it not guessable.
Folder structure is then by parts of the hash.
For example, if the hash is "2fd4e1c67a2d28fced849ee1bb76e7391b93eb12" then the folders could be:
/2
/2/2f/
/2/2f/2fd/
/2/2f/2fd/2fd4e1c67a2d28fced849ee1bb76e7391b93eb12
This is to prevent large folders (some Operating Systems have trouble enumarating folders with a million of files, hence making a few subfolders for parts of the hash. How many levels? That depends on how many files you expect, but 2 or 3 is usually reasonable.
Just in terms of one aspect of your question (security): the best way to safely store uploaded files in a filesystem is to ensure the uploaded files are out of the webroot (i.e., you can't access them directly via a URL - you have to go through a script).
This gives you complete control over what people can download (security) and allows for things such as logging. Of course, you have to ensure the script itself is secure, but it means only the people you allow will be able to download certain files.
Expanding on Phill Sacre's answer, another aspect of security is to use a separate domain name for uploaded files (for instante, Wikipedia uses upload.wikimedia.org), and make sure that domain cannot read any of your site's cookies. This prevents people from uploading a HTML file with a script to steal your users' session cookies (simply setting the Content-Type header isn't enough, because some browsers are known to ignore it and guess based on the file's contents; it can also be embedded in other kinds of files, so it's not trivial to check for HTML and disallow it).
Related
I am trying to get an offline backup function working on Android 12. It has worked for years on previous versions of Android, 6 & 8. It is required as the size of the backup can often exceed 25mb. I am using a Samsung A7 Lite for this testing to ensure Android 12 compliance. Essentially the function initially creates a backup folder in the downloads folder if it does not exist. It then writes a backup file to that folder. All goes well. I can repeat the function any number of times without there being a problem. It retains father and grandfather versions for security. However, if I try to use the same function where there are existing files the following day, I am presented with a java.io.FileNotFoundException, open failed EACCES (Permission denied). This whole situation appears very illogical, and does not appear to follow the documentation on accessing the downloads folder. If I manually delete the backup file from the previous day, the process succeeds, similarly if I delete the backup directory within the downloads folder, the backup proceeds successfully. The app asks the user for the appropriate permissions which I believe are read and write external storage. Can anybody identify what I am doing wrong in this environment.
The code is below.
String path = "";
// if no external, set to download
if (path.equals("")) {
File systemPath = Environment.getExternalStoragePublicDirectory(Environment.DIRECTORY_DOWNLOADS);
path = systemPath.getAbsolutePath();
}
// set up backup subdirectory
path = path + "/backup";
// check if path exists
File backupDir = new File(path);
if (!backupDir.exists()) {
try {
backupDir.mkdirs();
MediaScannerConnection.scanFile(this, new String[]{backupDir.getAbsolutePath()}, null, null);
}
catch(Exception e){
e.printStackTrace();
}
}
// first get rid of old backup files leaving at least 2 older versions
File backupFile = new File(path,"backup3.bkp");
if (backupFile.exists())
backupFile.delete();
for (int i = 3;i > 1;i--){
File renameBackupFile = new File(path,"backup" + i + ".bkp");
File existBackupFile = null;
if (i == 2)
existBackupFile = new File(path,"backup.bkp");
else
existBackupFile = new File(path,"backup" + (i - 1) + ".bkp");
if (existBackupFile.exists()) {
try {
existBackupFile.renameTo(renameBackupFile);
} catch (Exception e) {
String message = e.toString();
}
}
}
// create a new backup
String fileName = "backup.bkp";
String backup = path + "/" + fileName;
FileInputStream dataBaseFile = new FileInputStream(DB_PATH);
File newBackupFile = new File(backup);
newBackupFile.createNewFile();
FileOutputStream backupStream = new FileOutputStream(newBackupFile);
//transfer bytes from the inputfile to the outputfile
byte[] buffer = new byte[1024];
int length;
while ((length = dataBaseFile.read(buffer)) > 0) {
backupStream.write(buffer, 0, length);
}
//Close the streams
backupStream.flush();
backupStream.close();
dataBaseFile.close();
MediaScannerConnection.scanFile(this, new String[]{newBackupFile.getAbsolutePath()}, null, null);
To merge Storage files in Codename One I elaborated this solution:
/**
* Merges the given list of Storage files in the output Storage file.
* #param toBeMerged
* #param output
* #throws IOException
*/
public static synchronized void mergeStorageFiles(List<String> toBeMerged, String output) throws IOException {
if (toBeMerged.contains(output)) {
throw new IllegalArgumentException("The output file cannot be contained in the toBeMerged list of input files.");
}
// Note: the temporary file used for merging is placed in the FileSystemStorage because it offers the method
// openOutputStream(String file, int offset) that allows appending to a stream. Storage doesn't have a such method.
long writtenBytes = 0;
String tempFile = FileSystemStorage.getInstance().getAppHomePath() + "/tempFileUsedInMerge";
for (String partialFile : toBeMerged) {
InputStream in = Storage.getInstance().createInputStream(partialFile);
OutputStream out = FileSystemStorage.getInstance().openOutputStream(tempFile, (int) writtenBytes);
Util.copy(in, out);
writtenBytes = FileSystemStorage.getInstance().getLength(tempFile);
}
Util.copy(FileSystemStorage.getInstance().openInputStream(tempFile), Storage.getInstance().createOutputStream(output));
FileSystemStorage.getInstance().delete(tempFile);
}
This solution is based on the API FileSystemStorage.openOutputStream(String file, int offset), that is the only API that I found to allow to append the content of a file to another.
Are there other API that can be used to append or merge files?
Thank you
Since you end up copying everything to a Storage entry I don't see the value of using FileSystemStorage as an intermediate merging tool.
The only reason I can think of is integrity of the output file (e.g. if failure happens while writing) but that can happen here too. You can guarantee integrity by setting a flag e.g. creating a file called "writeLock" and deleting it when write has finished successfully.
To be clear I would copy like this which is simpler/faster:
try(OutputStream out = Storage.getInstance().createOutputStream(output)) {
for (String partialFile : toBeMerged) {
try(InputStream in = Storage.getInstance().createInputStream(partialFile)) {
Util.copyNoClose(in, out, 8192);
}
}
}
I have tried doing this by encrypting individual files but I have a lot of data (~20GB) and hence it would take a lot of time. In my test it took 2.28 minutes to encrypt a single file of size 80MB.
Is there a quicker way to be able to password protect that would apply to any any file (text/binary/multimedia)?
If you are just trying to hide the file from others, you can try to encrypt the file path instead of encrypting the whole huge file.
For the path you mentioned: text/binary/multimedia, you can try to encrypt it by a method as:
private static String getEncryptedPath(String filePath) {
String[] tokens = filePath.split("/");
List<String> tList = new ArrayList<>();
for (int i = 0; i < tokens.length; i++) {
tList.add(Hashing.md5().newHasher() // com.google.common.hash.Hashing;
.putString(tokens[i] + filePath, StandardCharsets.UTF_8).hash().toString()
.substring(2 * i, 2 * i + 5)); // to make it impossible to encrypt, add your custom secret here;
}
return String.join("/", tList);
}
and then it becomes an encrypted path as:
72b12/9cbb3/4a5f3
Once you know the real path text/binary/multimedia, any time you want to access the file, you can just use this method to get the real file path 72b12/9cbb3/4a5f3.
I need to search a drive (C:, D: etc) for a partuicular file type (extension like .xml, .csv, .xls). How do I preform a recursive search to loop all directories and inner directories and return the full path of where the file(s) are? or where can I get information on this?
VB.NET or C#
Thanks
Edit ~ I am running into some errors like unable to access system volume access denied etc. Does anyone know where I can see some smaple code on implementing a file search? I just need to search a selected drive and return the full path of the file type for all the files found.
System.IO.Directory.GetFiles(#"c:\", "*.xml", SearchOption.AllDirectories);
How about this? It avoids the exception often thrown by the in-built recursive search (i.e. you get access-denied to a single folder, and your whole search dies), and is lazily evaluated (i.e. it returns results as soon as it finds them, rather than buffering 2000 results). The lazy behaviour lets you build responsive UIs etc, and also works well with LINQ (especially First(), Take(), etc).
using System;
using System.Collections;
using System.Collections.Generic;
using System.IO;
static class Program { // formatted for vertical space
static void Main() {
foreach (string match in Search("c:\\", "*.xml")) {
Console.WriteLine(match);
}
}
static IEnumerable<string> Search(string root, string searchPattern) {
Queue<string> dirs = new Queue<string>();
dirs.Enqueue(root);
while (dirs.Count > 0) {
string dir = dirs.Dequeue();
// files
string[] paths = null;
try {
paths = Directory.GetFiles(dir, searchPattern);
} catch { } // swallow
if (paths != null && paths.Length > 0) {
foreach (string file in paths) {
yield return file;
}
}
// sub-directories
paths = null;
try {
paths = Directory.GetDirectories(dir);
} catch { } // swallow
if (paths != null && paths.Length > 0) {
foreach (string subDir in paths) {
dirs.Enqueue(subDir);
}
}
}
}
}
It looks like the recls library - stands for recursive ls - now has a pure .NET implementation. I just read about it in Dr Dobb's.
Would be used as:
using Recls;
using System;
static class Program { // formatted for vertical space
static void Main() {
foreach(IEntry e in FileSearcher.Search(#"c:\", "*.xml|*.csv|*.xls")) {
Console.WriteLine(e.Path);
}
}
How to get the File Directory of a file (C:\myfolder\subfoler\mydoc.pdf). I also want to add the size of the subfolders, and finally the main folder size. This is for a .NET CLR that I need to integrate with SQL Server 2005 for a SSRS report.
You can use GetDirectoryName, to get only the directory path of the file:
using System.IO;
string directoryName = Path.GetDirectoryName(#"C:\myfolder\subfolder\mydoc.pdf");
// directoryName now contains "C:\myfolder\subfolder"
For calculating the directory and subdirectory size, you can do something like this:
public static long DirSize(DirectoryInfo d)
{
long Size = 0;
// Add file sizes.
FileInfo[] fis = d.GetFiles();
foreach (FileInfo fi in fis)
{
Size += fi.Length;
}
// Add subdirectory sizes.
DirectoryInfo[] dis = d.GetDirectories();
foreach (DirectoryInfo di in dis)
{
Size += DirSize(di);
}
return(Size);
}