I'm working on a piece of software that stores files in a file system, as well as references to those files in a database. Querying the uploaded files can thus be done in the database without having to access the file system. From what I've read in other posts, most people say it's better to use a file system for file storage rather then storing binary data directly in a database as BLOB.
So now I'm trying to understand the best way to set this up so that both the database a file system stay in sync and I don't end up with references to files that don't exist, or files taking up space in the file system that aren't referenced. Here are a couple options that I'm considering.
Option 1: Add File Reference First
//Adds a reference to a file in the database
database.AddFileRef("newfile.txt");
//Stores the file in the file system
fileStorage.SaveFile("newfile.txt",dataStream);
This option would be problematic because the reference to the file is added before the actual file, so another user may end up trying to download a file before it is actually stored in the system. Although, since the reference to the the file is created before hand the primary key value could be used when storing the file.
Option 2: Store File First
//Stores the file
fileStorage.SaveFile("newfile.txt",dataStream);
//Adds a reference to the file in the database
//fails if reference file does not existing in file system
database.AddFileRef("newfile.txt");
This option is better, but would make it possible for someone to upload a file to the system that is never referenced. Although this could be remedied with a "Purge" or "CleanUpFileSystem" function that deletes any unreferenced files. This option also wouldn't allow the file to be stored using the primary key value from the database.
Option 3: Pending Status
//Adds a pending file reference to database
//pending files would be ignored by others
database.AddFileRef("newfile.txt");
//Stores the file, fails if there is no
//matching pending file reference in the database
fileStorage.SaveFile("newfile.txt",dataStream); database
//marks the file reference as committed after file is uploaded
database.CommitFileRef("newfile.txt");
This option allows the primary key to be created before the file is uploaded, but also prevents other users from obtaining a reference to a file before it is uploaded. Although, it would be possible for a file to never be uploaded, and a file reference to be stuck pending. Yet, it would also be fairly trivial to purge pending references from the database.
I'm leaning toward option 2, because it's simple, and I don't have to worry about users trying to request files before they are uploaded. Storage is cheap, so it's not the end of the world if I end up with some unreferenced files taking up space. But this also seems like a common problem, and I'd like to hear how others have solved it or other considerations I should be making.
I want to propose another option. Make the filename always equal to the hash of its contents. Then you can safely write any content at all times provided that you do it before you add a reference to it elsewhere.
As contents never change there is never a synchronization problem.
This gives you deduplication for free. Deletes become harder though. I recommend a nightly garbage collection process.
What is the real use of the database? If it's just a list of files, I don't think you need it at all, and not having it saves you the hassle of synchronising.
If you are convinced you need it, then options 1 and 2 are completely identical from a technical point of view - the 2 resources can be out of sync and you need a regular process to consolidate them again. So here you should choose the options that suits the application best.
Option 3 has no advantage whatsoever, but uses more resources.
Note that using hashes, as suggested by usr, bears a theoretical risk of collision. And you'd also need a periodical consolidation process, as for options 1 and 2.
Another questions is how you deal with partial uploads and uploads in progress. Here option 2 could be of use, but you could also use a second "flag" file that is created before the upload starts, and deleted when the upload is done. This would help you determine which uploads have been aborted.
To remedy the drawback you mentioned of option 1 I use something like fileStorage.FileExists("newfile.txt"); and filter out the result for which it returns a negative.
In Python lingo:
import os
op = os.path
filter(lambda ref: op.exists(ref.path()), database.AllRefs())
Related
We’ve got time-stamped directories containing text files, stored in HDFS.
We can regularly get new files added, so we’re using a FileSource (Flink 1.14.4, and a streaming job) with a monitoring duration, so that it continuously picks up any new files.
The challenge is that we need to include the parent directory’s timestamp in the output, for doing time-window joins of this enrichment data with another stream.
Previously I could extend the input format to extract path information, and emit a Tuple2<LongWritable, Text> (see my SO answer to a question about doing that).
But with the new FileSource architecture, I’m really not sure if it’s possible, or if so, the right way to go about doing it.
I’ve wandered through the source code (FileSource, AbstractFileSource, SourceReader, FileSourceReader, FileSourceSplit, ad nauseam) but haven’t seen any happy path to making that all work.
There might be a way using some really ugly hacks to TextLineFormat, where it would reverse engineer the FSDataInputStream to try to find information about the original file, but feels very fragile.
Any suggestions?
The situation
I use Labview 2012 on Windows 7
my test result data is written in text files. First, information about the test is written in the file (product type, test type, test conditions etc) and after that the logged data is written each second.
All data files are stored in folders, sorted to date and the names of the files contain some info about the test
I have years worth of data files and my search function now only works on the file names (opening each file to look for search terms costs too much time)
The goal
To write metadata (additional properties like Word files can have) with the text files so that I can implement a search function to quickly find the file that I need
I found here the way to write/read metadata for images, but I need it for text files or something similar.
You would need to be writing to data files that supports meta data to begin with (such as LabVIEW TDMS or datalog file formats). In a similar situation, I would simply use a separate file with the same name, but a different extension for example. Then you can index those file names, and if you want the data you just swap the meta data filename extension and you are good to go.
I would not bother with files and use database for results logging. It may be not what you wiling to do, but this is the ultimate solution for the search problem and it open a lot of data analytics possibilities.
The metadata in Word files is from a feature called "Alternative Data Streams" which is actually a function of NTFS. You can learn more about it here.
I can't say I've ever used this feature. I don't think there is a nice API for LabVIEW, but one could certainly be made. With some research you should be able to play around with this feature and see if it really makes finding files any easier. My understanding is that the data can be lost if transferred over the network or onto a non-NTFS thumbdrive.
i would like to ask for your opinion and advice.
In my application i need to store files uploaded from user to provide import to database - it could be XML or excel file (.xlsx), i guess max file size about 500kB per file.
There is need to store files because of import to database, which is not done immediately and also because of backup.
I consider scenario about thousands (ten thousands) of users.
Scenario - one user can upload many files to many categories. It means that user can upload file_1 to category 1, file_2 to category_2, but also file_3 to category_2_1(subcategory of category_2).
Generally, there is some kind of category tree and user can upload many files to many nodes.
Because of import application, filename will always contain :
user_code_category_code_timestamp
And my problem is, that i do not know that is the best way to store that files.
should i have one directory per user -> one directory per category -> relevant files
should i have one directory per user -> all user files
should i have one root directory -> all users and all files
?
In the best way i mean - there must be application for import, which will list relevant files in category and for relevant user. As i wrote above, there are many ways, so i am a bit confused.
What else should i consider ? File system limitations ?
Hope you understand problem.
Thank you.
Are you using some kind of a framework? Best case is you use a plugin for it.
The standard basic solution for storing files is to have one directory for all files(images for example). When you save a file, you change the name of the file so they do not duplicate in the directory. You keep all other data in a DB table.
From that base - you can improve and change the solution depending on the business logic.
You might want to restrict access to the files, you might want to put them in a tree directory if you need browsing in them.
And so on...
Thank you for this question! It was difficult to find answers for this online, but in my case I have potentially 10k's of images/pdfs/files/etc. and it seems that using hashes and saving to one location directory is ideal and makes it much less complicated.
Useful things to think about:
1. Add some additional meta data (you can do this in S3 buckets)
2. I would make sure you have the option to resize images if relevant such as ?w=200&h=200.
3. Perhaps save a file name that can be displayed if the user downloads it so it doesn't give them some weird hash.
4. if you save based on a hash that works off of the current time, you can generate non-duplicating hashes.
5. trying to view all the files at once would hurt performance, but when your app is requesting only one file at a time based on endpoint this shouldn't be an issue.
On my web server, I have two folders showcase and thumbnail to store images and their thumbnails, respectively. A database fetches these images to display them on a page.
The table column in the showcase table is s_image which stores something like /showcase/urlcode.jpg.
I heard that after around 10-20k files in a folder, it starts to slow down. So should I be creating a second folder, showcase2 once it's filled up? Is there some kind of automatic creation that can do this for me?
I appreciate your input.
The filesystem you're using matters when you put tens of thousands of files in a single directory. extfs4 on Linux scales up better than NTFS on Windows.
Windows has a compatibility mode for 8.3 file names (the old-timey DOS file name standard). This causes every file name longer than abcdefgh.ext to have an alias created for it something like abcd~123.ext. This is slow, and gets very slow when you have lots of files in a single directory. You can turn off this ancient compatibility behavior. See here. https://support.microsoft.com/en-us/kb/121007. If you do turn it off, it's a quick fix for an immediate performance problem.
But, 20,000 files in one directory is a large number. Your best bet, on any sort of file system, is automatically creating subdirectories in your file system based on something that changes. One strategy is to create subdirectories based on year / month, for example
/showcase/2015/08/image1.jpg (for images uploaded this month)
/showcase/2015/09/image7.jpg (for images next month)
It's obviously no problem to store those longer file names in your s_image column in your table.
Or, if you have some system to the naming of the images, exploit it to create subdirectories. For example, if your images are named
cat0001.jpg
cat0002.jpb
...
cat0456.jpg
...
cat0987.jpg
You can create subdirectories based on, say, the first five letters of the names
/showcase/cat00/cat0001.jpg
/showcase/cat00/cat0002.jpb
...
/showcase/cat04/cat0456.jpg
...
/showcase/cat09/cat0987.jpg
If you do this, it's much better to keep the image names intact rather than make them shorter (for example, don't do this /showcase/cat09/87.jpg) because if you have to search for a particular image by name you want the full name there.
As far as I know, there's nothing automatic in a file system to do this for you. But it's not hard to do in your program.
I am going to convert a text file in the SQLite db form; I am concerned about these points because giving any effort to write code for it:
Will both text file or its corresponding sqlite db be of same size?
SQLite would take less space than text file?
Or text file db is the one with lowest space?
"Hardware is cheap" - I'd strongly recommend not worrying about size differences, which will likely be insignificant anyway, and instead pick whichever solution best meets the rest of your needs. A text file can work just fine for simple projects, but a database has many more features that can help you organize, backup, and query your data much more efficiently and robustly.
For a more in-depth look at the pros and cons of both options, check out: database vs. flat files
Some things to keep in mind:
(NOTE about this answer: Files here references to internal/external storage, not SharedPrefs)
SQL:
Databases have overheads, which does take up size
If the database or a table goes corrupt, all data is lost(how bad this is depends on your app. Losing several thousand pictures: bad. Losing deletion log: not very bad)
Databases can be compressed(see this)
You can split up data into different tables, if you have issues with ID(or whatever way you identify row X), meaning one database can have several tables for each object where object X have identification conflicts with object Y. That basically means you can keep everything in one file, and still avoid conflicts with names. (Read more at the bottom of the answer)
Files:
Every single file has to be defined as its own separate file, which takes up space(name of the file)
You cannot store all attributes in one file without having to set up an advanced reader that determines the different types of data. If you don't do that, and have one file for each attribute, you will use a lot of space.
Reading thousands of lines can go slow, especially if you have several(say 100+) very big files
The OS uses space for each file, excluding the content. The name of the file for an instance, that takes up space. But something to keep in mind is you can keep all the data of an app in a single file. If you have an app where objects of two different types may have naming issues, you create a new database.
Naming conflicts
Say you have two objects, object X and Y.
Scenario 1'
Object X stores two variables. The file names are(x and y are in this case coordinates):
x.txt
y.txt
But in a later version, object Y comes in with the same two files.
So you have to assign an ID to object X and Y:
0-x.txt
0-y.txt
Every file uses 3 chars(7 total, including extension) on the name alone. This grows bigger the more complex the setup is. See scenario 2
But saving in the database, you get the row with ID 0 and find column X or Y.
You do not have to worry about the file name.
Further, if every object saves a lot of files, the reference to load or save each file will take up a lot of space. And that affects your APK file, and slowly pushes you up towards the 50 MB limit(google play limit).
You can create universal methods, but you can do the same with SQL and save space in the APK file. But compared to text files, SQL does save some space in terms of name.
Note though, that if you save 2-3 files(just to take a number) those few bytes going to names aren't going to matter
It is when you start saving hundreds of files, long names to avoid naming conflicts, that is when SQL saves you space. And if the table gets too big, you can compress it. You can zip text files to maybe save some space, but with one-liner files, there is not much to save.
Scenario 2
Object X and Y has three children each.
Every child has 3 variables it saves to the file system. If there was only one object with 3 children, it could have saved it like this
[id][variable name].txt
But because there is another parent with 3 children(of the same type, and they save the same files) the object's children who get saved last are the ones that stay saved. The first 3 get overwritten.
So you have to add the parent ID:
[parent ID][child ID][variable name].txt
And keep in mind, these examples are focused on a few objects. The amount of space saved is low, but when you save hundreds, if not thousands of files, that is when you start to save space.
Now, if you create a table, you can store your main objects(X and Y in this case). Then, you can either create the first table in a way that makes it recognisable whether the object is the parent or child, or you can create a second table. The second table have two ID values; One to identify the parent and one to identify the child. So if you want to find all the children of object 436, you simply write this query:
SELECT * FROM childrentable WHERE `parent_id`='436'
And that will give you all the attributes for all the children with object 436 as its parent.
And everything is stored in the Cursor when returned.
If you were to do the same with a file, this line(where Saver is the file saving and loading class):
Saver.load("0-436-file_name", context);
It is, of course, possible to use a for-loop to cycle the children ID(the 0 at the start), but you would also have to save how many children there are: You cannot get the files as easily, so you have to store values about thee amount of objects and the objects children.
This meaning you have to save more values in more files to be able to get the files you saved in the first place. And this is a really hard way to do things. A database would help you not have to write files to keep track of how many files you saved. The database would return [x] results on each query. So if object 436 has no children, SQL returns 0 rows. But in files, you would have to save 0 as the amount of children. Guessing file names lead to I/O exceptions.
I would expect the text file to be smaller as it has no overhead: all the things a Database gives you have a cost in terms of space.
It sounds like space is the only thing that matters to you, and that you expect to change the contents of the text file often (you call it a 'text file db'). Please note that there is no such thing as a 'text file db'. Reading and writing to it will be very slow compared to a proper db (such as SQLite). Adding different record types (Tables in a db) will complicate your like and I wouldn't want to try to maintain any sort of referential links between record types in a text file.
Hope that helps -