Best "architecture" to store lots of files per user on server

Best "architecture" to store lots of files per user on server - file

i would like to ask for your opinion and advice.
In my application i need to store files uploaded from user to provide import to database - it could be XML or excel file (.xlsx), i guess max file size about 500kB per file.
There is need to store files because of import to database, which is not done immediately and also because of backup.
I consider scenario about thousands (ten thousands) of users.
Scenario - one user can upload many files to many categories. It means that user can upload file_1 to category 1, file_2 to category_2, but also file_3 to category_2_1(subcategory of category_2).
Generally, there is some kind of category tree and user can upload many files to many nodes.
Because of import application, filename will always contain :
user_code_category_code_timestamp
And my problem is, that i do not know that is the best way to store that files.
should i have one directory per user -> one directory per category -> relevant files
should i have one directory per user -> all user files
should i have one root directory -> all users and all files
?
In the best way i mean - there must be application for import, which will list relevant files in category and for relevant user. As i wrote above, there are many ways, so i am a bit confused.
What else should i consider ? File system limitations ?
Hope you understand problem.
Thank you.

Are you using some kind of a framework? Best case is you use a plugin for it.
The standard basic solution for storing files is to have one directory for all files(images for example). When you save a file, you change the name of the file so they do not duplicate in the directory. You keep all other data in a DB table.
From that base - you can improve and change the solution depending on the business logic.
You might want to restrict access to the files, you might want to put them in a tree directory if you need browsing in them.
And so on...

Thank you for this question! It was difficult to find answers for this online, but in my case I have potentially 10k's of images/pdfs/files/etc. and it seems that using hashes and saving to one location directory is ideal and makes it much less complicated.
Useful things to think about:
1. Add some additional meta data (you can do this in S3 buckets)
2. I would make sure you have the option to resize images if relevant such as ?w=200&h=200.
3. Perhaps save a file name that can be displayed if the user downloads it so it doesn't give them some weird hash.
4. if you save based on a hash that works off of the current time, you can generate non-duplicating hashes.
5. trying to view all the files at once would hurt performance, but when your app is requesting only one file at a time based on endpoint this shouldn't be an issue.

Related

Duplicate files with same content

I was trying to implement a system where a user can save custom configurations.
My query to the teacher was "Why should I allow the user to have multiple custom configurations that are 100% same with different names?" To this query, my teacher responded with an example of the file system where I can save multiple duplicate files.
I am not very convinced by this response although it is true.
I want to know why do we allow the user to save duplicate files or in my case duplicate configurations? I believe it is just redundancy and wastage of available space which can be avoided.

Two configurations may be the same today, but next week one of them will be changed to do something different. Until then, it is a good idea to get used to loading ConfigA for JobA, and ConfigB for JobB. They are the same now, but next week ConfigB will change.

How should I store user-uploaded images for a web application?

On my web server, I have two folders showcase and thumbnail to store images and their thumbnails, respectively. A database fetches these images to display them on a page.
The table column in the showcase table is s_image which stores something like /showcase/urlcode.jpg.
I heard that after around 10-20k files in a folder, it starts to slow down. So should I be creating a second folder, showcase2 once it's filled up? Is there some kind of automatic creation that can do this for me?
I appreciate your input.

The filesystem you're using matters when you put tens of thousands of files in a single directory. extfs4 on Linux scales up better than NTFS on Windows.
Windows has a compatibility mode for 8.3 file names (the old-timey DOS file name standard). This causes every file name longer than abcdefgh.ext to have an alias created for it something like abcd~123.ext. This is slow, and gets very slow when you have lots of files in a single directory. You can turn off this ancient compatibility behavior. See here. https://support.microsoft.com/en-us/kb/121007. If you do turn it off, it's a quick fix for an immediate performance problem.
But, 20,000 files in one directory is a large number. Your best bet, on any sort of file system, is automatically creating subdirectories in your file system based on something that changes. One strategy is to create subdirectories based on year / month, for example
/showcase/2015/08/image1.jpg (for images uploaded this month)
/showcase/2015/09/image7.jpg (for images next month)
It's obviously no problem to store those longer file names in your s_image column in your table.
Or, if you have some system to the naming of the images, exploit it to create subdirectories. For example, if your images are named
cat0001.jpg
cat0002.jpb
...
cat0456.jpg
...
cat0987.jpg
You can create subdirectories based on, say, the first five letters of the names
/showcase/cat00/cat0001.jpg
/showcase/cat00/cat0002.jpb
...
/showcase/cat04/cat0456.jpg
...
/showcase/cat09/cat0987.jpg
If you do this, it's much better to keep the image names intact rather than make them shorter (for example, don't do this /showcase/cat09/87.jpg) because if you have to search for a particular image by name you want the full name there.
As far as I know, there's nothing automatic in a file system to do this for you. But it's not hard to do in your program.

Storing website content: database or file?

I'm building a website, and I'm planning to publish various kinds of posts, like tutorials, articles, etc. I'm going to manage it with php, but when it comes to storing the content of each post, the very text that will be displayed, what's a better option: using a separate text file or adding it as an for each entry in the database?
I don't see a reason why not to use the database directly to store the content, but it's the first time I use a DB and that feels kind of wrong.
What's your experience in this matter?

Ok Friends I am visiting this question once again for the benefit of those who will read this answer. After a lot of trial and error I have reached a conclusion that keeping text in database is a lot convenient and easy to manipulate. Thus all my data is now with in database. Previously I had some details in database and the text part in file but now i have moved all to database.
The only problem is that when editing your posts the field like title or tags or subject etc are changed on a simple html form. but for the main content I have created a text area. however i just have to cut and copy it from the text area to my favorite text editor. and after the editing copy and paste it back.
some benefits that forced me to put every thing in database are
EASY SEARCH: you can run quires like mysql LIKE on your text (specially main content).
EASY ESCAPING: you can run commands easily on your data to escape special characters and make it suitable for display etc.
GETTING INPUT FROM USER: if you want the user to give you input it makes sense to save his input in database , escape it and manipulate it as and when required.
Functions like moving tables , back up, merging two records, arranging posts with similar content in sequential order... etc etc all is more easy in database than the file system.
in file system there is always the problem of missing files, different file names, wrong file shown for different title etc etc
I do not escape user input before adding it to database just before display. this way no permanent changes are stored to the text.(i don't know if that's ok or not)

Infact I am also doing something like you. However I have reached the conclusion as explained below (almost the same as mentioned in the answer above me). I hope you must have made the decision by now but still I will explain it so that it is useful for future.
My Solution: I have a table called content_table it contain details about each and every article, post or anything else that I write. The main (text portion) of the articles/post is placed in a directory in a .php or .txt file. When a user clicks on an article to read, a view of the article is created dynamically by using the information in database and then pulling the text part (I call it main content) from the .txt file. The database contain information like _content_id_, creation date, author, catagory (most of this become meta tags).
The two major benefits are:
performance since less load on datbase
editing the text content is easy.

I am giving comments based on my experience ,
Except attachments you can store things in DB, why because managing content,back up, restore ,querying , searching especially full text search will be easy.
Store attached files in some folder and keep path in DB tables.
Even more if you r willing to implement search inside attachments you can go for some search engine like lucene which is efficient to search static contents.
keeping attachment in DB or in file system is upto the level of important to the files.

How to organize a large number of objects

We have a large number of documents and metadata (xml files) associated with these documents. What is the best way to organize them?
Currently we have created a directory hierarchy:
/repository/category/date(when they were loaded into our db)/document_number.pdf and .xml
We use the path as a unique identifier for the document in our system.
Having a flat structure doesn't seem to a good option. Also using the path as an id helps to keep our data independent from our database/application logic, so we can reload them easily in case of failure, and all documents will maintain their old ids.
Yet, it introduces some limitations. for example we can't move the files once they've been placed in this structure, also it takes work to put them this way.
What is the best practice? How websites such as Scribd deal with this problem?

Your approach does not seem unreasonable, but might suffer if you get more than a few thousand documents added within a single day (file systems tend not to cope well with very large numbers of files in a directory).
Storing the .xml document beside the .pdf seems a bit odd - If it's really metadata about the document, should it not be in the database (which it sounds like you already have) where it can be easily queries and indexed etc?
When storing very large numbers of files I've usually taken the file's key (say, a URL), hashed it, and then stored it X levels deep in directories based on the first characters of the hash...
Say you started with the key 'How to organize a large number of objects'. The md5 hash for that is 0a74d5fb3da8648126ec106623761ac5 so you might store it at...
base_dir/0/a/7/4/http___stackoverflow.com_questions_2734454_how-to-organize-a-large-number-of-objects
...or something like that which you can easily find again given the key you started with.
This kind of approach has one advantage over your date one in that it can be scaled to suit very large numbers of documents (even per day) without any one directory becoming too large, but on the other hand, it's less intuitive to someone having to manually find a particular file.

Best generic strategy to group items using multiple criteria

I have a simple, real life problem I want to solve using an OO approach. My harddrive is a mess. I have 1.500.000 files, duplicates, complete duplicate folders, and so on...
The first step, of course, is parsing all the files into my database. No problems so far, now I got a lot of nice entries which are kind of "naturaly grouped". Examples for this simple grouping can be obtained using simple queries like:
Give me all files bigger than 100MB
Show all files older than 3 days
Get me all files ending with docx
But now assume I want to find groups with a little more natural meaning. There are different strategies for this, depending on the "use case".
Assume I have a bad habit of putting all my downloaded files first on the desktop. Then I extract them to the appropriate folder, without deleting the ZIP file always. The I move them into a "attic" folder. For the system, to find this group of files a time oriented search approach, perhaps combined with a "check if ZIP is same then folder X" would be suitable.
Assume another bad habit of duplicating files, having some folder where "the clean files" are located in a nice structure, and another messy folders. Now my clean folder has 20 picture galleries, my messy folder has 5 duplicated and 1 new gallery. A human user could easily identify this logic by seeing "Oh, thats all just duplicates, thats a new one, so I put the new one in the clean folder and trash all the duplicates".
So, now to get to the point:
Which combination of strategies or patterns would you use to tackle such a situation. If I chain filters the "hardest" would win, and I have no idea how to let the system "test" for suitable combination. And it seemes to me it is more then just filtering. Its dynamic grouping by combining multiple criteria to find the "best" groups.
One very rough approach would be this:
In the beginning, all files are equal
The first, not so "good" group is the directory
If you are a big, clean directory, you earn points (evenly distributed names)
If all files have the same creation date, you may be "autocreated"
If you are a child of Program-Files, I don't care for you at all
If I move you, group A, into group C, would this improve the "entropy"
What are the best patterns fitting this situation. Strategy, Filters and Pipes, "Grouping".. Any comments welcome!
Edit in reacation to answers:
The tagging approach:
Of course, tagging crossed my mind. But where do I draw the line. I could create different tag types, like InDirTag, CreatedOnDayXTag, TopicZTag, AuthorPTag. These tags could be structured in a hirarchy, but the question how to group would remain. But I will give this some thought and add my insights here..
The procrastination comment:
Yes, it sounds like that. But the files are only the simplest example I could come up with (and the most relevant at the moment). Its actually part of the bigger picture of grouping related data in dynamic ways. Perhaps I should have kept it more abstract, to stress this: I am NOT searching for a file tagging tool or a search engine, but an algorithm or pattern to approach this problem... (or better, ideas, like tagging)
Chris

You're procrastinating. Stop that, and clean up your mess. If it's really big, I recommend the following tactic:
Make a copy of all the stuff on your drive on an external disk (USB or whatever)
Do a clean install of your system
As soon as you find you need something, get it from your copy, and place it in a well defined location
After 6 months, throw away your external drive. Anything that's on there can't be that important.
You can also install Google Desktop, which does not clean your mess, but at least lets you search it efficiently.
If you want to prevent this from happening in the future, you have to change the way you're organizing things on your computer.
Hope this helps.

I don't have a solution (and would love to see one), but I might suggest extracting metadata from your files besides the obvious name, size and timestamps.
in-band metadata such as MP3 ID3 tags, version information for EXEs / DLLs, HTML title and keywords, Summary information for Office documents etc. Even image files can have interesting metadata. A hash of the entire contents helps if looking for duplicates.
out-of-band metadata such as can be stored in NTFS alternate data streams - eg. what you can edit in the Summary tab for non-Office files
your browsers keep information on where you have downloaded files from (though Opera doesn't keep it for long), if you can read it.

You've got a fever, and the only prescription is Tag Cloud! You're still going to have to clean things up, but with tools like TaggCloud or Tag2Find you can organize your files by meta data as opposed to location on the drive. Tag2Find will watch a share, and when anything is saved to the share a popup appears and asks you to tag the file.
You should also get Google Desktop too.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight