Best generic strategy to group items using multiple criteria - file

I have a simple, real life problem I want to solve using an OO approach. My harddrive is a mess. I have 1.500.000 files, duplicates, complete duplicate folders, and so on...
The first step, of course, is parsing all the files into my database. No problems so far, now I got a lot of nice entries which are kind of "naturaly grouped". Examples for this simple grouping can be obtained using simple queries like:
Give me all files bigger than 100MB
Show all files older than 3 days
Get me all files ending with docx
But now assume I want to find groups with a little more natural meaning. There are different strategies for this, depending on the "use case".
Assume I have a bad habit of putting all my downloaded files first on the desktop. Then I extract them to the appropriate folder, without deleting the ZIP file always. The I move them into a "attic" folder. For the system, to find this group of files a time oriented search approach, perhaps combined with a "check if ZIP is same then folder X" would be suitable.
Assume another bad habit of duplicating files, having some folder where "the clean files" are located in a nice structure, and another messy folders. Now my clean folder has 20 picture galleries, my messy folder has 5 duplicated and 1 new gallery. A human user could easily identify this logic by seeing "Oh, thats all just duplicates, thats a new one, so I put the new one in the clean folder and trash all the duplicates".
So, now to get to the point:
Which combination of strategies or patterns would you use to tackle such a situation. If I chain filters the "hardest" would win, and I have no idea how to let the system "test" for suitable combination. And it seemes to me it is more then just filtering. Its dynamic grouping by combining multiple criteria to find the "best" groups.
One very rough approach would be this:
In the beginning, all files are equal
The first, not so "good" group is the directory
If you are a big, clean directory, you earn points (evenly distributed names)
If all files have the same creation date, you may be "autocreated"
If you are a child of Program-Files, I don't care for you at all
If I move you, group A, into group C, would this improve the "entropy"
What are the best patterns fitting this situation. Strategy, Filters and Pipes, "Grouping".. Any comments welcome!
Edit in reacation to answers:
The tagging approach:
Of course, tagging crossed my mind. But where do I draw the line. I could create different tag types, like InDirTag, CreatedOnDayXTag, TopicZTag, AuthorPTag. These tags could be structured in a hirarchy, but the question how to group would remain. But I will give this some thought and add my insights here..
The procrastination comment:
Yes, it sounds like that. But the files are only the simplest example I could come up with (and the most relevant at the moment). Its actually part of the bigger picture of grouping related data in dynamic ways. Perhaps I should have kept it more abstract, to stress this: I am NOT searching for a file tagging tool or a search engine, but an algorithm or pattern to approach this problem... (or better, ideas, like tagging)
Chris

You're procrastinating. Stop that, and clean up your mess. If it's really big, I recommend the following tactic:
Make a copy of all the stuff on your drive on an external disk (USB or whatever)
Do a clean install of your system
As soon as you find you need something, get it from your copy, and place it in a well defined location
After 6 months, throw away your external drive. Anything that's on there can't be that important.
You can also install Google Desktop, which does not clean your mess, but at least lets you search it efficiently.
If you want to prevent this from happening in the future, you have to change the way you're organizing things on your computer.
Hope this helps.

I don't have a solution (and would love to see one), but I might suggest extracting metadata from your files besides the obvious name, size and timestamps.
in-band metadata such as MP3 ID3 tags, version information for EXEs / DLLs, HTML title and keywords, Summary information for Office documents etc. Even image files can have interesting metadata. A hash of the entire contents helps if looking for duplicates.
out-of-band metadata such as can be stored in NTFS alternate data streams - eg. what you can edit in the Summary tab for non-Office files
your browsers keep information on where you have downloaded files from (though Opera doesn't keep it for long), if you can read it.

You've got a fever, and the only prescription is Tag Cloud! You're still going to have to clean things up, but with tools like TaggCloud or Tag2Find you can organize your files by meta data as opposed to location on the drive. Tag2Find will watch a share, and when anything is saved to the share a popup appears and asks you to tag the file.
You should also get Google Desktop too.

Related

Create a relational GWAS/Genomics database

I thought I'd ask before I try to build something from scratch.
Here is the type of problem I need to answer. One of our researchers comes to me and says "How many people in our data have such-and-such SNP genotyped?"
Our genetics data consists of several dozen GWAS files, typically flat delimited. Each GWAS file has between 100,000-1,000,000 SNPs. There is some overlap in the SNPs, but less than I'd originally thought.
Anyway, what I want to do is have some sort of structured database that links our participant IDs to a particular GWAS study, and then link that GWAS study to a list of SNPs, and I can write some kind of query that will pull all IDs that have the data. At no point do I need individual level genotype data, it is way easier to pull the SNP/Samples that I need once I know where they are.
So that is my problem and what I'm looking for. For anyone who works with a lot of GWAS data, I'm sure you're familiar with the problem. Is there anything (free or paid) that is built for this type of problem? Or do you have thoughts on what direction I might want to go if I need to build this myself?
Thanks.

Best "architecture" to store lots of files per user on server

i would like to ask for your opinion and advice.
In my application i need to store files uploaded from user to provide import to database - it could be XML or excel file (.xlsx), i guess max file size about 500kB per file.
There is need to store files because of import to database, which is not done immediately and also because of backup.
I consider scenario about thousands (ten thousands) of users.
Scenario - one user can upload many files to many categories. It means that user can upload file_1 to category 1, file_2 to category_2, but also file_3 to category_2_1(subcategory of category_2).
Generally, there is some kind of category tree and user can upload many files to many nodes.
Because of import application, filename will always contain :
user_code_category_code_timestamp
And my problem is, that i do not know that is the best way to store that files.
should i have one directory per user -> one directory per category -> relevant files
should i have one directory per user -> all user files
should i have one root directory -> all users and all files
?
In the best way i mean - there must be application for import, which will list relevant files in category and for relevant user. As i wrote above, there are many ways, so i am a bit confused.
What else should i consider ? File system limitations ?
Hope you understand problem.
Thank you.
Are you using some kind of a framework? Best case is you use a plugin for it.
The standard basic solution for storing files is to have one directory for all files(images for example). When you save a file, you change the name of the file so they do not duplicate in the directory. You keep all other data in a DB table.
From that base - you can improve and change the solution depending on the business logic.
You might want to restrict access to the files, you might want to put them in a tree directory if you need browsing in them.
And so on...
Thank you for this question! It was difficult to find answers for this online, but in my case I have potentially 10k's of images/pdfs/files/etc. and it seems that using hashes and saving to one location directory is ideal and makes it much less complicated.
Useful things to think about:
1. Add some additional meta data (you can do this in S3 buckets)
2. I would make sure you have the option to resize images if relevant such as ?w=200&h=200.
3. Perhaps save a file name that can be displayed if the user downloads it so it doesn't give them some weird hash.
4. if you save based on a hash that works off of the current time, you can generate non-duplicating hashes.
5. trying to view all the files at once would hurt performance, but when your app is requesting only one file at a time based on endpoint this shouldn't be an issue.

How should I store user-uploaded images for a web application?

On my web server, I have two folders showcase and thumbnail to store images and their thumbnails, respectively. A database fetches these images to display them on a page.
The table column in the showcase table is s_image which stores something like /showcase/urlcode.jpg.
I heard that after around 10-20k files in a folder, it starts to slow down. So should I be creating a second folder, showcase2 once it's filled up? Is there some kind of automatic creation that can do this for me?
I appreciate your input.
The filesystem you're using matters when you put tens of thousands of files in a single directory. extfs4 on Linux scales up better than NTFS on Windows.
Windows has a compatibility mode for 8.3 file names (the old-timey DOS file name standard). This causes every file name longer than abcdefgh.ext to have an alias created for it something like abcd~123.ext. This is slow, and gets very slow when you have lots of files in a single directory. You can turn off this ancient compatibility behavior. See here. https://support.microsoft.com/en-us/kb/121007. If you do turn it off, it's a quick fix for an immediate performance problem.
But, 20,000 files in one directory is a large number. Your best bet, on any sort of file system, is automatically creating subdirectories in your file system based on something that changes. One strategy is to create subdirectories based on year / month, for example
/showcase/2015/08/image1.jpg (for images uploaded this month)
/showcase/2015/09/image7.jpg (for images next month)
It's obviously no problem to store those longer file names in your s_image column in your table.
Or, if you have some system to the naming of the images, exploit it to create subdirectories. For example, if your images are named
cat0001.jpg
cat0002.jpb
...
cat0456.jpg
...
cat0987.jpg
You can create subdirectories based on, say, the first five letters of the names
/showcase/cat00/cat0001.jpg
/showcase/cat00/cat0002.jpb
...
/showcase/cat04/cat0456.jpg
...
/showcase/cat09/cat0987.jpg
If you do this, it's much better to keep the image names intact rather than make them shorter (for example, don't do this /showcase/cat09/87.jpg) because if you have to search for a particular image by name you want the full name there.
As far as I know, there's nothing automatic in a file system to do this for you. But it's not hard to do in your program.

Extracting old data from a DOS software

I have an old software created on DOS. All I have is an executable which shows me the UI. What this software does is it takes details of an order given to a door manufacturing company, stores it somewhere and sends the data to a needle printer. The data stored includes things like the name and address of the customer, door dimensions and so on.
The original creators of the software are no longer reachable and I have no idea what language was used to create it. My company wishes to get rid of this system but right now the only way to access information about old orders is by inserting the order number into the UI.
What I need to do is extract this data and convert it to some readable format, I have read research papers, searched this website and many others but have come up empty. I know that when I enter a new order the files that get modified have the following formats:
^01, WRK, DBK, STA
There are other files in the directory with formats like .ALT, .DBI, .ASC, .BAS, .DDF, .MA3 but those dont seem to have changed in the last 20 years.
Thank you very much in advance guys
The file extensions aren't always the best way of finding out things. File extensions are fluid and there's never been much in the way of standardisation, or at least not back in the DOS days. If you look at FilExt, for example, there's a fair bit of double up.
You'd be better off running the files through a tool like TrID/32 - File Identifier v2.10 - (C) 2003-11 By M.Pontello which does a good job of recognising files by their content rather than their file extension. It's not foolproof but can identify a few thousand different file types.
I used to do a lot of development on DOS back in the day. If you want to contact me off list, bruce dot axtens at gmail dot com, I can help identify the files and perhaps cook up a mechanism to extract the data.

How to organize a large number of objects

We have a large number of documents and metadata (xml files) associated with these documents. What is the best way to organize them?
Currently we have created a directory hierarchy:
/repository/category/date(when they were loaded into our db)/document_number.pdf and .xml
We use the path as a unique identifier for the document in our system.
Having a flat structure doesn't seem to a good option. Also using the path as an id helps to keep our data independent from our database/application logic, so we can reload them easily in case of failure, and all documents will maintain their old ids.
Yet, it introduces some limitations. for example we can't move the files once they've been placed in this structure, also it takes work to put them this way.
What is the best practice? How websites such as Scribd deal with this problem?
Your approach does not seem unreasonable, but might suffer if you get more than a few thousand documents added within a single day (file systems tend not to cope well with very large numbers of files in a directory).
Storing the .xml document beside the .pdf seems a bit odd - If it's really metadata about the document, should it not be in the database (which it sounds like you already have) where it can be easily queries and indexed etc?
When storing very large numbers of files I've usually taken the file's key (say, a URL), hashed it, and then stored it X levels deep in directories based on the first characters of the hash...
Say you started with the key 'How to organize a large number of objects'. The md5 hash for that is 0a74d5fb3da8648126ec106623761ac5 so you might store it at...
base_dir/0/a/7/4/http___stackoverflow.com_questions_2734454_how-to-organize-a-large-number-of-objects
...or something like that which you can easily find again given the key you started with.
This kind of approach has one advantage over your date one in that it can be scaled to suit very large numbers of documents (even per day) without any one directory becoming too large, but on the other hand, it's less intuitive to someone having to manually find a particular file.

Resources