Using database instead of thousands of small files

Using database instead of thousands of small files - database

At work, I have started working on a program that can potentially generate hundreds of thousands of mostly small files an hour. My predecessors have found out that working with many small files can become very slow, so they have resorted to some (in my opinion) crude methods to alleviate the problem.
So I asked my boss why won't we use a database instead and he gave me his oh-so-famous I-know-better-than-you look and told me obviously a database that big won't have a good performance.
My question is, is it really so? It seems to me that a database engine should be able to handle such data much better than the file system. Here are the conditions we have:
The program mostly writes data. Queries are much less frequent and their performance is not very important.
Millions of files could be generated every day. Most of these are small (a few kilobytes) but some can be huge.
If you think we should opt with the database solution, what open source database system do you think will work best? (If I decide that a database will certainly work better, I'm going to push for a change whatever the boss says!)

This is another one of those "it depends" type questions.
If you are just writing data (write once, read hardly ever) then just use the file system. Maybe use a hash-directory approach to create lots of sub-directories (things tend to go slowly with many files in a single directory.
If you are writing hundreds of thousands of events for later querying (e.g. find everything with X > 10 and Y < 11) then a database sounds like a great idea.
If you are writing hundreds of thousands of bits of non-relational data (e.g. simple key-value pairs) then it might be worth investigating a NoSQL approach.
The best approach is probably to prototype all the ideas you can think of, measure and compare!

As a minimal impact improvement, I'd split your millions of small files into a heirachy of directories. So say you were using uuids as your file names, I'd stip out the redundant urn:uuid: at the front, and then make 16 directories based on the first letter, and inside them make 16 subdirectories based on the second letter, and add even more levels if you need it. That alone will speed up the access quite a bit. Also, I would remove the directory whenever it became empty, to make sure the directory entry itself doesn't grow larger and larger.

Related

OpenVMS ODS-5 freeblocks

Our OpenVMS 8.3 ODS-5 machines, disks mounted as shadow set members sometimes lose freeblocks suddenly with no obvious cause. Adding up the FREEBLOCKS and the total size of all files on the disk gives a much lower total than the actual total available blocks on the disk. Can anyone suggest what might be causing this?
I have found that purging files will usually eliminate the issue but have no explanation for it and cannot find the file(s) causing it.
The machine is not in a cluster and ANALYZE/RMS told me, and others whom I consulted, nothing. All file versions were considered but it may be that dir/size needs to be qualified further. I am not aware of any temporary/scratch files but ideally I would like to find them if they exist. The shortfall between TOTALBLOCK-FREEBLOCKS and the output of dir/siz/grand [000000...] was approx 60 million blocks (about half the drive).
I am unfamiliar with DFU.

Don't worry. be happy. It is sure NOT to be a problem just a lack of understanding. ( Of course one could consider that in and of itself a bigger problem than an apparent mismatch in numbers. :-)
"Much lower" is almost meaningless. Everything is relative. How about some quantitative numbers.
Is this a cluster? Each cluster member can, and will have its own extend cache, possibly 10% of the free space each. Did you flush that/those before counting?
Were the ALLOCATED blocks counted, as one should, or perhaps just used blocks?
Were all versions of all files included in the count (since purge possibly changed the result)
Do the application on the system use TEMPORARY files which are not entered into a directory, and thus possibly not counted?
Have you considered enabling DISK QUOTA, just for the count, not to limit usage?
How about ANALYZE / DISK?
How about poking at the drive with DFU... highly recommended! Likely "Much faster" :-), and "Much more accurate" than anything DIRECTORY based.
Regards,
Hein.

How can the different Closed Hashing Variations be implemented in secondary memory?

i don't really know how to put this but, aside from it being implemented in primary memory i.e. heap, how can i implement variation 1,2 or 3 or any of the variations into the secondary memory which is where we manipulate files right?

Assuming your secondary memory is something with relatively slow seek times like hard disk drives, typically you want to implement a closed hash scheme based on "buckets" where buckets can be paged into main memory in their full relatively quickly. In this manner you usually don't have to perform expensive disk seeks for collisions or unstored keys. This isn't a particularly trivial undertaking and often one will end up using a library such as the classic gdbm or others (also see wikipedia).
Most of the bucket schemes are based on extensible hashing with a special case for trying to store large keys or data that don't fit nicely into a bucket. CiteSeer is also a good place to look for papers related to extensible hashing. (See the references of the linked paper, for example.)

How to find all files with the same content?

This is an interview question: "Given a directory with lots of files, find the files that have the same content". I would propose to use a hash function to generate hash values of the file contents and compare only the files with the same hash values. Does it make sense ?
The next question is how to choose the hash function. Would you use SHA-1 for that purpose ?

I'd rather use the hash as a second step. Sorting the dir by file size first and hashing and comparing only when there are duplicate sizes may improve a lot your search universe in the general case.

Like most interview questions, it's more meant to spark a conversation than to have a single answer.
If there are very few files, it may be faster to simply to a byte-by-byte comparison until you reach bytes which do not match (assuming you do). If there are many files, it may be faster to compute hashes, as you won't have to shift around the disk reading in chunks from multiple files. This process may be sped up by grabbing increasingly large chunks of each file, as you progress through the files eliminating potentials. hIt may also be necessary to distribute the problem among multiple servers, if their are enough files.
I would begin with a much faster and simpler hash function than SHA-1. SHA-1 is cryptographically secure, which is not necessarily required in this case. In my informal tests, Adler 32, for example, is 2-3 times faster. You could also use an even weaker presumptive test, than retest any files which match. This decision also depends on the relation between IO bandwidth and CPU power, if you have a more powerful CPU, use a more specific hash to save having to reread files in subsequent tests, if you have faster IO, the rereads may be cheaper than doing expensive hashes unnecessarily.
Another interesting idea would be to use heuristics on the files as you process them to determine the optimal method, based on the files size, computer's speed, and the file's entropy.

Yes, the proposed approach is reasonable and SHA-1 or MD5 will be enough for that task. Here's a detailed analysis for the very same scenario and here's a question specifically on using MD5. Don't forget you need a hash function as fast as possible.

Yes, hashing is the first that comes to mind. For your particular task you need to take the fastest hash function available. Adler32 would work. Collisions are not a problem in your case, so you don't need cryptographically strong function.

How to manipulate huge amounts of data

I'm having the following problem. I need to store huge amounts of information (~32 GB) and be able to manipulate it as fast as possible. I'm wondering what's the best way to do it (combinations of programming language + OS + whatever you think its important).
The structure of the information I'm using is a 4D array (NxNxNxN) of double-precission floats (8 bytes). Right now my solution is to slice the 4D array into 2D arrays and store them in separate files in the HDD of my computer. This is really slow and the manipulation of the data is unbearable, so this is no solution at all!
I'm thinking on moving into a Supercomputing facility in my country and store all the information in the RAM, but I'm not sure how to implement an application to take advantage of it (I'm not a professional programmer, so any book/reference will help me a lot).
An alternative solution I'm thinking on is to buy a dedicated server with lots of RAM, but I don't know for sure if that will solve the problem. So right now my ignorance doesn't let me choose the best way to proceed.
What would you do if you were in this situation? I'm open to any idea.
Thanks in advance!
EDIT: Sorry for not providing enough information, I'll try to be more specific.
I'm storing a discretized 4D mathematical function. The operations that I would like to perform includes transposition of the array (change b[i,j,k,l] = a[j,i,k,l] and the likes), array multiplication, etc.
As this is a simulation of a proposed experiment, the operations will be applied only once. Once the result is obtained it wont be necessary to perform more operations on the data.
EDIT (2):
I also would like to be able to store more information in the future, so the solution should be somehow scalable. The current 32 GB goal is because I want to have the array with N=256 points, but it'll be better if I can use N=512 (which means 512 GB to store it!!).

Amazon's "High Memory Extra Large Instance" is only $1.20/hr and has 34 GB of memory. You might find it useful, assuming you're not running this program constantly..

Any decent answer will depend on how you need to access the data. Randomly access? Sequential access?
32GB is not really that huge.
How often do you need to process your data? Once per (lifetime | year | day | hour | nanosecond)? Often, stuff only needs to be done once. This has a profound effect on how much you need to optimize your solution.
What kind of operations will you be performing (you mention multiplication)? Can the data be split up into chunks, such that all necessary data for a set of operations is contained in a chunk? This will make splitting it up for parallel execution easier.
Most computers you buy these days have enough RAM to hold your 32GB in memory. You won't need a supercomputer just for that.

As Chris pointed out, what are you going to do with the data.
Besides, I think storing it in a (relational) database will be faster than reading it from the harddrive since the RDBMS will perform some optimizations for you like caching.

If you can represent your problem as MapReduce, consider a clustering system optimized for disk access, such as Hadoop.
Your description sounds more math-intensive, in which case you probably want to have all your data in memory at once. 32 GB of RAM in a single machine is not unreasonable; Amazon EC2 offers virtual servers with up to 68 GB.

Without more information, if you need quickest possible access to all the data I would go with using C for your programming language, using some flavor of *nix as the O/S, and buying RAM, it's relatively cheap now. This also depends on what you are familiar with, you can go the windows route as well. But as others have mentioned it will depend on how you are using this data.

So far, there are a lot of very different answers. There are two good starting points mentioned above. David suggests some hardware and someone mentioned learning C. Both of these are good points.
C is going to get you what you need in terms of speed and direct memory paging. The last thing you want to do is perform linear searches on the data. That would be slow - slow - slow.
Determine your workflow -, if your workflow is linear, that is one thing. If the workflow is not linear, I would design a binary tree referencing pages in memory. There are tons of information on B-trees on the Internet. In addition, these B-trees will be much easier to work with in C since you will also be able to set up and manipulate your memory paging.

Depending on your use, some mathematical and physical problems tend to be mostly zeros (for example, Finite Element models). If you expect that to be true for your data, you can get serious space savings by using a sparse matrix instead of actually storing all those zeros in memory or on disk.
Check out wikipedia for a description, and to decide if this might meet your needs:
http://en.wikipedia.org/wiki/Sparse_matrix

Here's another idea:
Try using an SSD to store your data. Since you're grabbing very small amounts of random data, an SSD would probably be much, much faster.

You may want to try using mmap instead of reading the data into memory, but I'm not sure it'll work with 32Gb files.

The whole database technology is about manipulating huge amounts of data that can't fit in RAM, so that might be your starting point (i.e. get a good dbms principles book and read about indexing, query execution, etc.).
A lot depends on how you need to access the data - if you absolutely need to jump around and access random bits of information, you're in trouble, but perhaps you can structure your processing of the data such that you will scan it along one axis (dimension). Then you can use a smaller buffer and continuously dump already processed data and read new data.

For transpositions, it's faster to actually just change your understanding of what index is what. By that, I mean you leave the data where it is and instead wrap an accessor delegate that changes b[i][j][k][l] into a request to fetch (or update) a[j][i][k][l].

Could it be possible to solve it by this procedure?
First create M child processes and execute them in paralel. Each process will be running in a dedicated core of a cluster and will load some information of the array into the RAM of that core.
A father process will be the manager of the array, calling (or connecting) the appropiate child process to obtain certain chunks of data.
Will this be faster than the HDD storage approach? Or am I cracking nuts with a sledgehammer?

The first thing that I'd recommend is picking an object-oriented language, and develop or find a class that lets you manipulate a 4-D array without concern for how it's actually implemented.
The actual implementation of this class would probably use memory-mapped files, simply because that can scale from low-power development machines up to the actual machine where you want to run production code (I'm assuming that you'll want to run this many times, so that performance is important -- if you can let it run overnight, then a consumer PC may be sufficient).
Finally, once I had my algorithms and data debugged, I would look into buying time on a machine that could hold all the data in memory. Amazon EC2, for instance, will provide you with a machine that has 68 GB of memory for $US 2.40 an hour (less if you play with spot instances).

How to handle processing large amounts of data typically revolves around the following factors:
Data access order / locality of reference: Can the data be separated out into independent chunks that are then processed either independently or in a serial/sequential fashon vs. random access to the data with little or no order?
CPU vs I/O bound: Is the processing time spent more on computation with the data or reading/writing it from/to storage?
Processing frequency: Will the data be processed only once, every few weeks, daily, etc?
If the data access order is essentially random, you will need either to get access to as much RAM as possible and/or find a way to at least partially organize the order so that not as much of the data needs to be in memory at the same time. Virtual memory systems slow down very quickly once physical RAM limits are exceeded and significant swapping occurs. Resolving this aspect of your problem is probably the most critical issue.
Other than the data access order issue above, I don't think your problem has significant I/O concerns. Reading/writing 32 GB is usually measured in minutes on current computer systems, and even data sizes up to a terabyte should not take more than a few hours.
Programming language choice is actually not critical so long as it is a compiled language with a good optimizing compiler and decent native libraries: C++, C, C#, or Java are all reasonable choices. The most computationally and I/O-intensive software I've worked on has actually been in Java and deployed on high-performance supercomputing clusters with a few thousand CPU cores.

How to programmatically really clean Delete files?

So you are about to pass your work-computer to some of your colleague. How do you make sure you really delete all your personal data?
Re-formatting, Re-installing OS will not really solve the problem.
I searched around and found some programs does "Wipe out" disks.
This caught me thinking how does those programs work?
I mean, What algorithms they use and how low level those implementations go?
Any ideas?

Most of those programs do a "secure delete" by overwriting the file bits with random noise.
The biggest problem has more to do with the actual implementation of hard drives and file systems than anything else. Fragmentation, caching, where the data actually is that you're trying to overwrite: that's the big problem . And it's a very low-level problem -- driver level, really. You're not going to be able to do it with Python, C#, or Java.
Once that problem is solved, there's the one of physical media. Because of the nature of magnetic media, it's very frequently possible to read the previous bits that were once on the hard drive -- even if you overwrote them with a different bit. "Secure delete" programs solve this problem by overwriting several times -- preferably a random but suitably large number of times.
Further Reading:
Data Erasure
Data Remanence
The Great Zero Challenge (provided by #Stefano Borini -- vote him up!)

Safe delete programs overwrite the file multiple times with random patterns of data, so that even residual magnetization cannot be picked up and is lost in the noise.
However, assuming that the great zero challenge has some truth in it, I think you can just fill the file/disk with zeros and call yourself happy, as this residual magnetization is practically impossible to pick even with professional setup.

As far as I know most tools do this with X writes and deletes, where X is some suitably large number. The best way to do this is probably to interface with the hardware at some level, although a cheap and easy way would be to create files until the disk is full, writing random data, delete them, create new files and repeat.
Its all paranoia anyway. Just deleting a file is usually much more than enough...

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight