How to programmatically really clean Delete files? - filesystems

So you are about to pass your work-computer to some of your colleague. How do you make sure you really delete all your personal data?
Re-formatting, Re-installing OS will not really solve the problem.
I searched around and found some programs does "Wipe out" disks.
This caught me thinking how does those programs work?
I mean, What algorithms they use and how low level those implementations go?
Any ideas?

Most of those programs do a "secure delete" by overwriting the file bits with random noise.
The biggest problem has more to do with the actual implementation of hard drives and file systems than anything else. Fragmentation, caching, where the data actually is that you're trying to overwrite: that's the big problem . And it's a very low-level problem -- driver level, really. You're not going to be able to do it with Python, C#, or Java.
Once that problem is solved, there's the one of physical media. Because of the nature of magnetic media, it's very frequently possible to read the previous bits that were once on the hard drive -- even if you overwrote them with a different bit. "Secure delete" programs solve this problem by overwriting several times -- preferably a random but suitably large number of times.
Further Reading:
Data Erasure
Data Remanence
The Great Zero Challenge (provided by #Stefano Borini -- vote him up!)

Safe delete programs overwrite the file multiple times with random patterns of data, so that even residual magnetization cannot be picked up and is lost in the noise.
However, assuming that the great zero challenge has some truth in it, I think you can just fill the file/disk with zeros and call yourself happy, as this residual magnetization is practically impossible to pick even with professional setup.

As far as I know most tools do this with X writes and deletes, where X is some suitably large number. The best way to do this is probably to interface with the hardware at some level, although a cheap and easy way would be to create files until the disk is full, writing random data, delete them, create new files and repeat.
Its all paranoia anyway. Just deleting a file is usually much more than enough...

Related

OpenVMS ODS-5 freeblocks

Our OpenVMS 8.3 ODS-5 machines, disks mounted as shadow set members sometimes lose freeblocks suddenly with no obvious cause. Adding up the FREEBLOCKS and the total size of all files on the disk gives a much lower total than the actual total available blocks on the disk. Can anyone suggest what might be causing this?
I have found that purging files will usually eliminate the issue but have no explanation for it and cannot find the file(s) causing it.
The machine is not in a cluster and ANALYZE/RMS told me, and others whom I consulted, nothing. All file versions were considered but it may be that dir/size needs to be qualified further. I am not aware of any temporary/scratch files but ideally I would like to find them if they exist. The shortfall between TOTALBLOCK-FREEBLOCKS and the output of dir/siz/grand [000000...] was approx 60 million blocks (about half the drive).
I am unfamiliar with DFU.
Don't worry. be happy. It is sure NOT to be a problem just a lack of understanding. ( Of course one could consider that in and of itself a bigger problem than an apparent mismatch in numbers. :-)
"Much lower" is almost meaningless. Everything is relative. How about some quantitative numbers.
Is this a cluster? Each cluster member can, and will have its own extend cache, possibly 10% of the free space each. Did you flush that/those before counting?
Were the ALLOCATED blocks counted, as one should, or perhaps just used blocks?
Were all versions of all files included in the count (since purge possibly changed the result)
Do the application on the system use TEMPORARY files which are not entered into a directory, and thus possibly not counted?
Have you considered enabling DISK QUOTA, just for the count, not to limit usage?
How about ANALYZE / DISK?
How about poking at the drive with DFU... highly recommended! Likely "Much faster" :-), and "Much more accurate" than anything DIRECTORY based.
Regards,
Hein.

Optimal Buffer Sizing

I guess this is a performance computing question. I'm writing a program in C which produces a large amount of output, much more than can typically be stored in RAM in its entirety. I intend to simply write the output to stdout; so it may just be going to the screen, or may be being redirected into a file. My problem is how to choose an optimal buffer size for the data that will be stored in RAM?
The output data itself isn't particularly important, so let's just say that it's producing a massive list of random integers.
I intend to have 2 threads: one which produces the data and writes it to a buffer, and the other which writes that buffer to stdout. This way, I can begin producing the next buffer of output whilst the previous buffer is still being written to stdout.
To be clear, my question isn't about how to use functions like malloc() and pthread_create() etc. My question is purely about how to choose a number of bytes (512, 1024, 1048576) for the optimal buffer size, which will give the best performance?
Ideally, I'd like to find a way in which I could choose an optimal buffer size dynamically, so that my program could adjust to whatever hardware it was being run on at the time. I have tried searching for answers to this problem, and although I found a few threads about buffer size, I couldn't find anything particularly relevant to this problem. Therefore, I just wanted to post it as a question in the hope that I could get some different points of view, and come up with something better than I could on my own.
It's a big waste of time to mix design and optimization. This is considered one of the top canonical mistakes. It is likely to damage your design and not actually optimize much.
Get your program working, and if there is an indication of a performance issue, then profile it and consider analyzing the part that's really causing the problem.
I would think this applies particularly to a complex architectural optimization like multithreading your application. Multithreading a single image is something you never really want to do: it's impossible to test, prone to unreproducible bugs, it will fail differently in different execution environments, and there are other problems. But, for some programs, multithreaded parallel execution is required for functionality or is one way to get necessary performance. It's widely supported, and essentially it is, at times, a necessary evil.
It's not something you want in the initial design without solid evidence that programs like yours need it.
Almost any other method of parallelism (message passing?) will be easier to implement and debug, and you are getting lots of that in the I/O system of your OS anyway.
I personally think you are wasting your time.
First, run time ./myprog > /dev/null
Now, use time dd if=/dev/zero of=myfile.data bs=1k count=12M.
dd is about as simple a program as you can get, and it will write the file pretty quickly. But writing several gigabytes still takes a little while. (12G takes about 4 minutes on my machine - which is probably not the fastest disk in the world - the same size file to /dev/null takes about 5 seconds).
You can experiment with some different numbers in bs=x count=y where the combination makes, the same size as your program output for the test-run. But I only found that if you make VERY large blocks, it actually takes longer (1MB per write - probably because the OS needs to copy 1MB before it can write the data, then write it out and then copy the next 1MB, where with smaller blocks (I tested 1k and 4k), it takes a lot less time to copy the data, and there's actually less "disk spinning round not doing anything before we write to it").
Compare both of these times to your program running time. Is the time it takes to write the file the with dd much shorter than your program writing to the file?
If there isn't much difference, then look at the time it takes to write to /dev/null with your program - is that accounting for some or all of the difference?
Short answer: Measure it.
Long answer: From my experience, it depends too much on the factors that are hard to predict in advance. On the other hand, you do not have to commit yourself before the start. Just implement a generic solution and when you are done, make a few performance tests and take the settings with the best results. A profiler may help you to concentrate on the performance critical parts in your program.
From what I've seen, those that produce the fastest code, often try the simplest, straightforward approach first. What the do better than the average programmers is that they have a very good technique at writing good performance tests, which is by far not trivial.
Without experience, it is easy to fall into certain traps, for example, ignoring caching effects, or (maybe in your application?!) underestimating the costs of IO operations. In the worst case, you end up squeezing parts of the program which do not contribute to the overall performance at all.
Back to your original question:
In the scenario that you describe (one CPU-bound producer and one IO-bound consumer), it is likely that one of them will be the bottleneck (unless the rate at which the producer generates data varies a lot). Depending on which one is faster, the whole situation changes radically:
Let us first assume, the IO-bound consumer is your bottleneck (doesn't matter whether it writes to stdout or to a file). What are the likely consequences?
Optimizing the algorithm to produce the data will not improve performance, instead you have to maximize the write performance. I would assume, however, that the write performance will not depend very much on the buffer size (unless the buffer is too small).
In the other case, if the producer is the limiting factor, the situation is reversed. Here you have to profile the generation code and improve the speed of the algorithm and maybe the communication of the data between the reader and the writer thread. The buffer size, however, will still not matter, as the buffer will be empty most of the time, anyway.
Granted, the situation could be more complex than I have described. But unless you are actually sure that you are not in one of the extreme cases, I would not invest in tuning the buffer size yet. Just keep it configurable and you should be fine. I don't think that it should be a problem later to refit it to other hardware environments.
Most modern OSes are good at using the disk as a backing store for RAM. I suggest you leave the heuristics to the OS and just ask for as much memory as you want, till you hit a performance bottleneck.
There's no need to use buffering, the OS will automatically swap pages to the disk for you whenever necessary, you don't have to program that. The simples would be for you to leave in in RAM if you don't need to save the data, else you're probably better of saving it after generating the data, because it's better for the disk i/o.

Using database instead of thousands of small files

At work, I have started working on a program that can potentially generate hundreds of thousands of mostly small files an hour. My predecessors have found out that working with many small files can become very slow, so they have resorted to some (in my opinion) crude methods to alleviate the problem.
So I asked my boss why won't we use a database instead and he gave me his oh-so-famous I-know-better-than-you look and told me obviously a database that big won't have a good performance.
My question is, is it really so? It seems to me that a database engine should be able to handle such data much better than the file system. Here are the conditions we have:
The program mostly writes data. Queries are much less frequent and their performance is not very important.
Millions of files could be generated every day. Most of these are small (a few kilobytes) but some can be huge.
If you think we should opt with the database solution, what open source database system do you think will work best? (If I decide that a database will certainly work better, I'm going to push for a change whatever the boss says!)
This is another one of those "it depends" type questions.
If you are just writing data (write once, read hardly ever) then just use the file system. Maybe use a hash-directory approach to create lots of sub-directories (things tend to go slowly with many files in a single directory.
If you are writing hundreds of thousands of events for later querying (e.g. find everything with X > 10 and Y < 11) then a database sounds like a great idea.
If you are writing hundreds of thousands of bits of non-relational data (e.g. simple key-value pairs) then it might be worth investigating a NoSQL approach.
The best approach is probably to prototype all the ideas you can think of, measure and compare!
As a minimal impact improvement, I'd split your millions of small files into a heirachy of directories. So say you were using uuids as your file names, I'd stip out the redundant urn:uuid: at the front, and then make 16 directories based on the first letter, and inside them make 16 subdirectories based on the second letter, and add even more levels if you need it. That alone will speed up the access quite a bit. Also, I would remove the directory whenever it became empty, to make sure the directory entry itself doesn't grow larger and larger.

How to manipulate *huge* amounts of data

I'm having the following problem. I need to store huge amounts of information (~32 GB) and be able to manipulate it as fast as possible. I'm wondering what's the best way to do it (combinations of programming language + OS + whatever you think its important).
The structure of the information I'm using is a 4D array (NxNxNxN) of double-precission floats (8 bytes). Right now my solution is to slice the 4D array into 2D arrays and store them in separate files in the HDD of my computer. This is really slow and the manipulation of the data is unbearable, so this is no solution at all!
I'm thinking on moving into a Supercomputing facility in my country and store all the information in the RAM, but I'm not sure how to implement an application to take advantage of it (I'm not a professional programmer, so any book/reference will help me a lot).
An alternative solution I'm thinking on is to buy a dedicated server with lots of RAM, but I don't know for sure if that will solve the problem. So right now my ignorance doesn't let me choose the best way to proceed.
What would you do if you were in this situation? I'm open to any idea.
Thanks in advance!
EDIT: Sorry for not providing enough information, I'll try to be more specific.
I'm storing a discretized 4D mathematical function. The operations that I would like to perform includes transposition of the array (change b[i,j,k,l] = a[j,i,k,l] and the likes), array multiplication, etc.
As this is a simulation of a proposed experiment, the operations will be applied only once. Once the result is obtained it wont be necessary to perform more operations on the data.
EDIT (2):
I also would like to be able to store more information in the future, so the solution should be somehow scalable. The current 32 GB goal is because I want to have the array with N=256 points, but it'll be better if I can use N=512 (which means 512 GB to store it!!).
Amazon's "High Memory Extra Large Instance" is only $1.20/hr and has 34 GB of memory. You might find it useful, assuming you're not running this program constantly..
Any decent answer will depend on how you need to access the data. Randomly access? Sequential access?
32GB is not really that huge.
How often do you need to process your data? Once per (lifetime | year | day | hour | nanosecond)? Often, stuff only needs to be done once. This has a profound effect on how much you need to optimize your solution.
What kind of operations will you be performing (you mention multiplication)? Can the data be split up into chunks, such that all necessary data for a set of operations is contained in a chunk? This will make splitting it up for parallel execution easier.
Most computers you buy these days have enough RAM to hold your 32GB in memory. You won't need a supercomputer just for that.
As Chris pointed out, what are you going to do with the data.
Besides, I think storing it in a (relational) database will be faster than reading it from the harddrive since the RDBMS will perform some optimizations for you like caching.
If you can represent your problem as MapReduce, consider a clustering system optimized for disk access, such as Hadoop.
Your description sounds more math-intensive, in which case you probably want to have all your data in memory at once. 32 GB of RAM in a single machine is not unreasonable; Amazon EC2 offers virtual servers with up to 68 GB.
Without more information, if you need quickest possible access to all the data I would go with using C for your programming language, using some flavor of *nix as the O/S, and buying RAM, it's relatively cheap now. This also depends on what you are familiar with, you can go the windows route as well. But as others have mentioned it will depend on how you are using this data.
So far, there are a lot of very different answers. There are two good starting points mentioned above. David suggests some hardware and someone mentioned learning C. Both of these are good points.
C is going to get you what you need in terms of speed and direct memory paging. The last thing you want to do is perform linear searches on the data. That would be slow - slow - slow.
Determine your workflow -, if your workflow is linear, that is one thing. If the workflow is not linear, I would design a binary tree referencing pages in memory. There are tons of information on B-trees on the Internet. In addition, these B-trees will be much easier to work with in C since you will also be able to set up and manipulate your memory paging.
Depending on your use, some mathematical and physical problems tend to be mostly zeros (for example, Finite Element models). If you expect that to be true for your data, you can get serious space savings by using a sparse matrix instead of actually storing all those zeros in memory or on disk.
Check out wikipedia for a description, and to decide if this might meet your needs:
http://en.wikipedia.org/wiki/Sparse_matrix
Here's another idea:
Try using an SSD to store your data. Since you're grabbing very small amounts of random data, an SSD would probably be much, much faster.
You may want to try using mmap instead of reading the data into memory, but I'm not sure it'll work with 32Gb files.
The whole database technology is about manipulating huge amounts of data that can't fit in RAM, so that might be your starting point (i.e. get a good dbms principles book and read about indexing, query execution, etc.).
A lot depends on how you need to access the data - if you absolutely need to jump around and access random bits of information, you're in trouble, but perhaps you can structure your processing of the data such that you will scan it along one axis (dimension). Then you can use a smaller buffer and continuously dump already processed data and read new data.
For transpositions, it's faster to actually just change your understanding of what index is what. By that, I mean you leave the data where it is and instead wrap an accessor delegate that changes b[i][j][k][l] into a request to fetch (or update) a[j][i][k][l].
Could it be possible to solve it by this procedure?
First create M child processes and execute them in paralel. Each process will be running in a dedicated core of a cluster and will load some information of the array into the RAM of that core.
A father process will be the manager of the array, calling (or connecting) the appropiate child process to obtain certain chunks of data.
Will this be faster than the HDD storage approach? Or am I cracking nuts with a sledgehammer?
The first thing that I'd recommend is picking an object-oriented language, and develop or find a class that lets you manipulate a 4-D array without concern for how it's actually implemented.
The actual implementation of this class would probably use memory-mapped files, simply because that can scale from low-power development machines up to the actual machine where you want to run production code (I'm assuming that you'll want to run this many times, so that performance is important -- if you can let it run overnight, then a consumer PC may be sufficient).
Finally, once I had my algorithms and data debugged, I would look into buying time on a machine that could hold all the data in memory. Amazon EC2, for instance, will provide you with a machine that has 68 GB of memory for $US 2.40 an hour (less if you play with spot instances).
How to handle processing large amounts of data typically revolves around the following factors:
Data access order / locality of reference: Can the data be separated out into independent chunks that are then processed either independently or in a serial/sequential fashon vs. random access to the data with little or no order?
CPU vs I/O bound: Is the processing time spent more on computation with the data or reading/writing it from/to storage?
Processing frequency: Will the data be processed only once, every few weeks, daily, etc?
If the data access order is essentially random, you will need either to get access to as much RAM as possible and/or find a way to at least partially organize the order so that not as much of the data needs to be in memory at the same time. Virtual memory systems slow down very quickly once physical RAM limits are exceeded and significant swapping occurs. Resolving this aspect of your problem is probably the most critical issue.
Other than the data access order issue above, I don't think your problem has significant I/O concerns. Reading/writing 32 GB is usually measured in minutes on current computer systems, and even data sizes up to a terabyte should not take more than a few hours.
Programming language choice is actually not critical so long as it is a compiled language with a good optimizing compiler and decent native libraries: C++, C, C#, or Java are all reasonable choices. The most computationally and I/O-intensive software I've worked on has actually been in Java and deployed on high-performance supercomputing clusters with a few thousand CPU cores.

File descriptor limits and default stack sizes

Where I work we build and distribute a library and a couple complex programs built on that library. All code is written in C and is available on most 'standard' systems like Windows, Linux, Aix, Solaris, Darwin.
I started in the QA department and while running tests recently I have been reminded several times that I need to remember to set the file descriptor limits and default stack sizes higher or bad things will happen. This is particularly the case with Solaris and now Darwin.
Now this is very strange to me because I am a believer in 0 required environment fiddling to make a product work. So I am wondering if there are times where this sort of requirement is a necessary evil, or if we are doing something wrong.
Edit:
Great comments that describe the problem and a little background. However I do not believe I worded the question well enough. Currently, we require customers, and hence, us the testers, to set these limits before running our code. We do not do this programatically. And this is not a situation where they MIGHT run out, under normal load our programs WILL run out and seg fault.
So rewording the question, is requiring the customer to change these ulimit values to run our software to be expected on some platforms, ie, Solaris, Aix, or are we as a company making it to difficult for these users to get going?
Bounty:
I added a bounty to hopefully get a little more information on what other companies are doing to manage these limits. Can you set these pragmatically? Should we? Should our programs even be hitting these limits or could this be a sign that things might be a bit messy under the covers? That is really what I want to know, as a perfectionist a seemingly dirty program really bugs me.
If you need to change these values in order to get your QA tests to run, then that is not too much of a problem. However, requiring a customer to do this in order for the program to run should (IMHO) be avoided. If nothing else, create a wrapper script that sets these values and launches the application so that users will still have a one-click application launch. Setting these from within the program would be the preferable method, however. At the very least, have the program check the limits when it is launched and (cleanly) error out early if the limits are too low.
If a software developer told me that I had to mess with my stack and descriptor limits to get their program to run, it would change my perception of the software. It would make me wonder "why do they need to exceed the system limits that are apparently acceptable for every other piece of software I have?". This may or may not be a valid concern, but being asked to do something that (to many) can seem hackish doesn't have the same professional edge as an program that you just launch and go.
This problem seems even worse when you say "this is not a situation where they MIGHT run out, under normal load our programs WILL run out and seg fault". A program exceeding these limits is one thing, but a program that doesn't gracefully handle the error conditions resulting from exceeding these limits is quite another. If you hit the file handle limit and attempt to open a file, you should get an error indicating that you have too many files open. This shouldn't cause a program crash in a well-designed program. It may be more difficult to detect stack usage issues, but running out of file descriptors should never cause a crash.
You don't give much details about what type of program this is, but I would argue that it's not safe to assume that users of your program will necessarily have adequate permissions to change these values. In any case, it's probably also unsafe to assume that nothing else might change these values while your program is running without the user's knowledge.
While there are always exceptions, I would say that in general a program that exceeds these limits needs to have its code re-examined. The limits are there for a reason, and pretty much every other piece of software on your system works within those limits with no problems. Do you really need that many files open at the same time, or would it be cleaner to open a few files, process them, close them, and open a few more? Is your library/program trying to do too much in one big bundle, or would it be better to break it into smaller, independent parts that work together? Are you exceeding your stack limits because you are using a deeply-recursive algorithm that could be re-written in a non-recursive manner? There are likely many ways in which the library and program in question can be improved in order to ease the need to alter the system resource limits.
The short answer is: it's normal, but not inflexible. Of course, limits are in place to prevent rogue processes or users from starving the system of resources. Desktop systems will be less restrictive than server systems but still have certain limits (e.g. filehandles.)
This is not to say that limits cannot be altered in persistent/reproduceable manners, either by the user at the user's discretion (e.g. by adding the relevant ulimit calls in .profile) or programatically from within programs/libraries which know with certitude that they will require large amounts of filehandles (e.g. setsysinfo(SSI_FD_NEWMAX,...)), stack (provided at pthread creation time), etc.
On Darwin, the default soft limit on the number of open files is 256; the default hard limit is unlimited.
AFAICR, on Solaris, the default soft limit on the number of open files is 16384 and the hard limit is 32768.
For stack sizes, Darwin has soft/hard limits of 8192/65536 KB. I forget what the limit is on Solaris (and my Solaris machine is unavailable - power outages in Poughkeepsie, NY mean I can't get to the VPN to access the machine in Kansas from my home in California), but it is substantial.
I would not worry about the hard limits. If I thought the library might run out of 256 file descriptors, I'd increase the soft limit on Darwin; I would probably not bother on Solaris.
Similar limits apply on Linux and AIX. I can't answer for Windows.
Sad story: a few years ago now, I removed the code that changed the maximum file size limit in a program - because it had not been changed from the days when 2 MB was a big file (and some systems had a soft limit of just 0.5 MB). Once upon a decade and some ago, it actually increased the limit; when it was removed, it was annoying because it reduced the limit. Tempus fugit and all that.
On SuSE Linux (SLES 10), the open files limits are 4096/4096, and the stack limits are 8192/unlimited.
As you have to support a large number of different systems i would consider it wise to setup certain known to be good values for system limits/resources because the default values can differ wildly between systems.
The default size for pthread stacks is for example such a case. I recently had to find out that the default on HPUX11.31 is 256KB(!) which isn't very reasonable at least for our applications.
Setting up well defined values increases the portability of an application as you can be sure that there are X file descriptors, a stack size of Y, ... on every platform and that things are not just working by good luck.
I have the tendency to setup such limits from within the program itself as the user has less things to screw up (someone always tries to run the binary without the wrapper script). To optionally allow for runtime customization environment variables could be used to override the defaults (still enforcing the minimum limits).
Lets look at it this way. It is not very customer friendly to require customers to set these limits. As detailed in the other answers, you are most likely to hit soft limits and these can be changed. So change them automatically, if necessary in a script that starts the actual application (you can even write it so that it fails if the hard limits are too low and produce a nice error message instead of a segfault).
That's the practical part of it. Without knowing what the application does I'm a bit at a guess, but in most cases you should not be anywhere close to hitting any of the default limits of (even less progressive) operating systems. Assuming the system is not a server that is bombarded with requests (hence the large amount of file/socket handles used) it is probably a sign of sloppy programming. Based on experience with programmers, I would guess that file descriptors are left open for files that are only read/written once, or that the system keeps open a file descriptor on a file that is only sporadically changed/read.
Concerning stack sizes, that can mean two things. The standard cause of a program running out of stack is excessive recursion (or unbounded recursion), which is an error condition that the limits actually are designed to address. The second thing is that some big (probably configuration) structures are allocated on the stack that should be allocated in heap memory. It might even be worse and those huge structures are being passed around by value (instead of reference) and that would mean a big hit on available (wasted) stack space as well as a big performance penalty.
A small tip : If you plan to run the application over 64 bit processor, then please be careful about setting stacksize unlimited. Which in 64 Bit Linux system give -1 as stacksize.
Thanks
Shyam
Perhaps you could add whatever is appropriate to the start script, like 'ulimit -n -S 4096'.
But having worked with Solaris since 2.6, its not unusual to modify rlim_fd_cur and rlim_fd_max in /etc/system permanently. In older versions of Solaris, they're just too low for some workloads, like running webservers.

Resources