This question has been asked with varying degrees of success in the past...
Are there tools, or C/C++ unix functions to call that would enable me to retrieve the location on disk of a file? Not some virtual address of the file, but the disk/sector/block the file resides in?
The goal here is to enable overwriting of the actual bits that exist on disk. I would probably need a way to bypass the kernel's superimposition of addresses. I am willing to consider an x86 asm based solution...
However, I feel there are tools that do this quite well already.
Thanks for any input on this.
Removing files securely is only possible under very specific circumstances:
There are no uncontrolled layers of indirection between the OS and the actual storage medium.
On modern systems that can no longer be assumed. SSD drives with firmware wear-leveling code do not work like this; they may move or copy data at will with no logging or possibility of outside control. Even magnetic disk drives will routinely leave existing data in sectors that have been remapped after a failure. Hybrid drives do both...
The ATA specification does support a SECURE ERASE command which erases a whole drive, but I do not know how thorough the existing implementations are.
The filesystem driver has a stable and unique mapping of files to physical blocks at all times.
I believe that ext2fs does have this feature. I also think that ext3fs and ext4fs also work like this in the default journaling mode, but not when mounted with the data=journal option which allows for file data to be stored in the journal, rather than just metadata.
On the other hand reiserfs definitely works differently, since it stores small amounts of data along with the metadata, unless mounted with the notail option.
If these two conditions are met, then a program such as shred may be able to securely remove the content of a file by overwriting its content multiple times.
This method still does not take into account:
Backups
Virtualized storage
Left over data in the swap space
...
Bottom line:
You can no longer assume that secure deletion is possible. Better assume that it is impossible and use encryption; you should probably be using it anyway if you are handling sensitive data.
There is a reason that protocols regarding sensitive data mandate the physical destruction of the storage medium. There are companies that actually demagnetize their hard disk drives and then shred them before incinerating the remains...
Related
Context: I'm a student who just finished an operating systems course and is currently taking a databases course.
I'm confused about how the OS and the DBMS interact with one another.
For example, what happens when a user program tries to access a file? Does a system call get invoked that is then handled by the OS to find the correct file and data? Or is the call handled by the DBMS, which can then more efficiently find the data (tuple/record) using a B+ tree for example? And then the DBMS makes a call to the OS to actually get the data?
Is the database only accessed if using a programming language like SQL? If I just write a simple C program that writes a file to disk, is the data really stored in a "database" or just in some block on disk where the information for the file is stored within the inode for that file?
I apologize if this isn't the correct forum to ask this question and also if this question is too simple. I tried looking online, but surprisingly didn't find much info (maybe I was searching for the wrong key words?)
For example, what happens when a user program tries to access a file?
That depends on how the user program is accessing the file. fopen/read/write calls are offered by the filesystem managed by the OS.
Does a system call get invoked that is then handled by the OS to find the correct file and data?
If a database is used, the database manages it's own set of data files, and indexes into those data files. The database engine executes IO requests which are handled by the underlying OS. Additionally, the database most possibly will also do caching to reduce file IO.
Or is the call handled by the DBMS, which can then more efficiently find the data (tuple/record) using a B+ tree for example? And then the DBMS makes a call to the OS to actually get the data?
Depending on the database query, the data can be read sequentially, or accessed randomly via an index lookup.
Is the database only accessed if using a programming language like SQL? If I just write a simple C program that writes a file to disk, is the data really stored in a "database" or just in some block on disk where the information for the file is stored within the inode for that file?
It's stored in the filesystem that might used a block based design. A simple C program should use an SDK to connect to a database, and then invoke SQL statements.
Hope this helps!
The answer really depends upon the operating system and the database system.Industrial-strength operating systems have support for multiple file structures and record level locking. A database system can make use of the facilities provide by the operating system for locking and indices. In that case, much of the database works by invoking system service calls to the operating system.
With the rise of brain damaged operating systems that do not support records at all, let alone record or column locking, database system does nearly of the work, beyond low level I/O calls. Some operating systems are so brain damaged that database systems have to create their own partitions and effectively manage their own file systems within the partition. There are no files at all from the perspective of the operating system.
I am writing an embedded system, where I am creating a USB mass storage device driver that uses an 8MB chunk of RAM as the FAT fileystem..
Although it made sense at the time to allow the OS to take my zero'd out RAM area and format a FAT partition itself, I ran into problems that I am diagnosing on Jan Axelson's (popular author of books on USB) PORTS forum. The problem is related to erasing the blocks in the drive, and may be circumvented by pre-formatting the memory area to a FAT filesystem before USB enumeration. Then, I could see if the drive will operate normally. Ideally, the preformatted partition would include a file on it to test the read operation.
It would be nice if I could somehow create and mount a mock 8MB FAT filesystem on my OS (OSX), write a file to it, and export it to an image file for inclusion in my project. Does someone know how to do this? I could handle the rest. I'm not too concerned whether that would be FAT12/16/32 at the moment, optional MBR inclusion would be nice..
If that option doesn't exist, I'm looking to use a pre-written utility to create a FAT img file that I could include into my project and upload directly to RAM. this utility would allow me to specify an 8MB filesystem with 512-byte sectors, for instance, and possibly FAT12 / FAT16 / FAT32.
Is anyone aware of such a utility? I wasn't able to find one.
If not, can someone recommend a first step to take in implementing this in C? I'm hoping a library exists. I'm pretty exhausted after implementing the mass storage driver from scratch, but I understand I might have to 'get crinkled' and manually create the FAT partition. It's not too hard. I imagine some packed structs and some options. I'll get there. I already have resources on FAT filesystem itself.
I ended up discovering that FatFS has facilities for formatting and partitioning the "drive" from within the embedded device, and it relieved of me of having to absolutely format it manually or use host-side tools.
I would like to cover in more detail the steps taken, but I am exhausted. I may edit in further details at a later time.
There are several, they're normally hidden in the OS source.
On BSD (ie OS-X) you should have a "mkdosfs" tool, if not the source will be available all over the place ... here's a random example
http://www.blancco.com/downloads/source/utils/mkdosfs/mkdosfs.c
Also there's the 'mtools' package, it's normally use for floppies, but I think it does disk images too.
Neither of these will create partition tables though; you'd need something else if that's required too.
I'm working on a personal project to regularly (monthly-ish) traverse my hard disk and shred (overwrite with zeros) any blocks on the disk not currently allocated to any inode(s).
C seemed like the most logical language to do this in given the low-level nature of the project, but I am not sure how best to find the unused blocks in the filesystem. I've found some questions around S.O. and other places that are similar to this, but did not see any consensus on the best way to efficiently and effectively find these unused blocks.
df has come up in any questions even remotely similar to this, but I don't believe it has the resolution necessary to specify exact block offsets unless I am missing something. Is there another utility I should look into or some other direction entirely?
Whatever solution I develop would need to be able to handle, at minimum, ext3 filesystems, and preferably ext4 also.
You don't really have any general solution to find out which blocks are in use other than writing your own implementation to read and parse the on disk filesystem data which is highly specific to the filesystems you want to support. How the data looks on disk is something that is often undocumented outside of the code for that filesystem and when documented the documentation is often out of date compared to the actual implementation.
Your best bet is to read the implementation of fsck for the filesystem you want to support since it does more or less what you're interested in, but be warned that many fsck implementations out there don't always check all of the data that belongs to the filesystem. You might have alternate superblocks and certain metadata that fsck doesn't check (or only checks in case the primary superblock is corrupted).
If you really want to do what you say you want to do and not just learn about filesystems your best bet is to dump your filesystem like a normal backup, wipe the disk and restore the backup. I highly doubt anything else is safe to do especially considering that your disk wiping application might break your filesystem with any kernel update you do.
Linux currently support over a dozen different filesystems so the answer will depend on which one you choose.
However, they should all have an easy way of finding free blocks otherwise creating new files or extending current files would be a bit slow.
For example, ext2 has, at the start of each block group, a header containing, among other things, the free list for that block group. I don't believe this has changed in ext4 even though there's a lot of extra stuff in there.
You would probably be far better traversing the free blocks in those block headers rather than taking an arbitrary block and trying to figure out if it's used or free.
It seems like there isn't anything inherent in an operating system that would necessarily require that sort of abstraction/metaphor.
If so, what are they? Are they still used anywhere? I'd be especially interested in knowing about examples that can be run/experimented with on a standard desktop computer.
Examples are Persistent Haskell, Squeak Smalltalk, and KeyKOS and its descendants.
It seems like there isn't anything inherent in an operating system
that would necessarily require that sort of abstraction/metaphor.
There isn't any necessity, it's completely bogus. In fact, forcing everything to be accessible via a human readable name is fundamentally flawed, and precludes security due to Zooko's triangle.
Examples of hierarchies similar to this appear as well in DNS, URLs, programming language module systems (Python and Java are two good examples), and torrents, X.509 PKI.
One system that fixes some of the problems caused by DNS/URLs/X.509 PKI is Waterken's YURL.
All these systems exhibit ridiculous problems because the system is designed around some fancy hierarchy instead of for something that actually matters.
I've been planning on writing some blogs explaining why these types of systems are bad, I'll update with links to them when I get around to it.
I found this http://pages.stern.nyu.edu/~marriaga/papers/beyond-the-hfs.pdf but it's from 2003. Is something like that what you are looking for?
About 1995, I started to design an object oriented operating system
(SOOOS) that has no file system.
Almost everything is an object that exists in virtual memory
which is mapped/paged directly to the disk
(either local or networked, I.e. redudimentary cloud computing).
There is a lot of overhead in programs to read and write data in specific formats.
Image never reading and writing files.
In SOOOS there are no such things as files and directories,
Autonomous objects, which would essentially replace files, can be organized
suiting your needs, not simply a restrictive hierarchical file system.
There are no low level drive format structures (I.e. clusters)
that have additional level of abstraction and translation overhead.
SOOOS Data storage overhead is simply limited to page tables
that can be quickly indexed as with basic virtual memory paging.
Autonomous objects each have their own dynamic
virtual memory space which serves as the persistent data store.
When active they are given a task context and added to the active process task list
and then exist as processes.
A lot of complexity is eliminated in my design, simply instanciate objects
in a program and let the memory manager and virtual memory system handle
everything consistently with minimal overhead.
Booting the operating system is simply a matter of loading the basic kernal
setting up the virtual memory page tables to the key OS objects and
(re)starting the OS object tasks. When the computer is turned-off,
shutdown is essentially analogous to hibernation
so the OS is nearly in instant-on status,
The parts (pages) of data and code are loaded only as needed.
For example to edit a document, instead of starting a program by loading the entire
executable in memory, simply load the task control structure of the
autonomous object and set the instruction pointer to the function to be performed.
The code is paged in only as the instruction pointer traverses its virtual memory.
Data is always immediately ready to be used and simply paged in only as accessed
with no need to parse files and manage data structures which often
have a distict represention in memory from secondary storage.
Simply use the program's native memory allocation mechanism and
abstract data types without disparate and/or redundent data structures.
Object Linking and Embedding type of program interaction,
memory mapped IO, and interprocess communication you
get practically for free as one would implement
memory sharing using the facilities of the processor's Memory Management Unit.
I am wondering how the OS is reading/writing to the hard drive.
I would like as an exercise to implement a simple filesystem with no directories that can read and write files.
Where do I start?
Will C/C++ do the trick or do I have to go with a more low level approach?
Is it too much for one person to handle?
Take a look at FUSE: http://fuse.sourceforge.net/
This will allow you to write a filesystem without having to actually write a device driver. From there, I'd start with a single file. Basically create a file that's (for example) 100MB in length, then write your routines to read and write from that file.
Once you're happy with the results, then you can look into writing a device driver, and making your driver run against a physical disk.
The nice thing is you can use almost any language with FUSE, not just C/C++.
I found it quite easy to understand a simple filesystem while using the fat filesystem on the avr microcontroller.
http://elm-chan.org/fsw/ff/00index_e.html
Take look at the code you will figure out how fat works.
For learning the ideas of a file system it's not really necessary to use a disk i think. Just create an array of 512 byte byte-arrays. Just imagine this a your Harddisk an start to experiment a bit.
Also you may want to hava a look at some of the standard OS textbooks like http://codex.cs.yale.edu/avi/os-book/OS8/os8c/index.html
The answer to your first question, is that besides Fuse as someone else told you, you can also use Dokan that does the same for Windows, and from there is just a question of doing Reads and Writes to a physical partition (http://msdn.microsoft.com/en-us/library/aa363858%28v=vs.85%29.aspx (read particularly the section on Physical Disks and Volumes)).
Of course that in Linux or Unix besides using something like Fuse you only have to issue, a read or write call to the wanted device in /dev/xxx (if you are root), and in these terms the Unices are more friendly or more insecure depending on your point of view.
From there try to implement a simple filesystem like Fat, or something more exoteric like an tar filesystem, or even some simple filesystem based on Unix concepts like UFS or Minux, or just something that only logs the calls that are made and their arguments to a log file (and this will help you understand, the calls that are made to the filesystem driver during the regular use of your computer).
Now your second question (that is much more simple to answer), yes C/C++ will do the trick, since they are the lingua franca of system development, also a lot of your example code will be in C/C++ so you will at least read C/C++ in your development.
Now for your third question, yes, this is doable by one person, for example the ext filesystem (widely known in Linux world by it's successors as ext2 or ext3) was made by a single developer Theodore Ts'o, so don't think that these things aren't doable by a single person.
Now the final notes, remember that a real filesystem interacts with a lot of other subsystems in a regular kernel, for example, if you have a laptop and hibernate it the filesystem has to flush all changes made to the open files, if you have a pagefile on the partition or even if the pagefile has it's own filesystem, that will affect your filesystem, particularly the block sizes, since they will tend to be equal or powers of the page block size, because it's easy to just place a block from the filesystem on memory that by coincidence is equal to the page size (because that's just one transfer).
And also, security, since you will want to control the users and what files they read/write and that usually means that before opening a file, you will have to know what user is logged on, and what permissions he has for that file. And obviously without filesystem, users can't run any program or interact with the machine. Modern filesystem layers, also interact with the network subsystem due to the fact that there are network and distributed filesystems.
So if you want to go and learn about doing kernel filesystems, those are some of the things you will have to worry about (besides knowing a VFS interface)
P.S.: If you want to make Unix permissions work on Windows, you can use something like what MS uses for NFS on the server versions of windows (http://support.microsoft.com/kb/262965)