Retrieve file count without walking through entire file system

Retrieve file count without walking through entire file system - c

I am on a vxworks 6.9 platform. I want to know how many files are in a folder. The file system is DOSFS (FAT). The only way I know how to do this is to simply loop through every file in the folder and count. This gets very expensive the more files in the folder. Is there a more sensible way to do this? Does there exist some internal database or count of all files in a folder?

The FAT filesystem does not keep track of the number of files it contains. What it does contain is:
A boot sector
A filesystem information sector (on FAT32) including:
Last number of known free clusters
Number of the most recently allocated cluster
Two copies of the file allocation table
An area for the root directory (on FAT12 and FAT16)
Data clusters
You'll need to walk the directory tree to get a count.

Related

How to detect end of directory in the data area of FAT32?

I am working with the disk file directly. since the size of a directory is 0 in the directory structure, I wonder how do I detect the end of a directory file on the disk.
DIR_Name[0] == 0x00
The above way to detect end of directory doesn't seem reliable. I found on wiki that the size of the root directory in FAT32 is fixed to 512 entries, but what about other subdirectories. I might need to traverse down directories using the FAT and the cluster number.

From the first Google search result for "fat32 on disk format", page 24:
When a directory is created, a file with the ATTR_DIRECTORY bit set in
its DIR_Attr field, you set its DIR_FileSize to 0. DIR_FileSize is not
used and is always 0 on a file with the ATTR_DIRECTORY attribute
(directories are sized by simply following their cluster chains to the
EOC mark).
Also: The FAT32 root directory size is not fixed at 512 entries; its size is determined in exactly the same way as any other directory.
From another reliable source:
Reading Directories
The first step in reading directories is finding and reading the root
directory. On a FAT 12 or FAT 16 volumes the root directory is at a
fixed position immediately after the File Allocation Tables:
first_root_dir_sector = first_data_sector - root_dir_sectors;
In FAT32, root directory appears in data area on given cluster and can be
a cluster chain.
root_cluster_32 = extBS_32->root_cluster;
Emphasis added.

A non-root directory is just a file.
The root directory starts in a fixed place on the disk (following the FAT). An entry in the root directory contains a cluster number. That cluster contains the data of the file or directory. The entry of that cluster number in the FAT, i.e. FAT[cluster_number] contains the number of the next cluster that belongs to the file or directory. That cluster contains more data of the file or directory and the FAT entry contains the number of the next cluster of the file or directory, etcetera, until you encounter the end-of-cluste mark, a value equal to or greater than 0xFFFFFFF8.

Why are compressed files modified at the end of compression?

Using 7zip I compressed ~15GB worth of pictures split in folders in 15 1024MB files.
Compression methode: LZMA2; Level: Ultra; Dictionary size: 64M;
At the end of compression some of the files had their "last modified" time changed to the time of completion, while some of the files didn't.
Why is this?
And if I have already uploaded most of the files will I be able to unarchive them successfully?

You would need to ask the author of the program for an explanation of why it modifies volumes at the end of the operation. If I had to make an educated guess, it might be because 7-zip doesn't know which is the last volume until it's finished (because this would depend on the compression ratio of the files being archived, which can't be predicted), and so it needs to go back and update parts of the volume file headers accordingly.
In general, though, quoting the relevant 7-zip help file entry:
NOTE: Please don't use volumes (and don't copy volumes) before
finishing archiving. 7-Zip can change any volume (including first
volume) at the end of archiving operation.
The only safe assumption is that you can't reliably use any of your individual 1GB volumes until 7-zip has finished processing the whole 15GB archive.

FAT deleted files recovery capability

I was working on kinda exploration of File Allocation Table recovery last couple of weeks. My purpose is to locate a possibly deleted file by its signature (for example, ZIP file by "50 4B 03 04" bytes) and recover the whole thing to search inside of it.
I've explored there's a problem with FAT: file system uses allocation table indicies for both cluster chain storing and deleted files marking, making files recovery, at first sight, impossible.
But there's hell of a recovery software advertising promising recovery of files deleted from FAT file system. So, there might be a workaround, I assume.
I've found that we can successfully recover files continuously located on disk. First cluster gives us an index, and index address value gives us strong possiblity of finding a directory entry where file size is stored. But is it the end? I'd like to recover fragmented files as well, but can't find the way.
May anyone know a workaround and help me here a bit, please?

FAT file system uses a directory entry for each file and folder. It shows starting cluster, filename, date and size. To access file, system looks in directory finds file and notes the starting cluster. Then it goes to the FAT (file allocation table) cluster that corresponds to the starting cluster. The starting cluster entry contains the cluster number of the next cluster. The next cluster entry points to the next cluster and so on until you come to an end of file marker which means this is the last cluster used by the file.
When you delete a file or folder. It locates the directory it resides in and changes the 1st letter of the file or folder name entry to E6 hex (not sure if its E6 or something slightly different) and it deletes the FAT chain.
That is why you can recover only contiguous files in FAT system once a file is deleted. All data recovery utilities will use this method. None other available unless you can find traces of the FAT with correct cluster chains still in place.

Find trashed file location in GIO

The documentation says that all files moved to trash are stored normally in ~/.local/share/Trash/files. Are there an exception for files removed from removable media? Are they stored in drive_root/.Trash-xxx directory? Or is this behavior obsolete?
How do I find real file path of file in trash can? I have a list of GFileInfo obtained from g_file_enumerate_children for trash:/// uri. It's easy if all files are stored in one directory. But I'm afraid this could be different for removable drives.

From removeable media there are the .Trash-$(user_id) folders, so you will have to get all mounted disks as well as the home trashcan.
Under each mounted device (not being the home folder) will be a .Trash folder for each user ever having something deleted. So e.g. for my user foo which has ID 1000 (see /etc/passwd) you will have to look for .Trash-1000 folders.
This is AFAIK not obsolete, just think about the oposite, the file would have to be copied over to your home storage just to move it to trash...
For the second part, you probably better off asking that on the glib/gtk mailinglist.

Storing Large Number Of Files in File-System

I have millions of audio files, generated based on GUId (http://en.wikipedia.org/wiki/Globally_Unique_Identifier). How can I store these files in the file-system so that I can efficiently add more files in the same file-system and can search for a particular file efficiently. Also it should be scalable in future.
Files are named based on GUId (unique file name).
Eg:
[1] 63f4c070-0ab2-102d-adcb-0015f22e2e5c
[2] ba7cd610-f268-102c-b5ac-0013d4a7a2d6
[3] d03cf036-0ab2-102d-adcb-0015f22e2e5c
[4] d3655a36-0ab3-102d-adcb-0015f22e2e5c
Pl. give your views.
PS: I have already gone through < Storing a large number of images >. I need the particular data-structure/algorithm/logic so that it can also be scalable in future.
EDIT1: Files are around 1-2 millions in number and file system is ext3 (CentOS).
Thanks,
Naveen

That's very easy - build a folder tree based on GUID values parts.
For example, make 256 folders each named after the first byte and only store there files that have a GUID starting with this byte. If that's still too many files in one folder - do the same in each folder for the second byte of the GUID. Add more levels if needed. Search for a file will be very fast.
By selecting the number of bytes you use for each level you can effectively choose the tree structure for your scenario.

I would try and keep the # of files in each directory to some manageable number. The easiest way to do this is name the subdirectory after the first 2-3 characters of the GUID.

Construct n level deep folder hierarchy to store your files. The names of the nested folders will the first n bytes of the corresponding file name. For example: For storing a file "63f4c070-0ab2-102d-adcb-0015f22e2e5c" in a four level deep folder hierarchy, construct 6/3/f/4 and place this file in this hierarchy. The depth of the hierarchy depends on the maximum number of files you can have in your system. For a few million files in my project 4 level deep hierarchy works well.
I also did the same thing in my project having nearly 1 million files. My requirement was also to process the files by traversing this huge list. I constructed a 4 level deep folder hierarchy and the processing time reduced from nearly 10 minutes to a few seconds.
An add on to this optimization can be that, if you want to process all the files present in these deep folder hierarchies, then instead of calling a function to fetch the list for the first 4 levels just precompute all the possible 4 level deep folder hierarchy names. Suppose the guid can have 16 possible characters then we will have 16 folders each at the first four levels, we can just precompute the 16*16*16*16 folder hierarchies which takes just a few ms. This save a lot of time if these large number of files are stored at a shared location and calling a function to fetch the list in a directory takes nearly a second.

Sorting the audio files into separate subdirectories may slower if dir_index is used on the ext3 volume. (dir_index: "Use hashed b-trees to speed up lookups in large directories.")
This command will set the dir_index feature: tune2fs -O dir_index /dev/sda1

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Retrieve file count without walking through entire file system - c

Related

How to detect end of directory in the data area of FAT32?

Why are compressed files modified at the end of compression?

FAT deleted files recovery capability

Find trashed file location in GIO

Storing Large Number Of Files in File-System

Categories

Resources