Deleting specific files from a directory - file

I am trying to delete all files from a directory apart from two (which will be erased, then re-written). One of these files, not to be deleted, contains the names of all files in the folder/directory (the other contains the number of files in the directory).
I believe there (possibly?) are 2 solutions:
Read the names of each file from the un-deleted file and delete them individually until there is only the final 2 files remaining,
or...
Because all other files end in .txt I could use some sort of filter which would only delete files with this ending.
Which of these 2 would be most efficient and how could it be done?
Any help would be appreciated.

You are going to end up deleting files one by one, regardless of which method you use. Any optimizations you make are going to be very miniscule. Without actually timing your algorithms, I'd say they'd both take about the same amount of time (and this would vary from one computer to the next, based on CPU speed, HDD type, etc). So, instead of debating that, I'll provide you code for both the ways you've mentioned:
Method1:
import os
def deleteAll(infilepath):
with open(infilepath) as infile:
for line in infile:
os.remove(line)
Method 2:
import os
def deleteAll():
blacklist = set(['names/of/files/to/be/deleted', 'number/of/files'])
for fname in (f for f in os.listdir() if f not in blacklist):
os.remove(fname)

Related

Can a tar file be efficiently randomly edited?

Is there a way to modify an individual file within a tar file without having to rewrite the entire archive? I recognize this would probably result in fragmentation.
Is there any other archive format that does this?
First off, you should only ask exactly one question on StackOverflow. If you truly want to do frequent writes to the "archive", then you might be better off simply creating a large file, formatting it with some file system of your choice and then mounting it:
truncate -s $(( 512*1024*1024 )) 512MiB-filesystem.ext4
mkfs.ext4 512MiB-filesystem.ext4
sudo mount -o loop 512MiB-filesystem.ext4 mountpoint
sudo chmod a+w mountpoint/
echo foo > mountpoint/bar
sudo umount mountpoint
As for your question about TAR. It is possible and a fun exercise but it might lack the tools that actually implement this. First off, TAR is a very simple file format, it consists of 512 B blocks that can either contain metadata or actual file contents simply copied from the original file without any compression.
A TAR can actually contain multiple files for the same path and by convention, the last duplicate path wins. This means, in order to "modify" a file, you can simply append a newer version of that file to the TAR:
tar --append --file archive.tar modified-file
This should be fast, but it would grow the archive with every file change, so it should be used sparingly.
If you want even more in-place modifications, they should be possible but there is no tooling yet for that as far as I know. I would like to implement that into ratarmount but I'm not sure when I'll get to it.
File system operations and how to implement them:
Modifying a file:
File size is constant: As long as the file size does not change, we could simply change the file inside the TAR if we know the offset for the file contents in the TAR archive, which ratarmount does have stored in an SQLite database.
File size is quasi constant: Actually, the file size might even change by up to 511 B and it still would be possible to simply update the file inside the TAR as long as it doesn't change the number of required TAR blocks (512 B). This would also require updating the file size in the TAR metadata block and updating the checksum of that metadata block, though.
Required TAR blocks shrink: If the required TAR blocks become fewer than before, then it still would be rather easy to modify the TAR on the fly as outlined above. But we would have to somehow format the unused blocks. We could simply fill them with zeros, but in this case, we would have to call tar with the --ignore-zeros option to still get a valid tar. Without that, all files after that position would suddenly appear lost, so it might be unsuited in some circumstances. But we could also simply fill the empty blocks with dummy data, e.g., a directory metadata entry for the / (root) folder. As long as it contains the same metadata as the actual root folder, it basically is a no-op. It might even be possible to create dummy metadata blocks for invalid paths like . or .. to effectively create blocks that are ignored even without the --ignore-zeros option.
*Required TAR blocks grow:` This is the most difficult case. If there is simply no space to put the added data to the file, then we might have to delete it and move it to the end of the file (if it isn't already at the end). Removing the file without rewriting everything else in the TAR would be implemented as mentioned above by either filling the parts with zeros or dummy metadata blocks. At this point, however, we could implement defragmentation techniques, e.g., by keeping track of all empty / dummy blocks in the TAR and looking for fitting places. Or if we want to append 1 KiB to a 1 GiB file, then it might avoid fragmentation better if we move a small file right after the 1 GiB file to the end of the TAR to make space for the 1 KiB to append.
Modifying file metadata:
In General: In general, metadata can be changed by simply changing it in the metadata block and updating the block checksum. This does not require rewriting anything else in the archive
Removals: This is basically the same as file modifications for shrinking block counts. Simply overwrite the space for this file entry with zeros or dummy blocks and maybe keep track of it for writing files into this space at a later time.
Renames: Renames can actually be more tricky than one might think. In most cases, it can also simply be updated, however, there are two problematic cases:
The file name becomes too long: If the file name becomes too long, then the GNU long name extension will allocate further blocks right after the TAR metadata block, which will contain the very long filename. This however would require one more block, which might require moving around blocks inside the TAR as outlined for file modifications
There are file name collisions: If the target path already exists, then simply updating the metadata might not suffice depending on the order the files appear in the TAR. The last one with the same path wins. This might be easy to circumvent by simply forbidding to move to an existing path or by calling remove on the existing file beforehand.
Create: This is simple. Simply append the file to the end of the archive. If implemented manually, then we might have to find the actual end of the data because TAR archives have at least 2 (often more) zero-byte blocks after the last valid data and simply appending new files after those zero blocks would require the --ignore-zero-bytes option.

Assign unique numbers to File during runtime

I want to assign unique file numbers to files during run time.
Creating hash for the file name is not an option for me as I do not want any collisions.
One good option is create running numbers for all files. But I do not have access to source file to walk the directory in place where I am running my binary.
So I need some option that can extract file name from the binary (Say using symbol table similar to GDB). I am not sure how to do that. Any help is appriciated
You could try to use the inode number (st_ino) from the file itself -- you get that from using fstat (http://linux.die.net/man/2/fstat).
The inode number is how the file system is keeping track of the files, and they are unique for the given file system -- hence as long as the files are not located on different files systems (different mount points) the inode number is unique.
This include if there are multiple links to the same file, if that worries you as well.

How to extract data from multiple files with Python?

I am new to Python, which is also my first programming language. I have a set of txt files (academic papers), I need to extract the paper ID (e.g. ID: a1111111) and abstract (e.g. ABSTRACT: .....). I have no idea how to extract this data from multiple files from multiple folders? Thanks A LOT!
So your question is two part: reading files and accessing folders
Reading files
The methods/objects in python used for reading files is in Python's documentation on chapter 7:
http://docs.python.org/2/tutorial/inputoutput.html
The basic gist is that you use the open method to access files that are in the same directory
f = open('stuff.txt', 'r')
Where stuff.txt is the name of the file in the same directory that your python file is in.
Calling print f.read() will display the text (in String format) of the file. Feel free to assign f.read() to a variable to capture the data.
>>> x = f.read()
>>> print x
This is the entire file.\n
Best read the documentation for all these methods, cause there are subtleties. For example, calling f.read() once will return the entire file contents to you, but calling f.read() again will return an empty string, as the "end of the file has been reached."
Accessing Folders
Can you explain to me how exactly you'd like to access folders? In this case, it would be much easier to just put all your files in the same directory as where you are running your python file.
However, the basic way to move around in python is to use: os.chdir(path) which is basically cd'ing around. You must import os before you use this.
Leave a comment if you'd like some more information

Concatenate a large number of HDF5 files

I have about 500 HDF5 files each of about 1.5 GB.
Each of the files has the same exact structure, which is 7 compound (int,double,double) datasets and variable number of samples.
Now I want to concatenate all this files by concatenating each of the datasets so that at the end I have a single 750 GB file with my 7 datasets.
Currently I am running a h5py script which:
creates a HDF5 file with the right datasets of unlimited max
open in sequence all the files
check what is the number of samples (as it is variable)
resize the global file
append the data
this obviously takes many hours,
would you have a suggestion about improving this?
I am working on a cluster, so I could use HDF5 in parallel, but I am not good enough in C programming to implement something myself, I would need a tool already written.
I found that most of the time was spent in resizing the file, as I was resizing at each step, so I am now first going trough all my files and get their length (it is variable).
Then I create the global h5file setting the total length to the sum of all the files.
Only after this phase I fill the h5file with the data from all the small files.
now it takes about 10 seconds for each file, so it should take less than 2 hours, while before it was taking much more.
I get that answering this earns me a necro badge - but things have improved for me in this area recently.
In Julia this takes a few seconds.
Create a txt file that lists all the hdf5 file paths (you can use bash to do this in one go if there are lots)
In a loop read each line of txt file and use label$i = h5read(original_filepath$i, "/label")
concat all the labels label = [label label$i]
Then just write: h5write(data_file_path, "/label", label)
Same can be done if you have groups or more complicated hdf5 files.
Ashley's answer worked well for me. Here is an implementation of her suggestion in Julia:
Make text file listing the files to concatenate in bash:
ls -rt $somedirectory/$somerootfilename-*.hdf5 >> listofHDF5files.txt
Write a julia script to concatenate multiple files into one file:
# concatenate_HDF5.jl
using HDF5
inputfilepath=ARGS[1]
outputfilepath=ARGS[2]
f = open(inputfilepath)
firstit=true
data=[]
for line in eachline(f)
r = strip(line, ['\n'])
print(r,"\n")
datai = h5read(r, "/data")
if (firstit)
data=datai
firstit=false
else
data=cat(4,data, datai) #In this case concatenating on 4th dimension
end
end
h5write(outputfilepath, "/data", data)
Then execute the script file above using:
julia concatenate_HDF5.jl listofHDF5files.txt final_concatenated_HDF5.hdf5

C - Reading multiple files

just had a general question about how to approach a certain problem I'm facing. I'm fairly new to C so bear with me here. Say I have a folder with 1000+ text files, the files are not named in any kind of numbered order, but they are alphabetical. For my problem I have files of stock data, each file is named after the company's respective ticker. I want to write a program that will open each file, read the data find the historical low and compare it to the current price and calculate the percent change, and then print it. Searching and calculating are not a problem, the problem is getting the program to go through and open each file. The only way I can see to attack this is to create a text file containing all of the ticker symbols, having the program read that into an array and then run a loop that first opens the first filename in the array, perform the calculations, print the output, close the file, then loop back around moving to the second element (the next ticker symbol) in the array. This would be fairly simple to set up (I think) but I'd really like to avoid typing out over a thousand file names into a text file. Is there a better way to approach this? Not really asking for code ( unless there is some amazing function in c that will do this for me ;) ), just some advice from more experienced C programmers.
Thanks :)
Edit: This is on Linux, sorry I forgot to metion that!
Under Linux/Unix (BSD, OS X, POSIX, etc.) you can use opendir / readdir to go through the directory structure. No need to generate static files that need to be updated, when the file system has the information you want. If you only want a sub-set of stocks at a given time, then using glob would be quicker, there is also scandir.
I don't know what Win32 (Windows / Platform SDK) functions are called, if you are developing using Visual C++ as your C compiler. Searching MSDN Library should help you.
Assuming you're running on linux...
ls /path/to/text/files > names.txt
is exactly what you want.
opendir(); on linux.
http://linux.die.net/man/3/opendir
Exemple :
http://snippets.dzone.com/posts/show/5734
In pseudo code it would look like this, I cannot define the code as I'm not 100% sure if this is the correct approach...
for each directory entry
scan the filename
extract the ticker name from the filename
open the file
read the data
create a record consisting of the filename, data.....
close the file
add the record to a list/array...
> sort the list/array into alphabetical order based on
the ticker name in the filename...
You could vary it slightly if you wish, scan the filenames in the directory entries and sort them first by building a record with the filenames first, then go back to the start of the list/array and open each one individually reading the data and putting it into the record then....
Hope this helps,
best regards,
Tom.
There are no functions in standard C that have any notion of a "directory". You will need to use some kind of platform-specific function to do this. For some examples, take a look at this post from Cprogrammnig.com.
Personally, I prefer using the opendir()/readdir() approach as shown in the second example. It works natively under Linux and also on Windows if you are using Cygwin.
Approach 1) I would just have a specific directory in which I have ONLY these files containing the ticker data and nothing else. I would then use the C readdir API to list all files in the directory and iterate over each one performing the data processing that you require. Which ticker the file applies to is determined only by the filename.
Pros: Easy to code
Cons: It really depends where the files are stored and where they come from.
Approach 2) Change the file format so the ticker files start with a magic code identifying that this is a ticker file, and a string containing the name. As before use readdir to iterate through all files in the folder and open each file, ensure that the magic number is set and read the ticker name from the file, and process the data as before
Pros: More flexible than before. Filename needn't reflect name of ticker
Cons: Harder to code, file format may be fixed.
but I'd really like to avoid typing out over a thousand file names into a text file. Is there a better way to approach this?
I have solved the exact same problem a while back, albeit for personal uses :)
What I did was to use the OS shell commands to generate a list of those files and redirected the output to a text file and had my program run through them.
On UNIX, there's the handy glob function:
glob_t results;
memset(&results, 0, sizeof(results));
glob("*.txt", 0, NULL, &results);
for (i = 0; i < results.gl_pathc; i++)
printf("%s\n", results.gl_pathv[i]);
globfree(&results);
On Linux or a related system, you could use the fts library. It's designed for traversing file hierarchies: man fts,
or even something as simple as readdir
If on Windows, you can use their Directory Management API's. More specifically, the FindFirstFile function, used with wildcards, in conjunction with FindNextFile

Resources