Minimizing disk accesses when getting attributes of files in a directory

Minimizing disk accesses when getting attributes of files in a directory - file

As the title suggests, I'm looking for a way to get attributes of a large number of files in a directory, but without adding the cost of an additional disk access for each file.
For example, if I get the Name attribute of FileInfo objects in a collection, then there is no additional disk access. However if I get the LastWriteTimeUtc, then an additional disk access is made.
My code:
DirectoryInfo di = new DirectoryInfo(myDir);
FileInfo[] allFiles = di.GetFiles("*.*", SearchOption.TopDirectoryOnly);
foreach (FileInfo fInfo in allFiles)
{
name = fInfo.Name //no additional disk access made
lastMod = fInfo.LastWriteTimeUtc //further disk access made!!!
}
Does anyone know of a way I can get this information in one round trip? I would have hoped that DirectoryInfo.GetFiles() does this but no luck.
Thanks in advance.

If you really care about this, you should probably write this in C using FindFirstFile/GetFileTime, etc.

So, this happens by design. The LastWriteTimeUtc is lazy loaded. So nothing to do other write my own component.

Related

Is there a file object to get path or name of a file in Nim?

Let's say, I would like to use a single object to represent a file and I'd like to get the filename (or path) of it so that I can use the name to remove the file or for other standard library procedures. I'd like to have a single abstraction which can be used with all available file-related standard library procedures.
I've found FileInfo but in my research I didn't find a get-file-name-procedure. File and FileHandle are pretty useless from a software engineering point of view because they provide no convenient abstraction and don't have members.
Is there a file abstraction (object) in Nim, which provides fast access to FileInfo as well as the file name so that a file doesn't need more than one procedure parameter?

There is no such abstraction in Nim, or any other language, simply because you are asking for an impossible thing to do with most filesystems. Consider the FileInfo structure and its linkCount field which tells you the number of hard links the file object has. But there is no way to get-a-filename from one or all of those links short of building and updating yourself a database of the whole filesystem.
While most filesystems allow access to files through paths, there is rarely a filesystem that gives paths from files because they actually don't need one! An example would be a Unix filesystem where one process opens a file through a path, then removes the path without closing the file. While the process holding the file open is alive, that file won't actually disappear, so you would have the case of a file without path.
The issue of handling paths, especially considering cross platform applications, involves its own can of worms: if you store paths as strings, what is the path separator and how do you escape it? Does your filesystem support volumes that require special case handling? What string encoding do paths use to satisfy all users? Just the encoding issue requires tons of tables and conversions which would bog down every other API wishing to get just a file like handle to read or write bytes.
A FileInfo is just a snapshot of the state of the file at a given time, a file handle is the live file object you can operate on, and a path (or many paths if your filesystem supports hard links) is just a convenience name for end users.
These are all very different things, which is why they are separate. Your app may need a more complex abstraction than other programmers are willing to tolerate, so create own abstraction which holds together all the individual pieces you need. For instance, consider the following structure:
import os
type
AppFileInfo = object
fileInfo: FileInfo
file: File
oneOfMany: string
proc changeFileExt(appFileInfo: AppFileInfo, ext: string): string =
changeFileExt(appFileInfo.oneOfMany, ext)
proc readAll(appFileInfo: AppFileInfo): string =
readAll(appFileInfo.file)
Those procs simply mimic the respective standard library APIs but use your more complex structure as inputs and transform it as needed. If you are worried about this abstraction not being optimised due to the extra proc call you could use a template instead.
If you follow this route, however, at some point you will have to ask yourself what is the lifetime of an AppFileInfo object: do you create it with a path? Do you create it from a file handle? Is it safe to access the file field in parts of your code or has it not been initialised properly? Do you return errors or throw exceptions when something goes wrong? Maybe when you start to ask yourself these questions you'll realise they are very app specific and are very difficult to generalise for every use case. Therefore such a complex object doesn't make much sense in the language standard library.

I created the missing solution myself. I basically extended the File type using a global encapsulated table. Extending Types like this could be a useful idiom in Nim because of UFCS.
import tables
type FileObject = object
file : File
mode : FileMode
path : string
proc initFileObject(name: string; mode: FileMode; bufsize = -1) : FileObject =
result.file = open(name, mode, bufsize)
result.path = name
result.mode = mode
var g_fileObjects = initTable[File, FileObject]()
template get(this: File) : var FileObject = g_fileObjects[this]
proc openFile*(filepath: string; mode: FileMode = fmRead; bufsize = -1) : File =
var fileObject = initFileObject(filepath, mode, bufsize)
result = fileObject.file
g_fileObjects[result] = fileObject
proc filePath*(this: File) : string {.raises: KeyError.} =
return this.get.path
proc fileMode*(this: File) : FileMode {.raises: KeyError.} =
return this.get.mode
from os import tryRemoveFile
proc closeOrDeleteFile[delete = false](this: File) : bool =
result = g_fileObjects.hasKey(this)
if result:
when delete:
result = this.filepath.tryRemoveFile()
g_fileObjects.del(this)
this.close()
proc closeFile*(this: File) : bool = this.closeOrDeleteFile[:false]
proc deleteFile*(this: File) : bool = this.closeOrDeleteFile[:true]
Now you can write
var f = openFile("myFile.txt", fmWrite)
var g = openFile("hello.txt", fmWrite)
echo f.filePath
echo f.deleteFile()
g.writeLine(g.filePath)
echo g.closeFile()

find file content size docx,pptx etc

I want to find out the size of the content inside a docx,pptx etc. Is there any package which can be used for this? I googled and found that POI is used widely to read/write to MS file types. But not able to find the correct api to find the size of the file content. I want to know the actual content size not the compressed file size which can be seen from properties.
Finally i found the way, but it is throwing OOM exception if the file is too large.
OPCPackage opcPackage = OPCPackage.open(file.getAbsolutePath());
XWPFDocument doc = new XWPFDocument(opcPackage);
XWPFWordExtractor we = new XWPFWordExtractor(doc);
String paragraphs = we.getText();
System.out.println("Total Paragraphs: "+paragraphs.length() / 1024);
Please help me if there are any other better way to do this.

Ok this has been asked long time ago and there is also no response to this question. I have not used OPCPackage and hence my answer is not based on that.
DOCX (and for that matter PPTX as well as XSLX) files are all zip files having a particular structure.
We could hence use the java.util.zip package and enumerate the entries of the zip file and get the size of the zip entry xl for xlsx file and word for docx files. Probably a more generic method would be to ignore the following top-level zip entries i.e. zip entries starting with :
docProps
_rels
[Content_Types].xml
The size of the remaining zip entry (do not ignore any folder within this zip entry) would tell you the correct size of the content.
This method is also very efficient - you only read the entries of the zip file and not the zip file itself hence obtaining the size information would run with negligible time and memory resources. For a quick start I was able to get the size of a 4MB docx file in fraction of a second.
A "good-enough" but not adequately working piece of code using this approach is pasted below. Please feel free to use this as a starting point and fix bugs if found. It would be great if you can post back the modifications or corrections so that others can benefit
private static final void printUnzippedContentLength() throws IOException
{
ZipFile zf = new ZipFile(new File("/home/chaitra/verybigfile.docx"));
Enumeration<? extends ZipEntry> entries = zf.entries();
long sumBytes = 0L;
while(entries.hasMoreElements())
{
ZipEntry ze = entries.nextElement();
if(ze.getName().startsWith("docProps") || ze.getName().startsWith("_rels") || ze.getName().startsWith("[Content_Types].xml"))
{
continue;
}
sumBytes += ze.getSize();
}
System.out.println("Uncompressed content has size " + (sumBytes/1024) + " KB" );
}

Importing Excel file with dynamic name into SQL table via SSIS?

I've done a few searches here, and while some issues are similar, they don't seem to be exactly what I need.
What I'm trying to do is import an Excel file into a SQL table via SSIS, but the problem is that I will never know the exact filename. We get files at no steady interval, and the file usually has a date/month in the name. For instance, our current file is "Census Data - May 2013.xls". We will only ever load ONE file at a time, so I don't need to loop through a directory for multiple Excel files.
My concept is that I can take this file, copy it to a "Loading" directory, and load it from there. At the start of the package, I will first clear out the loading directory, then scan the original directory for an Excel file, copy it to the loading directory and then load it into SQL. I suppose I may have to store the file names somewhere so I don't copy the same file into the loading directory in subsequent months, but I'm not really sure of the best way to handle that.
I've pretty much got everything down except the part that scans the directory for the Excel file and copies it to the loading directory. I've taken the majority of my info from this page, which (again) is close to what I want to do but not quite exactly the solution I need.
Can anyone get me over the finish line? I can't seem to get the Excel Connection Manager right (this is my first time using variables), and I can't figure out how to get the file into the Loading directory.

Problem statement
How do I dynamically identify a file name?
You will require some mechanism to inspect the contents of a folder and see what exists. Specifically, you are looking for an Excel file in your "Loading" directory. You know the file extension and that is it.
Resolution A
Use a ForEach File Enumerator.
Configure the Enumerator with an Expression on FileSpec of *.xls or *.xlsx depending on which flavor of Excel you're dealing with.
Add another Expression on Directory to be your Loading directory.
I typically create SSIS Variables named FolderInput and FileMask and assign those in the Enumerator.
Now when you run your package, the Enumerator is going to look in Diretory and find all the files that match the FileSpec.
Something needs to be done with what is found. You need to use that file name that the Enumerator returns. That's done through the Variable Mappings tab. I created a third Variable called CurrentFileName and assign it the results of the enumerator.
If you put a Script Task inside the ForEach Enumerator, you should be able to see that the value in the "Locals" window for #[User::CurrentFileName] has updated from the Design time value of whatever to the "real" file name.
Resolution B
Use a Script Task.
You will still need to create a Variable to hold the current file name and it probably won't hurt to also have the FolderInput and FileMask Variables available. Set the former as ReadWrite and the latter as ReadOnly variables.
Chose the .NET language of your choice. I'm using C#. The method System.IO.Directory.EnumerateFiles
using System;
using System.Data;
using System.IO;
using Microsoft.SqlServer.Dts.Runtime;
using System.Windows.Forms;
namespace ST_fe2ea536a97842b1a760b271f190721e
{
[Microsoft.SqlServer.Dts.Tasks.ScriptTask.SSISScriptTaskEntryPointAttribute]
public partial class ScriptMain : Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase
{
public void Main()
{
string folderInput = Dts.Variables["User::FolderInput"].Value.ToString();
string fileMask = Dts.Variables["User::FileMask"].Value.ToString();
try
{
var files = Directory.EnumerateFiles(folderInput, fileMask, SearchOption.AllDirectories);
foreach (string currentFile in files)
{
Dts.Variables["User::CurrentFileName"].Value = currentFile;
break;
}
}
catch (Exception e)
{
Dts.Events.FireError(0, "Script overkill", e.ToString(), string.Empty, 0);
}
Dts.TaskResult = (int)ScriptResults.Success;
}
enum ScriptResults
{
Success = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success,
Failure = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Failure
};
}
}
Decision tree
Given the two resolutions to the above problem, how do you chose? Normally, people say "It Depends" but there only possible time it would depend is if the process should stop/error out in the case that more than one file did exist in the Loading folder. That's a case that the ForEach enumerator would be more cumbersome than a script task. Otherwise, as I stated in my original response that adds cost to your project for Development, Testing and Maintenance for no appreciable gain.
Bits and bobs
Further addressing nuances in the question: Configuring Excel - you'll need to be more specific in what isn't working. Both Siva's SO answer and the linked blogspot article show how to use the value of the Variable I call CurrentFileName to ensure the Excel File is pointing to the "right" file.
You will need to set the DelayValidation to True for both the Connection Manager and the Data Flow as the design-time value for the Variable will not be valid when the package begins execution. See this answer for a longer explanation but again, Siva called that out in their SO answer.

Retrieving Specific Active Directory Record Attributes using C#

I've been asked to set up a process which monitors the active directory, specifically certain accounts, to check that they are not locked so that should this happen, the support team can get an early warning.
I've found some code to get me started which basically sets up requests and adds them to a notification queue. This event is then assigned to a change event and has an ObjectChangedEventArgs object passed to it.
Currently, it iterates through the attributes and writes them to a text file, as so:
private static void NotifierObjectChanged(object sender,
ObjectChangedEventArgs e)
{
if (e.ResultEntry.Attributes.AttributeNames == null)
{
return;
}
// write the data for the user to a text file...
using (var file = new StreamWriter(#"C:\Temp\UserDataLog.txt", true))
{
file.WriteLine("{0} {1}", DateTime.UtcNow.ToShortDateString(), DateTime.UtcNow.ToShortTimeString());
foreach (string attrib in e.ResultEntry.Attributes.AttributeNames)
{
foreach (object item in e.ResultEntry.Attributes[attrib].GetValues(typeof(string)))
{
file.WriteLine("{0}: {1}", attrib, item);
}
}
}
}
What I'd like is to check the object and if a specific field, such as name, is a specific value, then check to see if the IsAccountLocked attribute is True, otherwise skip the record and wait until the next notification comes in. I'm struggling how to access specific attributes of the ResultEntry without having to iterate through them all.
I hope this makes sense - please ask if I can provide any additional information.
Thanks
Martin

This could get gnarly depending upon your exact business requirements. If you want to talk in more detail ping me offline and I'm happy to help over email/phone/IM.
So the first thing I'd note is that depending upon what the query looks like before this, this could be quite expensive or error prone (ie missing results). This worries me somewhat as most sample code out there gets this wrong. :) How are you getting things that have changed? While this sounds simple, this is actually a somewhat tricky question in directory land, given the semantics supported by AD and the fact that it is a multi-master system where writes happen all over the place (and replicate in after the fact).
Other variables would be things like how often you're going to run this, how large the data set could be in AD, and so on.
AD has some APIs built to help you here (the big one that comes to mind is called DirSync) but this can be somewhat complicated if you haven't used it before. This is where the "ping me offline" part comes in.
To your exact question, I'm assuming your result is actually a SearchResultEntry (if not I can revise, tell me what you have in hand). If that is the case then you'll find an Attributes field hanging off of that guy, and from there there is AttributeNames and Values. I think you'll see how it works from there if you have Values in hand, for example:
foreach (var attr in sre.Attributes.Values)
{
var da = (DirectoryAttribute)attr;
Console.WriteLine(da.Name);
foreach (var val in da.GetValues(typeof(byte[])))
{
// Handle a byte[] val ...
}
}
As I said, if you have something other than a SearchResultEntry in hand, let us know and I can revise the code sample.

How do I get a temporary File object (of correct content-type, without writing to disk) directly from a ZipEntry (RubyZip, Paperclip, Rails 3)?

I'm currently trying to attach image files to a model directly from a zip file (i.e. without first saving them on a disk). It seems like there should be a clearer way of converting a ZipEntry to a Tempfile or File that can be stored in memory to be passed to another method or object that knows what to do with it.
Here's my code:
def extract (file = nil)
Zip::ZipFile.open(file) { |zip_file|
zip_file.each { |image|
photo = self.photos.build
# photo.image = image # this doesn't work
# photo.image = File.open image # also doesn't work
# photo.image = File.new image.filename
photo.save
}
}
end
But the problem is that photo.image is an attachment (via paperclip) to the model, and assigning something as an attachment requires that something to be a File object. However, I cannot for the life of me figure out how to convert a ZipEntry to a File. The only way I've seen of opening or creating a File is to use a string to its path - meaning I have to extract the file to a location. Really, that just seems silly. Why can't I just extract the ZipEntry file to the output stream and convert it to a File there?
So the ultimate question: Can I extract a ZipEntry from a Zip file and turn it directly into a File object (or attach it directly as a Paperclip object)? Or am I stuck actually storing it on the hard drive before I can attach it, even though that version will be deleted in the end?
UPDATE
Thanks to blueberry fields, I think I'm a little closer to my solution. Here's the line of code that I added, and it gives me the Tempfile/File that I need:
photo.image = zip_file.get_output_stream image
However, my Photo object won't accept the file that's getting passed, since it's not an image/jpeg. In fact, checking the content_type of the file shows application/x-empty. I think this may be because getting the output stream seems to append a timestamp to the end of the file, so that it ends up looking like imagename.jpg20110203-20203-hukq0n. Edit: Also, the tempfile that it creates doesn't contain any data and is of size 0. So it's looking like this might not be the answer.
So, next question: does anyone know how to get this to give me an image/jpeg file?
UPDATE:
I've been playing around with this some more. It seems output stream is not the way to go, but rather an input stream (which is which has always kind of confused me). Using get_input_stream on the ZipEntry, I get the binary data in the file. I think now I just need to figure out how to get this into a Paperclip attachment (as a File object). I've tried pushing the ZipInputStream directly to the attachment, but of course, that doesn't work. I really find it hard to believe that no one has tried to cast an extracted ZipEntry as a File. Is there some reason that this would be considered bad programming practice? It seems to me like skipping the disk write for a temp file would be perfectly acceptable and supported in something like Zip archive management.
Anyway, the question still stands:
Is there a way of converting an Input Stream to a File object (or Tempfile)? Preferably without having to write to a disk.

Try this
Zip::ZipFile.open(params[:avatar].path) do |zipfile|
zipfile.each do |entry|
filename = entry.name
basename = File.basename(filename)
tempfile = Tempfile.new(basename)
tempfile.binmode
tempfile.write entry.get_input_stream.read
user = User.new
user.avatar = {
:tempfile => tempfile,
:filename => filename
}
user.save
end
end

Check out the get_input_stream and get_output_stream messages on ZipFile.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Minimizing disk accesses when getting attributes of files in a directory - file

If you really care about this, you should probably write this in C using FindFirstFile/GetFileTime, etc.

So, this happens by design. The LastWriteTimeUtc is lazy loaded. So nothing to do other write my own component.

Related

Is there a file object to get path or name of a file in Nim?

find file content size docx,pptx etc

Importing Excel file with dynamic name into SQL table via SSIS?

Retrieving Specific Active Directory Record Attributes using C#

How do I get a temporary File object (of correct content-type, without writing to disk) directly from a ZipEntry (RubyZip, Paperclip, Rails 3)?

Categories

Resources