ssis script task (vb) issues when reading large file - sql-server

I'm using the below code inside a ssis script task to modify the contents of a file. I'm basicallly creating 1 json document when in the file there are many jsons, one after the other.
This code works perfectly up until around a 1GB file (to read the 1GB file it's using almost 7GB memory in SSIS), after that it crashes (i assume due to memory). I need to read files up until 5GB.
Any help please
Public Sub Main()
Dim filePath As String = Dts.Variables("User::filepath").Value.ToString()
Dim content As String = File.ReadAllText(filePath).Replace("}", "},")
content = content.Substring(0, Len(content) - 1)
content = "{ ""query"" : [" + content + "] }"
File.WriteAllText(filePath, content)
Dts.TaskResult = ScriptResults.Success
End Sub

It is not recommended to use File.ReadAllText(filePath) to read big flat files because it will store all the content in memory. I think you should use a simple data flow task to transfer the data from this flat file to a new flat file, and you can do the transformation you need in a script component on each row.
Also you can read it line by line in a script using a StreamReader using and write it to a new file using a StreamWriter, when finished you can delete the first file, and rename the new one.
References
How to open a large text file in C#
File System Read Buffer Efficiency
c# - How to read a large (5GB) txt file in .NET?

Related

Managing links in Excel source workbooks when duplicating source files

I have three files:
Activefile - where my code is stored and run
Databasefile - where my raw data is housed (has lots of protection)
Copyofdatabasefile - is a copy without the protection
I have a macro that runs in activefile to update a databasefile excel file, later in the macro, I then use the saveas method on the databasefile to make a copyofdatabasefile file, I remove some functionality to allow people to access the data easily without going through some of the checks on the main databasefile.
When saving the copyofdatabasefile, the links in my active file are updated to look at the new copyofdatabasefile file. I don't want this to happen.
How can I adjust my excel links/code to ensure that the links in my file aren't transferred across to the copyofdatabasefile?
Saveas macro options are currently:
Databasefile.SaveAs filename:="\\somelocation\copyofdatabasefile.xlsx", FileFormat:=51, CreateBackup:=False
Using the Workbook.SaveCopyAs Method should work.
If your original file is xlsx use
Databasefile.SaveCopyAs Filename:="\\somelocation\copyofdatabasefile.xlsx"
Note that it saves in the same FileFormat as the original file only!
If your original file is xlsm use
If you need to change the file format (eg from xlsm to xlsx) you need to save as copy in the original file format first, then reopen that copy with Workbooks.Open() and then use .SaveAs to change the FileFormat.
Databasefile.SaveCopyAs Filename:="\\somelocation\copyofdatabasefile.xlsm" 'if original file was xlsm
Dim wb As Workbook
Set wb = Workbooks.Open("\\somelocation\copyofdatabasefile.xlsm")
wb.SaveAs filename:="\\somelocation\copyofdatabasefile.xlsx", FileFormat:=51, CreateBackup:=False
wb.Close False
Kill "\\somelocation\copyofdatabasefile.xlsm" 'delete old format

filter filename from folder that contains a specific text in ssis

I am trying to move some files from one folder to another, and i Need to evaluate if a file name contains certain text then only we need to move those files.
for example , i have following files in a folder.
abcd_takeme_fdsljker.txt
abcd_file_fdsljker.txt
abcd_takeme_fdsljsdfker.txt
abcd_filetk_fdsljker.txt
abcd_takeme_fdsljssker.txt
from the above I want to pick files which has text "takeme"
Using For each loop container
You have to add a for-each loop container to loop over files in a specific directory.
Choose the follow expression as a filename:
*takeme*
Map the filename to a variable
Add a dataflow task inside the for each loop to transfer files
use the filename variable as a source
you can follow the detailed article at:
http://www.sqlis.com/sqlis/post/Looping-over-files-with-the-Foreach-Loop.aspx
if you want to add multiple filter follow my answer at:
How to add multiple file extensions to the Files: input field in the Foreach loop container SSIS
Using a script task
or you can achieve this using a script task with a similar code: (i used VB.Net)
Public Sub Main()
For Each strFile As String In IO.Directory.GetFiles("C:\New Folder\", "*takeme*", IO.SearchOption.AllDirectories)
Dim filename As String = IO.Path.GetFileName(strFile)
IO.File.Copy(strFile, "D:\New Folder\" & filename)
Next
Dts.TaskResult = ScriptResults.Success
End Sub

find file content size docx,pptx etc

I want to find out the size of the content inside a docx,pptx etc. Is there any package which can be used for this? I googled and found that POI is used widely to read/write to MS file types. But not able to find the correct api to find the size of the file content. I want to know the actual content size not the compressed file size which can be seen from properties.
Finally i found the way, but it is throwing OOM exception if the file is too large.
OPCPackage opcPackage = OPCPackage.open(file.getAbsolutePath());
XWPFDocument doc = new XWPFDocument(opcPackage);
XWPFWordExtractor we = new XWPFWordExtractor(doc);
String paragraphs = we.getText();
System.out.println("Total Paragraphs: "+paragraphs.length() / 1024);
Please help me if there are any other better way to do this.
Ok this has been asked long time ago and there is also no response to this question. I have not used OPCPackage and hence my answer is not based on that.
DOCX (and for that matter PPTX as well as XSLX) files are all zip files having a particular structure.
We could hence use the java.util.zip package and enumerate the entries of the zip file and get the size of the zip entry xl for xlsx file and word for docx files. Probably a more generic method would be to ignore the following top-level zip entries i.e. zip entries starting with :
docProps
_rels
[Content_Types].xml
The size of the remaining zip entry (do not ignore any folder within this zip entry) would tell you the correct size of the content.
This method is also very efficient - you only read the entries of the zip file and not the zip file itself hence obtaining the size information would run with negligible time and memory resources. For a quick start I was able to get the size of a 4MB docx file in fraction of a second.
A "good-enough" but not adequately working piece of code using this approach is pasted below. Please feel free to use this as a starting point and fix bugs if found. It would be great if you can post back the modifications or corrections so that others can benefit
private static final void printUnzippedContentLength() throws IOException
{
ZipFile zf = new ZipFile(new File("/home/chaitra/verybigfile.docx"));
Enumeration<? extends ZipEntry> entries = zf.entries();
long sumBytes = 0L;
while(entries.hasMoreElements())
{
ZipEntry ze = entries.nextElement();
if(ze.getName().startsWith("docProps") || ze.getName().startsWith("_rels") || ze.getName().startsWith("[Content_Types].xml"))
{
continue;
}
sumBytes += ze.getSize();
}
System.out.println("Uncompressed content has size " + (sumBytes/1024) + " KB" );
}

How do I get a temporary File object (of correct content-type, without writing to disk) directly from a ZipEntry (RubyZip, Paperclip, Rails 3)?

I'm currently trying to attach image files to a model directly from a zip file (i.e. without first saving them on a disk). It seems like there should be a clearer way of converting a ZipEntry to a Tempfile or File that can be stored in memory to be passed to another method or object that knows what to do with it.
Here's my code:
def extract (file = nil)
Zip::ZipFile.open(file) { |zip_file|
zip_file.each { |image|
photo = self.photos.build
# photo.image = image # this doesn't work
# photo.image = File.open image # also doesn't work
# photo.image = File.new image.filename
photo.save
}
}
end
But the problem is that photo.image is an attachment (via paperclip) to the model, and assigning something as an attachment requires that something to be a File object. However, I cannot for the life of me figure out how to convert a ZipEntry to a File. The only way I've seen of opening or creating a File is to use a string to its path - meaning I have to extract the file to a location. Really, that just seems silly. Why can't I just extract the ZipEntry file to the output stream and convert it to a File there?
So the ultimate question: Can I extract a ZipEntry from a Zip file and turn it directly into a File object (or attach it directly as a Paperclip object)? Or am I stuck actually storing it on the hard drive before I can attach it, even though that version will be deleted in the end?
UPDATE
Thanks to blueberry fields, I think I'm a little closer to my solution. Here's the line of code that I added, and it gives me the Tempfile/File that I need:
photo.image = zip_file.get_output_stream image
However, my Photo object won't accept the file that's getting passed, since it's not an image/jpeg. In fact, checking the content_type of the file shows application/x-empty. I think this may be because getting the output stream seems to append a timestamp to the end of the file, so that it ends up looking like imagename.jpg20110203-20203-hukq0n. Edit: Also, the tempfile that it creates doesn't contain any data and is of size 0. So it's looking like this might not be the answer.
So, next question: does anyone know how to get this to give me an image/jpeg file?
UPDATE:
I've been playing around with this some more. It seems output stream is not the way to go, but rather an input stream (which is which has always kind of confused me). Using get_input_stream on the ZipEntry, I get the binary data in the file. I think now I just need to figure out how to get this into a Paperclip attachment (as a File object). I've tried pushing the ZipInputStream directly to the attachment, but of course, that doesn't work. I really find it hard to believe that no one has tried to cast an extracted ZipEntry as a File. Is there some reason that this would be considered bad programming practice? It seems to me like skipping the disk write for a temp file would be perfectly acceptable and supported in something like Zip archive management.
Anyway, the question still stands:
Is there a way of converting an Input Stream to a File object (or Tempfile)? Preferably without having to write to a disk.
Try this
Zip::ZipFile.open(params[:avatar].path) do |zipfile|
zipfile.each do |entry|
filename = entry.name
basename = File.basename(filename)
tempfile = Tempfile.new(basename)
tempfile.binmode
tempfile.write entry.get_input_stream.read
user = User.new
user.avatar = {
:tempfile => tempfile,
:filename => filename
}
user.save
end
end
Check out the get_input_stream and get_output_stream messages on ZipFile.

Copy text from WPF DataGrid to Clipboard to Excel

I have WPF DataGrid (VS2010 C#). I copied the data from DataGrid to Clipboard and write it to an Excel file. Below is my code.
dataGrid1.SelectAllCells();
dataGrid1.ClipboardCopyMode = DataGridClipboardCopyMode.IncludeHeader;
ApplicationCommands.Copy.Execute(null, dataGrid1);
dataGrid1.UnselectAllCells();
string path1 = "C:\\test.xls";
string result1 = (string)Clipboard.GetData(DataFormats.CommaSeparatedValue);
Clipboard.Clear();
System.IO.StreamWriter file1 = new System.IO.StreamWriter(path1);
file1.WriteLine(result1);
file1.Close();
Everything works out OK except when I open the excel file it gives me two warning:
"The file you are trying to open
'test.xls' is in a different format
than specified by the file extension.
Verify that the file is not corrupted
and is from a trusted source before
opening the file. Do you want to open
the file now?"
"Excel has detected that 'test.xls' is
a SYLK file, but cannot load it."
But after I click through it, it still open the excel file OK and data are formated as it supposed to be. But I can't find how to get rid of the two warnings before the excel file is open.
You need to use csv as extension. Xls is the Excel file extension.
So
string path1 = "C:\\test.csv";
should work.
A problem like yours has already been described here : generating/opening CSV from console - file is in wrong format error.
Does it helps to solve yours ?
Edit : Here is the Microsoft KB related => http://support.microsoft.com/kb/323626

Resources