I have this task that needs investigation as to why the md5 value of a file keeps changing.
Example:
I need to generate the diagnostic file of a certain machine.
After generating the file, it produces a .zip file, say, Diag.zip which contains all the information/files of that certain machine.
Inside Diag.zip file contain a .xls, say, Data.xls which contains all the summary of all files in that certain machine, includes, the directory of the file, file version, file size, create time and md5.
Then save all the information of Data.xls in database.
After a day or so, do it again back in Step 1-4.
Then when I queried all the save data of Data.xls in the database in a 2 weeks range, and it shows that almost all files in that certain machine have its md5 value changed.
The question is: Why is it that md5 value always changed every time I generated a new diagnostic files?
There seems to be an issue with excel files, in particular Excel 2003 xls files. Whenever they get opened in Excel, even if they don't get changed and don't get saved, Excel automatically updates some of the file's metadata, such as the "Document Properties and Personal Information" and "Last Accessed Statistics". Therefore, the file every time it gets opened changes a little bit, and this makes that the MD5 changes also.
One way to avoid this problem is to remove "document properties and personal information".
Remove hidden data and personal information from Office documents. Excel 2007: Remove Hidden Data and Personal Information from Office Documents
Remove hidden data and personal information from Office documents. Excel 2013, Excel 2010: Remove Hidden Data and Personal Information by Inspecting Workbooks
Other way to avoid this would be to use xlsx files. I have been trying to replicate this behavior in xlsx files, but it seems it only happens on xls (2003).
The MD5 is based on a lot of things. But I can assume filesize, filename & creationdate.
If one of those changes, the md5 hash changes. The exact same file will always return the exact same md5 hash. A new file always generates a new md5 hash.
Related
There are many aspects of what I want to do but I think learning one piece will let me derive the rest.
I have an SSIS package that uses powershell to download a publicly available zip file, an execute script to unzip with 7zip and then data flows to load the unzipped files to corresponding tables.
What I want to do is add the file name (and eventually other aspects of the file like creation date, record counts and so on) from any one of the unzipped files to a log table that keeps track of the summary level details of the files.
How do I dynamically store this type of information as part of the package? Derived columns? But what's the input? Thanks!
There are many options for dynamically working with files through SSIS. Below is an overview of one method. Of course this can vary, depending on your specific needs and requirements.
Add a Foreach Loop Container. On the Collection pane, the Folder property can either be set using the
GUI as well as through a parameter or variable with the Directory
expression. Searching sub folders can also be set by checking the "Traverse subfolders" checkbox or using the Recurse expression like the Folder field.
The Files field will indicate the files to use and wildcards can be
used. * will match any number of characters. For
example, *.csv will get all csv files regardless of name and
Test*.txt will return all .txt files with names beginning Test,
regardless of how many or which characters follow. To limit this to
a single character, use ?. The FileSpec expression will allow
this to be set dynamically similar to the directory by variable or parameter.
The Variable Mappings pane will allow for setting a variable to hold a file name from the directory. Add a variable that will hold the file name to index 0 to map these.
You indicated that you wanted to store the file name. The detail of this can be controlled from the "Retrieve file name" field on the Collection window. As their names imply, Fully Qualified will hold the complete file path, Name and Extension will return the file name with extension, and Name Only is just the file name.
As for other aspects of the file, I'd recommend a using a Script Task for this for more functionality. The C# FileInfo class provides options for finding details about the file such as the creation date, last time the file was accessed, and when the file was most recently written to. Additonal information on this can be found here.
For the record counts from the file, you'll need to create a Connection Manager for this and work with the data within the package. I'm assuming these are flat files? If so, creating a Flat File Connection Manager, and setting the same variable from the Variable Mappings pane of the Foreach Loop to the ConnectionString expression of the Connection Manager will allow you to dynamically loop through each file. Make sure that the Fully Qualified option is used for the "Retrieve file name" field as earlier if you decide to do this. You will also want to configure the correct columns and data types for the Connection Manager ahead of time. This same process can be followed for Excel files, however the variable with the file name will be used on the ExcelFilePath expression instead.
As for storing information about a file in a log table, there are a multitude of options for these. A very basic example of an Insert statement within an Execute SQL Task that's placed within the Foreach Loop is below. The 3 part table name is only necessary if you're using a table that differs from the initial catalog of the Connection Manager. The ? is the parameter marker (assuming this is an OLE DB connection). After this, map the same variable/parameter that stores the file name using the Parameter Mapping pane. Set the direction to Input, appropriate data type (likely VARCHAR/NVARCHAR), 0 in the Parameter Name field to indicate this is the first parameter in the SQL statement (additional ? can be used for subsequent parameters in the SQL statement, then increment this field in accordance), and the default Parameter Size can be left at -1. Again, this is a simple example and you'll probably want store more about the files and their contents, but this can get you started.
Sample SQL Insert:
INSERT INTO YourDataBase.YourSchema.YourTable (ColumnToHoldFileName)
VALUES (?)
you can use Variable to store File name when your loop the files, and after file been loaded to table, then u can use current file name to insert/update log table.
figured it out from looking at other posts. I had to expand the parameter size...easy fix!
I have a very simple database in access, but for each record i need to attach a scanned in document (probably pdf). What is the best way to do this, the database should not just link to a file on the pc, but should copy and keep the file with it, meaning if the original file goes missing the database is moved or copied, the file should still be accessable from within the Database. Is This possible? and what is the easiest way of doing it? If is should i can write a macro, i just dont know where to start. and also when i display a report of the table, i would like to just see thumbnails of the documents.
Thank you.
As the other answerers have noted, storing file data inside a database table can be a questionable practice. That said, I wouldn't personally rule it out, though if you are going to take that option, I'd strongly suggest splitting out the file data into its own table in its own backend file. For example:
Create a new database file called Scanned files.mdb (or Scanned files.accdb).
Add a single table called Scans with fields such as FileID (AutoNumber, primary key), MainTableID (matches whatever is the primary key of the main table in the main database file), FileName (Text), FileExt (Text) and FileData ('OLE object', really just a BLOB - don't actually use OLE Objects because they will bloat the database horribly).
Back in the frontend, add a reference to Scans as a linked table.
Use a bit of VBA to upload and extract files from the Scans table (if you're interested in the mechanics of this, post a separate question).
Use the VBA Shell routine (if you must) or ShellExecute from the Windows API (= the better option IMO) to open extracted data.
If you are using the newer ACCDB format, then you have the 'attachment' field type available as smk081 suggests. This basically does most of the above steps for you, however doing things 'by hand' gives you greater flexibilty - for example, it allows giving each file a 'DateScanned' or 'DateEffective' field.
That said, your requirement for thumbnails will require explicit coding whatever option you take. It might be possible to leverage the Windows file previewing API, though I'd be certain thumbnails are a definite requirement before investigating this - Access VBA is powerful enough to encourage attempts at complex solutions, but frequently not clean and modern enough to allow fulfilling them in a particularly maintainable fashion.
There is an Attachment type under Data Type when you go into Design View of your table. You can add an attachment field here. When you go into the Datasheet view of the table you can select this field for a particular row and a window will open for you to specify the attachment. This will cause your database to quickly grow in size if you add a lot of large attachments.
You can use an OLE field in a table, but I would really suggest you not use this approach. The database is going to be HUGE in no time, and you're going to regret it.
Instead, you should consider adding a field that stores the path to the file, and keep the files in one folder on your network. Then you can use a SHELL() command to open the file. What's the difference between restoring an Access database and restoring PDF files if something goes wrong? This will keep your database at a manageable size and reduce the possibility of corruption.
I have a huge amount of trouble loading spreadsheets into a SQL Server database.
Currently, I'm using an SSIS package to load the data and I have had to make lots of adjustments to get the data to load:
All numbers must be formatted as text (otherwise they don't load properly).
Sometimes numbers must be preceded with single quote (') to get them to load.
If a column has a mix of number cells and text cells, the text cells must come first in the file (otherwise only numbers load and text comes in as NULL).
If a user changes a column name the file will not load.
If a user changes a tab name the file won't load.
If a user adds a new column (even at the end of a sheet) the file won't load.
Extra sheets in the file is not a problem, thankfully!
Dates seem sensitive whether or not they will load properly.
Connection strings to the Excel file must include "IMEX=1" or things are worse.
Scheduled SSIS jobs must be run as 32-bit even on 64-bit system.
I've been loading the data (usually 200,000-500,000 rows per file) into a table with all fields defined as nvarchar. Then, when loaded I transfer that data in the next step of the SSIS package to the working table with typed data fields.
All of the requirements that I must put on the user for how to format the Excel file is really a pain. We usually have to send the file back multiple times until all the formatting issues are correct before the file will load. I'd like to eliminate this thrash.
I know I'm not the only one that is facing this type of problem. So, I must ask...
What is a better alternative to Excel for loading data into a SQL Server database?
Or, am I going about this the wrong way? Should I be using something other than SSIS to load Excel spreadsheets?
You can try OpenRowSet:
SELECT *
INTO SomeTable
From OpenRowSet('Microsoft.Jet.OLEDB.4.0',
'Excel 8.0;Database=\\servername\c$\filename.xls;HDR=YES;IMEX=1', [Sheet2$])
Not really a SQL answer, but an easy one:
You could require the users to copy and paste data to an excel spreadsheet where everything but the data fields to be included are locked. This will prevent many of the pain points described.
I am using a file consisting of published scientific data. I'm using this file with a program that reads in the first 5 space delimited data fields, and everything after that is considered a comment by the program.
2 example lines (of thousands):
FeII 1608.4511 0.521 55.36 -1300 M03 Journal of Physics
FeII 1611.23045 0.0321 55.36 1100 01J AJ
The program reads it as:
FeII 1608.4511 0.521 55.36 -1300
FeII 1611.23045 0.0321 55.36 1100
These numbers are each measurements and most (don't get me started) have associated errors that are not listed in this file. I would like to store this information in a useful and updatable way. That is, say the first entry FeII 1608.4511 has an error of plus/minus 0.002. Consider when a new measurement is made and changes it to: FeII 1608.45034 plus/minus 0.0005. I would like to update the value, the error, and record some information about the publication that it came from.
The program that uses this file is legacy code and is both crucial and inflexible: and it needs the file to look like the above output when it's read in. I would really like for there to be a way to update the input file to include things like errors on the values and publication hyperlinks in comments. I would also like a kind of version control ability to return the state of this large file today; or in 5 months after 20 more lines are updated with new values.
Any suggestions on how best to accomplish this? Should I store everything in some kind of database?
Databases are deeply tied to identity. If a database can't identify a row by the data that's in it, a database isn't going to help you.
If I were you, I'd start by storing the base file in a version control system, not a database. At 20 changes per 5 months, I'd probably make those changes manually and commit each batch of changes. (I don't know what might constitute a batch for you. Could be a single change every time.)
Since the format of the existing file is both crucial and brittle, I'm not sure whether modifying it is a good idea. I think I'd feel better about storing error ranges and publication hyperlinks in a separate file, and using a script to put the pieces together for applications that can use error ranges and hyperlinks.
A database sounds sensible, SQL Server Express is free and widely used.
You can read in the text file including all comments and output the edited data in the same format. You can use a number of front ends including Access, for rapid development, or something you create yourself in VB.Net, or even Excel, at a pinch.
You will need to consider the structure of the table(s) but it should not be too difficult, and you can get help here.
For updating the information in the file introducing errors and links, you don't need any database; just open the file, iterate through the lines and update each one.
If you want to be able to restore a line state, you definetively need some kind of database. You can create a database in Sql Server or Firebird for example, and store in it a row for each line historical state (with date of creation off course); your file itself would be the repository for current values and you would be able to restore the file with a date and some simple fetcing of the database information.
If you can't use a database like Firebird or SQL Server, you can store the historical data in a simple text file, it's up to you. Just remember that you necesarely will need, like #CatCall commented, a way to identify each line in order to create a relation between the line in the file and the historical data stored in your repository.
I'm using SQL2008 to load sensor data in a table with Integration Services. I have to deal with hundreds of files. The problem is that the CSV files all have slightly different schemas. Each file can have a maximum of 20 data fields. All data files have these fields in common. Some files have all the fields others have some of the fields. In addition, the order of the fields can vary.
Here’s and example of what the file schemas look like.
Station Name,Station ID,LOCAL_DATE,T_1,TD_1,RH_1,CL_1,RS_1,RI_1,PR_1,RD_1,SH_1,CL_2
Station Name,Station ID,LOCAL_DATE,T_1,TD_1,RH_1,CL_1,RS_1,RI_1,PR_1,WS_1,WD_1,WSM_1,WDM_1,SH_1
Station Name,Station ID,LOCAL_DATE,T_1,TD_1,RH_1,RS_1,RI_1,PR_1,RD_1,WS_1,WD_1,WSM_1,WDM_1
Station Name,Station ID,LOCAL_DATE,T_1,RH_1,RS_1,PR_1,VI_1,PW_1,WS_1,WD_1,WSM_1
Station Name,Station ID,LOCAL_DATE,T_1,RH_1,RS_1,WS_1,WD_1,WSM_1
Station Name,Station ID,LOCAL_DATE,T_1,RH_1,RS_1,PR_1,VI_1,WS_1,WD_1,WSM_1
I’m using a Data Flow Script Task to process the data via CreateNewOutputRows() and MyOutputBuffer.AddRow(). I have a working package to load the data however it’s not reliable and robust because as I had more files the package fails because the file schema has not been defined in CreateNewOutputRows().
I'm looking for a dynamic solution that can cope with the variation in the file schema. Doeas anyone have any ideas?
Who controls the data model for the output of the sensors? If it's not you, do they know what they are doing? If they create new and inconsistent models every time they invent a new sensor, you are pretty much up the creek.
If you can influence or control the evolution of the schemas for CSV files, try to come up with a top level data architecture. In the bad old days before there were databases, files made up of records often had, as the first field of each record, a "record type". CSV files could be organized the same way. The first field of every record could indicate what type of record you are dealing with. When you get an unknown type, put it in the "bad input file" until you can maintain your software.
If that isn't dynamic enough for you, you may have to consider artificial intelligence, or looking for a different job.
Maybe the cmd command is good. in the cmd, you can use sqlserver import csv.
If the CSV files that all have identical formats use the same file name convention or if they can be separated out in some fashion you can use the ForEach Loop Container for each file schema type.
Possible way to separate out the CSV files is run a Script (in VB) in SSIS that reads the first row of the CSV file and checks for the differing types (if the column names are in the first row) and then moves the files to the appropriate folder for use in the ForEach Loop Container.