Delete old backup files in pentaho etl tool - database

I want to know how to delete files based the creation date using a kettle job. I have a log folder which contains log files for last four years. But I want to keep only last week log files. The job should be deleting all the log files which are more than one month old. There is a delete file option in pentaho job. But how do we get file creation date and delete the files accordingly.
Step by step process I used to create kettle:
Get file name
Get system info
Add constants
Database lookup: here I am using postgresql it lookup the field the entity_name and attribute_name from database and date is inserted in database by using this database lookup.
Select values:
Calculator
Filter rows
Set files in result
Process files with option delete.
I want to ask that i am having filename for eg:abcd_2018_06_05.backup.
I have to use hard core regular expression to define above filename.Could anybody help me to define it so that it can take right(file_name, len(file_name)-7).
I know how it can be done in a SQL query, but in pentaho I don't know.

The get filenames step also returns the last modified timestamp. Can’t you use that instead?
Something like this:
Get filenames -> get system info (to get current date) -> calculator (subtract 7 days from current date -> filter rows (let only files older than 7 days through) -> process files: delete (delete old files.
Alternatively, using the regex step you can parse the filename and then filter rows.

Related

How to import files week by week using SSIS?

I want to load files in SQL server database on weekly basis. Each file name contains date on it. Currently, I am using Foreach Loop Container to get the file name and stored it in table. Table contains 3 columns FileName, Date and Week. After loading FileName using Execute SQL Task I extract Date and Week from the FileName and Populate Date and Week column. Then I use Execute SQL Task to SELECT all table date ORDER BY Date and Week and store it into object variable. Finally, I use Foreach Loop Container to load actual files in date order using ADO Enumerator and object variable. This works fine. However, I want to load files on Weekly basis. For an example all the files which has week 15 in the table should loaded first. Then it should load load all the files of week 16 and so on. The reason I want to load like this is after loading one week of files I want to process it using some stored procedure.
I think the problem can be solved by making two edits:
Loop over weeks
Add an Execute SQL Task that retrieve the Distinct Weeks from the table
Add a foreach loop container to loop over weeks
inside the foreach loop add an Execute SQL Task that retrieve the rows based on the current week
Use another foreach loop container to loop over result
Ordered results
You can simply add an ORDER BY clause inside the Execute SQL Task to get an ordered resultset.
This is a limitation of the ForEach loop enumerators - there is no way to load files in a sorted/ordered manner. If you want to load files in such a manner then there are two ways to do this:
Purchase an expensive package of components from third party vendors that provide a ForEach loop enumerator that can process files in a sorted/ordered manner
Do it yourself manually.
For option two, you will need to perform the following steps:
Create a ForEach File loop enumerator scan the folder for all files and insert the file names into a database table.
Create an Execute SQL Task that will SELECT all file names, ORDERED BY file name. You can add constraints in the WHERE clause to control the date range of files that you want to process.
Load the result set into a variable of type Object
Create a ForEach ADO loop enumerator to loop through each file name that is stored in the object.
Place a data flow in the loop and then process the files.

Trying to delete records or reports in MS Access before a certain date

I have a Microsoft Access Database file.
I wanted to delete records older than 5 years in it.
I made a backup before starting to modify the file.
I was able to run a query and then run the command below and append it or update it to the database file.
DELETE FROM Inspections Report WHERE Date <= #01/01/2013#
I used the example:
Delete by Date In Access
The records still seem to be in there.
My desired Output:
A analogy to what I am trying to do would be the bottom left corner of a Microsoft Word file where you see page 1 of 10 when it should say page 1 of 5 after deleting pages.
DELETE Table1.*, Table1.VisitDate
FROM Table1
WHERE (((Table1.VisitDate)<=#1/1/2013#));
I suggest you make the query object and save it, so it appears in the Navigation Pane and can be tested manually. [In which case you use Query Design View and don't need the syntax above]
Then use the OpenQuery method to fire that query.
To run a sequel command from Access VBA you need to preface it with DoCmd.RunSQL or CurrentDb.Execute, and then put your SQL coding in quotes.
Also, the space is probably causing an issue - if the table you're deleting records from is called "Inspections Report" you'd enclose both those words in square brackets to show its a single entity.
Finally "Date" is a special word in Access, and it doesn't like it when you use it as a field name, as it can cause problems when referencing that field later on. You might try something like "InspectionDate".
So your code would look like this:
DoCmd.RunSQL "DELETE FROM [Inspections Report] WHERE InspectionDate <=#1/01/2013#"
If you have a static date, you'd probably only need to complete this process once, which you could just do in the table by filtering - filter for before that date, use ctrl+a to select all that match that criteria, and hit delete. It will ask if you want to delete them, and you may see that the number of records it's trying to delete is only the number that satisfies your set criteria.
Of course, if you're interested in never having records older than a certain number of years, you could go for something in the original coding like > DATEADD("yyyy", -5, DATE()) and set it to execute every time you launch the database.

Need to delete all the files in a folder except the file which contains current datetime in the file name using sql job?

I have a text file in the name of BankDetails_09302014_153054.txt in one folder. I need to create a SQL job which will delete old files except current datetime file. (ie BankDetails_10012014_103104.txt). This sql job need to be execute on daily basis. Kindly give me some suggestions to achieve my requirement
A similar question was answered here on how to get your Creation and modifieddate of a file
Source 1
Source 2
Source 3
Once you get your respective date you can compare and delete. Hope this helps

Merge Number of Excel Files and load to SQL Database

I am trying to merge a number of files. About 40,000 excel files all in exactly the same format (columns etc).
I have tried to run a merge command through CMD which has merged them together to a point but the CSV file it has merged to I am unable to open due to the size of it.
What I am trying to find out is what is the best process to merge such a large amount of files and then the process to load them into SQL server.
Is there any tools or something that may need to be customised and built?
I don't know a tool for that, but my first idea is this, assumed you are experienced with Transact SQL:
open a command shell, change to folder where your Excel files are stored in and enter the following command: dir *.xlsx /b > source.txt
This will create a textfile named "source.txt", which contains the names (and only the names) of all your Excel files
import this file in a SQL Server table, i.e. called "sourcefiles"
create a new stored procedure, which contains a cursor. The cursor should read your table "sourcefiles" in a loop row by row and store the name of the actually readed Excel file in a variable, i.e. called "#FileName"
in this loop perform a sql statement like this for every readed Excel file:
SELECT * INTO dbo.YourDatabaseTable
FROM OPENROWSET('Microsoft.ACE.OLEDB.12.0',
'Excel 12.0 Xml;HDR=YES;Database=#FileName',
'SELECT * FROM [YourWorkSheet$]')
let the cursor read the next row
Replace "YourDataseTable" and "YourWorkSheet" with your needs.
#FileName must contain the full path to the Excel files.
Maybe you have to download the Microsoft.ACE.OLEDB.12.0-Provider before executing the sql command.
Hope, this helps to think about your further steps
Michael
edit: have a look on this website for possible errors

SSIS file move and rename

As part of migration from traditional system to new technology, I need to rename N number of files[.txt, .pdf, .xl, etc] available in the particular folder using SSIS.
Move the file to destination
Parse the prefix of files which is used as ID for associating with the record in the table.
Ex: 1012BA12_Attach_Emp.doc [ID=1012BA12]
Then I need to go to database and lookup the new ID.
Ex: old ID=1012BA12 and equivalent new ID=512
Then replace the old ID with new one.
Ex: 512_Attach_Emp.doc
Insert one row to some table with respect new name & path.
I have been used the for each file enumerator, Execute sqltask and file system task
but it's consuming a day to do so.
Please guide me best approach.
The issue you are having is likely to be on the database side, not SSIS.
Do you have indexes on the tables you are accessing?
Are the files local to the SSIS instance, or does SSIS access the files remotely?

Resources