I need to extract three columns from a file - dataset

I am doing a project in Association Rule Mining. For my data set, I need to extract three columns from a text file.
Here is the link for the text file.
I need to extract Billno, Product and Batch columns and write them on to a text file.

The easiest way would be to use grep
http://www.dreamincode.net/forums/topic/290545-using-a-grep-command-in-a-c-program-in-linux/
Otherwise, that data is consistent, all the bill numbers are the same length and and appear to be easily Regex'ed
http://www.cplusplus.com/reference/regex/

Related

Merging Text/Data files into Rows and Columns over for 80m lines of data

I've been assigned to take over 5k of csv files and merge them to create seperate files which contain transposed data with each filename becoming a column in a new file (the source column 1 being extracted from each file as the data) and rows = dates.
I was after some input/suggestions on how to accomplish this..
Example details as follows:
File1.csv -> File5000.csv
Each file contains the following
Date, Quota, Price, % Value, BaseCost,...etc..,Units
'date1','value1-1','value1-2',....,'value1-8'
'date2','value2-1','value2-2',....,'value2-8'
....etc....
'date20000','value20000-1','value20000-2',....,'value20000-8'
The resulting/merged csv file(s) would look like this:
Filename- Quota.csv
Date,'File1','File2','File3',etc.,'File5000'
'date1','file1-value1-1''file2-value1-1','file3-value1-1',
etc.,'File5000-value20000-1'
'date20000','file1-value20000-1','file2-value20000-1','file3-value20000-1',
etc.,'File5000-value20000-1'
Filename Price,csv
Date,'File1','File2','File3',etc.'File5000'
'date1','file1-value2-1''file2-value2-1','file3-value2-1',
etc.,'File5000-value2-1'
'date20000','file1-value20000-2','file2-value20000-2','file3-value20000-2',
etc.,'File5000-value20000-2'
....up to Filename: Units.csv
Date,'File1','File2','File3',etc.'File5000'
'date1','file1-value2-8''file2-value2-8','file3-value2-8',
etc.,'File5000-value20000-8'
'date20000','file1-value20000-8','file2-value20000-8','file3-value20000-8',
etc.,'File5000-value20000-8'
I've been able to use an array contruct to reformat the data, but due to the shear number of files and entries it uses way too much RAM - the array gets too big, and this approach is not scalable.
I was thinking of simply loading each of the 5,000 files one at a time and extracting each line 'one at a time' per file, then outputing the results to each new files 1-8 row-by-row, however this may take an extremely long time to convert the data even on an SSD drive with over 80million lines of data in 5k+ files.
The idea was it would load File1.csv, extract the first line, store the Date and first column data into a simple array. Then load the second File2.csv, extract the first line, check if the Date matches and if so store the first column data in the same array....repeat for all 5k files and once completed store the array into a new file Column1-8.csv. Then repeat each file again for the corresponding dates and only extract the first data column of each file to add to the Value1.csv file. Then repeat the whole process for Column2 data, up to Column8....taking forever :(
Any ideas/suggestions on approach via scripting language?
Note: The machine it will likely run on only has 8GB RAM, using *nix.

Export SQL Data to Fixed Width Text File with Multiple Record Types

I need to create an export of data from SQL Server (multiple tables) into a fixed width text file. The text file will have rows that are different based on the record type.
Header Info (Customer, Address)
Line Item Info (Customer, Item, Qty)
Summary Info (Customer, Total Qty)
Any suggestions to accomplish this efficiently?
I'm currently re-casting all columns into char to create the "fixed width" then using SSIS to merge the tables before exporting as a ragged right text file. However, because not all the widths are the same, I'm having to concatenate the line item info into one column to make the merge work. Also, the header info is being merged AFTER the line item info, not before so there's a sorting problem there. Not sure if I'm going down the right path?
Hope that made sense... this export is used to import into a COBOL type system.
Thanks,
Using SSIS create three data flow tasks, each for creating a single text file with the fixed-width format.
File 1: Header Info
File 2: Line Item Info
File 3: Summary Info
Then concatenate them together into a fourth file using the approach described in the following link:
How to concatenate 2 files in SSIS (Integration Services)?
Hope this helps.
For these sorts of problems, I reach for SSIS. It eats this kind of thing for lunch

SAP Data Services .csv data file load from Excel with special characters

I am trying to load data from an Excel .csv file to a flat file format to use as a datasource in a Data Services job data flow which then transfers the data to an SQL-Server (2012) database table.
I consistently lose 1 in 6 records.
I have tried various parameter values in the file format definition and settled on setting Adaptable file scheme to "Yes", file type "delimited", column delimeter "comma", row delimeter {windows new line}, Text delimeter ", language eng(English) and all else as defaults.
I have also set "write errors to file" to "yes" but it just creates an empty error file (I expected the 6,000 odd unloaded rows to be in here).
If we strip out three of the columns containing special characters (visible in XL) it loads a treat so I think these characters are the problem.
The thing is, we need the data in those columns and unfortunately, this .csv file is as good a data source as we are likely to get and it is always likely to contain special characters in these three columns so we need to be able to read it in if possible.
Should I try to specifically strip the columns in the Query source component of the dataflow? Am I missing a data-cleansing trick in the query or file format definition?
OK so didn't get the answer I was looking for but did get it to work by setting the "Row within Text String" parameter to "Row delimiter".

How to Dynamically render Table name and File name in pentaho DI

I have a requirement in which one source is a table and one source is a file. I need to join these both on a column. The problem is that I can do this for one table with one transformation but I need to do it for multiple set of files and tables to load into another set of specific files as target using the same transformation.
Breaking down my requirement more specifically :
Source Table Source File Target File
VOICE_INCR_REVENUE_PROFILE_0 VoiceRevenue0 ProfileVoice0
VOICE_INCR_REVENUE_PROFILE_1 VoiceRevenue1 ProfileVoice1
VOICE_INCR_REVENUE_PROFILE_2 VoiceRevenue2 ProfileVoice2
VOICE_INCR_REVENUE_PROFILE_3 VoiceRevenue3 ProfileVoice3
VOICE_INCR_REVENUE_PROFILE_4 VoiceRevenue4 ProfileVoice4
VOICE_INCR_REVENUE_PROFILE_5 VoiceRevenue5 ProfileVoice5
VOICE_INCR_REVENUE_PROFILE_6 VoiceRevenue6 ProfileVoice6
VOICE_INCR_REVENUE_PROFILE_7 VoiceRevenue7 ProfileVoice7
VOICE_INCR_REVENUE_PROFILE_8 VoiceRevenue8 ProfileVoice8
VOICE_INCR_REVENUE_PROFILE_9 VoiceRevenue9 ProfileVoice9
The table and file names are always corresponding i.e. VOICE_INCR_REVENUE_PROFILE_0 should always join with VoiceRevenue0 and the result should be stored in ProfileVoice0. There should be no mismatches in this case. I tried setting the variables with table names and file names, but it only takes on value at a time.
All table names and file names are constant. Is there any other way to get around this. Any help would be appreciated.
Try using "Copy rows to result" step. It will store all the incoming rows (in your case the table and file names) into a memory. And for every row, it will try to execute your transformation. In this way, you can read multiple filenames at one go.
Try reading this link. Its not the exact answer, but similar.
I have created a sample here. Please check if this is what is required.
In the first transformation, i read the tablenames and filenames and loaded it in the memory. After that i have used the get variable step to read all the files and table names to generate the output. [Note: I have not used table input as source anywhere, instead used TablesNames. You can replace the same with the table input data.]
Hope it helps :)

Reading and writing to xls and doc files in c

I have this particular problem where i have to write a c program that reads numerical data from a text file. The data is tab delimited. Here is a sample from the text file.
1 23099 345565 345569
2 908 66766 66768
This is data for clients and each client has a row.Each column represents customer no.,previous balance,previous reading, current reading.Then i have to generate a doc. document
that summarizes all this information and calculates the balance I can write a function that does this but how do i create an xls document
and a word document where all the results are summarized using the program? The text document has only numerical data. Any ideas
The easiest way is to create a csv file and not a xls file.
Office can open those csv files with good results.
And it is way easier to create a ascii text file with commaseparated values,
than to create something into a closed format like the ms office formats.
The simplest way to create a spreadsheet that contains formulas and formatting, and can be opened by Excel, is to create an XML Spreadsheet file.

Resources