Read a CSV file having unknown number of columns in flink - apache-flink

I need to read a CSV file using Flink file source. I am using the below code to read it:
final TypeInformation[] fieldTypes = IntStream.range(0, 4)
.mapToObj(i -> BasicTypeInfo.STRING_TYPE_INFO)
.toArray(TypeInformation[]::new);
RowCsvInputFormat rowCsvInputFormat =
new RowCsvInputFormat(new Path(lookupPath), fieldTypes,
System.getProperty(LOOKUP_RECORD_SEPARATOR, LookupSeparators.LINE_SEPARATOR.getSeparator()),
lookUpProcessingData.getDelimiter().toString());
rowCsvInputFormat.setSkipFirstLineAsHeader(true);
DataStream<Row> lookupStream =
Context.getEnvironment()
.readFile(
rowCsvInputFormat,
lookupPath
//+ "/"
, FileProcessingMode.PROCESS_CONTINUOUSLY,
refreshIntervalinMS);
In the above code I am specifying that the number of columns in my Row would be 4. But my problem is that I would not be knowing the number of columns in a CSV file beforehand.
Although my type for each column would be String, but number of fields are unknown.
Is there a way I can provide dynamic number of columns in RowCsvInputFormat?
I also tried TextInputFormat & split the line based on my CSV delimiter, but it does not have setSkipFirstLineAsHeader API.
How would simultaneously split my record based on a delimiter & also use setSkipFirstLineAsHeader API without knowing the number of columns in CSV file beforehand?

Related

Merging Text/Data files into Rows and Columns over for 80m lines of data

I've been assigned to take over 5k of csv files and merge them to create seperate files which contain transposed data with each filename becoming a column in a new file (the source column 1 being extracted from each file as the data) and rows = dates.
I was after some input/suggestions on how to accomplish this..
Example details as follows:
File1.csv -> File5000.csv
Each file contains the following
Date, Quota, Price, % Value, BaseCost,...etc..,Units
'date1','value1-1','value1-2',....,'value1-8'
'date2','value2-1','value2-2',....,'value2-8'
....etc....
'date20000','value20000-1','value20000-2',....,'value20000-8'
The resulting/merged csv file(s) would look like this:
Filename- Quota.csv
Date,'File1','File2','File3',etc.,'File5000'
'date1','file1-value1-1''file2-value1-1','file3-value1-1',
etc.,'File5000-value20000-1'
'date20000','file1-value20000-1','file2-value20000-1','file3-value20000-1',
etc.,'File5000-value20000-1'
Filename Price,csv
Date,'File1','File2','File3',etc.'File5000'
'date1','file1-value2-1''file2-value2-1','file3-value2-1',
etc.,'File5000-value2-1'
'date20000','file1-value20000-2','file2-value20000-2','file3-value20000-2',
etc.,'File5000-value20000-2'
....up to Filename: Units.csv
Date,'File1','File2','File3',etc.'File5000'
'date1','file1-value2-8''file2-value2-8','file3-value2-8',
etc.,'File5000-value20000-8'
'date20000','file1-value20000-8','file2-value20000-8','file3-value20000-8',
etc.,'File5000-value20000-8'
I've been able to use an array contruct to reformat the data, but due to the shear number of files and entries it uses way too much RAM - the array gets too big, and this approach is not scalable.
I was thinking of simply loading each of the 5,000 files one at a time and extracting each line 'one at a time' per file, then outputing the results to each new files 1-8 row-by-row, however this may take an extremely long time to convert the data even on an SSD drive with over 80million lines of data in 5k+ files.
The idea was it would load File1.csv, extract the first line, store the Date and first column data into a simple array. Then load the second File2.csv, extract the first line, check if the Date matches and if so store the first column data in the same array....repeat for all 5k files and once completed store the array into a new file Column1-8.csv. Then repeat each file again for the corresponding dates and only extract the first data column of each file to add to the Value1.csv file. Then repeat the whole process for Column2 data, up to Column8....taking forever :(
Any ideas/suggestions on approach via scripting language?
Note: The machine it will likely run on only has 8GB RAM, using *nix.

Flink : How to implement TypeInformation without actual number of columns in csv

I am reading a csv file through flink. csv file have a specific number of columns.
I have defined
RowCsvInputFormat format = new RowCsvInputFormat(filePath,
new TypeInformation[]{
BasicTypeInfo.STRING_TYPE_INFO,
BasicTypeInfo.STRING_TYPE_INFO,
BasicTypeInfo.STRING_TYPE_INFO,
BasicTypeInfo.STRING_TYPE_INFO
});
The code works fine if in the file all the rows have proper 4 columns.
I want to handle a scenario when few rows in the file do not have 4 columns OR there is any other issue in few rows.
How can i achieve this in flink.
If you look here at the specifications on wikipedia or the rfc4180 it seems like CSV files should only contain rows which have the same amount of columns. So it makes sense the RowCsvInputFormat would not support this.
You could read the files using readTextFile(path) and then in a flatMap() operator parse the strings into a Row object (or ignore if there are issues in a row)
env.readTextFile(params.get("input"))
.flatMap(someCsvRowParseFunction())

Load csv file data into tables

Created tables as below :
source:([id:`symbol$()] ric:();source:();Date:`datetime$())
property:([id:`symbol$()] Value:())
Then i have two .csv files which include two tables datas.
property.csv showing as below :
id,Value
TEST1,1
TEST2,2
source.csv showing as below :
id,ric,source,Date
1,TRST,QO,2017-07-07 11:42:30.603
2,TRST2,QOT,2018-07-07 11:42:30.603
Now , how to load csv file data into each tables one time
You can use the 0: to load delimited records. https://code.kx.com/wiki/Reference/ZeroColon
The most simple form of the function is (types; delimiter) 0: filehandle
The types should be given as their uppercase letter representations, one for each column or a blank space to ignore a column. e.g using "SJ" for source.csv would mean I wanted to read in the id column as a symbol and the value column as a long.
The delimiter specifies how each columns is separated, in your case Comma Separated Values (CSV). You can pass in the delimiter as a string "," which will treat every row as part of the data and return a nested list of the columns which you can either insert into a table with matching schema or you can append on headers and flip the dictionary manually and then flip to get a table like so: flip `id`value!("IS";",") 0: `:test.txt.
If you have column headers as the first row in the csv you can pass an enlisted delimeter enlist "," which will then use the column headers and return a table in kdb with these as the headers, which you can then rename if you see fit.
As the files you want to read in have different types for the columns and are to bed into you could create a function to read them in for examples
{x insert (y;enlist ",") 0:z}'[(`source;`property);("SSSP";"SJ");(`:source.csv;`:property.csv)]
Which would allow you to specify the name of the table that should be created, the column types and the file handle of the file.
I would suggest a timestamp instead of the (depreciated) datetime as it is stored as a long instead of a float so there will be no issues with comparison.
you can use key to list the contents of the dir ;
files: key `:.; /get the contents of the dir
files:files where files like "*.csv"; /filter the csv files
m:`property.csv`source.csv!("SJ";"JSSZ"); /create the mappings for each csv file
{[f] .[first ` vs f;();:; (m#f;enlist csv) 0: hsym f]}each files
and finally, load each csv file; please note here the directory is 'pwd', you might need to add the dir path to each file before using 0:

Load CSV to a database when columns can be added/removed by the vendor

I've got some SSIS packages that take CSV files that come from the vendor and puts them into our local database. The problem I'm having is that sometimes the vendor adds or removes columns and we don't have time to update our packages before our next run, which causes the SSIS packages to abend. I want to somehow prevent this from happening.
I've tried reading in the CSV files line by line, stripping out new columns, and then using an insert statement to put the altered line into the table, but that takes far longer than our current process (the CSV files can have thousands or hundreds of thousands of records).
I've started looking into using ADO connections, but my local machine has neither the ACE nor JET providers and I think the server the package gets deployed to also lacks those providers (and I doubt I can get them installed on the deployment server).
I'm at a loss as to what I can do to be able to load tables and be able to ignore newly added or removed columns (although if a CSV file is lacking a column the table has, that's not a big deal) that's fast and reliable. Any ideas?
I went with a different approach, which seems to be working (after I worked out some kinks). What I did was take the CSV file rows and put them into a temporary datatable. When that was done, I did a bulk copy from the datatable to my database. In order to deal with missing or new columns, I determined what columns were common to both the CSV and the table and only processed those common columns (new columns were noted in the log file so they can be added later). Here's my BulkCopy module:
Private Sub BulkCopy(csvFile As String)
Dim i As Integer
Dim rowCount As Int32 = 0
Dim colCount As Int32 = 0
Dim writeThis As ArrayList = New ArrayList
tempTable = New DataTable()
Try
'1) Set up the columns in the temporary data table, using commonColumns
For i = 0 To commonColumns.Count - 1
tempTable.Columns.Add(New DataColumn(commonColumns(i).ToString))
tempTable.Columns(i).DataType = GetDataType(commonColumns(i).ToString)
Next
'2) Start adding data from the csv file to the temporary data table
While Not csvReader.EndOfData
currentRow = csvReader.ReadFields() 'Read the next row of the csv file
rowCount += 1
writeThis.Clear()
For index = 0 To UBound(currentRow)
If commonColumns.Contains(csvColumns(index)) Then
Dim location As Integer = tableColumns.IndexOf(csvColumns(index))
Dim columnType As String = tableColumnTypes(location).ToString
If currentRow(index).Length = 0 Then
writeThis.Add(DBNull.Value)
Else
writeThis.Add(currentRow(index))
End If
'End Select
End If
Next
Dim row As DataRow = tempTable.NewRow()
row.ItemArray = writeThis.ToArray
tempTable.Rows.Add(row)
End While
csvReader.Close()
'3) Bulk copy the temporary data table to the database table.
Using copy As New SqlBulkCopy(dbConnection)
'3.1) Set up the column mappings
For i = 0 To commonColumns.Count - 1
copy.ColumnMappings.Add(commonColumns(i).ToString, commonColumns(i).ToString)
Next
'3.2) Set the destination table name
copy.DestinationTableName = tableName
'3.3) Copy the temporary data table to the database table
copy.WriteToServer(tempTable)
End Using
Catch ex As Exception
message = "*****ERROR*****" + vbNewLine
message += "BulkCopy: Encountered an exception of type " + ex.GetType.ToString()
message += ": " + ex.Message + vbNewLine + "***************" + vbNewLine
LogThis(message)
End Try
End Sub
There may be something more elegant out there, but this so far seems to work.
Look into BiML, which build and executes your SSIS Package dynamically based on the meta-data at run time.
Based on this comment:
I've tried reading in the CSV files line by line, stripping out new
columns, and then using an insert statement to put the altered line
into the table, but that takes far longer than our current process
(the CSV files can have thousands or hundreds of thousands of
records).
And this:
I used a csvreader to read the file. The insert was via a sqlcommand
object.
It would appear at first glance that the bottleneck is not in the flat file source, but in the destination. An OLEDB Command executes in a row by row fashion, one statement per input row. By changing this to an OLEDB destination, it will convert the process to a bulk insert operation. To test this out, just use the flat file source and connect it to a derived column. Run that and check the speed. If it's faster, change to the oledb destination and try again. It also helps to be inserting into a heap (no clustered or nonclustered indexes) and use tablock.
However, this does not solve your whole varied file problem. I don't know what the flat file source does if you are short a column or more from how you originally configured it at design time. It might fail, or it might import the rows in some jagged form where part of the next row is assigned to the last columns in the current row. That could be a big mess.
However, I do know what happens, when a flat file source gets extra columns. I put in this connect item for it which was sadly rejected: https://connect.microsoft.com/SQLServer/feedback/details/963631/ssis-parses-flat-files-incorrectly-when-the-source-file-contains-unexpected-extra-columns
What happens is that the extra columns are concatenated into the last column. If you plan for it, you could make the last column large and then parse in SQL from the staging table. Also, you could just jam the whole row into SQL and parse each column from there. That's a bit clunky though because you'll have a lot of CHARINDEX() checking the position of values all of the place.
An easier option might be to parse it in .Net in a script task using some combo of split() to get all the values and check the count of values in the array to know how many columns you have. This would also allow you to direct the rows to different buffers based on what you find.
And lastly, you could ask the vendor to commit to a format. Either a fixed number of columns or use a format that handles variation like XML.
I've got a C# solution (I haven't checked it, but I think it works) for a source script component.
It will read the header into an array using split.
And then for each data row use the same split function and use the header value to check the column and use rowval to set the output.
You will need to put all the output columns in to the output area.
All columns that are not present will have a null value on exit.
public override void CreateNewOutputRows()
{
using (System.IO.StreamReader sr = new System.IO.StreamReader(#"[filepath and name]"))
{
while (!sr.EndOfStream)
{
string FullText = sr.ReadToEnd().ToString();
string[] rows = FullText.Split('\n');
//Get header values
string[] header = rows[0].Split(',');
for (int i = 1; i < rows.Length - 1; i++)
{
string[] rowVals = rows[i].Split(',');
for (int j = 0; j < rowVals.Length - 1; j++)
{
Output0Buffer.AddRow();
//Deal with each known header name
switch (header[j])
{
case "Field 1 Name": //this is where you use known column names
Output0Buffer.FieldOneName = rowVals[j]; //Cast if not string
break;
case "Field 2 Name":
Output0Buffer.FieldTwoName = rowVals[j]; //Cast if not string
break;
//continue this pattern for all column names
}
}
}
}
}
}

How can I compare the one line in one CSV with all lines in another CSV file?

I have two CSV files:
Identity(no,name,Age) which has 10 rows
Location(Address,no,City) which has 100 rows
I need to extract rows and check the no column in the Identity with Location CSV files.
Get the single row from Identity CSV file and check Identity.no with Location.no having 100 rows in Location CSV file.
If it is matching then combine the name, Age, Address, City in Identity, Location
Note: I need to get 1st row from Identity compare it with 100 rows in Location CSV file and then get the 2nd row compare it with 100 rows. It will be continue up to 10 rows in Identity CSV file.
And overall results convert into Json.Then move the results in to SQL Server.
Is it possible in Apache Nifi?
Any help appreciated.
You can do this in NiFi by using the DistributedMapCache feature, which implements a key/value store for lookups. The setup requires a distributed map cache, plus two flows - one to populate the cache with your Address records, and one to look up the address by the no field.
The DistributedMapCache is defined by two controller services, a DistributedMapCacheServer and a DistributeMapCacheClientService. If your data set is small, you can just use "localhost" as the server.
Populating the cache requires reading the Address file, splitting the records, extracting the no key, and putting key/value pairs to the cache. An approximate flow might include GetFile -> SplitText -> ExtractText -> UpdateAttribute -> PutDistributedMapCache.
Looking up your identity records is actually fairly similar to the flow above, in that it requires reading the Identity file, splitting the records, extracting the no key, and then fetching the address record. Processor flow might include GetFile -> SplitText -> ExtractText -> UpdateAttribute -> FetchDistributedMapCache.
You can convert the whole or parts from CSV to JSON with AttributesToJSON, or maybe ExecuteScript.

Resources