Read LINQ results - arrays

I'm writing a small application that will download a .csv file via FTP and read that into Excel in a particular format.
I found the LINQ code snippet below on this site that will read the .csv file into the var 'csv'. The problem is that I can't seem to figure out how to enumerate the var 'csv' into a string array (which I'll then use to populate the relevant Excel cells).
Can anyone help? Thanks, Gavin
var lines = File.ReadAllLines(lblShowFileName.Text).Select(a => a.Split(','));
var csv = from line in lines select (from piece in line select piece);

Use ToArray() extension method to generate string[] instead of IEnumerable<string>
var csv = (from line in lines
select (from piece in line
select piece).ToArray()).ToArray();
Because you're calling ToArray() twice - within inner and outer query, your csv variable will be jagged array of strings: string[][]

Related

Read a CSV file having unknown number of columns in flink

I need to read a CSV file using Flink file source. I am using the below code to read it:
final TypeInformation[] fieldTypes = IntStream.range(0, 4)
.mapToObj(i -> BasicTypeInfo.STRING_TYPE_INFO)
.toArray(TypeInformation[]::new);
RowCsvInputFormat rowCsvInputFormat =
new RowCsvInputFormat(new Path(lookupPath), fieldTypes,
System.getProperty(LOOKUP_RECORD_SEPARATOR, LookupSeparators.LINE_SEPARATOR.getSeparator()),
lookUpProcessingData.getDelimiter().toString());
rowCsvInputFormat.setSkipFirstLineAsHeader(true);
DataStream<Row> lookupStream =
Context.getEnvironment()
.readFile(
rowCsvInputFormat,
lookupPath
//+ "/"
, FileProcessingMode.PROCESS_CONTINUOUSLY,
refreshIntervalinMS);
In the above code I am specifying that the number of columns in my Row would be 4. But my problem is that I would not be knowing the number of columns in a CSV file beforehand.
Although my type for each column would be String, but number of fields are unknown.
Is there a way I can provide dynamic number of columns in RowCsvInputFormat?
I also tried TextInputFormat & split the line based on my CSV delimiter, but it does not have setSkipFirstLineAsHeader API.
How would simultaneously split my record based on a delimiter & also use setSkipFirstLineAsHeader API without knowing the number of columns in CSV file beforehand?

Flink : How to implement TypeInformation without actual number of columns in csv

I am reading a csv file through flink. csv file have a specific number of columns.
I have defined
RowCsvInputFormat format = new RowCsvInputFormat(filePath,
new TypeInformation[]{
BasicTypeInfo.STRING_TYPE_INFO,
BasicTypeInfo.STRING_TYPE_INFO,
BasicTypeInfo.STRING_TYPE_INFO,
BasicTypeInfo.STRING_TYPE_INFO
});
The code works fine if in the file all the rows have proper 4 columns.
I want to handle a scenario when few rows in the file do not have 4 columns OR there is any other issue in few rows.
How can i achieve this in flink.
If you look here at the specifications on wikipedia or the rfc4180 it seems like CSV files should only contain rows which have the same amount of columns. So it makes sense the RowCsvInputFormat would not support this.
You could read the files using readTextFile(path) and then in a flatMap() operator parse the strings into a Row object (or ignore if there are issues in a row)
env.readTextFile(params.get("input"))
.flatMap(someCsvRowParseFunction())

Python splitting issues due to extra columns being created

This is for a project I'm working on.
Using python, I've imported a big CSV file that has about 2000 rows, I turned it into a list.
below is the script that I used to create a list
data=[] #Will put the data in here
with open('output.csv', "r") as file: # open the file
for data_row in file:
#get data one row at a time split up the row into columns, stripping
whitespace from each one and store it in 'data'
data.append( [x.strip() for x in data_row.split(",")] )
My main goal with this project is to create a table directly into a SQL server using a python script using pandas for example
df = pd.DataFrame(mydata, columns = ['column1' , 'column2', ...]
However, I ran into a problem while splitting because there are fields that contain peoples' names in 'Doe, John' format which creates extra columns and because of that when I insert column names into the pd.DataFrame it throws me error of 'AssertionError: 39 columns passed, passed data had 44 columns'
Could someone please help me solve this problem? I'd appreciate it much!

Stream Analytics GetArrayElements as String

I have a Stream analytics job that gets the data from an external source (I do not have a say on how the data is being formatted). I am trying to import the data into my data lake, storing as a JSON. This works fine, but I also want to get the output in a CSV, this is where I am having trouble.
As the input data has an array as one of the column, when importing in JSON it recognizes it and provides the right data i.e. places them in brackets [A, B, C], but when I use it in CSV I get the column represented as the word "Array". I thought I would convert it to XML, use STUFF and get them in one line, but it does not like using a SELECT statement in a CROSS APPLY.
Has anyone worked with Stream Analytics importing data into CSV, that has array column? If so, how did you manage to import the array values?
Sample data:
[
{"GID":"10","UID":1,"SID":"5400.0","PG:["75aef","e5f8e"]},
{"GID":"10","UID":2,"SID":"4400.0","PG:["75aef","e5f8e","6d793"]}
]
PG is the column I am trying to extract, so the output CSV should look something like.
GID|UID|SID|PG
10|1|5400.0|75aef,e5f8e
10|2|4400.0|75aef,e5f8e,6d793
This is the query I am using,
SELECT
D.GID ,
D.UID ,
D.SID ,
A.ArrayValue
FROM
dummy AS D
CROSS APPLY GetArrayElements(D.PG) AS A
As you could imagine, this gives me results in this format.
GID|UID|SID|PG
10|1|5400.0|75aef
10|1|5400.0|e5f8e
10|2|4400.0|75aef
10|2|4400.0|e5f8e
10|2|4400.0|6d793
As Pete M said, you could try to create a JavaScript user-defined function to convert an array to a string, and then you could call this User-defined function in your query.
JavaScript user-defined function:
function main(inputobj) {
var outstring = inputobj.toString();
return outstring;
}
Call UDF in query:
SELECT
TI.GID,TI.UID,TI.SID,udf.extractdatafromarray(TI.PG)
FROM
[TEST-SA-DEMO-BLOB-Input] as TI
Result:

Create 2d array from csv file in python 3

I have a .csv file which contains data like the following example
A,B,C,D,E,F,
1,-1.978,7.676,7.676,7.676,0,
2,-2.028,6.081,6.081,6.081,1,
3,-1.991,6.142,6.142,6.142,1,
4,-1.990,65.210,65.210,65.210,5,
5,-2.018,8.212,8.212,8.212,5,
6,54.733,32.545,32.545,32.545,6,
..and so on
The format is constant.
I want to load the file in a variable "log" and access them as log[row][column]
example
log[0][2] should give C
log[3][1] should give -1
If use this code
file = open('log.csv')
log = list(file)
when i use this code i get only one dimensional. log[row]
Is their any way to directly store them?
for example
read the file
split with '\n'
split again with ','
Thanks!!
Try this
log = [line.split(',') for line in open('log.csv')]

Resources