Ruby storing array into csv result in output with extra "" - arrays

I have a script that is scrapping a website, generates a csv file and is storing raw data into this csv. Everything works well except when I am trying to store an array into the csv file :
tarif_jeune = []
tarif_adulte =[]
html_doc.search("td table table table tr").each do |tr|
unless (tr.css("td:nth-child(11)").text.squish == "") || (tr.css("td:nth-child(11)").text.squish.size > 5) || (tr.css("td:nth-child(11)").text.squish == "0,00")
tarif_adulte << tr.css("td:nth-child(11)").text.squish
end
unless (tr.css("td:nth-child(12)").text.squish == "") || (tr.css("td:nth-child(12)").text.squish.size > 5) || (tr.css("td:nth-child(12)").text.squish == "0,00")
tarif_jeune << tr.css("td:nth-child(12)").text.squish
end
end
then inserting tarif_jeune and tarif_adulte into csv file :
csv << ["true", tr.css("td:nth-child(10)").text.squish, tr.css("td:nth-child(11)").text.squish, tr.css("td:nth-child(11)").text.squish, tr.css("td:nth-child(12)").text.squish, tr.css("td:nth-child(13)").text.squish, tr.css("td:nth-child(14)").text.squish, tr.css("td:nth-child(15)").text.squish, tr.css("td:nth-child(1) a").attr("href").value, tarif_jeune.uniq, tarif_adulte.uniq, cat.uniq, address]
cat, tarif_jeune, tarif_adulte are all arrays. I would expect them to look like this in my csv ["poo", "faa", "foo"] but the outpout is quite different: "" are inserted everywhere and I get something like this :
tarif_jeune, tarif_adulte, cat
"[""15,00""]","[""20,00""]","[""Simple Messieurs 45"", ""Simple Dames Senior"", ""Simple Messieurs Senior""]"
Can someone explain where those extra "" come from and how to get rid of them.

The double quote " character is the default :quote_char in CSV class.
So, if you try to write a string that comprises of double quote characters, they will be escaped by CSV class and will be written as "" to the file.
In your case, you are writing an array of strings to CSV. When Array#to_s is called on a array of strings, the output will be a String that looks somewhat like below:
ary = ["a", "b", "c"]
puts "#{ary}"
#=> ["a", "b", "c"]
The double quotes in string representation of array will be escaped by CSV class, and the above ary will appear as "[""a"", ""b"", ""c""]" in csv file.
To solve this issue, look at why you need to store the output Array#to_s in CSV. You most likely want to do ary.join(" ") or equivalent and then write to file.

This is perfectly valid csv.
Assuming you want your data to look like this (as you say in ["poo", "faa", "foo"]):
tarif_jeune, tarif_adulte, cat
["15,00"],["20,00"],["Simple Messieurs 45", "Simple Dames Senior", "Simple Messieurs Senior"]
Here, inside the fields, there is a comma, which is also your field seperator. Therefore the fields have to be surrounded with quotes. And because your fields also contain quotes inside, those quotes need to be escaped with another quote:
tarif_jeune, tarif_adulte, cat
"[""15,00""]","[""20,00""]","[""Simple Messieurs 45"", ""Simple Dames Senior"", ""Simple Messieurs Senior""]"
Any decent csv parser should be able to handle those extra quotes. In fact, without them, your csv would be malformed.

Related

How to create array dynamically from a string without copy pasting?

I have these as strings: { column01 \ column02 \ column01 }(for other countries { column01 , column02 , column01 }). I want them evaluated as a array as if copy pasted.
The array string range is created automatically, based on the selection of the user. I created a dynamically personalized dataset based on a sheet studeertijden named. The user can easily select the wanted tables by checkboxes, so by Google Sheets ARRAY range (formula). I try to copy these content to an other sheet ... to make the required data available for Google Data Studio.
The contents of page studeertijden is NOT important. Let's say, a cell in 'legende-readme'!B39 returns a string with the required columns/data in a format like this:
{ studeertijden!A:A \ studeertijden!B:B}
If I put this in an empty sheet, by copy and paste, it works fine :
={ studeertijden!A:A \ studeertijden!B:B}
How can it be done automatically???
my first thought was by indirect ...
What I've tried(Does NOT work):
Cell 'legende - readme'!B39 contains:
{ studeertijden!A:A \ studeertijden!B:B}
=indirect('legende - readme'!B39)
returns :
#REF! - It is not a valid cell/range reference.
={ indirect('legende - readme'!B39) }
returns :
#REF! - It is not a valid cell/range reference.
={'legende - readme'!B39}
returns : { studeertijden!A:A \ studeertijden!B:B}
Note : For European users, use a '\' [backslash] as the column separator. Instead of the ',' [comma].
Assuming I've understood the question, if string doesn't need to start and end with curly brackets, then is this the behaviour you are looking for?
=arrayformula(transpose(split(byrow(transpose(split(string,",")),lambda(row,join(",",indirect(row)))),",")))
N.B. In my case I'm assuming that string is of the format 'studeertijden!A:A,studeertijden!B:B' (i.e. comma separated). So SPLIT by the comma to generate a column vector of references, TRANSPOSE to a row vector, INDIRECT each row (with the JOIN to return a single cell per row), ARRAYFORMULA/SPLIT to get back to multiple cells per row, TRANSPOSE back into columns like the original data.
This would be a lot easier if BYROW/BYCOL could return a matrix rather than being limited to just a row/column vector - the outer SPLIT and the JOIN in the BYROW wouldn't be needed. Over in Excel world they can also use arrays of thunks rather than string manipulation to deal with this limitation (which Excel also has), but Google Sheets doesn't seem to allow them when I've tried - see https://www.flexyourdata.com/blog/what-is-a-thunk-in-an-excel-lambda-function/ for more details.
Thanks to Natalia Sharashova of AbleBits, she provided this working solution (for the complete sheet).
referenceString is the reference to the string to all wanted columns; array matrix; { studeertijden!A:A \ studeertijden!B:B}
Note :
for European users, use a ' \ ' - [backslash] as the column separator
instead of the default ' , ' - [comma]
=REDUCE(
FALSE;
ArrayFormula(TRIM(SPLIT(REGEXREPLACE( referenceString ; "^=?{(.*?)}$"; "$1"); "\"; TRUE; TRUE)));
LAMBDA(accumulator; current_value;
IF(
accumulator = FALSE;
INDIRECT(current_value);
{ accumulator \ INDIRECT(current_value)}
)
)
)
={"1" , "2"}
={"1" \ "2"}
both are valid. it all depends on your locale settings
see: https://stackoverflow.com/a/73767720/5632629
with indirect it would be:
=INDIRECT("sheet1!A:B")
where you can build it dynamically for example:
=INDIRECT("sheet1!"& A1 &":"& B1)
where A1 contains a string like A or A1 (and same for B1)
another way how to construct range is with ADDRESS like:
=INDIRECT(ADDRESS(1; 2)
from another sheet it could be:
=INDIRECT(ADDRESS(1; 2;;; "sheet2")
or like:
=INDIRECT("sheet2!"&ADDRESS(1; 2))
for a range we can do:
=INDIRECT("sheet2!"&ADDRESS(1; 2)&":"&ADDRESS(10; 3))

How to escape single quote in a string taken by YAML file using Sprintf

I'm using Go but I'm having issues while trying to get an array that contains a single quote, I'm making a query structure to create a .sql file with that query, the issue is with an array field that is adding double quotes instead of a single quote.
This is what I have:
Yaml File:
Name: 'Myname'
Age: 9
Dimensions ['go', 'lang']
Go code Syntaxis:
var Query string = "";
Query = fmt.Sprintf("INSERT INTO persons (name, age, dimensions) VALUES ('%s', %d, %q)")
OUTPUT:
Query:
INSERT INTO persons (name, age, dimensions) VALUES ('MyName', 9, ["go", "lang"])
I don't want "go" and "lang" double quoted, I want it as the YAML file came, with single quote.
Is supposed that "%q" escape the single quote...
any idea what to do?

Can't display unicode characters from file properly

I'm writing a script which should operate on words from a number of files which have unicode characters in a form of something\u0142somethingelse.
I use python 3 so I suppose after reading line \u0142 should be replaced by 'Å‚' character, but it isn't. I receive "something\u0142somethingelse" in console.
After manually copying "bad" output from console and pasting it to: print("something\u0142somethingelse") it is displayed correctly.
Problematic part of the script:
list_of_files = ['test/stack.txt']
for file in list_of_files:
with open(file,'r') as fp:
for line in fp:
print(line)
print("something\u0142somethingelse")
stack.txt:
something\u0142somethingelse
Output:
something\u0142somethingelse
somethingłsomethingelse
I experimented with utf-8 encoding when opening this file and really I'm out of ideas...
I think you can do what you want with ast.literal_eval. This uses the same syntax as the Python interpreter to understand literals: like eval but safer. So this works, for example:
a = 'something\\u0142somethingelse'
import ast
b = ast.literal_eval('"' + a + '"')
print '"' + a + '"'
print b
The output should be:
"something\u0142somethingelse"
somethingłsomethingelse

Tab based split on lines misses empty columns - Perl

I have a tab separated text file. I read line by line and column by column. I make few changes in each column and write the line to a new file. When I read each column using split function of perl
my #aLastOldElements = split(/\t/, $_);
I miss out on empty columns in the end. For example if file has 33 tab separated columns, out of which 10 in the end are empty. The split function creates array of size 23. I want to have all the columns. Because this way the header of file (33 columns) doesn't match the data (23 columns) and I get errors while writing the file to the database.
split accepts an optional third parameter for the maximum number of fields to return. If this is present, empty trailing fields will not be discarded:
perl -E '#arr = split(/ /, "foo bar ", 100); say scalar #arr'
14
So long as the tabs to separate the empty fields at the end of the line are present, this should always give you 33 fields in the array, even if the last 10 are empty. (In my example, there are 14 fields returned because the string contains 13 separators, even though the specified limit was 100.)
Edit: In answer to the question in the first comment:
perl -wE '#arr = split(/\t/, "foo\tbar\t\thello\t", 100); say $_ || "(empty field)" for #arr'
foo
bar
(empty field)
hello
(empty field)
If you know that the columns should be there, whether or not they have any data, you can just ensure the result yourself.
my #aLastOldElements = split(/\t/, $_);
my $short_fall = 33 - #aLastOldElements;
if ( $short_fall > 0 ) {
push #aLastOldElements => ( '' ) x $short_fall;
}

Reading a text file in MATLAB line by line

I have a CSV file, I want to read this file and do some pre-calculations on each row to see for example that row is useful for me or not and if yes I save it to a new CSV file.
can someone give me an example?
in more details this is how my data looks like: (string,float,float) the numbers are coordinates.
ABC,51.9358183333333,4.183255
ABC,51.9353866666667,4.1841
ABC,51.9351716666667,4.184565
ABC,51.9343083333333,4.186425
ABC,51.9343083333333,4.186425
ABC,51.9340916666667,4.18688333333333
basically i want to save the rows that have for distances more than 50 or 50 in a new file.the string field should also be copied.
thanks
You could actually use xlsread to accomplish this. After first placing your sample data above in a file 'input_file.csv', here is an example for how you can get the numeric values, text values, and the raw data in the file from the three outputs from xlsread:
>> [numData,textData,rawData] = xlsread('input_file.csv')
numData = % An array of the numeric values from the file
51.9358 4.1833
51.9354 4.1841
51.9352 4.1846
51.9343 4.1864
51.9343 4.1864
51.9341 4.1869
textData = % A cell array of strings for the text values from the file
'ABC'
'ABC'
'ABC'
'ABC'
'ABC'
'ABC'
rawData = % All the data from the file (numeric and text) in a cell array
'ABC' [51.9358] [4.1833]
'ABC' [51.9354] [4.1841]
'ABC' [51.9352] [4.1846]
'ABC' [51.9343] [4.1864]
'ABC' [51.9343] [4.1864]
'ABC' [51.9341] [4.1869]
You can then perform whatever processing you need to on the numeric data, then resave a subset of the rows of data to a new file using xlswrite. Here's an example:
index = sqrt(sum(numData.^2,2)) >= 50; % Find the rows where the point is
% at a distance of 50 or greater
% from the origin
xlswrite('output_file.csv',rawData(index,:)); % Write those rows to a new file
If you really want to process your file line by line, a solution might be to use fgetl:
Open the data file with fopen
Read the next line into a character array using fgetl
Retreive the data you need using sscanf on the character array you just read
Perform any relevant test
Output what you want to another file
Back to point 2 if you haven't reached the end of your file.
Unlike the previous answer, this is not very much in the style of Matlab but it might be more efficient on very large files.
Hope this will help.
You cannot read text strings with csvread.
Here is another solution:
fid1 = fopen('test.csv','r'); %# open csv file for reading
fid2 = fopen('new.csv','w'); %# open new csv file
while ~feof(fid1)
line = fgets(fid1); %# read line by line
A = sscanf(line,'%*[^,],%f,%f'); %# sscanf can read only numeric data :(
if A(2)<4.185 %# test the values
fprintf(fid2,'%s',line); %# write the line to the new file
end
end
fclose(fid1);
fclose(fid2);
Just read it in to MATLAB in one block
fid = fopen('file.csv');
data=textscan(fid,'%s %f %f','delimiter',',');
fclose(fid);
You can then process it using logical addressing
ind50 = data{2}>=50 ;
ind50 is then an index of the rows where column 2 is greater than 50. So
data{1}(ind50)
will list all the strings for the rows of interest.
Then just use fprintf to write out your data to the new file
here is the doc to read a csv : http://www.mathworks.com/access/helpdesk/help/techdoc/ref/csvread.html
and to write : http://www.mathworks.com/access/helpdesk/help/techdoc/ref/csvwrite.html
EDIT
An example that works :
file.csv :
1,50,4.1
2,49,4.2
3,30,4.1
4,71,4.9
5,51,4.5
6,61,4.1
the code :
File = csvread('file.csv')
[m,n] = size(File)
index=1
temp=0
for i = 1:m
if (File(i,2)>=50)
temp = temp + 1
end
end
Matrix = zeros(temp, 3)
for j = 1:m
if (File(j,2)>=50)
Matrix(index,1) = File(j,1)
Matrix(index,2) = File(j,2)
Matrix(index,3) = File(j,3)
index = index + 1
end
end
csvwrite('outputFile.csv',Matrix)
and the output file result :
1,50,4.1
4,71,4.9
5,51,4.5
6,61,4.1
This isn't probably the best solution but it works! We can read the CSV file, control the distance of each row and save it in a new file.
Hope it will help!

Resources