I've got a csv file with this content:
"Compañía","Aeropuerto Base","Año","Clase","Grupo Compañía","Mes","Movimiento","País","Servicio","Tipo Avión","Tipo Tráfico","Operaciones Totales"
"2 EXCEL AVIATION LTD","ADOLFO SUÁREZ MADRID-BARAJAS","2020","UE SCHENGEN","Total","","","","","","","0"
"2 EXCEL AVIATION LTD","ADOLFO SUÁREZ MADRID-BARAJAS","2020","UE NO SCHENGEN","Total","","","","","","","4"
"2 EXCEL AVIATION LTD","ADOLFO SUÁREZ MADRID-BARAJAS","2020","INTERNACIONAL","Total","","","","","","","2"
I've uploaded it to snowflake using the stage feature:
PUT 'file://C:\\tmp\\opc2020.csv' #demo_stage;
I've created a file format:
CREATE OR REPLACE FILE FORMAT demo_file_format TYPE = 'CSV' field_delimiter = ',';
If I try to query the content:
SELECT C.$1 FROM #demo_stage (file_format => 'demo_file_format') C
I get an error:
SQL Error [100144] [22000]: Invalid UTF8 detected in string '0xFF0xFE"0x00C0x00o0x00m0x00p0x00a0x000xF10x000xED0x00a0x00"0x00'
File 'opc2020.csv.gz', line 1, character 1
Row 1, column "TRANSIENT_STAGE_TABLE"["$1":1]
If I add the VALIDATE_UTF8 = false attribute then I can query the stage but losing the UTF8 characters and with some unexpected whitespace between characters:
CREATE OR REPLACE FILE FORMAT dbt_demo_file_format TYPE = 'CSV' field_delimiter = ',' VALIDATE_UTF8 = TRUE;
��" C o m p a � � a " |
" 2 E X C E L A V I A T I O N L T D "
" 2 E X C E L A V I A T I O N L T D "
" 2 E X C E L A V I A T I O N L T D "
" 2 E X C E L A V I A T I O N L T D "
" 2 E X C E L A V I A T I O N L T D "
" 2 E X C E L A V I A T I O N L T D "
" 2 E X C E L A V I A T I O N L T D "
How can I solve this?
If the original file was generated using a different character set than UTF-8 then you would get into this issue.
If you know the character set used to generate the file then you can set into the FILE FORMAT statement the ENCODING parameter to the correct value.
In your case if the original file was created using UTF-16LE then your CREATE FILE FORMAT would look like:
CREATE OR REPLACE FILE FORMAT demo_file_format TYPE = 'CSV' field_delimiter = ',' ENCODING = 'UTF-16 LE' VALIDATE_UTF8 = TRUE FIELD_OPTIONALLY_ENCLOSED_BY = '"';
More information on character sets and encoding is on our docs here.
Related
Input data:
name date
G
A 2011-01-21
A
B
C 2011-02-04
D
D 2011-03-26
E 2011-05-13
F 2011-02-20
G 2011-05-10
G
H
A
My desired output is a list of distinct values from name and date disgarding rows containing where name is a duplicate and date is blank:
name date
A 2011-01-21
B
C 2011-02-04
D 2011-03-26
E 2011-05-13
F 2011-02-20
G 2011-05-10
H
My awk code below produces this result:
awk 'BEGIN { FS=OFS="\t"}
NR==1 { print }
NR>1 { a[$1]++ }
NR>1 && $2!="" { b[$1]=$2 }
NR>1 && $2=="" { c[$1]=$2 }
END { for (i in a) {
if ( c[i] ) {print i,b[i]}
else {print i,b[i]}
}
}
' test.tsv
However, it shouldn't produce the desired result because in the event c[i] is empty, b[i] should be empty and it should give up. What am I missing here please?
your c[i] is useless, you always print the same combination. You can simplify it a bit and I think it will get clearer
$ awk 'NR==1 {print; next}
{a[$1]=a[$1]==""?$2:a[$1]}
END {for(k in a) print k,a[k]}' file | column -t
name date
A 2011-01-21
B
C 2011-02-04
D 2011-03-26
E 2011-05-13
F 2011-02-20
G 2011-05-10
H
only update the mapping if the value is not blank, so this will capture the first non-blank value for each key if there is any.
Assuming the values are dates you can replace the middle block with !a[$1]{a[$1]=$2}
How to determine when to use Set Features vs Sequence Features on a column and difference between them with some examples.
I'm trying to use Ludwig to perform classification. My dataset looks something like below:
Letters here are just for representational purpose only
For example feature 1 (alpha word) could stand for ^al lph pha ha$ (trigram here)
LABEL, Feature1, Feature2
X, A B C, D A E
X, B C K, K J L
Y, A D C, D A E
Y, B D E, J L R
name: Feature1_trigrams
type: set
level: words
encoder:
representation: dense
embedding_size: 10
embeddings_on_cpu: false
pretrained_embeddings: null
embeddings_trainable: true
dropout: false
initializer: null
regularize: true
reduce_output: sqrt
tied_weights: null
cell_type: lstm
bidirectional: true
num_layers: 2
reduce_output: null
preprocessing:
format: space
Should I be using Sequence instead?
I have up to 16 datasets (only using 8 for the example here), and I'm trying to sort them into groups of 4, where the datasets within each group are as closely matched as possible. (Using a VBA macro in Excel).
My aim was to iterate through every possible combination of groups of 4 and compare how well matched they are to the previous "best match", overwriting that if so.
I've got no problems comparing how well matched the groups are, but the code I have won't iterate through every possible combination.
My question is why doesn't this code work? And if there is a better solution please let me know.
For a = 1 To UBound(Whitelist) - 3
For b = a + 1 To UBound(Whitelist) - 2
For c = b + 1 To UBound(Whitelist) - 1
For d = c + 1 To UBound(Whitelist)
TempGroups(1, 1) = a: TempGroups(1, 2) = b: TempGroups(1, 3) = c: TempGroups(1, 4) = d
For e = 1 To UBound(Whitelist) - 3
If InArray(TempGroups, e) = False Then
For f = e + 1 To UBound(Whitelist) - 2
If InArray(TempGroups, f) = False Then
For g = f + 1 To UBound(Whitelist) - 1
If InArray(TempGroups, g) = False Then
For h = g + 1 To UBound(Whitelist)
If InArray(TempGroups, h) = False Then
TempGroups(2, 1) = e: TempGroups(2, 2) = f: TempGroups(2, 3) = g: TempGroups(2, 4) = h
If HowClose(Differences, TempGroups, 1) + HowClose(Differences, TempGroups, 2) < HowClose(Differences, Groups, 1) + HowClose(Differences, Groups, 2) Then
For x = 1 To 4
For y = 1 To 4
Groups(x, y) = TempGroups(x, y)
Next y
Next x
End If
End If
Next h
End If
Next g
End If
Next f
End If
Next e
Next d
Next c
Next b
Next a
For Reference, UBound(Whitelist) can be taken as 8 (number of datasets I have to match)
TempGroups is an array which I'm writing each iteration to, so it can be compared to...
Groups, the array which will contain the data sorted into matched groups
The InArray function checks to see if the value is already allocated to a group, as the datasets can only be in one group each.
Thanks in advance!
Images:
Datasets
Relatively Well Matched Data
Fairly Poorly Matched Data
Is it possible to write a string stored into a list into a .CSV file into one cell?
I have a folder with files and I want to write the file names onto a .csv file.
Folder with files:
Data.txt
Data2.txt
Data3.txt
Here is my code:
import csv
import os
index = -1
filename = []
filelist = []
filelist = os.listdir("dirname")
f = csv.writer(open("output.csv", "ab"), delimiter=",", quotechar=" ", quoting=csv.QUOTE_MINIMAL)
for file in filelist:
if (len(filelist) + index) <0:
break
filename = filelist[len(filelist)+index]
index -= 1
f.writerow(filename)
Output I'm getting is one letter per cell in the .csv file:
A B C D E F G H I
1 D a t a . t x t
2 D a t a 2 . t x t
3 D a t a 3 . t x t
Desired output would be to have it all in 1 cell. There should be three rows on the csv file with strings "Data.txt" in cell A1, "Data2.txt" in cell B1, and "Data3.txt" in cell C1:
A B
1 Data.txt
2 Data2.txt
3 Data3.txt
Is it possible to do this? Let me know if you need more information. I am currently using Python 2.7 on Windows 7.
Solution/Corrected Code:
import csv
import os
index = -1
filename = []
filelist = []
filelist = os.listdir("dirname")
f = csv.writer(open("output.csv", "ab"), delimiter=",", quotechar=" ", quoting=csv.QUOTE_MINIMAL)
for file in filelist:
if (len(filelist) + index) <0:
break
filename = filelist[len(filelist)+index]
index -= 1
f.writerow([filename]) #Need to send in a list of strings without the ',' as a delimiter since writerow expects a tuple/list of strings.
You can do this:
import csv
import os
filelist = os.listdir("dirname") # Use a real directory
f = csv.writer(open("output.csv", 'ab'), delimiter=",", quotechar=" ", quoting=csv.QUOTE_MINIMAL)
for file_name in filelist:
f.writerow([file_name])
Writerow expects a sequence, for example a list of strings. You're giving it a single string which it is then iterating over causing you to see each letter of the string with a , between.
you can also put each sentence/value that you want to be in 1 cell inside a list and all the internal lists inside one external list.
something like this:
import csv
import os
list_ = [["value1"], ["value2"], ["value3"]]
with open('test.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerows(list_)
I'm using Gviz library from bioconductor. I input a tab delimited file containing CNV position that I need to plot on my chromosome ideogram.
My input file is defined by dat and has 4 columns
[1] chromosome
[2] start
[3] end
[4] width (could '+' or '-' depending on the orientation of the Copy Number)
So I did that :
library(IRanges)
libraray(Gviz)
gen <- "mm9"
chr <- "chr1"
itrack <- IdeogramTrack(genome = gen, chromosome = chr)
gtrack <- GenomeAxisTrack()
dat <- read.delim("C:/R/1ips_chr1.txt", header = FALSE, sep ="\t")
s <- dat[2]
e <- dat[3]
l <- dat[4]
It shows an error message when I call the file dat :
atrack1 <- AnnotationTrack( start = s, width = l , chromosome = chr, genome = gen, name = "Sample1")
Error : function (classes, fdef, mtable) : unable to find an inherited method for function ".buildRange", for signature "NULL", "data.frame", "NULL", "data.frame"
Obviously the way I call a the inputed file (in dat) doesn't satisfy R .. Someone help me please :)
From the reference manual for the Gviz package (with which I am not familiar), the arguments start and width in the AnnotationTrack function need to be integer vectors. When you subset dat using the single square bracket [, the resulting object is a data.frame (see ?`[.data.frame` for more on this). Try instead
s <- dat[[2]]
e <- dat[[3]]
l <- dat[[4]]
to obtain integer vectors.