Clickhouse query with dictionary - database

I imported the database of ontime airlines from here https://clickhouse.com/docs/en/getting-started/example-datasets/ontime/
Then I created a dictionary mapping the 2 digit airplane codes to company names like this:
id,code,company
1,UA,United Airlines
2,HA,Hawaiian Airlines
3,OO,SkyWest
4,B6,Jetblue Airway
5,QX,Horizon Air
6,YX,Republic Airway
7,G4,Allegiant Air
...
..
I used this query to generate it and it seems to be working:
CREATE DICTIONARY airlinecompany
(
id UInt64,
code String,
company String
)
PRIMARY KEY id
SOURCE(FILE(path '/var/lib/clickhouse/user_files/airlinenames.csv' format 'CSVWithNames'))
LAYOUT(FLAT())
LIFETIME(3600)
In the main database (ontime) Looks like this:
SELECT Reporting_Airline AS R_air
FROM ontime
GROUP BY R_air
LIMIT 4
┌─R_air─┐
│ UA │
│ HA │
│ OO │
│ B6 │
└───────┘
What I want to do is have a table that uses R_air's 2 code value and then checks it against the airlinecompany dict to create a mapping ie
R_Air Company
UA | United Airlines
HA | Hawaiian Airlines
00 | SkyWest
...
..
But I cant seem to form this query correctly:
SELECT
Reporting_Airline AS R_Air,
dictGetString('airlinecompany', 'company', R_Air) AS company
FROM ontime
GROUP BY R_Air
Received exception from server (version 22.3.3):
Code: 6. DB::Exception: Received from localhost:9000. DB::Exception: Cannot parse string 'UA' as UInt64: syntax error at begin of string. Note: there are toUInt64OrZero and toUInt64OrNull functions, which returns zero/NULL instead of throwing exception.: while executing 'FUNCTION dictGetString('airlinecompany' :: 1, 'company' :: 2, Reporting_Airline :: 0) -> dictGetString('airlinecompany', 'company', Reporting_Airline) String : 4'. (CANNOT_PARSE_TEXT)
What am I missing? I dont know why it thinks UA is a UInt64

LAYOUT = COMPLEX_KEY_HASHED
CREATE DICTIONARY airlinecompany
(
id UInt64,
code String,
company String
)
PRIMARY KEY code
SOURCE(FILE(path '/var/lib/clickhouse/user_files/airlinenames.csv' format 'CSVWithNames'))
LAYOUT(COMPLEX_KEY_HASHED())
LIFETIME(3600)
SELECT dictGet('airlinecompany', 'company', tuple('UA'))
┌─dictGet('airlinecompany', 'company', tuple('UA'))─┐
│ United Airlines │
└───────────────────────────────────────────────────┘
SELECT Reporting_Airline AS R_air,
dictGetString('airlinecompany', 'company', tuple(R_Air)) AS company
FROM ontime
LIMIT 4;
┌─R_Air─┬─company───────────┐
│ B6 │ Jetblue Airway │
│ G4 │ Allegiant Air │
│ HA │ Hawaiian Airlines │
│ OO │ SkyWest │
└───────┴───────────────────┘

Related

Neo4j issue during running the merge syntax how can it be solved?

LOAD CSV WITH HEADERS FROM
'file:///epl_mataches.csv' as row
MATCH (c1:Club {name:row.`Team1`}), (c2:Club {name:row.`Team2`})
MERGE (c1) -[f:FEATURED{
round:toInteger(row.Round),
date:row.Date,
homeTeamFTScore: toInteger(split(row.FT,"-" [0])),
awayTeamFTScore: toInteger(split(row.FT,"-" [1])),
homeTeamHTScore: toInteger(split(row.HT,"-" [0])),
awayTeamHTScore: toInteger(split(row.HT,"-" [1]))
}] -> (c2)
The error is present when I try to create the relationships and to pull through the required information from the data file.
Neo.ClientError.Statement.SyntaxError
Type mismatch: expected List<T> but was String (line 7, column 45 (offset: 248))
" homeTeamFTScore: toInteger(split(row.FT,"-" [0])),"
There is a typo on your script, so instead of
homeTeamFTScore: toInteger(split(row.FT,"-" [0])),
Use below
homeTeamFTScore: toInteger(split(row.FT,"-") [0])
Notice the parenthesis before [0] and NOT after it.
For example:
RETURN toInteger(split("2-test","-") [0]) as sample
result:
╒════════╕
│"sample"│
╞════════╡
│ 2 │
└────────┘

Merging column with array from multiple rows

I'm trying to merge the data from a dataset as follow:
id
sms
longDescription
OtherFields
123
contentSms
ContentDesc
xxx
123
contentSms2
ContentDesc2
xxx
123
contentSms3
ContentDesc3
xxx
456
contentSms4
ContentDesc
xxx
the sms and longDescription have the following structure:
sms:array
|----element:struct
|----content:string
|----languageId:string
The aim is to capture the data with the same Id and merge the column sms and longDescription into one array with multiple struct( with the languageID as key):
id
sms
longDescription
OtherFields
123
contentSms, ContentSms2,contentSms3
ContentDesc,ContentDesc2,ContentDesc3
xxx
456
contentSms4
ContentDesc
xxx
I've tried using
x = df.select("*").groupBy("id").agg( collect_list("sms"))
but the result is :
collect_list(longDescription): array (nullable = false)
| |-- element: array (containsNull = false)
| | |-- element: struct (containsNull = true)
| | | |-- content: string (nullable = true)
| | | |-- languageId: string (nullable = true)
which is an array too much, as the goal is to have an array of struct in order to have the following result:
sms: [{content: 'aze', languageId:'en-GB'},{content: 'rty', languageId:'fr-BE'},{content: 'poiu', languageId:'nl-BE'}]
You're looking for flatten function:
x = df.groupBy("id").agg(flatten(collect_list("sms")))

Filename stripped of prefix in form data

I am sendind files from js to my golang server:
for (var value of formData.values()) {
console.log(value);
}
// File {name: 'img/<hash_key>.png', lastModified: 1635043863231, lastModifiedDate: Sat Oct 23 2021 23:51:03 GMT-0300 (Brasilia Standard Time), webkitRelativePath: '', size: 969, …}
// ...
var request = new Request( serverEndpoint, { body: formData, method: "POST", ... })
return fetch(request).then(response => { ... })
In my golang server, I am using the following code to deal with multipart form data from a request to read files
if err := r.ParseMultipartForm(32 << 20); err != nil {
...
}
for _, fileHeader := range r.MultipartForm.File["files"] {
...
}
I expected to read the files in Go with the same filenames, like img/<hash_key>.png
but my server is reading multipart-form to the following struct:
f = {*mime/multipart.Form | 0xc000426090}
├── Value = {map[string][]string}
└── File = {map[string][]*mime/multipart.FileHeader}
├── 0 = files -> len:1, cap:1
│ ├── key = {string} "files"
│ └── value = {[]*mime/multipart.FileHeader} len:1, cap:1
│ └── 0 = {*mime/multipart.FileHeader | 0xc000440000}
│ ├── Filename = {string} "<hash_key>.png" // notice how FileName is missing 'img/' prefix
│ └── ...
└── ...
I am trying to figure out how this is happening and how to prevent this strip prefix as I need this prefix to correctly resolve upload path for my files
Edit:
Closer inspection revealed that my server IS in fact getting the files with the correct name. After calling r.ParseMultipartForm(32 << 20), I get the following in r.Body.src.R.buf:
------WebKitFormBoundary1uanPdXqZeL8IPUH
Content-Disposition: form-data; name="files"; filename="img/upload.svg"
---- notice the img/ prefix
Content-Type: image/svg+xml
<svg height="512pt" viewBox= ...
However, in r.MultipartForm.File["files"][0].FileName, it shows as upload.svg
The directory is removed in in Part.FileName():
// RFC 7578, Section 4.2 requires that if a filename is provided, the
// directory path information must not be used.
return filepath.Base(filename)
Workaround Part.FileName() by parsing the content disposition header directly.
for _, fileHeader := range r.MultipartForm.File["files"] {
_, params, _ := mime.ParseMediaType(fileHeader.Header.Get("Content-Disposition"))
filename := params["filename"]
if filename == "" {
// TODO: Handle unexpected content disposition
// header (missing header, parse error, missing param).
}

validating a list of input variables in terraform

This validation block works for a single input variable.
variable "mytestname" {
validation {
condition = length(regexall("test$", var.mytestname)) > 0
error_message = "Should end in test"
}
}
I need it to work inside a for_each - or have some workaround to accomplish this. The issue is that there is a restriction on the condition statement - the condition HAS to take in the input variable itself (i.e. - it cannot accept an each.value)
variable "mytestnames" {
listnames = split(",",var.mytestnames)
for_each = var.listnames
validation {
condition = length(regexall("test$", each.value)) > 0
error_message = "Should end in test"
}
}
The above snippet does not work. I need a way I can iterate over a list of values and validate each of them. It looks like the newly introduced 'validation block' does not work on lists of input variables. There must be a way to do this without a validation block...??
I believe it will not work. The attributes that can be defined in the variable block are type, description, and default. So how could we define additional attribute such as "listnames" dynamically...
variable "mytestnames" {
listnames = split(",",var.mytestnames)
}
$ terraform validate
Error: Unsupported argument
on hoge.tf line 3, in variable "mytestnames":
3: listnames = split(",",var.mytestnames)
An argument named "listnames" is not expected here.
we can validate a loop
variable "mytestnames" {
type = string
description = "comma separated list of names"
# default = "nametest,name1test,name2test"
default = "nametest,nametest1,nametest2"
validation {
condition = alltrue([
for n in split(",", var.mytestnames) :
can(regex("test$", n)) # can't use a local var 'can only refer to the variable itself'
])
error_message = "Should end in test" # can't use local var here either
}
}
│ Error: Invalid value for variable
│
│ on main.tf line 5:
│ 5: variable "mytestnames" {
│ ├────────────────
│ │ var.mytestnames is "nametest,nametest1,nametest2"
│
│ Should end in test
│
... but, we can do better by using an output
locals { name_regex = "test$" }
output "mytestnames_valid" {
value = "ok" # we can output whatever we want
precondition {
condition = alltrue([
for n in split(",", var.mytestnames) :
can(regex(local.name_regex, n)) # in an output we can use a local var
])
error_message = format("invalid names: %s",
join(",", [
for n in split(",", var.mytestnames) :
n if !can(regex(local.name_regex, n)) # we can reference local AND make a list of bad names
]
)
)
}
}
│ Error: Module output value precondition failed
│
│ on main.tf line 23, in output "mytestnames":
│ 23: condition = alltrue([
│ 24: for n in split(",", var.mytestnames) :
│ 25: can(regex(local.name_regex, n)) # an an output we can use a local var
│ 26: ])
│ ├────────────────
│ │ local.name_regex is "test$"
│ │ var.mytestnames is "nametest,nametest1,nametest2"
│
│ invalid names: nametest1,nametest2

How to extract array<bigint> in hive table in spark properly?

I have a hive table which has a column(c4) with array<bigint> type. Now, I want to extract this column with spark. So, here is the code snippet:
val query = """select c1, c2, c3, c4 from
some_table where some_condition"""
val rddHive = hiveContext.sql(query).rdd.map{ row =>
//is there any other ways to extract wid_list(String here seems not work)
//no compile error and no runtime error
val w = if (row.isNullAt(3)) List() else row.getAs[scala.collection.mutable.WrappedArray[String]]("wid_list").toList
w
}
-> rddHive: org.apache.spark.rdd.RDD[List[String]] = MapPartitionsRDD[7] at map at <console>:32
rddHive.map(x => x(0).getClass.getSimpleName).take(1)
-> Array[String] = Array[Long]
So, I extract c4 with getAs[scala.collection.mutable.WrappedArray[String]], while the original data type is array<int>. However, there is no compile error or runtime error. Data I extracted is still bigint(Long) type. So, what happened here(why no compiler error or runtime error)? What is the proper way to extract array<int> as List[String]in Spark?
==================add more information====================
hiveContext.sql(query).printSchema
root
|-- c1: string (nullable = true)
|-- c2: integer (nullable = true)
|-- c3: string (nullable = true)
|-- c4: array (nullable = true)
| |-- element: long (containsNull = true)
hiveContext.sql(query).show(3)
+--------+----+----------------+--------------------+
| c1| c2| c3| c4|
+--------+----+----------------+--------------------+
| c1111| 1|5511798399.22222|[21772244666, 111...|
| c1112| 1|5511798399.88888|[11111111, 111111...|
| c1113| 2| 5555117114.3333|[77777777777, 112...|

Resources