Make dictionary from a chart setup - arrays

I'm trying to use keys (sizes of pipe) that have outlets of different sizes. Each outlet has its own set of sizes. I created dictionaries within a dictionary style which is called from a pickerView. But I can't call individual items from the buried dictionaries/arrays...
-------> _____
| | |
M ___| |___ size >>> 1/2"
|_|__ _ __ | ___|___
|______|______| | |
| | | outlets >>> 1/2" 3/8"
|__C___|___C__| /\ /\
/ \ / \
C M C M
I have already tried this setup with a structure and also tried individual dictionaries that all call to each other:>>
var out2 = ["1": [["1": [1.5 , 1.5]], ["3/4": [1.5 , 1.5]], ["1/2": [1.5 , 1.5]]]]
This does work but I can't call individual pieces from it.
struct teeDict {
var out1: [(size: String) :
[[(outlet1: String) :
[(c: Double) , (m: Double)]],
[(outlet2: String) :
[(c: Double) , (m: Double)]]]] =
["1/2" :
[["1/2" :
[1.0 , 1.0]],
["3/8" :
[1.0 , 1.0]]]]
}
I tried to create a dictionary instead of using arrays but nothing is recognized. How do I get this to work?? Thanks.
SOURCE DATA (for clarification)

Related

How to compare two array of string columns in Pyspark

I want to compare two arrays and filter the data frame
condition_1 = AAA
condition_2 = ["AAA","BBB","CCC"]
My spark data frame has a column with array of strings
df = df.withColumn("array_column", F.lit(["XXX","YYY","AAA"]))
# to filter a string condition_1 with the array column
df = df.filter(
F.col('array_column').isin(condition_1) &
# second filter here
)
But how can I filter condition_2 in in a similar way? since they are both arrays?
Code I tried:
df = df.filter(
F.col('array_column').isin(condition_1) &
any(x in condition_2 for x in F.col('array_column'))
)
But I get an error - Column is not iterable.
I also tried - bool(set(F.col('array_column')).intersection(condition_2))
But still have the same error. Can anyone help me with this?
Hope I got your question right. It wasnt as clear. Use pyspark's array functions
Data
condition_1 = 'AAA'
condition_2 = ["AAA","BBB","CCC"]
df=spark.createDataFrame([('1A', '3412asd','value-1', ['XXX', 'YYY', 'AAA']),
('2B', '2345tyu','value-2', ['DDD', 'YFFFYY', 'GGG']),
('3C', '9800bvd', 'value-3', ['AAA']),
('3C', '9800bvd', 'value-1', ['AAA', 'YYY', 'CCCC'])],
('ID', 'Company_Id', 'value' ,'array_column'))
df.show()
+---+----------+-------+------------------+
| ID|Company_Id| value| array_column|
+---+----------+-------+------------------+
| 1A| 3412asd|value-1| [XXX, YYY, AAA]|
| 2B| 2345tyu|value-2|[DDD, YFFFYY, GGG]|
| 3C| 9800bvd|value-3| [AAA]|
| 3C| 9800bvd|value-1| [AAA, YYY, CCCC]|
+---+----------+-------+------------------+
Code
df.where((array_contains(col('array_column'), lit(condition_1)))&(size(array_intersect(col('array_column'),array([lit(x) for x in condition_2])))!=0)).show(truncate=False)
Outcome
+---+----------+-------+----------------+
|ID |Company_Id|value |array_column |
+---+----------+-------+----------------+
|1A |3412asd |value-1|[XXX, YYY, AAA] |
|3C |9800bvd |value-3|[AAA] |
|3C |9800bvd |value-1|[AAA, YYY, CCCC]|
+---+----------+-------+----------------+
How it works
condition_1 ; get a boolean selection of where column contains string
array_contains(col('array_column'), lit(condition_1))
condition_2 ; This happens in stages
Intersect column with the list
array_intersect(col('array_column'),array([lit(x) for x in condition_2]))
get the size of the outcome of 1 above
size(array_intersect(col('array_column'),array([lit(x) for x in` condition_2])))
Check that the intersection contains at least one item
size(array_intersect(col('array_column'),array([lit(x) for x in condition_2])))!=0
Finally, chain condition_1 and condition_2 using operant & and pass into the df.where() or df.filter() methods

How to create a new array of substrings from string array column in a spark dataframe

I have a spark dataframe. One of the columns is an array type consisting of an array of text strings of varying lengths. I am looking for a way to add a new column that is an array of the unique left 8 characters of those strings.
df.printSchema()
root
(...)
|-- arr_agent: array (nullable = true)
| |-- element: string (containsNull = true)
example data from column "arr_agent":
["NRCANL2AXXX", "NRCANL2A"]
["UTRONL2U", "BKRBNL2AXXX", "BKRBNL2A"]
["NRCANL2A"]
["UTRONL2U", "REUWNL2A002", "BKRBNL2A", "REUWNL2A", "REUWNL2N"]
["UTRONL2U", "UTRONL2UXXX", "BKRBNL2A"]
["MQBFDEFFYYY", "MQBFDEFFZZZ", "MQBFDEFF" ]
What I need to have in the new column:
["NRCANL2A"]
["UTRONL2U", "BKRBNL2A"]
["NRCANL2A"]
["UTRONL2U", "BKRBNL2A", "REUWNL2A", "REUWNL2N"]
["UTRONL2U", "BKRBNL2A"]
["MQBFDEFF" ]
I already tried to define a udf that does it for me.
from pyspark.sql import functions as F
from pyspark.sql import types as T
def make_list_of_unique_prefixes(text_array, prefix_length=8):
out_arr = set(t[0:prefix_length] for t in text_array)
return(out_arr)
make_list_of_unique_prefixes_udf = F.udf(lambda x,y=8: make_list_of_unique_prefixes(x,y), T.ArrayType(T.StringType()))
But then calling:
df.withColumn("arr_prefix8s", F.collect_set( make_list_of_unique_prefixes_udf(F.col("arr_agent") )))
Throws an error
AnalysisException: grouping expressions sequence is empty,
Any tips would be appreciated.
thanks
You can solve this using higher order functions available from spark 2.4+ using transform and substring and then take array distinct:
from pyspark.sql import functions as F
n = 8
out = df.withColumn("New",F.expr(f"array_distinct(transform(arr_agent,x->substring(x,0,{n})))"))
out.show(truncate=False)
+-----------------------------------------------------+----------------------------------------+
|arr_agent |New |
+-----------------------------------------------------+----------------------------------------+
|[NRCANL2AXXX, NRCANL2A] |[NRCANL2A] |
|[UTRONL2U, BKRBNL2AXXX, BKRBNL2A] |[UTRONL2U, BKRBNL2A] |
|[NRCANL2A] |[NRCANL2A] |
|[UTRONL2U, REUWNL2A002, BKRBNL2A, REUWNL2A, REUWNL2N]|[UTRONL2U, REUWNL2A, BKRBNL2A, REUWNL2N]|
|[UTRONL2U, UTRONL2UXXX, BKRBNL2A] |[UTRONL2U, BKRBNL2A] |
|[MQBFDEFFYYY, MQBFDEFFZZZ, MQBFDEFF] |[MQBFDEFF] |
+-----------------------------------------------------+----------------------------------------+

Pyspark Array Column - Replace Empty Elements with Default Value

I have a dataframe with a column which is an array of strings. Some of the elements of the array may be missing like so:
-------------|-------------------------------
ID |array_list
---------------------------------------------
38292786 |[AAA,, JLT] |
38292787 |[DFG] |
38292788 |[SHJ, QKJ, AAA, YTR, CBM] |
38292789 |[DUY, ANK, QJK, POI, CNM, ADD] |
38292790 |[] |
38292791 |[] |
38292792 |[,,, HKJ] |
I would like to replace the missing elements with a default value of "ZZZ". Is there a way to do that? I tried the following code, which is using a transform function and a regular expression:
import pyspark.sql.functions as F
from pyspark.sql.dataframe import DataFrame
def transform(self, f):
return f(self)
DataFrame.transform = transform
df = df.withColumn("array_list2", F.expr("transform(array_list, x -> regexp_replace(x, '', 'ZZZ'))"))
This doesn't give an error but it is producing nonsense. I'm thinking I just don't know the correct way to identify the missing elements of the array - can anyone help me out?
In production our data has around 10 million rows and I am trying to avoid using explode or a UDF (not sure if it's possible to avoid using both though, just need the code run as efficiently as possible). I'm using Spark 2.4.4
This is what I would like my output to look like:
-------------|-------------------------------|-------------------------------
ID |array_list | array_list2
---------------------------------------------|-------------------------------
38292786 |[AAA,, JLT] |[AAA, ZZZ, JLT]
38292787 |[DFG] |[DFG]
38292788 |[SHJ, QKJ, AAA, YTR, CBM] |[SHJ, QKJ, AAA, YTR, CBM]
38292789 |[DUY, ANK, QJK, POI, CNM, ADD] |[DUY, ANK, QJK, POI, CNM, ADD]
38292790 |[] |[ZZZ]
38292791 |[] |[ZZZ]
38292792 |[,,, HKJ] |[ZZZ, ZZZ, ZZZ, HKJ]
The regex_replace works at character level.
I could not get it to work with transform either, but with help from the first answerer I used a UDF - not that easy.
Here is my example with my data, you can tailor.
%python
from pyspark.sql.types import StringType, ArrayType
from pyspark.sql.functions import udf, col
concat_udf = udf(
lambda con_str, arr: [
x if x is not None else con_str for x in arr or [None]
],
ArrayType(StringType()),
)
arrayData = [
('James',['Java','Scala']),
('Michael',['Spark','Java',None]),
('Robert',['CSharp','']),
('Washington',None),
('Jefferson',['1','2'])]
df = spark.createDataFrame(data=arrayData, schema = ['name','knownLanguages'])
df = df.withColumn("knownLanguages", concat_udf(lit("ZZZ"), col("knownLanguages")))
df.show()
returns:
+----------+------------------+
| name| knownLanguages|
+----------+------------------+
| James| [Java, Scala]|
| Michael|[Spark, Java, ZZZ]|
| Robert| [CSharp, ]|
|Washington| [ZZZ]|
| Jefferson| [1, 2]|
+----------+------------------+
Quite tough this, had some help from the first answerer.
I'm thinking of something, but i'm not sure if it is efficient.
from pyspark.sql import functions as F
df.withColumn("array_list2", F.split(F.array_join("array_list", ",", "ZZZ"), ","))
First I concatenate the values as a string with a delimiter , (hoping you don't have it in your string but you can use something else). I use the null_replacement option to fill the null values. Then I split according to the same delimiter.
EDIT: Based on #thebluephantom comment, you can try this solution :
df.withColumn(
"array_list_2", F.expr(" transform(array_list, x -> coalesce(x, 'ZZZ'))")
).show()
SQL built-in transform is not working for me, so I couldn't try it but hopefully you'll have the result you wanted.

convert JSON text string to Pandas, but each row cell ends up as an array of values inside

I manage to extract a time-series of prices from a web-portal. The data arrives in a json format, and I convert them into a pandas dataFrame.
Unfortunately, the data for the different bands come in a text string, and I can't seem to extract them out properly.
The below is the json data I extract
I convert them into a pandas dataframe using this code
data = pd.DataFrame(r.json()['prices'])
and get them like this
I need to extract (for example) the data in the column ClosePrice out, so that I can do data analysis and cleansing on them.
I tried using
data['closePrice'].str.split(',', expand=True).rename(columns = lambda x: "string"+str(x+1))
but it doesn't really work.
Is there any way to either
a) when I convert the json to dataFrame, such that the prices within the closePrice, bidPrice etc are extracted in individual columns OR
b) if they were saved in the dataFrame, extract the text strings within them, such that I can extract the prices (e.g. the bid, ask and lastTraded) within the text string?
A relatively brute force way, using links from other stackOverflow.
# load and extract the json data
s = requests.Session()
r = s.post(url + '/session', json=data)
loc = <url>
dat1 = s.get(loc)
dat1 = pd.DataFrame(dat1.json()['prices'])
# convert the object list into individual columns
dat2 = pd.DataFrame()
dat2[['bidC','askC', 'lastP']] = pd.DataFrame(dat1.closePrice.values.tolist(), index= dat1.index)
dat2[['bidH','askH', 'lastH']] = pd.DataFrame(dat1.highPrice.values.tolist(), index= dat1.index)
dat2[['bidL','askL', 'lastL']] = pd.DataFrame(dat1.lowPrice.values.tolist(), index= dat1.index)
dat2[['bidO','askO', 'lastO']] = pd.DataFrame(dat1.openPrice.values.tolist(), index= dat1.index)
dat2['tStamp'] = pd.to_datetime(dat1.snapshotTime)
dat2['volume'] = dat1.lastTradedVolume
get the equivalent below
Use pandas.json_normalize to extract the data from the dict
import pandas as pd
data = r.json()
# print(data)
{'prices': [{'closePrice': {'ask': 1.16042, 'bid': 1.16027, 'lastTraded': None},
'highPrice': {'ask': 1.16052, 'bid': 1.16041, 'lastTraded': None},
'lastTradedVolume': 74,
'lowPrice': {'ask': 1.16038, 'bid': 1.16026, 'lastTraded': None},
'openPrice': {'ask': 1.16044, 'bid': 1.16038, 'lastTraded': None},
'snapshotTime': '2018/09/28 21:49:00',
'snapshotTimeUTC': '2018-09-28T20:49:00'}]}
df = pd.json_normalize(data['prices'])
Output:
| | lastTradedVolume | snapshotTime | snapshotTimeUTC | closePrice.ask | closePrice.bid | closePrice.lastTraded | highPrice.ask | highPrice.bid | highPrice.lastTraded | lowPrice.ask | lowPrice.bid | lowPrice.lastTraded | openPrice.ask | openPrice.bid | openPrice.lastTraded |
|---:|-------------------:|:--------------------|:--------------------|-----------------:|-----------------:|:------------------------|----------------:|----------------:|:-----------------------|---------------:|---------------:|:----------------------|----------------:|----------------:|:-----------------------|
| 0 | 74 | 2018/09/28 21:49:00 | 2018-09-28T20:49:00 | 1.16042 | 1.16027 | | 1.16052 | 1.16041 | | 1.16038 | 1.16026 | | 1.16044 | 1.16038 | |

Implementation of Array data type in OCaml

I have very little knowledge about OCaml as a whole and just received an assignment to take one of the source files in the project and allow it to take a new data type (Array). I am not asking for someone to solve this problem for me, but instead I would appreciate someone walking me through this code. I would also appreciate any input on how difficult it is going to be to implement this new data type.
The file itself lacks a lot of documentation which doesn't make it any easier either.
(* * Types (hashconsed) *)
(* ** Imports *)
open Abbrevs
open Util
(* ** Identifiers *)
module Lenvar : (module type of Id) = Id
module Tysym : (module type of Id) = Id
module Groupvar : (module type of Id) = Id
module Permvar : (module type of Id) = Id
(* ** Types and type nodes *)
type ty = {
ty_node : ty_node;
ty_tag : int
}
and ty_node =
| BS of Lenvar.id
| Bool
| G of Groupvar.id
| TySym of Tysym.id
| Fq
| Prod of ty list
| Int
(* ** Equality, hashing, and hash consing *)
let equal_ty : ty -> ty -> bool = (==)
let hash_ty t = t.ty_tag
let compare_ty t1 t2 = t1.ty_tag - t2.ty_tag
module Hsty = Hashcons.Make (struct
type t = ty
let equal t1 t2 =
match t1.ty_node, t2.ty_node with
| BS lv1, BS lv2 -> Lenvar.equal lv1 lv2
| Bool, Bool -> true
| G gv1, G gv2 -> Groupvar.equal gv1 gv2
| TySym ts1, TySym ts2 -> Tysym.equal ts1 ts2
| Fq, Fq -> true
| Prod ts1, Prod ts2 -> list_eq_for_all2 equal_ty ts1 ts2
| _ -> false
let hash t =
match t.ty_node with
| BS lv -> hcomb 1 (Lenvar.hash lv)
| Bool -> 2
| G gv -> hcomb 3 (Groupvar.hash gv)
| TySym gv -> hcomb 4 (Tysym.hash gv)
| Fq -> 5
| Prod ts -> hcomb_l hash_ty 6 ts
| Int -> 7
let tag n t = { t with ty_tag = n }
end)
(** Create [Map], [Set], and [Hashtbl] modules for types. *)
module Ty = StructMake (struct
type t = ty
let tag = hash_ty
end)
module Mty = Ty.M
module Sty = Ty.S
module Hty = Ty.H
(* ** Constructor functions *)
let mk_ty n = Hsty.hashcons {
ty_node = n;
ty_tag = (-1)
}
let mk_BS lv = mk_ty (BS lv)
let mk_G gv = mk_ty (G gv)
let mk_TySym ts = mk_ty (TySym ts)
let mk_Fq = mk_ty Fq
let mk_Bool = mk_ty Bool
let mk_Int = mk_ty Int
let mk_Prod tys =
match tys with
| [t] -> t
| _ -> mk_ty (Prod tys)
(* ** Indicator and destructor functions *)
let is_G ty = match ty.ty_node with
| G _ -> true
| _ -> false
let is_Fq ty = match ty.ty_node with
| Fq -> true
| _ -> false
let is_Prod ty = match ty.ty_node with
| Prod _ -> true
| _ -> false
let destr_G_exn ty =
match ty.ty_node with
| G gv -> gv
| _ -> raise Not_found
let destr_BS_exn ty =
match ty.ty_node with
| BS lv -> lv
| _ -> raise Not_found
let destr_Prod_exn ty =
match ty.ty_node with
| Prod ts -> ts
| _ -> raise Not_found
let destr_Prod ty =
match ty.ty_node with
| Prod ts -> Some ts
| _ -> None
(* ** Pretty printing *)
let pp_group fmt gv =
if Groupvar.name gv = ""
then F.fprintf fmt "G"
else F.fprintf fmt "G_%s" (Groupvar.name gv)
let rec pp_ty fmt ty =
match ty.ty_node with
| BS lv -> F.fprintf fmt "BS_%s" (Lenvar.name lv)
| Bool -> F.fprintf fmt "Bool"
| Fq -> F.fprintf fmt "Fq"
| TySym ts -> F.fprintf fmt "%s" (Tysym.name ts)
| Prod ts -> F.fprintf fmt "(%a)" (pp_list " * " pp_ty) ts
| Int -> F.fprintf fmt "Int"
| G gv ->
if Groupvar.name gv = ""
then F.fprintf fmt "G"
else F.fprintf fmt "G_%s" (Groupvar.name gv)
It's hard to walk through this code because quite a bit is missing (definitions of Id, Hashcons, StructMake). But in general this code manipulates data structures that represent types.
You can read about hash consing here: https://en.wikipedia.org/wiki/Hash_consing (which is what I just did myself). In essence it is a way of maintaining a canonical representation for a set of structures (in this case, structures representing types) so that two structures that are equal (having constituents that are equal) are represented by the identical value. This allows constant-time comparison for equality. When you do this with strings, it's called "interning" (a technique from Lisp I've used many times).
To add arrays, you need to know whether the array type will include the length of the array or just the type of its elements. The semi-mysterious type BS seems to include a length, which suggests you may want to include the length in your reprsentation.
If I were doing this project I would look for every occurence of Prod (which represents tuples) and I'd add a type representing Array in an analogous way. Instead of a list of constituent types (as for a tuple) you have one constituent type and (I would guess) a variable representing the length of the array.
Before starting out I'd look for some documentation, on what BS represents for one thing. I also have no idea what "groups" are, but maybe you could worry about it later.
Update
Here's what I mean by copying Prod. Keep in mind that I am basing this almost entirely on guesswork. So, many details (or even the entire idea) could be wrong.
The current definition of a type looks like this:
and ty_node =
| BS of Lenvar.id
| Bool
| G of Groupvar.id
| TySym of Tysym.id
| Fq
| Prod of ty list
| Int
If you add a representation for Array after Prod you get something like this:
and ty_node =
| BS of Lenvar.id
| Bool
| G of Groupvar.id
| TySym of Tysym.id
| Fq
| Prod of ty list
| Array of Lenvar.id * ty
| Int
You would then go through the rest of the code adding support for the new Array variant. The compiler will help you find many of the places that need fixing.

Resources