Spark: Removing first array from Nested Array in Scala - arrays

I have a DataFrame with 2 columns.I want to remove first array of the nested array in every record. Example :- I have a DF like this
+---+-------+--------+-----------+-------------+
|id |arrayField |
+---+------------------------------------------+
|1 |[[Akash,Kunal],[Sonu,Monu],[Ravi,Kishan]] |
|2 |[[Kunal, Mrinal],[Priya,Diya]] |
|3 |[[Adi,Sadi]] |
+---+-------+---------+----------+-------------+
and I want my output like this:-
+---+-------+------+------+-------+
|id |arrayField |
+---+-----------------------------+
|1 |[[Sonu,Monu],[Ravi,Kishan]] |
|2 |[[Priya,Diya]] |
|3 | null |
+---+-------+------+------+-------+

From Spark-2.4 use slice function.
Example:
df.show(10,false)
/*
+------------------------+
|arrayField |
+------------------------+
|[[A, k], [s, m], [R, k]]|
|[[k, M], [c, z]] |
|[[A, b]] |
+------------------------+
*/
import org.apache.spark.sql.functions._
df.withColumn("sliced",expr("slice(arrayField,2,size(arrayField))")).
withColumn("arrayField",when(size(col("sliced"))==0,lit(null)).otherwise(col("sliced"))).
drop("sliced").
show()
/*
+----------------+
| arrayField|
+----------------+
|[[s, m], [R, k]]|
| [[c, z]]|
| null|
+----------------+
*/

Related

DataFrame column (Array type) contains Null values and empty array (len =0). How to convert Null to empty array?

I've Spark DataFrame with a Array column (StringType)
Sample DataFrame:
df = spark.createDataFrame([
[None],
[[]],
[['foo']]
]).toDF("a")
Current Output:
+-----+
| a|
+-----+
| null|
| []|
|[foo]|
+-----+
Desired Output:
+-----+
| a|
+-----+
| []|
| []|
|[foo]|
+-----+
I need to convert the Null values to an empty Array to concat with another array column.
Already tried this, but it's not working
df.withColumn("a",F.coalesce(F.col("a"),F.from_json(F.lit("[]"), T.ArrayType(T.StringType()))))
Convert null values to empty array in Spark DataFrame
Use array function.
df = spark.createDataFrame([
[None],
[[]],
[['foo']]
]).toDF("a")
import pyspark.sql.functions as F
df.withColumn('a', F.coalesce(F.col('a'), F.array(F.lit(None)))).show(10, False)
+-----+
|a |
+-----+
|[] |
|[] |
|[foo]|
+-----+
The result is now array(string), so there is no null value. Please check the results.
temp = spark.sql("SELECT a FROM table WHERE a is NULL")
temp.show(10, False)
temp = spark.sql("SELECT a FROM table WHERE a = array(NULL)")
temp.show(10, False)
temp = spark.sql("SELECT a FROM table")
temp.show(10, False)
+---+
|a |
+---+
+---+
+---+
|a |
+---+
|[] |
+---+
+-----+
|a |
+-----+
|[] |
|[] |
|[foo]|
+-----+

Get the most common element of an array using Pyspark

How can I get the most common element of an array after concatenating two columns using Pyspark
df = spark.createDataFrame([
[['a','a','b'],['a']],
[['c','d','d'],['']],
[['e'],['e','f']],
[[''],['']]
]).toDF("arr_1","arr2")
df_new = df.withColumn('arr',F.concat(F.col('arr_1'),F.col('arr_2'))
expected output:
+------------------------+
| arr | arr_1 | arr_2 |
+------------------------+
| [a] | [a,a,b] | [a] |
| [d] | [c,d,d] | [] |
| [e] | [e] | [e,f] |
| [] | [] | [] |
+------------------------+
Try it
df1 = df.select('arr_1','arr_2',monotonically_increasing_id().alias('id'),concat('arr_1','arr_2').alias('arr'))
df1.select('id',explode('arr')).\
groupBy('id','col').count().\
select('id','col','count',rank().over(Window.partitionBy('id').orderBy(desc('count'))).alias('rank')).\
filter(col('rank')==1).\
join(df1,'id').\
select(col('col').alias('arr'), 'arr_1', 'arr_2').show()
+---+---------+------+
|arr| arr_1| arr_2|
+---+---------+------+
| a|[a, a, b]| [a]|
| | []| []|
| e| [e]|[e, f]|
| d|[c, d, d]| []|
+---+---------+------+
You can explode the array then by doing group by count, Window we can get the most occurring element.
Example:
df = spark.createDataFrame([
[['a','a','b'],['a']],
[['c','d','d'],['']],
[['e'],['e','f']],
[[''],['']]
]).toDF("arr_1","arr_2")
df_new = df.withColumn('arr_concat',concat(col('arr_1'),col('arr_2')))
from pyspark.sql.functions import *
from pyspark.sql import *
df1=df_new.withColumn("mid",monotonically_increasing_id())
df2=df1.selectExpr("explode(arr_concat) as arr","mid").groupBy("mid","arr").agg(count(lit("1")).alias("cnt"))
w=Window.partitionBy("mid").orderBy(desc("cnt"))
df3=df2.withColumn("rn",row_number().over(w)).filter(col("rn") ==1).drop(*["rn","cnt"])
df3.join(df1,['mid'],'inner').drop(*['mid','arr_concat']).withColumn("arr",array(col("arr"))).show()
#+---+---------+------+
#|arr| arr_1| arr_2|
#+---+---------+------+
#|[d]|[c, d, d]| []|
#|[e]| [e]|[e, f]|
#|[a]|[a, a, b]| [a]|
#| []| []| []|
#+---+---------+------+

why `execv` can't use implicit convert from char** to char* const*?

Consider the following code:
#include <stdio.h>
#include <unistd.h>
void foo(char * const arg[]) {
printf("success\n");
}
int main() {
char myargs[2][64] = { "/bin/ls", NULL };
foo(myargs);
execv(myargs[0], myargs);
return 0;
}
Both foo and execv require char * const * argument, but while my foo works (I get success in the output) the system call execv fails.
I would like to know why. Does this have something to do with the implementation of execv?
Also, assuming I have a char** variable - how can I send it to execv?
A two-dimensional array looks like this:
char myargs[2][16];
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| | | | | | | | | | | | | | | | |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| | | | | | | | | | | | | | | | |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
I reduced the size from 64 to 16 to keep the diagram from being annoyingly big.
With an initializer, it can look like this:
char myargs[2][16] = { "/bin/ls", "" }
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| /| b| i| n| /| l| s|\0| | | | | | | | |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|\0| | | | | | | | | | | | | | | |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
Notice I didn't try to put a null pointer in the second row. It doesn't make sense to do that, since that's an array of chars. There's no place in it for a pointer.
The rows are contiguous in memory, so if you look at a lower level, it's actually more like this:
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| /| b| i| n| /| l| s|\0| | | | | | | | |\0| | | | | | | | | | | | | | | |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
When you pass myargs to a function, the famous "array decay" produces a pointer. That looks like this:
void foo(char (*arg)[16]);
...
char myargs[2][16] = { "/bin/ls", "" }
foo(myargs);
+-----------+ +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| POINTER==|===>| /| b| i| n| /| l| s|\0| | | | | | | | |\0| | | | | | | | | | | | | | | |
+-----------+ +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
The pointer is arg contains a value which locates the beginning of the array. Notice there is no pointer pointing to the second row. If foo wants to find the value in the second row, it needs to know how big the rows are so it can break down this:
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| /| b| i| n| /| l| s|\0| | | | | | | | |\0| | | | | | | | | | | | | | | |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
into this:
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| /| b| i| n| /| l| s|\0| | | | | | | | |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|\0| | | | | | | | | | | | | | | |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
That's why arg must be char (*arg)[16] and not char **arg or the equivalent char *arg[].
The exec family of functions doesn't work with this data layout. It wants this:
+-----------+ +-----------+-----------+
| POINTER==|===>| POINTER | NULL |
+-----------+ +-----|-----+-----------+
|
/----------------------/
|
|
| +--+--+--+--+--+--+--+--+
\--->| /| b| i| n| /| l| s|\0|
+--+--+--+--+--+--+--+--+
And when you want to add more arguments, it wants this:
+-----------+ +-----------+-----------+- -+-----------+
| POINTER==|===>| POINTER | POINTER | ... | NULL |
+-----------+ +-----|-----+-----|-----+- -+-----------+
| |
/----------------------/ |
| |
| /--------------------------------/
| |
| |
| | +--+--+--+--+--+--+--+--+
\-+->| /| b| i| n| /| l| s|\0|
| +--+--+--+--+--+--+--+--+
|
| +--+--+--+--+--+--+
\->| /| h| o| m| e|\0|
+--+--+--+--+--+--+
If you compare this to the two-dimensional array diagram, hopefully you can understand why this can't be an implicit conversion. It actually involves moving stuff around in memory.
Both foo and execv require char * const * argument,
Yes.
but while my foo works (I get success in the output) the system call execv fails.
Getting the output you expect does not prove that your code is correct. The call exhibits undefined behavior because its argument does not match the parameter type, but it is plausible that that has little practical effect because the implementation of foo() does not use the parameter in any way. More generally, your code could, in principle, exhibit absolutely any behavior at all, because that's what "undefined" means.
I would like to know why. Does this have something to do with the implementation of execv?
From the standard's perspective, both calls exhibit equally undefined behavior. As a practical matter, however, we know that execv does use its arguments, so it would be much more surprising for that call to produce the behavior you expected than it is for the call to foo to produce the behavior you expected.
The main problem is that 2D arrays are arrays of arrays, and arrays are not pointers. Thus, your 2D array myargs does not at all have the correct type for an argument to either function.
Also, assuming I have a char** variable - how can I send it to execv?
You do not have such a variable in your code, but if you did have, you could cast it to the appropriate type:
char *some_args[] = { "/bin/ls", NULL };
execv((char * const *) some_args);
In practice, most compilers would probably accept it if you omitted the cast, too, although the standard does require it. Best would be to declare a variable that has the correct type in the first place:
char * const correct_args[] = { "/bin/ls", NULL };
execv(correct_args);
Note also that although arrays are not pointers, they are converted to pointers in most contexts -- which I use in the example code -- but only the top level. An array of arrays thus "decays" to a pointer to an array, not a pointer to a pointer.

Apache pyspark How to create a column with array containing n elements

I have a dataframe with 1 column of type integer.
I want to create a new column with an array containing n elements (n being the # from the first column)
For example:
x = spark.createDataFrame([(1,), (2,),],StructType([ StructField("myInt", IntegerType(), True)]))
+-----+
|myInt|
+-----+
| 1|
| 2|
| 3|
+-----+
I need the resulting data frame to look like this:
+-----+---------+
|myInt| myArr|
+-----+---------+
| 1| [1]|
| 2| [2, 2]|
| 3|[3, 3, 3]|
+-----+---------+
Note, It doesn't actually matter what the values inside of the arrays are, it's just the count that matters.
It'd be fine if the resulting data frame looked like this:
+-----+------------------+
|myInt| myArr|
+-----+------------------+
| 1| [item]|
| 2| [item, item]|
| 3|[item, item, item]|
+-----+------------------+
It is preferable to avoid UDFs if possible because they are less efficient. You can use array_repeat instead.
import pyspark.sql.functions as F
x.withColumn('myArr', F.array_repeat(F.col('myInt'), F.col('myInt'))).show()
+-----+---------+
|myInt| myArr|
+-----+---------+
| 1| [1]|
| 2| [2, 2]|
| 3|[3, 3, 3]|
+-----+---------+
Use udf:
from pyspark.sql.functions import *
#udf("array<int>")
def rep_(x):
return [x for _ in range(x)]
x.withColumn("myArr", rep_("myInt")).show()
# +-----+------+
# |myInt| myArr|
# +-----+------+
# | 1| [1]|
# | 2|[2, 2]|
# +-----+------+

Could anyone explain this strange behaviour of appending to golang slices

The program below has unexpected output.
func main(){
s:=[]int{5}
s=append(s,7)
s=append(s,9)
x:=append(s,11)
y:=append(s,12)
fmt.Println(s,x,y)
}
output: [5 7 9] [5 7 9 12] [5 7 9 12]
Why is the last element of x 12?
A slice is only a window over part of an array, it has no specific storage.
This means that if you have two slices over the same part of an array, both slices must "contain" the same values.
Here's exactly what happens here :
When you do the first append, you get a new slice of size 2 over an underlying array of size 2.
When you do the next append, you get a new slice of size 3 but the underlying array is of size 4 (append usually allocates more space than the immediately needed one so that it doesn't need to allocate at every append).
This means the next append doesn't need a new array. So x and y both will use the same underlying array as the precedent slice s. You write 11 and then 12 in the same slot of this array, even if you get two different slices (remember, they're just windows).
You can check that by printing the capacity of the slice after each append :
fmt.Println(cap(s))
If you want to have different values in x and y, you should do a copy, for example like this :
s := []int{5}
s = append(s, 7)
s = append(s, 9)
x := make([]int,len(s))
copy(x,s)
x = append(x, 11)
y := append(s, 12)
fmt.Println(s, x, y)
Another solution here might have been to force the capacity of the array behind the s slice to be not greater than the needed one (thus ensuring the two following append have to use a new array) :
s := []int{5}
s = append(s, 7)
s = append(s, 9)
s = s[0:len(s):len(s)]
x := append(s, 11)
y := append(s, 12)
fmt.Println(s, x, y)
See also Re-slicing slices in Golang
dystroy explained it very well. I like to add a visual explanation to the behaviour.
A slice is only a descriptor of an array segment. It consists of a pointer to the array (ptr), the length of the segment (len), and capacity (cap).
+-----+
| ptr |
|*Elem|
+-----+
| len |
|int |
+-----+
| cap |
|int |
+-----+
So, the explanation of the code is as follow;
func main() {
+
|
s := []int{5} | s -> +-----+
| []int | ptr +-----> +---+
| |*int | [1]int| 5 |
| +-----+ +---+
| |len=1|
| |int |
| +-----+
| |cap=1|
| |int |
| +-----+
|
s = append(s,7) | s -> +-----+
| []int | ptr +-----> +---+---+
| |*int | [2]int| 5 | 7 |
| +-----+ +---+---+
| |len=2|
| |int |
| +-----+
| |cap=2|
| |int |
| +-----+
|
s = append(s,9) | s -> +-----+
| []int | ptr +-----> +---+---+---+---+
| |*int | [4]int| 5 | 7 | 9 | |
| +-----+ +---+---+---+---+
| |len=3|
| |int |
| +-----+
| |cap=4|
| |int |
| +-----+
|
x := append(s,11) | +-------------+-----> +---+---+---+---+
| | | [4]int| 5 | 7 | 9 |11 |
| | | +---+---+---+---+
| s -> +--+--+ x -> +--+--+
| []int | ptr | []int | ptr |
| |*int | |*int |
| +-----+ +-----+
| |len=3| |len=4|
| |int | |int |
| +-----+ +-----+
| |cap=4| |cap=4|
| |int | |int |
| +-----+ +-----+
|
y := append(s,12) | +-----> +---+---+---+---+
| | [4]int| 5 | 7 | 9 |12 |
| | +---+---+---+---+
| |
| +-------------+-------------+
| | | |
| s -> +--+--+ x -> +--+--+ y -> +--+--+
| []int | ptr | []int | ptr | []int | ptr |
| |*int | |*int | |*int |
| +-----+ +-----+ +-----+
| |len=3| |len=4| |len=4|
| |int | |int | |int |
| +-----+ +-----+ +-----+
| |cap=4| |cap=4| |cap=4|
| |int | |int | |int |
+ +-----+ +-----+ +-----+
fmt.Println(s,x,y)
}

Resources