I have two dataframes ecah has an array(string) columns.
I am trying to create a new data frame that only filters rows where one of the array element in a row matches with other.
#first dataframe
main_df = spark.createDataFrame([('1', ['YYY', 'MZA']),
('2', ['XXX','YYY']),
('3',['QQQ']),
('4', ['RRR', 'ZZZ', 'BBB1'])],
('No', 'refer_array_col'))
#second dataframe
df = spark.createDataFrame([('1A', '3412asd','value-1', ['XXX', 'YYY', 'AAA']),
('2B', '2345tyu','value-2', ['DDD', 'YFFFYY', 'GGG', '1']),
('3C', '9800bvd', 'value-3', ['AAA']),
('3C', '9800bvd', 'value-1', ['AAA', 'YYY', 'CCCC'])],
('ID', 'Company_Id', 'value' ,'array_column'))
df.show()
+---+----------+-------+--------------------+
| ID|Company_Id| value| array_column |
+---+----------+-------+--------------------+
| 1A| 3412asd|value-1| [XXX, YYY, AAA] |
| 2B| 2345tyu|value-2|[DDD, YFFFYY, GGG, 1]|
| 3C| 9800bvd|value-3| [AAA] |
| 3C| 9800bvd|value-1| [AAA, YYY, CCCC] |
+---+----------+-------+---------------------+
Code I tried:
The main idea is to use rdd.toLocalIterator() as there are some other functions inside the same for loop that are depending on this filters
for x in main_df.rdd.toLocalIterator:
a = main_df["refer_array_col"]
b = main_df["No"]
some_x_filter = F.col('array_coulmn').isin(b)
final_df = df.filter(
# filter 1
some_x_filter &
# second filter is to compare 'a' with array_column - i tried using F.array_contains
(F.array_contains(F.col('array_column'), F.lit(a)))
)
some_x_filter is also working in a similar way
some_x_filter is comparing a string value in a array of strings column.
But now a contains a list of strings and I am unable to compare it with array_column
With my code I am getting an error for array contains
Error
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.functions.lit.
: java.lang.RuntimeException: Unsupported literal type class java.util.ArrayList ['YYY', 'MZA']
Can anyone tell me what can i use at the second filter alternatively?
From what I understood based on our conversation in the comments.
Essentially your requirement is to compare an array column with a Python List.
Thus, this would do the job
df.withColumn("asArray", F.array(*[F.lit(x) for x in b]))
I'm trying to filter and output from JSON with jq.
The API will sometime return an object and sometime an array, I want to catch the result using an if statement and return empty string when the object/array is not available.
{
"result":
{
"entry": {
"id": "207579",
"title": "Realtek Bluetooth Mesh SDK on Linux\/Android Segmented Packet reference buffer overflow",
"summary": "A vulnerability, which was classified as critical, was found in Realtek Bluetooth Mesh SDK on Linux\/Android (the affected version unknown). This affects an unknown functionality of the component Segmented Packet Handler. There is no information about possible countermeasures known. It may be suggested to replace the affected object with an alternative product.",
"details": {
"affected": "A vulnerability, which was classified as critical, was found in Realtek Bluetooth Mesh SDK on Linux\/Android (the affected version unknown).",
"vulnerability": "The manipulation of the argument reference with an unknown input leads to a unknown weakness. CWE is classifying the issue as CWE-120. The program copies an input buffer to an output buffer without verifying that the size of the input buffer is less than the size of the output buffer, leading to a buffer overflow.",
"impact": "This is going to have an impact on confidentiality, integrity, and availability.",
"countermeasure": "There is no information about possible countermeasures known. It may be suggested to replace the affected object with an alternative product."
},
"timestamp": {
"create": "1661860801",
"change": "1661861110"
},
"changelog": [
"software_argument"
]
},
"software": {
"vendor": "Realtek",
"name": "Bluetooth Mesh SDK",
"platform": [
"Linux",
"Android"
],
"component": "Segmented Packet Handler",
"argument": "reference",
"cpe": [
"cpe:\/a:realtek:bluetooth_mesh_sdk"
],
"cpe23": [
"cpe:2.3:a:realtek:bluetooth_mesh_sdk:*:*:*:*:*:*:*:*"
]
}
}
}
Would also like to to use the statement globally for the whole array output so I can parse it to .csv and escape the null, since sofware name , can also contain an array or an object. Having a global if statement with simplify the syntax result and suppress the error with ?
The error i received from bash
jq -r '.result [] | [ "https://vuldb.com/?id." + .entry.id ,.software.vendor // "empty",(.software.name | if type!="array" then [.] | join (",") else . //"empty" end )?,.software.type // "empty",(.software.platform | if type!="array" then [] else . | join (",") //"empty" end )?] | #csv' > tst.csv
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 7452 0 7393 100 59 4892 39 0:00:01 0:00:01 --:--:-- 4935
jq: error (at <stdin>:182): Cannot iterate over null (null)
What I have tried is the following code which i tried to demo https://jqplay.org/ which is incorrect syntax
.result [] |( if .[] == null then // "empty" else . end
| ,software.name // "empty" ,.software.platform |if type!="array" then [.] // "empty" else . | join (",") end)
Current output
[
[
"Bluetooth Mesh SDK"
],
"Linux,Android"
]
Desired outcome
[
"Bluetooth Mesh SDK",
"empty"
]
After fixing your input JSON, I think you can get the desired output by using the following JQ filter:
if (.result | type) == "array" then . else (.result |= [.]) end \
| .result[].software | [ .name, (.platform // [ "Empty" ] | join(",")) ]
Where
if (.result | type) == "array" then . else (.result |= [.]) end
Wraps .result in an array if type isn't array
.result[].software
Loops though the software in each .result obj
[ .name, (.platform // [ "Empty" ] | join(",")) ]
Create an array with .name and .platform (which is replaced by [ "Empty" ] when it doesn't exist. Then it's join()'d to a string
Outcome:
[
"Bluetooth Mesh SDK",
"Linux,Android"
]
Online demo
Or
[
"Bluetooth Mesh SDK",
"Empty
]
Online demo
I have a json file(input.json) which looks like this :
{"header1":"a","header2":1a, "header3":1a, "header4":"apple"},
{"header1":"b","header2":2a, "header3":2a, "header4":"orange"}
{"header1":"c","header2":1a, "header3":2a, "header4":"banana"},
{"header1":"d","header2":2a, "header3":1a, "header4":"apple"},
{"header1":"a","header2":2a, "header3":1a, "header4":"banana"},
{"header1":"b","header2":1a, "header3":2a, "header4":"orange"},
{"header1":"b","header2":1a, "header3":1a, "header4":"orange"},
{"header1":"d","header2":1a, "header3":1a, "header4":"apple"},
{"header1":"a","header2":2a, "header3":1a, "header4":"banana"} (repeat of line 5)
I want to filter out only the unique combinations of each of the values jq.
Results should look like:
{"header1":"a","header2":1a, "header3":1a, "header4":"apple"},
{"header1":"b","header2":2a, "header3":2a, "header4":"orange"}
{"header1":"c","header2":1a, "header3":2a, "header4":"banana"},
{"header1":"d","header2":2a, "header3":1a, "header4":"apple"},
{"header1":"a","header2":2a, "header3":1a, "header4":"banana"},
{"header1":"b","header2":1a, "header3":2a, "header4":"orange"},
{"header1":"b","header2":1a, "header3":1a, "header4":"orange"},
{"header1":"d","header2":1a, "header3":1a, "header4":"apple"}
I tried doing group by of header1 with the other headers but it didn't generate unique results.
I've used unique but that didnt generate the proper results.
How can I get this? Im new to jq and not finding many tutorials on it.
Thanks
The sample lines you give are not valid JSON. Since your preamble introduces them as JSON, the following will assume that you intended to present an array of JSON objects.
The question is unclear in several respects, but from the example, it looks as though unique might be what you're looking for, so consider:
Invocation: jq -c 'unique[]' input.json
Output:
{"header1":"a","header2":"1a","header3":"1a","header4":"apple"}
{"header1":"a","header2":"2a","header3":"1a","header4":"banana"}
{"header1":"b","header2":"1a","header3":"1a","header4":"orange"}
{"header1":"b","header2":"1a","header3":"2a","header4":"orange"}
{"header1":"b","header2":"2a","header3":"2a","header4":"orange"}
{"header1":"c","header2":"1a","header3":"2a","header4":"banana"}
{"header1":"d","header2":"1a","header3":"1a","header4":"apple"}
{"header1":"d","header2":"2a","header3":"1a","header4":"apple"}
If you need the output in some other format, you could do that using jq as well, but the requirements are not so clear, so let's leave that as an exercise :-)
Since as peak indicated your input isn't legal JSON I've taken the liberty of correcting it and converting to a list of individual objects:
{"header1":"a","header2":"1a", "header3":"1a", "header4":"apple"}
{"header1":"b","header2":"2a", "header3":"2a", "header4":"orange"}
{"header1":"c","header2":"1a", "header3":"2a", "header4":"banana"}
{"header1":"d","header2":"2a", "header3":"1a", "header4":"apple"}
{"header1":"a","header2":"2a", "header3":"1a", "header4":"banana"}
{"header1":"b","header2":"1a", "header3":"2a", "header4":"orange"}
{"header1":"b","header2":"1a", "header3":"1a", "header4":"orange"}
{"header1":"d","header2":"1a", "header3":"1a", "header4":"apple"}
{"header1":"a","header2":"2a", "header3":"1a", "header4":"banana"}
If this data is in data.json and you run
jq -M -s -f filter.jq data.json
with the following filter.jq
foreach .[] as $r (
{}
; ($r | map(.)) as $p | if getpath($p) then empty else setpath($p;1) end
; $r
)
it will generate the following output in the original order with no duplicates.
{"header1":"a","header2":"1a","header3":"1a","header4":"apple"}
{"header1":"b","header2":"2a","header3":"2a","header4":"orange"}
{"header1":"c","header2":"1a","header3":"2a","header4":"banana"}
{"header1":"d","header2":"2a","header3":"1a","header4":"apple"}
{"header1":"a","header2":"2a","header3":"1a","header4":"banana"}
{"header1":"b","header2":"1a","header3":"2a","header4":"orange"}
{"header1":"b","header2":"1a","header3":"1a","header4":"orange"}
{"header1":"d","header2":"1a","header3":"1a","header4":"apple"}
Note that the
($r | map(.))
is used to generate an array containing just the values from each row
which is assumed to always produce a unique key path. This is true for
the sample data but may not be true for more complex values.
A slower but more robust filter.jq is
foreach .[] as $r (
{}
; [$r | tojson] as $p | if getpath($p) then empty else setpath($p;1) end
; $r
)
which uses the json representation of the entire row as a unique key to determine if a row has been previously seen.
I am trying to do a parsing of a long file like this (the output of the command play in Linux):
File :1.mp3
In:0.00% 00:00:00.00 [00:03:14.51] Out:0 [ | ]
In:0.19% 00:00:00.37 [00:03:14.14] Out:16.4k [ | ]
In:0.29% 00:00:00.56 [00:03:13.95] Out:24.6k [======|======]
In:0.33% 00:00:00.65 [00:03:13.86] Out:28.7k [ =====|===== ]
In:0.43% 00:00:00.84 [00:03:13.67] Out:36.9k [ =====|===== ]
In:0.53% 00:00:01.02 [00:03:13.49] Out:45.1k [ -====|===== ]
In:0.62% 00:00:01.21 [00:03:13.30] Out:53.2k [ =====|===== ]
In:0.72% 00:00:01.39 [00:03:13.11] Out:61.4k [-=====|======]
In:0.81% 00:00:01.58 [00:03:12.93] Out:69.6k [-=====|=====-]
In:0.91% 00:00:01.76 [00:03:12.74] Out:77.8k [-=====|=====-]
In:0.96% 00:00:01.86 [00:03:12.65] Out:81.9k [ =====|===== ]
And so on
I would like to parse the percentage number.
How can i do it without saving the file into(because is too large ~ 100KB) a String.
i thought with this regular expression :"In:(\d{1,2}\.\d{2})"
how to do it?
Try this regex:
/^In:([0-9]{1,3}\.[0-9]{1,2})\%/gm
Explanation:
/
^ Matches start of string.
In: Matches "In:".
( ) Groups percentage (excl. sign).
[0-9]{1,3} Matches 1-3 (incl.) numbers.
\. Matches a dot.
[0-9]{1,2} Matches 1-2 (incl.) numbers.
\% Matches a percent sign.
/gm Allows multiple matches and makes ^ match beginning of line (not beginning of string), respectively.
How do I get the last element of an array and show the rest of the elements?
Like this :
#myArray = (1,1,1,1,1,2);
Expected output :
SomeVariable1 = 11111
SomeVariable2 = 2
# print last element
print $myArray[-1];
# joined rest of the elements
print join "", #myArray[0 .. $#myArray-1] if #myArray >1;
If you don't mind modifying the array,
# print last element
print pop #myArray;
# joined rest of the elements
print join "", #myArray;
Сухой27 has given you the answer. I wanted to add that if you are creating a structured output, it might be nice to use a hash:
my #myArray = (1,1,1,1,1,2);
my %variables = (
SomeVariable1 => [ #myArray[0 .. $#myArray -1] ],
SomeVariable2 => [ $myArray[-1] ]
);
for my $key (keys %variables) {
print "$key => ",#{ $variables{$key} },"\n";
}
Output:
SomeVariable1 => 11111
SomeVariable2 => 2