Unique combinations of different values in json using jq - arrays

I have a json file(input.json) which looks like this :
{"header1":"a","header2":1a, "header3":1a, "header4":"apple"},
{"header1":"b","header2":2a, "header3":2a, "header4":"orange"}
{"header1":"c","header2":1a, "header3":2a, "header4":"banana"},
{"header1":"d","header2":2a, "header3":1a, "header4":"apple"},
{"header1":"a","header2":2a, "header3":1a, "header4":"banana"},
{"header1":"b","header2":1a, "header3":2a, "header4":"orange"},
{"header1":"b","header2":1a, "header3":1a, "header4":"orange"},
{"header1":"d","header2":1a, "header3":1a, "header4":"apple"},
{"header1":"a","header2":2a, "header3":1a, "header4":"banana"} (repeat of line 5)
I want to filter out only the unique combinations of each of the values jq.
Results should look like:
{"header1":"a","header2":1a, "header3":1a, "header4":"apple"},
{"header1":"b","header2":2a, "header3":2a, "header4":"orange"}
{"header1":"c","header2":1a, "header3":2a, "header4":"banana"},
{"header1":"d","header2":2a, "header3":1a, "header4":"apple"},
{"header1":"a","header2":2a, "header3":1a, "header4":"banana"},
{"header1":"b","header2":1a, "header3":2a, "header4":"orange"},
{"header1":"b","header2":1a, "header3":1a, "header4":"orange"},
{"header1":"d","header2":1a, "header3":1a, "header4":"apple"}
I tried doing group by of header1 with the other headers but it didn't generate unique results.
I've used unique but that didnt generate the proper results.
How can I get this? Im new to jq and not finding many tutorials on it.
Thanks

The sample lines you give are not valid JSON. Since your preamble introduces them as JSON, the following will assume that you intended to present an array of JSON objects.
The question is unclear in several respects, but from the example, it looks as though unique might be what you're looking for, so consider:
Invocation: jq -c 'unique[]' input.json
Output:
{"header1":"a","header2":"1a","header3":"1a","header4":"apple"}
{"header1":"a","header2":"2a","header3":"1a","header4":"banana"}
{"header1":"b","header2":"1a","header3":"1a","header4":"orange"}
{"header1":"b","header2":"1a","header3":"2a","header4":"orange"}
{"header1":"b","header2":"2a","header3":"2a","header4":"orange"}
{"header1":"c","header2":"1a","header3":"2a","header4":"banana"}
{"header1":"d","header2":"1a","header3":"1a","header4":"apple"}
{"header1":"d","header2":"2a","header3":"1a","header4":"apple"}
If you need the output in some other format, you could do that using jq as well, but the requirements are not so clear, so let's leave that as an exercise :-)

Since as peak indicated your input isn't legal JSON I've taken the liberty of correcting it and converting to a list of individual objects:
{"header1":"a","header2":"1a", "header3":"1a", "header4":"apple"}
{"header1":"b","header2":"2a", "header3":"2a", "header4":"orange"}
{"header1":"c","header2":"1a", "header3":"2a", "header4":"banana"}
{"header1":"d","header2":"2a", "header3":"1a", "header4":"apple"}
{"header1":"a","header2":"2a", "header3":"1a", "header4":"banana"}
{"header1":"b","header2":"1a", "header3":"2a", "header4":"orange"}
{"header1":"b","header2":"1a", "header3":"1a", "header4":"orange"}
{"header1":"d","header2":"1a", "header3":"1a", "header4":"apple"}
{"header1":"a","header2":"2a", "header3":"1a", "header4":"banana"}
If this data is in data.json and you run
jq -M -s -f filter.jq data.json
with the following filter.jq
foreach .[] as $r (
{}
; ($r | map(.)) as $p | if getpath($p) then empty else setpath($p;1) end
; $r
)
it will generate the following output in the original order with no duplicates.
{"header1":"a","header2":"1a","header3":"1a","header4":"apple"}
{"header1":"b","header2":"2a","header3":"2a","header4":"orange"}
{"header1":"c","header2":"1a","header3":"2a","header4":"banana"}
{"header1":"d","header2":"2a","header3":"1a","header4":"apple"}
{"header1":"a","header2":"2a","header3":"1a","header4":"banana"}
{"header1":"b","header2":"1a","header3":"2a","header4":"orange"}
{"header1":"b","header2":"1a","header3":"1a","header4":"orange"}
{"header1":"d","header2":"1a","header3":"1a","header4":"apple"}
Note that the
($r | map(.))
is used to generate an array containing just the values from each row
which is assumed to always produce a unique key path. This is true for
the sample data but may not be true for more complex values.
A slower but more robust filter.jq is
foreach .[] as $r (
{}
; [$r | tojson] as $p | if getpath($p) then empty else setpath($p;1) end
; $r
)
which uses the json representation of the entire row as a unique key to determine if a row has been previously seen.

Related

Compare two arrays from two different dataframes in Pyspark

I have two dataframes ecah has an array(string) columns.
I am trying to create a new data frame that only filters rows where one of the array element in a row matches with other.
#first dataframe
main_df = spark.createDataFrame([('1', ['YYY', 'MZA']),
('2', ['XXX','YYY']),
('3',['QQQ']),
('4', ['RRR', 'ZZZ', 'BBB1'])],
('No', 'refer_array_col'))
#second dataframe
df = spark.createDataFrame([('1A', '3412asd','value-1', ['XXX', 'YYY', 'AAA']),
('2B', '2345tyu','value-2', ['DDD', 'YFFFYY', 'GGG', '1']),
('3C', '9800bvd', 'value-3', ['AAA']),
('3C', '9800bvd', 'value-1', ['AAA', 'YYY', 'CCCC'])],
('ID', 'Company_Id', 'value' ,'array_column'))
df.show()
+---+----------+-------+--------------------+
| ID|Company_Id| value| array_column |
+---+----------+-------+--------------------+
| 1A| 3412asd|value-1| [XXX, YYY, AAA] |
| 2B| 2345tyu|value-2|[DDD, YFFFYY, GGG, 1]|
| 3C| 9800bvd|value-3| [AAA] |
| 3C| 9800bvd|value-1| [AAA, YYY, CCCC] |
+---+----------+-------+---------------------+
Code I tried:
The main idea is to use rdd.toLocalIterator() as there are some other functions inside the same for loop that are depending on this filters
for x in main_df.rdd.toLocalIterator:
a = main_df["refer_array_col"]
b = main_df["No"]
some_x_filter = F.col('array_coulmn').isin(b)
final_df = df.filter(
# filter 1
some_x_filter &
# second filter is to compare 'a' with array_column - i tried using F.array_contains
(F.array_contains(F.col('array_column'), F.lit(a)))
)
some_x_filter is also working in a similar way
some_x_filter is comparing a string value in a array of strings column.
But now a contains a list of strings and I am unable to compare it with array_column
With my code I am getting an error for array contains
Error
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.functions.lit.
: java.lang.RuntimeException: Unsupported literal type class java.util.ArrayList ['YYY', 'MZA']
Can anyone tell me what can i use at the second filter alternatively?
From what I understood based on our conversation in the comments.
Essentially your requirement is to compare an array column with a Python List.
Thus, this would do the job
df.withColumn("asArray", F.array(*[F.lit(x) for x in b]))

JQ substitute values in one array based on values from different array

I would like in JQ to substitute the values in the first array with the names in the second array based on the key "size" while keeping the length of the first array unchanged.
[
[{\"size\":0},{\"size\":1},{\"size\":2},{\"size\":3},{\"size\":4},{\"size\":5},{\"size\":6},{\"size\":7},{\"size\":8},{\"size\":9},{\"size\":0},{\"size\":1},{\"size\":2},{\"size\":3},{\"size\":4},{\"size\":5},{\"size\":6},{\"size\":7},{\"size\":8},{\"size\":0},{\"size\":1},{\"size\":2},{\"size\":3},{\"size\":4},{\"size\":5},{\"size\":6},{\"size\":7},{\"size\":8}]
[{\"size\":0,\"name\":\"6M\"},{\"size\":1,\"name\":\"6.5M\"},{\"size\":2,\"name\":\"7M\"},{\"size\":3,\"name\":\"7.5M\"},{\"size\":4,\"name\":\"8M\"},{\"size\":5,\"name\":\"8.5M\"},{\"size\":6,\"name\":\"9M\"},{\"size\":7,\"name\":\"9.5M\"},{\"size\":8,\"name\":\"10M\"},{\"size\":9,\"name\":\"11M\"}]
]
So the wanted result would look like this:
[{\"size\":6M},{\"size\":6.5M},{\"size\":7M},{\"size\":7.5M},{\"size\":8M},{\"size\":8.5M},{\"size\":9M},{\"size\":9.5M},{\"size\":10M},{\"size\":11M},{\"size\":6M},{\"size\":6.5M},{\"size\":7M},{\"size\":7.5M},{\"size\":8M},{\"size\":8.5M},{\"size\":9M},{\"size\":9.5M},{\"size\":10M},{\"size\":6M},{\"size\":6.5M},{\"size\":7M},{\"size\":7.5M},{\"size\":8M},{\"size\":8.5M},{\"size\":9M},{\"size\":9.5M},{\"size\":10M}]
I used this input to obtain the above arrays jq "[ .DATA.product.traits|([.colors.colorMap[]| {size:.sizes[]}]), ( [ .sizes.sizeMap[]| {size:.id, name}])]" and then tried using if-then-else statement (|if (.[0:-1][][].size == .[-1][].size) then [.[-1][].name ] else null end") but I was unable to reach the desired output..
Given the input
[
[
{"size":0}, {"size":1}, {"size":2}, {"size":3}, {"size":4},
{"size":5}, {"size":6}, {"size":7}, {"size":8}, {"size":9},
{"size":0}, {"size":1}, {"size":2}, {"size":3}, {"size":4},
{"size":5}, {"size":6}, {"size":7}, {"size":8}, {"size":0},
{"size":1}, {"size":2}, {"size":3}, {"size":4}, {"size":5},
{"size":6}, {"size":7}, {"size":8}
],
[
{"size":0,"name":"6M"}, {"size":1,"name":"6.5M"},
{"size":2,"name":"7M"}, {"size":3,"name":"7.5M"},
{"size":4,"name":"8M"}, {"size":5,"name":"8.5M"},
{"size":6,"name":"9M"}, {"size":7,"name":"9.5M"},
{"size":8,"name":"10M"}, {"size":9,"name":"11M"}
]
]
you can generate an INDEX over the second item, and use it for reference in a map on the first
(.[1] | INDEX(.size)) as $map | .[0] | map($map[.size | tostring])
to get
[
{"size":0,"name":"6M"}, {"size":1,"name":"6.5M"},
{"size":2,"name":"7M"}, {"size":3,"name":"7.5M"},
{"size":4,"name":"8M"}, {"size":5,"name":"8.5M"},
{"size":6,"name":"9M"}, {"size":7,"name":"9.5M"},
{"size":8,"name":"10M"}, {"size":9,"name":"11M"},
{"size":0,"name":"6M"}, {"size":1,"name":"6.5M"},
{"size":2,"name":"7M"}, {"size":3,"name":"7.5M"},
{"size":4,"name":"8M"}, {"size":5,"name":"8.5M"},
{"size":6,"name":"9M"}, {"size":7,"name":"9.5M"},
{"size":8,"name":"10M"}, {"size":0,"name":"6M"},
{"size":1,"name":"6.5M"}, {"size":2,"name":"7M"},
{"size":3,"name":"7.5M"}, {"size":4,"name":"8M"},
{"size":5,"name":"8.5M"}, {"size":6,"name":"9M"},
{"size":7,"name":"9.5M"}, {"size":8,"name":"10M"}
]
Demo

Pyspark Array Column - Replace Empty Elements with Default Value

I have a dataframe with a column which is an array of strings. Some of the elements of the array may be missing like so:
-------------|-------------------------------
ID |array_list
---------------------------------------------
38292786 |[AAA,, JLT] |
38292787 |[DFG] |
38292788 |[SHJ, QKJ, AAA, YTR, CBM] |
38292789 |[DUY, ANK, QJK, POI, CNM, ADD] |
38292790 |[] |
38292791 |[] |
38292792 |[,,, HKJ] |
I would like to replace the missing elements with a default value of "ZZZ". Is there a way to do that? I tried the following code, which is using a transform function and a regular expression:
import pyspark.sql.functions as F
from pyspark.sql.dataframe import DataFrame
def transform(self, f):
return f(self)
DataFrame.transform = transform
df = df.withColumn("array_list2", F.expr("transform(array_list, x -> regexp_replace(x, '', 'ZZZ'))"))
This doesn't give an error but it is producing nonsense. I'm thinking I just don't know the correct way to identify the missing elements of the array - can anyone help me out?
In production our data has around 10 million rows and I am trying to avoid using explode or a UDF (not sure if it's possible to avoid using both though, just need the code run as efficiently as possible). I'm using Spark 2.4.4
This is what I would like my output to look like:
-------------|-------------------------------|-------------------------------
ID |array_list | array_list2
---------------------------------------------|-------------------------------
38292786 |[AAA,, JLT] |[AAA, ZZZ, JLT]
38292787 |[DFG] |[DFG]
38292788 |[SHJ, QKJ, AAA, YTR, CBM] |[SHJ, QKJ, AAA, YTR, CBM]
38292789 |[DUY, ANK, QJK, POI, CNM, ADD] |[DUY, ANK, QJK, POI, CNM, ADD]
38292790 |[] |[ZZZ]
38292791 |[] |[ZZZ]
38292792 |[,,, HKJ] |[ZZZ, ZZZ, ZZZ, HKJ]
The regex_replace works at character level.
I could not get it to work with transform either, but with help from the first answerer I used a UDF - not that easy.
Here is my example with my data, you can tailor.
%python
from pyspark.sql.types import StringType, ArrayType
from pyspark.sql.functions import udf, col
concat_udf = udf(
lambda con_str, arr: [
x if x is not None else con_str for x in arr or [None]
],
ArrayType(StringType()),
)
arrayData = [
('James',['Java','Scala']),
('Michael',['Spark','Java',None]),
('Robert',['CSharp','']),
('Washington',None),
('Jefferson',['1','2'])]
df = spark.createDataFrame(data=arrayData, schema = ['name','knownLanguages'])
df = df.withColumn("knownLanguages", concat_udf(lit("ZZZ"), col("knownLanguages")))
df.show()
returns:
+----------+------------------+
| name| knownLanguages|
+----------+------------------+
| James| [Java, Scala]|
| Michael|[Spark, Java, ZZZ]|
| Robert| [CSharp, ]|
|Washington| [ZZZ]|
| Jefferson| [1, 2]|
+----------+------------------+
Quite tough this, had some help from the first answerer.
I'm thinking of something, but i'm not sure if it is efficient.
from pyspark.sql import functions as F
df.withColumn("array_list2", F.split(F.array_join("array_list", ",", "ZZZ"), ","))
First I concatenate the values as a string with a delimiter , (hoping you don't have it in your string but you can use something else). I use the null_replacement option to fill the null values. Then I split according to the same delimiter.
EDIT: Based on #thebluephantom comment, you can try this solution :
df.withColumn(
"array_list_2", F.expr(" transform(array_list, x -> coalesce(x, 'ZZZ'))")
).show()
SQL built-in transform is not working for me, so I couldn't try it but hopefully you'll have the result you wanted.

many array element and i need to search a file and print array whcih are not present in the file and there should be no duplicate records in my output

Please help me on the below code :
I have an array with 155 elements and i have a file which has some elements of array inside it , i need all values of the array elements which are found in the file and also i need the array element to be printed as zero if the array element is not found in the file .
Thanks in advance, this is what i have tried.
args=("C9" "DP10" "DP11" "DP20" "DP21" "DP30" "DP31" "DP50" "FR31" "G128" "G402" "G602" "GA" "GI" "GT08" "GT14" "GT17" "GT25" "GT37" "GT67" "H6" "H7" "IL" "IM" "J6" "JD05" "JD09" "JD14" "JD25" "JD37" "K1" "K2" "L100" "L106" "L116" "L150" "L202" "L7" "L8" "L9" "LD11" "LD21" "LE09" "LE26" "LP11" "LP21" "LP31" "LP55" "LQ11" "LQ21" "LQ31" "LS07" "LT09" "LT10" "LT12" "LT15" "LT20" "LT22" "LT24" "LT25" "LT30" "LT38" "LT42" "LT43" "LT44" "LT48" "LT50" "LT59" "LT60" "LT65" "M395" "OV04" "OV07" "OV14" "OV18" "OV23" "OV27" "OV35" "OV39" "OV40" "OV79" "Q15" "Q150" "Q19" "QD11" "QD21" "QD31" "QD65" "QE11" "QE21" "QE31" "QF50" "QM25" "QP10" "QP15" "QP20" "QP30" "QP31" "QP50" "QT25" "QT50" "R39" "R40" "r57" "R9" "rc23" "RC27" "RC39" "rc7" "rc79" "S1" "S101" "S117" "S118" "S13p" "S18" "S202" "S317" "S318" "S319" "S40" "S408" "S67" "S76" "S82" "S99" "SD11" "SD12" "SD14" "SD17" "SD29" "SD3" "SD5" "SD98" "SF20" "SF74" "SR07" "SV19" "SV6p" "T402" "T602" "TG00" "TG17" "TG43" "TG8" "TG92" "WD09" "WD14" "WD17" "WD24" "WD29" "WD37" "WD43" "WWE1" "XR91")
MY CODE :
for loop i have used to traverse the elements search inside a file .
for i in ${args[#]}; do
grep $i file.txt
if [ $? -ne 0 ]; then
echo $i"","""0"
fi
done >> output.txt
TOTAL FILE:
C9,5015319
DP10,36870732
DP11,188
DP20,18728254
DP21,341182
DP30,8415555
DP31,2390000
DP50,12371853
FR31,24541
G128,49780
G402,2000
G602,2000
GA,879888
GT08,1580384
GT17,1968192
GT25,4104
GT37,21550
GT67,24770
H6,660652
IL,137651
JD05,1518400
JD14,325800
JD25,828600
JD37,357100
K1,261549
K2,4715330
L100,284
L116,80000
L7,200847
L8,3158
L9,5054495
LE09,75776
LE26,343410
LP11,1030
LP21,492
LP31,113
LP55,3
LQ11,6776000
LQ21,3543600
LQ31,4525600
LT09,682800
LT12,5715
LT15,568873
LT22,236077
LT24,702800
LT25,4600
LT38,28990
LT65,300125
M395,29600
OV14,462
OV18,86300
OV40,217899
Q150,678
QD11,1000022
QD31,50
QF50,58575
QM25,57900
QP10,1792153
QP15,953400
QP20,770000
QP30,179450
QP31,163223
QP50,8
QT50,66340
R39,62440
R40,18807
r57,3456
rc23,3370
RC27,2809
RC39,2570
rc7,7137
rc79,1296
S1,25007
S117,1000000
S13p,52313
S18,75000
S317,289148
S318,3046
S319,30000
S40,300
S408,4967
S76,28
S82,103238
S99,480
SD11,6719
SD12,23123
SD14,22595
SD17,100000
SD29,252392
SD3,20000
SD5,14090
SD98,653
SF20,1000
SF74,7330
SV19,26461
SV6p,154994
T402,2000
T602,2000
TG17,2031
TG8,2964
TG92,1759
WD17,131194
WD24,94589
WD29,202198
WD37,101794
WD43,112942
WWE1,9600
XR91,70000
EXPECTED OUTPUT :
The output should contain the values which are present in the file for each array element.
If not present the output should contain the array element as zero. For eg:
c9 is not present in the file
output of c9 should be
c9,0
Your approach is not bad. I just would use
^$i,
as a grep-pattern. With your current file data, it's not necessary, but maybe one day your file will contain things like
X,2354
XA,1234
and suddenly your algorithm will fail, if args contain the element X.
Also, the echo statement is unnecessarily complex. I would write it simply as
echo $x,0
You can also simplify the if, by combining it with the grep
if ! grep ^$i, file.txt
but this is mere cosmetics and a matter of taste.

Extract Text from Array - perl

I am trying to extract the Interface from an array created from an SNMP Query.
I want to create an array like THIS:
my #array = ( "Gig 11/8",
"Gig 10/1",
"Gig 10/4",
"Gig 10/2");
It currently looks like THIS:
my #array =
( "orem-g13ap-01 Gig 11/8 166 T AIR-LAP11 Gig 0",
"orem-g15ap-06 Gig 10/1 127 T AIR-LAP11 Gig 0",
"orem-g15ap-05 Gig 10/4 168 T AIR-LAP11 Gig 0",
"orem-g13ap-03 Gig 10/2 132 T AIR-LAP11 Gig 0");>
I am doing THIS:
foreach $ints (#array) {
#gig = substr("$ints", 17, 9);
print("Interface: #gig");
Sure it works, but the hostname [orem-g15ap-01] doesn't always stay the same length, it varies depending on the site. I need to extract the word "Gig" plus the next 6 characters. I have no idea what is the best way of doing this.
I am a novice at perl but trying. Thanks
# "I need to extract the word "Gig" plus the next 6 characters."
# This looks like a fixed-with file format, so consider using unpack.
foreach ( #lines ) {
my( $orem, $gig, $rest ) = unpack 'a17 a9 a*';
print "[$gig]\n";
}
If it's not fixed-with format, then you need to find out what the file spec is and then maybe use a regular expression, something like:
my( $orem, $gig, $rest ) = m/(\S+)\s+(.{9})(.*)/;
But this will not work in the general case without a proper file spec.
Stuff like that is what Perl is made for. Regular Expressions are the way to go. Read the perldoc perlre.
foreach $ints (#array) {
$ints =~ s/(Gig.{6})/$1/g;
}
So you want the second and third field.
my #array = map { /^\S+\s+(\S+\s\S+)/s } #source;
This one is like ikegami's, but I recommend that if you know how something you want looks, then by all means, specify that. Because this is done in a list context, any string that does not match the spec, returns an empty list--or is ignored.
my #results = map { m!(\bGig\s+\d+/d+)! } #array;

Resources