Spark get datatype of nested object

Spark get datatype of nested object - arrays

I have some JSON data which looks like this:
{
"key1":"value1",
"key2":[
1,
2,
3
],
"key3":{
"key31":"value31",
"key32":"value32"
},
"key4":[
{
"key41":"value411",
"key42":"value412",
"key43":"value413"
},
{
"key41":"value421",
"key42":"value422",
"key43":"value423"
}
],
"key5":{
"key51":[
{
"key511":"value511",
"key512":"value512",
"key513":"value513"
},
{
"key511":"value521",
"key512":"value522",
"key513":"value523"
}
]
},
"key6":{
"key61":{
"key611":[
{
"key_611":"value_611",
"key_612":"value_612",
"key_613":"value_613"
},
{
"key_611":"value_621",
"key_612":"value_622",
"key_613":"value_623"
},
{
"key_611":"value_621",
"key_612":"value_622",
"key_613":"value_623"
}
]
}
}
}
It contains the a mix of simple, complex and array type values.
If I try to get the datatype of key1 schema.("key1").dataType, I get StringType and likewise for key2, key3 and key4.
For key5 also, I get StructType.
But when I try to get the datatype for key51, which is nested under key5 using schema.("key5.key51").dataType, I'm getting the following error:
java.lang.IllegalArgumentException: Field "key5.key51" does not exist.
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at scala.collection.AbstractMap.getOrElse(Map.scala:59)
at org.apache.spark.sql.types.StructType.apply(StructType.scala:263)
... 48 elided
The main intention for me is to be able to explode a given type if its of ArrayType and not explode for any other type.
The explode function is able to recognize this given key (key5.key51) properly and exploding the array. But the problem is with determining the datatype.
One possible solution for me is to do a select of key5.key51 as a separate column key51 and then explode that column.
But is there any better and more elegant way of doing this while still being able to determine the datatype of the given column?

The simplest solution is to select the field of interest, and then retrieve the schema:
df.select("key5.key51").schema.head.dataType
Using full schema directly, would require traversing schema, and might be hard to do right, while with embedded ., StructTypes and complex types (Maps and Arrays).

Here is some (recursive) code to find all ArrayType fields names:
import org.apache.spark.sql.types._
def findArrayTypes(parents:Seq[String],f:StructField) : Seq[String] = {
f.dataType match {
case array: ArrayType => parents
case struct: StructType => struct.fields.toSeq.map(f => findArrayTypes(parents:+f.name,f)).flatten
case _ => Seq.empty[String]
}
}
val arrayTypeColumns = df.schema.fields.toSeq
.map(f => findArrayTypes(Seq(f.name),f))
.filter(_.nonEmpty).map(_.mkString("."))
For your dataframe, this gives:
arrayTypeColumns.foreach(println)
key2
key4
key5.key51
key6.key61.key611
This does not work yet for arrays inside maps or nested arrays

Related

Using inherited struct array attributes in ranking an searching

I am currently using document inheritance to avoid having to include specific fields in every document type that I make. It works for most scenarios, but when an array of structs is inherited, using that field in ranking and searching seems to break things. For example, take the following application.
schema base {
document base {
field simple_base type string {
indexing: summary | index
}
struct base_struct {
field name type string {}
}
field complex_base type array<base_struct> {
indexing: summary
struct-field name { indexing: attribute }
}
}
}
schema sub {
document sub inherits base {
field simple_sub type string {
indexing: summary | index
}
struct sub_struct {
field name type string {}
}
field complex_sub type array<sub_struct> {
indexing: summary
struct-field name { indexing: attribute }
}
}
rank-profile default inherits default {
first-phase {
expression: nativeRank(simple_sub, complex_sub.name, simple_base, complex_base.name)
}
}
}
If you try to prepare this application, you will get the error
WARNING: invalid rank feature 'nativeRank(simple_sub,complex_sub.name,simple_base,complex_base.name)': The parameter list used for setting up rank feature nativeRank is not valid: Param[3]: Field 'complex_base.name' was not found in the index environment
If you take out the complex_base.name from the nativeRank function, it will compile properly. Additionally if you try to search using the various fields after feeding the following json:
{
"put": "id:content:sub::0",
"fields": {
"simple_base": "simple",
"simple_sub": "simple",
"complex_base": [
{
"name": "complex"
}
],
"complex_sub": [
{
"name": "complex"
}
]
}
}
Then the following are the results:
/search/?query=simple_sub:simple -> 1 result
/search/?query=complex_sub.name:complex -> 1 result
/search/?query=simple_base:simple -> 1 result
/search/?query=complex_base.name:complex -> 0 results
If this is the intended behavior then is there a workaround for this other than copy/pasting the
complex_base field into every document that needs to inherit it? Thank you for your assistance.

This is not related to inheritance. Vespa.ai does not support ranking for user defined struct fields. See https://github.com/vespa-engine/vespa/issues/11580.

using .map (or another stdlib feature) to create a hash not array

I'm trying to target JMS servers in the cloud, the puppet module init.pp needs to add a key to a hash.
I'm reading a block of hiera and having to extract parts of it to form a new hash. .each doesn't return any values so I'm using .map.
The values I'm getting out are exactly as I want, however when I tried a deep_merge I discovered that .map outputs as an array.
service.yaml
jms_subdeployment_instances:
'BPMJMSModuleUDDs:BPMJMSSubDM':
ensure: 'present'
target:
- 'BPMJMSServer_auto_1'
- "BPMJMSServer_auto_%{::ec2_tag_name}"
targettype:
- 'JMSServer'
- 'JMSServer'
init.pp
$jms_subdeployments = lookup('jms_subdeployment_instances', $default_params)
$jms_target_args = $jms_subdeployments.map |$subdep, $value| {
$jms_short_name = $subdep[0, 3]
$jms_subdeployment_inst = $array_domain_jmsserver_addresses.map |$index, $server| {
"${jms_short_name}JMSServer_auto_${server}"
if defined('$jms_subdeployment_inst') {
$jmsTargetArg = {
"${subdep}" => {
'target' => $jms_subdeployment_inst
}
}
}
}
$merge_subdeployment_targets = merge($jms_subdeployments, $jms_target_args)
```Output
New JMS targets are : [{BPMJMSModuleUDDs:BPMJMSSubDM => {target => [BPMJMSServer_auto_server101, BPMJMSServer_auto_server102]}}]
The enclosing [ ] are causing me trouble. As far as I can see, in puppet .to_h doesn't work either
Thanks
Update 22/07/2019:
Thanks for the reply, I've had to tweak it slightly because puppet was failing with "Server Error: Evaluation Error: Error while evaluating a Method call, 'values' parameter 'hsh' expects a Hash value, got Tuple"
$array_domain_jmsserver_addresses =
any2array(hiera('pdb_domain_msserver_addresses'))
$array_domain_jmsserver_addresses.sort()
$jms_subdeployments = lookup('jms_subdeployment_instances', $default_params)
$hash_domain_jmsserver_addresses = Hash($array_domain_jmsserver_addresses)
if $hash_domain_jmsserver_addresses.length > 0 {
$jms_target_arg_tuples = $jms_subdeployments.keys.map |$subdep| {
$jms_short_name = $subdep[0, 3]
$jms_subdeployment_inst = regsubst(
$hash_domain_jmsserver_addresses.values, /^/, "${jms_short_name}JMSServer_auto_")
# the (key, value) tuple to which this element maps
[ $subdep, { 'target' => $jms_subdeployment_inst } ]
}
$jms_target_args = Hash($jms_target_arg_tuples)
} else {
$jms_target_args = {}
}
notify{"Normal array is : ${jms_subdeployments}": }
notify{"Second array is : ${jms_target_args}": }
$merge_subdeployment_targets = deep_merge($jms_subdeployments, $jms_target_args)
notify{"Merged array is : ${merge_subdeployment_targets}": }
Normal is : {BPMJMSModuleUDDs:BPMJMSSubDM => {ensure => present, target => [BPMJMSServer_auto_1, BPMJMSServer_auto_server1], targettype => [JMSServer, JMSServer]},
Second is : {BPMJMSModuleUDDs:BPMJMSSubDM => {target => [BPMJMSServer_auto_server2]}
Merged is : {BPMJMSModuleUDDs:BPMJMSSubDM => {ensure => present, target => [BPMJMSServer_auto_server2], targettype => [JMSServer, JMSServer]}
Desired output it:
{BPMJMSModuleUDDs:BPMJMSSubDM => {ensure => present, target => [BPMJMSServer_auto_1, BPMJMSServer_auto_server1, BPMJMSServer_auto_server2], targettype => [JMSServer, JMSServer, JMSServer]}

when I tried a deep_merge I discovered that .map outputs as an array.
Yes, this is its documented behavior. map() should be considered a function on the elements of a collection, not on the collection overall, and the results are always provided as an array.
It would probably be useful to look over the alternatives for converting values to hashes. Particularly attractive is this one:
An Array matching Array[Tuple[Any,Any], 1] is converted to a hash where each tuple describes a key/value entry
To make use of this, map each entry to a (key, value) tuple, and convert the resulting array of tuples to a hash. A conversion of your attempt to that approach might look something like this:
if $array_domain_jmsserver_addresses.length > 0 {
$jms_target_arg_tuples = $jms_subdeployments.keys.map |$subdep| {
$jms_short_name = $subdep[0, 3]
$jms_subdeployment_inst = regsubst(
$array_domain_jmsserver_addresses.sort, /^/, "${jms_short_name}JMSServer_auto_")
# the (key, value) tuple to which this element maps
[ $subdep, { 'target' => $jms_subdeployment_inst } ]
}
$jms_target_args = Hash($jms_target_arg_tuples)
} else {
$jms_target_args = {}
}
$merge_subdeployment_targets = merge($jms_subdeployments, $jms_target_args)
Note that since you don't use the values of $jms_subdeployments, I have taken the liberty of simplifying your code somewhat by applying the keys() function to it. I have also used regsubst() instead of map() to form target names from the elements of $array_domain_jmsserver_addresses, which I personally find more readable in this case, especially since you were not using the indexes.
I've also inferred what I think you meant your if defined() test to accomplish, and replaced it with the outermost test of the length of the $array_domain_jmsserver_addresses array. One could also write it in somewhat more functional form, by building the hash without regard to whether there are any targets, and then filter()ing it after, but that seems wasteful because it appears that either all entries will have (the same) targets, or none will.

Using $rename in MongoDB for an item inside an array of objects

Consider the following MongoDB collection of a few thousand Objects:
{
_id: ObjectId("xxx")
FM_ID: "123"
Meter_Readings: Array
0: Object
Date: 2011-10-07
Begin_Read: true
Reading: 652
1: Object
Date: 2018-10-01
Begin_Reading: true
Reading: 851
}
The wrong key was entered for 2018 into the array and needs to be renamed to "Begin_Read". I have a list using another aggregate of all the objects that have the incorrect key. The objects within the array don't have an _id value, so are hard to select. I was thinking I could iterate through the collection and find the array index of the errored Readings and using the _id of the object to perform the $rename on the key.
I am trying to get the index of the array, but cannot seem to select it correctly. The following aggregate is what I have:
[
{
'$match': {
'_id': ObjectId('xxx')
}
}, {
'$project': {
'index': {
'$indexOfArray': [
'$Meter_Readings', {
'$eq': [
'$Meter_Readings.Begin_Reading', True
]
}
]
}
}
}
]
Its result is always -1 which I think means my expression must be wrong as the expected result would be 1.
I'm using Python for this script (can use javascript as well), if there is a better way to do this (maybe a filter?), I'm open to alternatives, just what I've come up with.

I fixed this myself. I was close with the aggregate but needed to look at a different field for some reason that one did not work:
{
'$project': {
'index': {
'$indexOfArray': [
'$Meter_Readings.Water_Year', 2018
]
}
}
}
What I did learn was the to find an object within an array you can just reference it in the array identifier in the $indexOfArray method. I hope that might help someone else.

Array .map() returning undefined

This is the structure of the array when I console.log it.
-[]
-0: Array(31)
-0:
-date: "2018-08-26T00:00:00-04:00"
-registered:
-standard: 0
-vip: 0
-waitlisted:
-standard: 0
-vip: 0
This is my code to map the date and the registered (two separate arrays):
this.data.map((value) => value.date);
this.data.map((value) => value.registered['standard']);
I either get an empty array or undefined when I log these. What am I doing wrong?
I want to use these for a chart using ChartJS where:
this.lineChart = new Chart(lineCtx, {
type: 'line',
data: {
datasets: [{
label: (I want the dates to be the labels),
data: (I want the list of standard registrants)
}]...
EDIT:
I've updated the way I get the data to show the following structure:
{
"registrationHistory": [{
"date": "2018-08-26T00:00:00-4:00",
"registered": {
"vip":0,
"standard":0
},
"waitlisted":{
"vip":0,
"standard":0
}
{
,...
}
]}

Your array is two-dimensional and map is iterating only the first dimension, i.e:
-[]
-0: Array(31) // first dimension
-0: // second dimension
-date: "2018-08-26T00:00:00-04:00"
...
This would look like the following JSON string:
[[{"date":"2018-08-26T00:00:00-04:00", ...}]]
Since you haven't provided a full example it's impossible to recommend the most applicable solution:
If you control the data source, remove the first dimension since it appears redundant.
Assuming you only want the first element of the first dimension, refer to that key:
this.data[0].map((value) => value.date);
If your data model is more complex than revealed in your question you'll need to figure out another approach.

Json creation in ruby for a list

I am new to Ruby.
I want to create a JSON file for a group of elements.
For this, I am using eachfunction to retrieve the datas. I want to create json as follows for the 4 length array,
'{
"desc":{
"1":"1st Value",
"2":"2nd value"
"3":"3rd Value",
"4":"4th value"
},
}'
This is my array iteration,
REXML::XPath.each( doc, "//time" ) { |element1|
puts element1.get_text
}
I know here is the simple code to generate a JSON,
require 'json/add/core'
class Item < Struct.new(:id, :name); end
chair = Item.new(1, 'chair')
puts JSON.pretty_generate(chair)
This syntax will generate a json as follows,
{
"json_class": "Item",
"v": [
1,
"chair"
]
}
But I'm not sure how to do that to make JSON for my elements as stated above. Google search didn't give me a proper way to do this.
Can anyone help me here?

it means this?:
require 'json'
my_arr= ["1st Value","2nd Value","3rd Value","4th Value"]
tmp_str= {}
tmp_str["desc"] = {}
my_arr.each do |x|
tmp_str["desc"]["#{x[0]}"] = x
end
puts JSON.generate(tmp_str)
you can iterate the string array ,then take the strings to hash object.JSON can easy to parse Hash objcect .

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Spark get datatype of nested object - arrays

The simplest solution is to select the field of interest, and then retrieve the schema: df.select("key5.key51").schema.head.dataType Using full schema directly, would require traversing schema, and might be hard to do right, while with embedded ., StructTypes and complex types (Maps and Arrays).

Related

Using inherited struct array attributes in ranking an searching

using .map (or another stdlib feature) to create a hash not array

Using $rename in MongoDB for an item inside an array of objects

Array .map() returning undefined

Json creation in ruby for a list

Categories

Resources