In PostgreSQL, my column type is json , the data is json array like:
[{"attsId": "42a2ce04-52ab-4a3c-8dfb-98c3d14b307d", "planId": 46, "filePath": "fileOperate\\upload", "cfileName": "潜在客户名单 (1).xls", "ufileName": "42a2ce04-52ab-4a3c-8dfb-98c3d14b307d.xls"}, {"attsId": "1adb2f13-00b0-4780-ae76-7a068dc3289c", "planId": 46, "filePath": "fileOperate\\upload", "cfileName": "潜在客户名单.xls", "ufileName": "1adb2f13-00b0-4780-ae76-7a068dc3289c.xls"}, {"attsid": "452f6c62-28df-47c7-8c30-038339f7b223", "planid": 48.0, "filepath": "fileoperate\\upload", "cfilename": "技术市场印花税.xls", "ufilename": "452f6c62-28df-47c7-8c30-038339f7b223.xls"}]
I want update one of the array data by element key like:
UPDATE plan_base
set atts->.. = {"planId":47,"name":"qm"}
where atts->"planId"= 46 and id=2;
and someone has beterr way to do it
Related
table name: muscle_groups
fields: id, name, segment_ids
data:
{"f": [], "m": [31, 32, 33, 34, 35, 36, 38, 39]}
tried many variations like:
select id, name, segment_ids->>"m"
where 5 = any(json_array_element(segment_ids->>"m")
You are looking for the contains operator #>
select *
from muscle_groups
where segment_ids #> '{"m": [5]}'
sorry for my annoying questions again.
Having a bit of trouble with some code. The task is to write a function that takes a nested list of currency conversions and turn it into a table (I've attached some pictures for clarification)
(could only attach one image, this is the nested list it converts)
[[10, 9.6, 7.5, 6.7, 4.96], [20, 19.2, 15.0, 13.4, 9.92], [30, 28.799999999999997, 22.5, 20.1, 14.879999999999999], [40, 38.4, 30.0, 26.8, 19.84], [50, 48.0, 37.5, 33.5, 24.8], [60, 57.599999999999994, 45.0, 40.2, 29.759999999999998], [70, 67.2, 52.5, 46.900000000000006, 34.72], [80, 76.8, 60.0, 53.6, 39.68], [90, 86.39999999999999, 67.5, 60.300000000000004, 44.64], [100, 96.0, 75.0, 67.0, 49.6]]
I've got the Header column for the table to work fine.
I'm having issues when I'm trying to iterate over each sublist in the nested list, convert it to a string (and two decimal places) and with a tab between each entry.
the code I've got so far is:
def printTable(cur):
list2 = makeTable(cur)
lst1 = Extract(cur)
lst1.insert(0, "$NZD")
lne1 = "\t".join(lst1)
print(lne1)
list=map(str,list2)
print(list2)
for list in list2:
for elem in list:
linelem = "\t".join(elem)
print(linelem)
printTable(cur)
(Note: The first function I call and assign to list2 is what generates the data/nested list)
I've tried playing around a bit but I keep coming up with different error messages trying to convert each sublist to a string.
Thank you all for your help!![enter image description here][1]
Try using this method;
import pandas as pd
import csv
with open("out.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerows(a)
pd.read_csv("out.csv", header = [col1, col2, col3, col4, col5])
I am trying to flatten a JSON file to be able to load it into PostgreSQL all in AWS Glue. I am using PySpark. Using a crawler I crawl the S3 JSON and produce a table. I then use an ETL Glue script to:
read the crawled table
use the 'Relationalize' function to flatten the file
convert the dynamic frame to a dataframe
try to 'explode' the request.data field
Script so far:
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = glue_source_database, table_name = glue_source_table, transformation_ctx = "datasource0")
df0 = Relationalize.apply(frame = datasource0, staging_path = glue_temp_storage, name = dfc_root_table_name, transformation_ctx = "dfc")
df1 = df0.select(dfc_root_table_name)
df2 = df1.toDF()
df2 = df1.select(explode(col('`request.data`')).alias("request_data"))
<then i write df1 to a PostgreSQL database which works fine>
Issues I face:
The 'Relationalize' function works well except the request.data field which becomes a bigint and therefore 'explode' doesn't work.
Explode cannot be done without using 'Relationalize' on the JSON first due to the structure of the data. Specifically the error is: "org.apache.spark.sql.AnalysisException: cannot resolve 'explode(request.data)' due to data type mismatch: input to function explode should be array or map type, not bigint"
If I try to make the dynamic frame a dataframe first then I get this issue: "py4j.protocol.Py4JJavaError: An error occurred while calling o72.jdbc.
: java.lang.IllegalArgumentException: Can't get JDBC type for struct..."
I tried to also upload a classifier so that the data would flatten in the crawl itself but AWS confirmed this wouldn't work.
The JSON format of the original file is as follows, that I an trying to normalise:
- field1
- field2
- {}
- field3
- {}
- field4
- field5
- []
- {}
- field6
- {}
- field7
- field8
- {}
- field9
- {}
- field10
# Flatten nested df
def flatten_df(nested_df):
for col in nested_df.columns:
array_cols = [c[0] for c in nested_df.dtypes if c[1][:5] == 'array']
for col in array_cols:
nested_df =nested_df.withColumn(col, F.explode_outer(nested_df[col]))
nested_cols = [c[0] for c in nested_df.dtypes if c[1][:6] == 'struct']
if len(nested_cols) == 0:
return nested_df
flat_cols = [c[0] for c in nested_df.dtypes if c[1][:6] != 'struct']
flat_df = nested_df.select(flat_cols +
[F.col(nc+'.'+c).alias(nc+'_'+c)
for nc in nested_cols
for c in nested_df.select(nc+'.*').columns])
return flatten_df(flat_df)
df=flatten_df(df)
It will replace all dots with underscore. Note that it uses explode_outer and not explode to include Null value in case array itself is null. This function is available in spark v2.4+ only.
Also remember, exploding array will add more duplicates and overall row size will increase. Flattening struct will increase column size. In short, your original df will explode horizontally and vertically. It may slow down processing data later.
Therefore my recommendation would be to identify feature related data and store only those data in postgresql and original json files in s3.
Once you have rationalized the json column, you don't need to explode it. Relationalize transforms the nested JSON into key-value pairs at the outermost level of the JSON document. The transformed data maintains a list of the original keys from the nested JSON separated by periods.
Example :
Nested json :
{
"player": {
"username": "user1",
"characteristics": {
"race": "Human",
"class": "Warlock",
"subclass": "Dawnblade",
"power": 300,
"playercountry": "USA"
},
"arsenal": {
"kinetic": {
"name": "Sweet Business",
"type": "Auto Rifle",
"power": 300,
"element": "Kinetic"
},
"energy": {
"name": "MIDA Mini-Tool",
"type": "Submachine Gun",
"power": 300,
"element": "Solar"
},
"power": {
"name": "Play of the Game",
"type": "Grenade Launcher",
"power": 300,
"element": "Arc"
}
},
"armor": {
"head": "Eye of Another World",
"arms": "Philomath Gloves",
"chest": "Philomath Robes",
"leg": "Philomath Boots",
"classitem": "Philomath Bond"
},
"location": {
"map": "Titan",
"waypoint": "The Rig"
}
}
}
Flattened out json after rationalize :
{
"player.username": "user1",
"player.characteristics.race": "Human",
"player.characteristics.class": "Warlock",
"player.characteristics.subclass": "Dawnblade",
"player.characteristics.power": 300,
"player.characteristics.playercountry": "USA",
"player.arsenal.kinetic.name": "Sweet Business",
"player.arsenal.kinetic.type": "Auto Rifle",
"player.arsenal.kinetic.power": 300,
"player.arsenal.kinetic.element": "Kinetic",
"player.arsenal.energy.name": "MIDA Mini-Tool",
"player.arsenal.energy.type": "Submachine Gun",
"player.arsenal.energy.power": 300,
"player.arsenal.energy.element": "Solar",
"player.arsenal.power.name": "Play of the Game",
"player.arsenal.power.type": "Grenade Launcher",
"player.arsenal.power.power": 300,
"player.arsenal.power.element": "Arc",
"player.armor.head": "Eye of Another World",
"player.armor.arms": "Philomath Gloves",
"player.armor.chest": "Philomath Robes",
"player.armor.leg": "Philomath Boots",
"player.armor.classitem": "Philomath Bond",
"player.location.map": "Titan",
"player.location.waypoint": "The Rig"
}
Thus in your case, request.data is already a new column flattened out from request column and its type is interpreted as bigint by spark.
Reference : Simplify/querying nested json with the aws glue relationalize transform
I am recently evaluating graph databases or any other databases on one specific requirement:
The ability to retrieve top n children of each node by a aggregated property of the node's direct children and their all direct and indirect children in a tree in one query. The result should return correct hierarchical structure.
Example
root + 11
++ 111
++ 1111
++ 112
++ 1121
++ 1122
++ 1123
++ 113
+ 12
++ 121
++ 122
++ 1221
++ 1222
++ 1223
++ 123
+ 13
++ 131
++ 132
++ 133
++ 134
+ 14
Each node has a property of how many direct children it has. And the tree has no more than 8 levels. Let's say I want to run a query of the entire tree, by all nodes at each level, whose top 2 children which has the most direct and indirect children. It would give us the following:
root + 11
++ 111
++ 1111
++ 112
++ 1121
++ 1122
+ 12
++ 121
++ 122
++ 1221
++ 1222
I am wondering if there is any graph database, or any other database that support such query efficiently and if yes, how ?
Using Neo4j
You can do this with Neo4j, but you'll need to ensure you're using the APOC Procedures plugin for access to some of the map and collection functions and procedures.
One thing to note first. You didn't define any criteria to use when selecting between child nodes when there is a tie between their descendent node counts. As such, the results of the following may not match yours exactly, as alternate nodes (with tied counts) may have been selected. If you do need additional criteria for the ordering and selection, you will have to add that to your description so I can modify the queries accordingly.
Create the test graph
First, let's create the test data set. We can do this through the Neo4j browser.
First let's set the parameters we'll need to create the graph:
:param data => [{id:11, children:[111, 112, 113]}, {id:12, children:[121, 122, 123]}, {id:13, children:[131,132,133,134]}, {id:14, children:[]}, {id:111, children:[1111]}, {id:112, children:[1121, 1122, 1123]}, {id:122, children:[1221,1222,1223]}]
Now we can use this query to use those parameters to create the graph:
UNWIND $data as row
MERGE (n:Node{id:row.id})
FOREACH (x in row.children |
MERGE (c:Node{id:x})
MERGE (n)-[:CHILD]->(c))
We're working with nodes of type :Node connected to each other by :CHILD relationships, outgoing toward the leaf nodes.
Let's also add a :Root:Node at the top level to make some of our later queries a bit easier:
MERGE (r:Node:Root{id:0})
WITH r
MATCH (n:Node)
WHERE NOT ()-[:CHILD]->(n)
MERGE (r)-[:CHILD]->(n)
The :Root node is now connected to the top nodes (11, 12, 13, 14) and our test graph is ready.
The Actual Query
Because the aggregation you want needs the count of all descendants of a node and not just its immediate children, we can't use the child count property of how many direct children a node has. Or rather, we COULD, summing the counts from all descendants of the node, but since that requires us to traverse down to all descendants anyway, it's easier to just get the count of all descendants and avoid property access entirely.
Here's the query in its entirety below, you should be able to run the full query on the test graph. I'm breaking it into sections with linebreaks and comments to better show what each part is doing.
// for each node and its direct children,
// order by the child's descendant count
MATCH (n:Node)-[:CHILD]->(child)
WITH n, child, size((child)-[:CHILD*]->()) as childDescCount
ORDER BY childDescCount DESC
// now collect the ordered children and take the top 2 per node
WITH n, collect(child)[..2] as topChildren
// from the above, per row, we have a node and a list of its top 2 children.
// we want to gather all of these children into a single list, not nested
// so we collect the lists (to get a list of lists of nodes), then flatten it with APOC
WITH apoc.coll.flatten(collect(topChildren)) as topChildren
// we now have a list of the nodes that can possibly be in our path
// although some will not be in the path, as their parents (or ancestors) are not in the list
// to get the full tree we need to match down from the :Root node and ensure
// that for each path, the only nodes in the path are the :Root node or one of the topChildren
MATCH path=(:Root)-[:CHILD*]->()
WHERE all(node in nodes(path) WHERE node:Root OR node in topChildren)
RETURN path
Without the comments, this is merely an 8-line query.
Now, this actually returns multiple paths, one path per row, and the entirety of all the paths together create the visual tree you're after, if you view the graphical results.
Getting the results as a tree in JSON
However, if you're not using a visualizer to view the results graphically, you would probably want a JSON representation of the tree. We can get that by collecting all the result paths and using a procedure from APOC to produce the JSON tree structure. Here's a slightly modified query with those changes:
MATCH (n:Node)-[:CHILD]->(child)
WITH n, child, size((child)-[:CHILD*]->()) as childDescCount
ORDER BY childDescCount DESC
WITH n, collect(child)[..2] as topChildren
WITH apoc.coll.flatten(collect(topChildren)) as topChildren
MATCH path=(:Root)-[:CHILD*]->()
WHERE all(node in nodes(path) WHERE node:Root OR node in topChildren)
// below is the new stuff to get the JSON tree
WITH collect(path) as paths
CALL apoc.convert.toTree(paths) YIELD value as map
RETURN map
The result will be something like:
{
"_type": "Node:Root",
"_id": 52,
"id": 0,
"child": [
{
"_type": "Node",
"_id": 1,
"id": 12,
"child": [
{
"_type": "Node",
"_id": 6,
"id": 122,
"child": [
{
"_type": "Node",
"_id": 32,
"id": 1223
},
{
"_type": "Node",
"_id": 31,
"id": 1222
}
]
},
{
"_type": "Node",
"_id": 21,
"id": 123
}
]
},
{
"_type": "Node",
"_id": 0,
"id": 11,
"child": [
{
"_type": "Node",
"_id": 4,
"id": 111,
"child": [
{
"_type": "Node",
"_id": 26,
"id": 1111
}
]
},
{
"_type": "Node",
"_id": 5,
"id": 112,
"child": [
{
"_type": "Node",
"_id": 27,
"id": 1121
},
{
"_type": "Node",
"_id": 29,
"id": 1123
}
]
}
]
}
]
}
So, I'm trying to set up check_json.pl in NagiosXI to monitor some statistics. https://github.com/c-kr/check_json
I'm using the code with the modification I submitted in pull request #32, so line numbers reflect that code.
The json query returns something like this:
[
{
"total_bytes": 123456,
"customer_name": "customer1",
"customer_id": "1",
"indices": [
{
"total_bytes": 12345,
"index": "filename1"
},
{
"total_bytes": 45678,
"index": "filename2"
},
],
"total": "765.43gb"
},
{
"total_bytes": 123456,
"customer_name": "customer2",
"customer_id": "2",
"indices": [
{
"total_bytes": 12345,
"index": "filename1"
},
{
"total_bytes": 45678,
"index": "filename2"
},
],
"total": "765.43gb"
}
]
I'm trying to monitor the sized of specific files. so a check should look something like:
/path/to/check_json.pl -u https://path/to/my/json -a "SOMETHING" -p "SOMETHING"
...where I'm trying to figure out the SOMETHINGs so that I can monitor the total_bytes of filename1 in customer2 where I know the customer_id and index but not their position in the respective arrays.
I can monitor customer1's total bytes by using the string "[0]->{'total_bytes'}" but I need to be able to specify which customer and dig deeper into file name (known) and file size (stat to monitor) AND the working query only gives me the status (OK,WARNING, or CRITICAL). Adding -p all I get are errors....
The error with -p no matter how I've been able to phrase it is always:
Not a HASH reference at ./check_json.pl line 235.
Even when I can get a valid OK from the example "[0]->{'total_bytes'}", using that in -p still gives the same error.
Links pointing to documentation on the format to use would be very helpful. Examples in the README for the script or in the -h output are failing me here. Any ideas?
I really have no idea what your question is. I'm sure I'm not alone, hence the downvotes.
Once you have the decoded json, if you have a customer_id to search for, you can do:
my ($customer_info) = grep {$_->{customer_id} eq $customer_id} #$json_response;
Regarding the error on line 235, this looks odd:
foreach my $key ($np->opts->perfvars eq '*' ? map { "{$_}"} sort keys %$json_response : split(',', $np->opts->perfvars)) {
# ....................................... ^^^^^^^^^^^^^
$perf_value = $json_response->{$key};
if perfvars eq "*", you appear to be looking for $json_reponse->{"{total}"} for example. You might want to validate the user's input:
die "no such key in json data: '$key'\n" unless exists $json_response->{$key};
This entire business of stringifying the hash ref lookups just smells bad.
A better question would look like:
I have this JSON data. How do I get the sum of total_bytes for the customer with id 1?
See https://stackoverflow.com/help/mcve