Postgress INET to Snowflake Compatible - snowflake-cloud-data-platform

I need help in migrating a postgres function to snowflake function.
Currently, i have following function that takes an ip_address and returns the starting range of ip_address:
...
begin
if (p_ip_address is not null) then
return p_ip_address::inet - '0.0.0.0';
else
return null;
end if;
end
...
I know we have PARSE_IP in snowflake that give JSON file but i need just one piece of this json (ipv4_range_start)
For Example:
Input = 192.168.242.188
Output = 3232297472

If you only want peice of information, select it.
select parse_ip('127.0.0.0/24','INET') as a
,a:ipv4_range_end;
gives:
A A:IPV4_RANGE_END
{ "family": 4, "host": "127.0.0.0", "ip_fields": [ 2130706432, 0, 0, 0 ], "ip_type": "inet", "ipv4": 2130706432, "ipv4_range_end": 2130706687, "ipv4_range_start": 2130706432, "netmask_prefix_length": 24, "snowflake$type": "ip_address" } 2130706687

Related

Neo4j: Do queries on specific database?

I´m new to Neo4j, and want to implement a service that makes use of it.
I´ve read the docs and searched for it, however I still didn´t get an answer to this simple question:
How do I specify which database to query in a Neo4j query?
E.g. I connected to bolt://localhost:7687, and have three databases in there: system, neo4j, and mydb. The neo4j database is the standard.
When I open the Neo4j browser and do a query such as MATCH (n) RETURN n, it automatically assumes that I want to query the standard DB which is called neo4j. However, I want to query another one, mydb.
My output when I query aforementioned query says
{
"query": {
"text": "match (n) return n",
"parameters": {}
},
"queryType": "r",
"counters": {
"_stats": {
"nodesCreated": 0,
"nodesDeleted": 0,
"relationshipsCreated": 0,
"relationshipsDeleted": 0,
"propertiesSet": 0,
"labelsAdded": 0,
"labelsRemoved": 0,
"indexesAdded": 0,
"indexesRemoved": 0,
"constraintsAdded": 0,
"constraintsRemoved": 0
},
"_systemUpdates": 0
},
"updateStatistics": {
"_stats": {
"nodesCreated": 0,
"nodesDeleted": 0,
"relationshipsCreated": 0,
"relationshipsDeleted": 0,
"propertiesSet": 0,
"labelsAdded": 0,
"labelsRemoved": 0,
"indexesAdded": 0,
"indexesRemoved": 0,
"constraintsAdded": 0,
"constraintsRemoved": 0
},
"_systemUpdates": 0
},
"plan": false,
"profile": false,
"notifications": [],
"server": {
"address": "localhost:7687",
"version": "Neo4j/4.4.5",
"agent": "Neo4j/4.4.5",
"protocolVersion": 4.4
},
"resultConsumedAfter": {
"low": 2,
"high": 0
},
"resultAvailableAfter": {
"low": 8,
"high": 0
},
"database": {
"name": "neo4j"
}
}
In the last JSON value is the proof that the query was executed on database neo4j.
What do I have to add to my queries to instead query another database in the same DBMS?
You can change/specify the database using the following options.
From the Neo4j Browser, you can select the database in the sidebar.
In Cypher syntax, the use command lets you choose different databases.
:use mydb.
If you connect to Neo4j through an Application driver, you can specify the database while creating the session object.
For example, if you are using the Python driver:
from neo4j import GraphDatabase
driver = GraphDatabase.driver(uri, auth=(user, password))
session = driver.session(database="mydb")
Specify the default database in a system-wide manner by modifying the config_dbms.default_database value in the the neo4j.conf file.

Parsing a pipe delimited json data in python

I am trying to parse an API response which is JSON. the JSON looks like this:
{
'id': 112,
'name': 'stalin-PC',
'type': 'IP4Address',
'properties': 'address=10.0.1.110|ipLong=277893412|state=DHCP Allocated|macAddress=41-1z-y4-23-dd-98|'
}
It's length is 1200, If i convert it I should get 1200 rows. My goal is to parse this json like below:
id name type address iplong state macAddress
112 stalin-PC IP4Address 10.0.1.110 277893412 DHCP Allocated 41-1z-y4-23-dd-98
I am getting the first 3 elements but having an issue in "properties" key which value is pipe delimited. I have tried the below code:
for network in networks: # here networks = response.json()
network_id = network['id']
network_name = network['name']
network_type = network['type']
print(network_id, network_name, network_type)
It works file and gives me result :
112 stalin-PC IP4Address
But when I tried to parse the properties key with below code , its not working.
for network in networks:
network_id = network['id']
network_name = network['name']
network_type = network['type']
for line in network['properties']:
properties_value = line.split('|')
network_address = properties_value[0]
print(network_id, network_name, network_type, network_address )`
How can I parse the pipe delimited properties key? Would anyone help me please.
Thank you
Using str methods
Ex:
network = {
'id': 112,
'name': 'stalin-PC',
'type': 'IP4Address',
'properties': 'address=10.0.1.110|ipLong=277893412|state=DHCP Allocated|macAddress=41-1z-y4-23-dd-98'
}
for n in network['properties'].split("|"):
key, value = n.split("=")
print(key, "-->", value)
Output:
address --> 10.0.1.110
ipLong --> 277893412
state --> DHCP Allocated
macAddress --> 41-1z-y4-23-dd-98

How to flatten an array in a nested json in aws glue using pyspark?

I am trying to flatten a JSON file to be able to load it into PostgreSQL all in AWS Glue. I am using PySpark. Using a crawler I crawl the S3 JSON and produce a table. I then use an ETL Glue script to:
read the crawled table
use the 'Relationalize' function to flatten the file
convert the dynamic frame to a dataframe
try to 'explode' the request.data field
Script so far:
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = glue_source_database, table_name = glue_source_table, transformation_ctx = "datasource0")
df0 = Relationalize.apply(frame = datasource0, staging_path = glue_temp_storage, name = dfc_root_table_name, transformation_ctx = "dfc")
df1 = df0.select(dfc_root_table_name)
df2 = df1.toDF()
df2 = df1.select(explode(col('`request.data`')).alias("request_data"))
<then i write df1 to a PostgreSQL database which works fine>
Issues I face:
The 'Relationalize' function works well except the request.data field which becomes a bigint and therefore 'explode' doesn't work.
Explode cannot be done without using 'Relationalize' on the JSON first due to the structure of the data. Specifically the error is: "org.apache.spark.sql.AnalysisException: cannot resolve 'explode(request.data)' due to data type mismatch: input to function explode should be array or map type, not bigint"
If I try to make the dynamic frame a dataframe first then I get this issue: "py4j.protocol.Py4JJavaError: An error occurred while calling o72.jdbc.
: java.lang.IllegalArgumentException: Can't get JDBC type for struct..."
I tried to also upload a classifier so that the data would flatten in the crawl itself but AWS confirmed this wouldn't work.
The JSON format of the original file is as follows, that I an trying to normalise:
- field1
- field2
- {}
- field3
- {}
- field4
- field5
- []
- {}
- field6
- {}
- field7
- field8
- {}
- field9
- {}
- field10
# Flatten nested df
def flatten_df(nested_df):
for col in nested_df.columns:
array_cols = [c[0] for c in nested_df.dtypes if c[1][:5] == 'array']
for col in array_cols:
nested_df =nested_df.withColumn(col, F.explode_outer(nested_df[col]))
nested_cols = [c[0] for c in nested_df.dtypes if c[1][:6] == 'struct']
if len(nested_cols) == 0:
return nested_df
flat_cols = [c[0] for c in nested_df.dtypes if c[1][:6] != 'struct']
flat_df = nested_df.select(flat_cols +
[F.col(nc+'.'+c).alias(nc+'_'+c)
for nc in nested_cols
for c in nested_df.select(nc+'.*').columns])
return flatten_df(flat_df)
df=flatten_df(df)
It will replace all dots with underscore. Note that it uses explode_outer and not explode to include Null value in case array itself is null. This function is available in spark v2.4+ only.
Also remember, exploding array will add more duplicates and overall row size will increase. Flattening struct will increase column size. In short, your original df will explode horizontally and vertically. It may slow down processing data later.
Therefore my recommendation would be to identify feature related data and store only those data in postgresql and original json files in s3.
Once you have rationalized the json column, you don't need to explode it. Relationalize transforms the nested JSON into key-value pairs at the outermost level of the JSON document. The transformed data maintains a list of the original keys from the nested JSON separated by periods.
Example :
Nested json :
{
"player": {
"username": "user1",
"characteristics": {
"race": "Human",
"class": "Warlock",
"subclass": "Dawnblade",
"power": 300,
"playercountry": "USA"
},
"arsenal": {
"kinetic": {
"name": "Sweet Business",
"type": "Auto Rifle",
"power": 300,
"element": "Kinetic"
},
"energy": {
"name": "MIDA Mini-Tool",
"type": "Submachine Gun",
"power": 300,
"element": "Solar"
},
"power": {
"name": "Play of the Game",
"type": "Grenade Launcher",
"power": 300,
"element": "Arc"
}
},
"armor": {
"head": "Eye of Another World",
"arms": "Philomath Gloves",
"chest": "Philomath Robes",
"leg": "Philomath Boots",
"classitem": "Philomath Bond"
},
"location": {
"map": "Titan",
"waypoint": "The Rig"
}
}
}
Flattened out json after rationalize :
{
"player.username": "user1",
"player.characteristics.race": "Human",
"player.characteristics.class": "Warlock",
"player.characteristics.subclass": "Dawnblade",
"player.characteristics.power": 300,
"player.characteristics.playercountry": "USA",
"player.arsenal.kinetic.name": "Sweet Business",
"player.arsenal.kinetic.type": "Auto Rifle",
"player.arsenal.kinetic.power": 300,
"player.arsenal.kinetic.element": "Kinetic",
"player.arsenal.energy.name": "MIDA Mini-Tool",
"player.arsenal.energy.type": "Submachine Gun",
"player.arsenal.energy.power": 300,
"player.arsenal.energy.element": "Solar",
"player.arsenal.power.name": "Play of the Game",
"player.arsenal.power.type": "Grenade Launcher",
"player.arsenal.power.power": 300,
"player.arsenal.power.element": "Arc",
"player.armor.head": "Eye of Another World",
"player.armor.arms": "Philomath Gloves",
"player.armor.chest": "Philomath Robes",
"player.armor.leg": "Philomath Boots",
"player.armor.classitem": "Philomath Bond",
"player.location.map": "Titan",
"player.location.waypoint": "The Rig"
}
Thus in your case, request.data is already a new column flattened out from request column and its type is interpreted as bigint by spark.
Reference : Simplify/querying nested json with the aws glue relationalize transform

Why can't I retrieve the master appointment of a series via `AppointmentCalendar.FindAppointmentsAsync`?

I'm retrieving multiple appointments via AppointmentCalendar.FindAppointmentsAsync. I'm evaluating the Recurrence.RecurrenceType and noticed an unexpected value of 1 for master appointments of a series. I expect the Recurrence.RecurrenceType to be 0 (Master) but instead it is 1 (Instance).
(Note: I added AppointmentProperties.Recurrence to FindAppointmentsOptions.FetchProperties that is passed to GetAppointmentsAsync, so the Recurrence data should be fetched propertly.)
To double check I retrieved the respective master appointment via GetAppointmentAsync (instead of FindAppointmentsAsync) using its LocalId - and here the RecurrenceType is correctly set to 0.
Here is demo output for a test appointment series:
Data gotten by FindAppointmentsAsync (Instance??):
"Recurrence": {
"Unit": 0,
"Occurrences": 16,
"Month": 1,
"Interval": 1,
"DaysOfWeek": 0,
"Day": 1,
"WeekOfMonth": 0,
"Until": "2016-09-29T02:00:00+02:00",
"TimeZone": "Europe/Budapest",
"RecurrenceType": 1,
"CalendarIdentifier": "GregorianCalendar"
},
"StartTime": "2016-09-14T19:00:00+02:00",
"OriginalStartTime": "2016-09-14T19:00:00+02:00",
Data gotten by GetAppointmentAsync for the same appointment (Master):
"Recurrence": {
"Unit": 0,
"Occurrences": 16,
"Month": 1,
"Interval": 1,
"DaysOfWeek": 0,
"Day": 1,
"WeekOfMonth": 0,
"Until": "2016-09-29T02:00:00+02:00",
"TimeZone": "Europe/Budapest",
"RecurrenceType": 0,
"CalendarIdentifier": "GregorianCalendar"
},
"StartTime": "2016-09-14T19:00:00+02:00",
"OriginalStartTime": null,
Notice the difference in RecurrenceType. Also note that OriginalStartTime is set to null for the master gotten by GetAppointmentAsync but has a value for the appointment gotten by FindAppointmentsAsync.
You can also see that the StartTime for the master appointment is the start time set for the alleged Instance (which in reality is the master).
Shouldn't FindAppointmentsAsync return a master as the first element of a series, instead of an instance?
(SDK: 10.0.14393.0, Anniversary)
Code to explicitly find such a master/instance situation for a given calendar:
var appointmentsCurrent = await calendar.FindAppointmentsAsync(DateTimeOffset.Now, TimeSpan.FromDays(365), findAppointmentOptions);
foreach(var a in appointmentsCurrent)
{
var a2 = await calendar.GetAppointmentAsync(a.LocalId);
if (a2.Recurrence?.RecurrenceType == RecurrenceType.Master &&
a2.StartTime == a.StartTime &&
a.Recurrence?.RecurrenceType == RecurrenceType.Instance &&
a.OriginalStartTime == a2.StartTime)
{
Debug.WriteLine("Gotcha!");
}
}
I tested above code on my side. If you get the count of the appointments which are got from FindAppontmentsAsync by the following code:var count=appointmentsCurrent.Count;, you will find it does return the count of the appointment instances, not the count of master appointments. So the FindAppontmentsAsync method got all instances of the appointments not master appointments. This is the reason why the RecurrenceType is instance.
It seems like we can get one master appointment by method GetAppointmentAsync as you mentioned above, so I suppose this may not block you.
If you think this is not a good design for this API or you require a API for finding all the master appointments in one calendar, you can submit your ideas to the windows 10 feedback tool or the user voice site.

opa database : how to know if value exists in database?

I'd like to know if a record is present in a database, using a field different from the key field.
I've try the following code :
function start()
{
jlog("start db query")
myType d1 = {A:"rabbit", B:"poney"};
/myDataBase/data[A == d1.A] = d1
jlog("db write done")
option opt = ?/myDataBase/data[B == "rabit"]
jlog("db query done")
match(opt)
{
case {none} : <>Nothing in db</>
case {some:data} : <>{data} in database</>
}
}
Server.start(
{port:8092, netmask:0.0.0.0, encryption: {no_encryption}, name:"test"},
[
{page: start, title: "test" }
]
)
But the server hang up, and never get to the line jlog("db query done"). I mispell "rabit" willingly. What should I've done ?
Thanks
Indeed it fails with your example, I have a "Match failure 8859742" exception, don't you see it?
But do not use ?/myDataBase/data[B == "rabit"] (which should have been rejected at compile time - bug report sent) but /myDataBase/data[B == "rabit"] which is a DbSet. The reason is when you don't use a primary key, then you can have more than one value in return, ie a set of values.
You can convert a dbset to an iter with DbSet.iterator. Then manipulate it with Iter: http://doc.opalang.org/module/stdlib.core.iter/Iter

Resources