Hive lateral view not working AWS Athena - arrays

Im working on a process of AWS Cloudtrail log analysis, Im getting stuck in extract JSON from a row,
This is my table definition.
CREATE EXTERNAL TABLE cloudtrail_logs (
eventversion STRING,
eventName STRING,
awsRegion STRING,
requestParameters STRING,
elements STRING ,
additionalEventData STRING
)
ROW FORMAT SERDE 'com.amazon.emr.hive.serde.CloudTrailSerde'
STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://XXXXXX/CloudTrail'
If I run select elements from cl1 limit 1 it returns this result.
{"groupId":"sg-XXXX","ipPermissions":{"items":[{"ipProtocol":"tcp","fromPort":22,"toPort":22,"groups":{},"ipRanges":{"items":[{"cidrIp":"0.0.0.0/0"}]},"prefixListIds":{}}]}}
I need to show this result as virtual columns like,
| groupId | ipProtocol | fromPort | toPort| ipRanges.items.cidrIp|
|---------|------------|--------- | ------|-----------------------------|
| -1 | 0 | | | |
Im using AWS Athena and I tried Lateral view and get_json_object is not working in AWS.
its an external table

select json_extract_scalar(i.item,'$.ipProtocol') as ipProtocol
,json_extract_scalar(i.item,'$.fromPort') as fromPort
,json_extract_scalar(i.item,'$.toPort') as toPort
from cloudtrail_logs
cross join unnest (cast(json_extract(elements,'$.ipPermissions.items')
as array(json))) as i (item)
;
ipProtocol | fromPort | toPort
------------+----------+--------
"tcp" | 22 | 22

Related

Store a list of values as a string when creating a table in snowflake

I am trying to create a table with 5 columns. COLUMN #2 (PROGRESS) is a comma seperated list (i.e 1,2,3,4 etc.) but when trying to create this table as either a string, variant or varchar, Snowflake refuses to allow this. Any advice on how I can create a column seperated list from a CSV? I tried to import the data as a TSV, XML, as well as a JSON file but no success.
create or replace TABLE AD_HOC.TEMP.NEW_DATA (
VISITOR_ID VARCHAR(16777216),
PROGRESS VARCHAR(16777216),
DATE DATETIME,
ROLE VARCHAR(16777216),
FIRST_VISIT DATETIME
)COMMENT='Interaction data'
;
Goal:
VISITOR_ID | PROGRESS | DATE | ROLE | FIRST_VISIT
111 | [1,2,3] | 1/1/2022 | OWNER | 1/1/2021
123 | [1] | 1/2/2022 | ADMIN | 2/2/2021
23321 | [1,2,3,4] | 2/22/2022 | USER | 3/12/2021
I encoded the column in python and loaded the data in Snowflake!
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = doc_data.join(pd.DataFrame(mlb.fit_transform(doc_data.pop('PROGRESS')),
columns=mlb.classes_,
index=doc_data.index))
df

How to inner join two windowed tables in Flux query language?

The goal is to join tables min and max returned by the following query:
data = from(bucket: "my_bucket")
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
min = data
|> aggregateWindow(
every: 1d,
fn: min,
column: "_value")
max = data
|> aggregateWindow(
every: 1d,
fn: max,
column: "_value")
The columns of max look like this:
+---------------------------------+
| Columns |
+---------------------------------+
| table MAX |
| _measurement GROUP STRING |
| _field GROUP STRING |
| _value NO GROUP DOUBLE |
| _start GROUP DATETIME:RFC3339 |
| _stop GROUP DATETIME:RFC3339 |
| _time NO GROUP DATETIME:RFC3339 |
| env GROUP STRING |
| path GROUP STRING |
+---------------------------------+
The min table looks the same except the name of the first column. Both tables return data which can be confirmed by running yield(tables:min) or yield(tables:max). The join should be an inner join on columns _measurement, _field, _time, env and path and it should contain both the minimum and the maximum value _value of every window.
When I try to run within influxdb DataExplorer
join(tables: {min: min, max: max}, on: ["_time", "_field", "path", "_measurement", "env"], method: "inner")
I get the following error:
Failed to execute Flux query
When I run the job in Bash via influx query --file ./query.flux -r > ./query.csv; I get the following error:
Error: failed to execute query: 504 Gateway Timeout: unable to decode response content type "text/html; charset=utf-8"
No more logging-output is available to investigate the issue further. Whats wrong with this join?
join can only take two tables as the parameters according to this doc. You could try the union where it can take more than two tables as the input. See more details here.
You might just need to modify the script as below:
union(tables: [min: min, max: max]

How can I update many rows with different values for the same column?

I have a table with a column containing a path to a file. The path is an absolute path, and values for this column look like this: C:\CI\Media\animal.jpg.
The table looks like so, except there are many rows so editing by hand is not practical:
`+----+-----------------------------------+
| ID | Path |
+----+-----------------------------------+
| 1 | C:\CI\Media\sushi.jpg |
| 2 | C:\CI\Media\animal.jpg |
| 3 | C:\CI\Media\Tuscany Trip\pisa.png |
+----+-----------------------------------+`
Path is an nvarchar(260)
And what'd I'd like to do is run a query that will update each record so the path for each record replaces C:\CI\ with C:\CI\Net, and end up with a table that looks like so:
`+----+---------------------------------------+
| ID | Path |
+----+---------------------------------------+
| 1 | C:\CI\Net\Media\sushi.jpg |
| 2 | C:\CI\Net\Media\animal.jpg |
| 3 | C:\CI\Net\Media\Tuscany Trip\pisa.png |
+----+---------------------------------------+`
Is there a way to format a query that will update every record, but update it based on the existing value (replace the C:\CI portion with C:\CI\Net for each record while maintaining the rest of the the value) instead of setting each column to the same value like a normal Update table set column = value ?
Gosh you almost wrote the code yourself.
Update YourTable
set path = replace(path, 'C:\CI', 'C:\CI\Net')

importing json array into hive

I'm trying to import the following json in hive
[{"time":1521115600,"latitude":44.3959,"longitude":26.1025,"altitude":53,"pm1":21.70905,"pm25":16.5,"pm10":14.60085,"gas1":0,"gas2":0.12,"gas3":0,"gas4":0,"temperature":null,"pressure":0,"humidity":0,"noise":0},{"time":1521115659,"latitude":44.3959,"longitude":26.1025,"altitude":53,"pm1":24.34045,"pm25":18.5,"pm10":16.37065,"gas1":0,"gas2":0.08,"gas3":0,"gas4":0,"temperature":null,"pressure":0,"humidity":0,"noise":0},{"time":1521115720,"latitude":44.3959,"longitude":26.1025,"altitude":53,"pm1":23.6826,"pm25":18,"pm10":15.9282,"gas1":0,"gas2":0,"gas3":0,"gas4":0,"temperature":null,"pressure":0,"humidity":0,"noise":0},{"time":1521115779,"latitude":44.3959,"longitude":26.1025,"altitude":53,"pm1":25.65615,"pm25":19.5,"pm10":17.25555,"gas1":0,"gas2":0.04,"gas3":0,"gas4":0,"temperature":null,"pressure":0,"humidity":0,"noise":0}]
CREATE TABLE json_serde (
s array<struct<time: timestamp, latitude: string, longitude: string, pm1: string>>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'mapping.value' = 'value'
)
STORED AS TEXTFILE
location '/user/hduser';
the import works but if i try
Select * from json_serde;
it will return from every document that is on hadoop/user/hduser only the first element per file.
there is a good documentation on working with json array??
If I may suggest you another approach to just load the whole JSON string into a column as String datatype into an external table. The only restriction is to define LINES TERMINATED BY properly. e.g. If you may have each json in one line, then you can create table as below:
e.g.
CREATE EXTERNAL TABLE json_data_table (
json_data String
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\u0001' LINES TERMINATED BY '\n' STORED AS TEXTFILE
LOCATION '/path/to/json';
Use Hive get_json_object to extract individual columns. this commands support basic xPath like query to json string E.g.
If json_data column has below JSON string
{"store":
{"fruit":\[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],
"bicycle":{"price":19.95,"color":"red"}
},
"email":"amy#only_for_json_udf_test.net",
"owner":"amy"
}
The below query fetches
SELECT get_json_object(json_data, '$.owner') FROM json_data_table;
returns amy
In this way you could extract each json element as column from the table.
You have an array of structs. What you pasted is only one line.
If you want to see all the elements, you need to use inline
SELECT inline(s) FROM json_table;
Alternatively, you need to rewrite your files such that each object within that array is a single JSON object on its own line of the file
Also, I don't see a value field in your data, so I'm not sure what you're mapping in the serde properties
The JSON that you provided is not correct. A JSON always starts with an opening curly brace "{" and ends with an ending curly brace "}".
So, the first thing to look out here is that your JSON is wrong.
Your JSON should have been like the one below:
{"key":[{"key1":"value1","key2":"value2"},{"key1":"value1","key2":"value2""},{"key1":"value1","key2":"value2"}]}
And, the second thing is you have declared the data-type of the "time" field as timestamp. But the data (1521115600) is in milliseconds. The timestamp data-type need data in the format YYYY-MM-DD HH:MM:SS[.fffffffff].
So, your data should ideally be in the below format:
{"myjson":[{"time":"1970-01-18
20:01:55","latitude":44.3959,"longitude":26.1025,"altitude":53,"pm1":21.70905,"pm25":16.5,"pm10":14.60085,"gas1":0,"gas2":0.12,"gas3":0,"gas4":0,"temperature":null,"pressure":0,"humidity":0,"noise":0},{"time":"1970-01-18
20:01:55","latitude":44.3959,"longitude":26.1025,"altitude":53,"pm1":24.34045,"pm25":18.5,"pm10":16.37065,"gas1":0,"gas2":0.08,"gas3":0,"gas4":0,"temperature":null,"pressure":0,"humidity":0,"noise":0},{"time":"1970-01-18
20:01:55","latitude":44.3959,"longitude":26.1025,"altitude":53,"pm1":23.6826,"pm25":18,"pm10":15.9282,"gas1":0,"gas2":0,"gas3":0,"gas4":0,"temperature":null,"pressure":0,"humidity":0,"noise":0},{"time":"1970-01-18
20:01:55","latitude":44.3959,"longitude":26.1025,"altitude":53,"pm1":25.65615,"pm25":19.5,"pm10":17.25555,"gas1":0,"gas2":0.04,"gas3":0,"gas4":0,"temperature":null,"pressure":0,"humidity":0,"noise":0}]}
Now, you can use query to select the records from the table.
hive> select * from json_serde;
OK
[{"time":"1970-01-18 20:01:55","latitude":"44.3959","longitude":"26.1025","pm1":"21.70905"},{"time":"1970-01-18 20:01:55","latitude":"44.3959","longitude":"26.1025","pm1":"24.34045"},{"time":"1970-01-18 20:01:55","latitude":"44.3959","longitude":"26.1025","pm1":"23.6826"},{"time":"1970-01-18 20:01:55","latitude":"44.3959","longitude":"26.1025","pm1":"25.65615"}]
Time taken: 0.069 seconds, Fetched: 1 row(s)
hive>
If you want each value separately displayed in tabular format, you can use the below query.
select b.* from json_serde a lateral view outer inline (a.myjson) b;
The result of the above query would be like this:
+------------------------+-------------+--------------+-----------+--+
| b.time | b.latitude | b.longitude | b.pm1 |
+------------------------+-------------+--------------+-----------+--+
| 1970-01-18 20:01:55.0 | 44.3959 | 26.1025 | 21.70905 |
| 1970-01-18 20:01:55.0 | 44.3959 | 26.1025 | 24.34045 |
| 1970-01-18 20:01:55.0 | 44.3959 | 26.1025 | 23.6826 |
| 1970-01-18 20:01:55.0 | 44.3959 | 26.1025 | 25.65615 |
+------------------------+-------------+--------------+-----------+--+
Beautiful. Is n't it?
Happy Learning.
If you can not use update your input file format you can directly import in spark and use it, once data is finalized write back to Hive table.
scala> val myjs = spark.read.format("json").option("path","file:///root/tmp/test5").load()
myjs: org.apache.spark.sql.DataFrame = [altitude: bigint, gas1: bigint ... 13 more fields]
scala> myjs.show()
+--------+----+----+----+----+--------+--------+---------+-----+--------+--------+----+--------+-----------+----------+
|altitude|gas1|gas2|gas3|gas4|humidity|latitude|longitude|noise| pm1| pm10|pm25|pressure|temperature| time|
+--------+----+----+----+----+--------+--------+---------+-----+--------+--------+----+--------+-----------+----------+
| 53| 0|0.12| 0| 0| 0| 44.3959| 26.1025| 0|21.70905|14.60085|16.5| 0| null|1521115600|
| 53| 0|0.08| 0| 0| 0| 44.3959| 26.1025| 0|24.34045|16.37065|18.5| 0| null|1521115659|
| 53| 0| 0.0| 0| 0| 0| 44.3959| 26.1025| 0| 23.6826| 15.9282|18.0| 0| null|1521115720|
| 53| 0|0.04| 0| 0| 0| 44.3959| 26.1025| 0|25.65615|17.25555|19.5| 0| null|1521115779|
+--------+----+----+----+----+--------+--------+---------+-----+--------+--------+----+--------+-----------+----------+
scala> myjs.write.json("file:///root/tmp/test_output")
Alternatively you can directly hive table
scala> myjs.createOrReplaceTempView("myjs")
scala> spark.sql("select * from myjs").show()
scala> spark.sql("create table tax.myjs_hive as select * from myjs")

Returning BigQuery data filtered on nested objects

I'm trying to create a query which returns data which is filtered on 2 nested objects. I've added (1) and (2) to the code to indicate that I want results from two different nested objects (I know that this isn't a valid query). I've been looking at WITHIN RECORD but I can't get my head around it.
SELECT externalIds.value(1) AS appName, externalIds.value(2) AS driverRef, SUM(quantity)/ 60 FROM [billing.tempBilling]
WHERE callTo = 'example' AND externalIds.type(1) = 'driverRef' AND externalIds.type(2) = 'applicationName'
GROUP BY appName, driverRef ORDER BY appName, driverRef;
The data loaded into BigQuery looks like this:
{
"callTo": "example",
"quantity": 120,
"externalIds": [
{"type": "applicationName", "value": "Example App"},
{"type": "driverRef", "value": 234}
]
}
The result I'm after is this:
+-------------+-----------+----------+
| appName | driverRef | quantity |
+-------------+-----------+----------+
| Example App | 123 | 12.3 |
| Example App | 234 | 132.7 |
| Test App | 142 | 14.1 |
| Test App | 234 | 17.4 |
| Test App | 347 | 327.5 |
+-------------+-----------+----------+
If all of the quantities that you need to sum are within the same record, then you can use WITHIN RECORD for this query. Use NTH() WITHIN RECORD to get the first and second values for a field in the record. Then use HAVING to perform the filtering because it requires a value computed by an aggregation function.
SELECT callTo,
NTH(1, externalIds.type) WITHIN RECORD AS firstType,
NTH(1, externalIds.value) WITHIN RECORD AS maybeAppName,
NTH(2, externalIds.type) WITHIN RECORD AS secondType,
NTH(2, externalIds.value) WITHIN RECORD AS maybeDriverRef,
SUM(quantity) WITHIN RECORD
FROM [billing.tempBilling]
HAVING callTo LIKE 'example%' AND
firstType = 'applicationName' AND
secondType = 'driverRef';
If the quantities to be summed are spread across multiple records, then you can start with this approach and then group by your keys and sum those quantities in an outer query.

Resources