MongoDb groupby count query - database

I have a mongoDB document having certain columns like Id, EmployeeID, SiteID, EmployeeAddress.
A employee can be present at a site more than once.
I want to have a group by query along with count which will give result set as
EmployeeID SiteID Count EmployeeAddress
basically how many times an employee is present as a site.
I am using this query but not getting the desired data.
db.pnr_dashboard.aggregate(
[
{
"$group" : {
"_id" : { "siteId" : "$siteId" , "employeeId" : "$employeeId"} ,
"count" : { "$sum":1}},
}
]
);

Related

SQL Searching on columns generated in output

I have two tables and writing a query by taking data from both the tables as below:
select distinct on (e.pol) e.pol, ei,bene, ei.status,
(jsonb_array_elements(ei.name_json)->> 'custName1') as name1,
(jsonb_array_elements(ei.name_json)->> 'custName1') as name2
from table1 e, table2 ei
cross join (jsonb_array_elements(ei.name_json)
where e.pol=ei.pol and value->>'custName1' like '%Tes%'
order by e.pol, ei.bene
In this query I am trying to search from a json array "name_json" as follows:
[
{
"custName1": "Tesla",
"custName2": ""
},
{
"custName1": "Gerber",
"custName2": "N"
}
]
I am displaying only that column which is having the distinct policy and its respective custName1 and wantto search for only field that is diaplayed in my output. How to do it? How to search only on the columns displayed in the output?
Help is appreciated.

Parsing string with multiple delimiters into columns

I want to split strings into columns.
My columns should be:
account_id, resource_type, resource_name
I have a JSON file source that I have been trying to parse via ADF data flow. That hasn't worked for me, hence I flattened the data and brought it into SQL Server (I am open to parsing values via ADF or SQL if anyone can show me how). Please check the JSON file at the bottom.
Use this code to query the data I am working with.
CREATE TABLE test.test2
(
resource_type nvarchar(max) NULL
)
INSERT INTO test.test2 ([resource_type])
VALUES
('account_id:224526257458,resource_type:buckets,resource_name:camp-stage-artifactory'),
('account_id:535533456241,resource_type:buckets,resource_name:tni-prod-diva-backups'),
('account_id:369798452057,resource_type:buckets,resource_name:369798452057-s3-manifests'),
('account_id:460085747812,resource_type:buckets,resource_name:vessel-incident-report-nonprod-accesslogs')
The output that I should be able to query in SQL Server should like this:
account_id
resource_type
resource_name
224526257458
buckets
camp-stage-artifactory
535533456241
buckets
tni-prod-diva-backups
and so forth.
Please help me out and ask for clarification if needed. Thanks in advance.
EDIT:
Source JSON Format:
{
"start_date": "2021-12-01 00:00:00+00:00",
"end_date": "2021-12-31 23:59:59+00:00",
"resource_type": "all",
"records": [
{
"directconnect_connections": [
"account_id:227148359287,resource_type:directconnect_connections,resource_name:'dxcon-fh40evn5'",
"account_id:401311080156,resource_type:directconnect_connections,resource_name:'dxcon-ffxgf6kh'",
"account_id:401311080156,resource_type:directconnect_connections,resource_name:'dxcon-fg5j5v6o'",
"account_id:227148359287,resource_type:directconnect_connections,resource_name:'dxcon-fgvfo1ej'"
]
},
{
"virtual_interfaces": [
"account_id:227148359287,resource_type:virtual_interfaces,resource_name:'dxvif-fgvj25vt'",
"account_id:227148359287,resource_type:virtual_interfaces,resource_name:'dxvif-fgbw5gs0'",
"account_id:401311080156,resource_type:virtual_interfaces,resource_name:'dxvif-ffnosohr'",
"account_id:227148359287,resource_type:virtual_interfaces,resource_name:'dxvif-fg18bdhl'",
"account_id:227148359287,resource_type:virtual_interfaces,resource_name:'dxvif-ffmf6h64'",
"account_id:390251991779,resource_type:virtual_interfaces,resource_name:'dxvif-fgkxjhcj'",
"account_id:227148359287,resource_type:virtual_interfaces,resource_name:'dxvif-ffp6kl3f'"
]
}
]
}
Since you don't have a valid JSON string and not wanting to get in the business of string manipulation... perhaps this will help.
Select B.*
From test2 A
Cross Apply ( Select account_id = max(case when value like 'account_id:%' then stuff(value,1,11,'') end )
,resource_type = max(case when value like 'resource_type:%' then stuff(value,1,14,'') end )
,resource_name = max(case when value like 'resource_name:%' then stuff(value,1,14,'') end )
from string_split(resource_type,',')
)B
Results
account_id resource_type resource_name
224526257458 buckets camp-stage-artifactory
535533456241 buckets tni-prod-diva-backups
369798452057 buckets 369798452057-s3-manifests
460085747812 buckets vessel-incident-report-nonprod-accesslogs
Unfortunately, the values inside the arrays are not valid JSON. You can patch them up by adding {} to the beginning/end, and adding " on either side of : and ,.
DECLARE #json nvarchar(max) = N'{
"start_date": "2021-12-01 00:00:00+00:00",
"end_date": "2021-12-31 23:59:59+00:00",
"resource_type": "all",
"records": [
{
"directconnect_connections": [
"account_id:227148359287,resource_type:directconnect_connections,resource_name:''dxcon-fh40evn5''",
"account_id:401311080156,resource_type:directconnect_connections,resource_name:''dxcon-ffxgf6kh''",
"account_id:401311080156,resource_type:directconnect_connections,resource_name:''dxcon-fg5j5v6o''",
"account_id:227148359287,resource_type:directconnect_connections,resource_name:''dxcon-fgvfo1ej''"
]
},
{
"virtual_interfaces": [
"account_id:227148359287,resource_type:virtual_interfaces,resource_name:''dxvif-fgvj25vt''",
"account_id:227148359287,resource_type:virtual_interfaces,resource_name:''dxvif-fgbw5gs0''",
"account_id:401311080156,resource_type:virtual_interfaces,resource_name:''dxvif-ffnosohr''",
"account_id:227148359287,resource_type:virtual_interfaces,resource_name:''dxvif-fg18bdhl''",
"account_id:227148359287,resource_type:virtual_interfaces,resource_name:''dxvif-ffmf6h64''",
"account_id:390251991779,resource_type:virtual_interfaces,resource_name:''dxvif-fgkxjhcj''",
"account_id:227148359287,resource_type:virtual_interfaces,resource_name:''dxvif-ffp6kl3f''"
]
}
]
}';
SELECT
j4.account_id,
j4.resource_type,
TRIM('''' FROM j4.resource_name) resource_name
FROM OPENJSON(#json, '$.records') j1
CROSS APPLY OPENJSON(j1.value) j2
CROSS APPLY OPENJSON(j2.value) j3
CROSS APPLY OPENJSON('{"' + REPLACE(REPLACE(j3.value, ':', '":"'), ',', '","') + '"}')
WITH (
account_id bigint,
resource_type varchar(20),
resource_name varchar(100)
) j4;
db<>fiddle
The first three calls to OPENJSON have no schema, so the resultset is three columns: key value and type. In the case of arrays (j1 and j3), key is the index into the array. In the case of single objects (j2), key is each property name.

Compare multiple date fields in JSON and use them in where clause

So i have a text field in my Postgres 10.8 (json_array_elements not possible) DB. It has a json structure like this.
{
"code_cd": "02",
"tax_cd": null,
"earliest_exit_date": [
{
"date": "2023-03-31",
"_destroy": ""
},
{
"date": "2021-11-01",
"_destroy": ""
},
{
"date": "2021-12-21",
"_destroy": ""
}
],
"enter_date": null,
"leave_date": null
}
earliest exit_date can also be empty like this:
{
"code_cd": "02",
"tax_cd": null,
"earliest_exit_date":[],
"enter_date": null,
"leave_date": null
}
Now i want to get the earliest_exit_date back where the date is after current_date and is the closest one to current_date. From the example with earliest_exit_date the output have to be: 2021-12-21
Anyone knows how to do this?
If your table has unique value or has id you can use below query:
Sample table and data structure: dbfiddle
select distinct
id,
min("date") filter (where "date" > current_date) over (partition by id)
from
test t
cross join jsonb_to_recordset(t.data::jsonb -> 'earliest_exit_date') as e("date" date)
order by id

How do I import an array of data into separate rows in a hive table?

I am trying to import data in the following format into a hive table
[
{
"identifier" : "id#1",
"dataA" : "dataA#1"
},
{
"identifier" : "id#2",
"dataA" : "dataA#2"
}
]
I have multiple files like this and I want each {} to form one row in the table. This is what I have tried:
CREATE EXTERNAL TABLE final_table(
identifier STRING,
dataA STRING
) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION "s3://bucket/path_in_bucket/"
This is not creating a single row for each {} though. I have also tried
CREATE EXTERNAL TABLE final_table(
rows ARRAY< STRUCT<
identifier: STRING,
dataA: STRING
>>
) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION "s3://bucket/path_in_bucket/"
but this is not work either. Is there some way of specifying that the input as an array with each record being an item in the array to the hive query? Any suggestions on what to do?
Here is what you need
Method 1: Adding name to the array
Data
{"data":[{"identifier" : "id#1","dataA" : "dataA#1"},{"identifier" : "id#2","dataA" : "dataA#2"}]}
SQL
SET hive.support.sql11.reserved.keywords=false;
CREATE EXTERNAL TABLE IF NOT EXISTS ramesh_test (
data array<
struct<
identifier:STRING,
dataA:STRING
>
>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 'my_location';
SELECT rows.identifier,
rows.dataA
FROM ramesh_test d
LATERAL VIEW EXPLODE(d.data) d1 AS rows ;
Output
Method 2 - No Changes to the data
Data
[{"identifier":"id#1","dataA":"dataA#1"},{"identifier":"id#2","dataA":"dataA#2"}]
SQL
CREATE EXTERNAL TABLE IF NOT EXISTS ramesh_raw_json (
json STRING
)
LOCATION 'my_location';
SELECT get_json_object (exp.json_object, '$.identifier') AS Identifier,
get_json_object (exp.json_object, '$.dataA') AS Identifier
FROM ( SELECT json_object
FROM ramesh_raw_json a
LATERAL VIEW EXPLODE (split(regexp_replace(regexp_replace(a.json,'\\}\\,\\{','\\}\\;\\{'),'\\[|\\]',''), '\\;')) json_exploded AS json_object ) exp;
Output
JSON records in data files must appear one per line, an empty line would produce a NULL record.
This json should work
{ "identifier" : "id#1", "dataA" : "dataA#1" },
{ "identifier" : "id#2", "dataA" : "dataA#2" }

In ArangoDB, will querying, with filters, from the neighbor(s) be done in O(n)?

I've been reading Aql Graph Operation and Graphs, and have found no concrete example and performance explanation for the use case of SQL-Traverse.
E.g.:
If I have a collection Users, which has a company relation to collection Company
Collection Company has relation location to collection Location;
Collection Location is either a city, country, or region, and has relation city, country, region to itself.
Now, I would like to query all users who belong to companies in Germany or EU.
SELECT from Users where Users.company.location.city.country.name="Germany";
SELECT from Users where Users.company.location.city.parent.name="Germany";
or
SELECT from Users where Users.company.location.city.country.region.name="europe";
SELECT from Users where Users.company.location.city.parent.parent.name="europe";
Assuming that Location.name is indexed, can I have the two queries above executed with O(n), with n being the number of documents in Location (O(1) for graph traversal, O(n) for index scanning)?
Of course, I could just save regionName or countryName directly in company, as these cities and countries are in EU, unlike in ... other places, won't probably change, but what if... you know what I mean (kidding, what if I have other use cases which require constant update)
I'm going to explain this using the ArangoDB 2.8 Traversals.
We Create these collections to match your shema using arangosh:
db._create("countries")
db.countries.save({_key:"Germany", name: "Germany"})
db.countries.save({_key:"France", name: "France"})
db.countries.ensureHashIndex("name")
db._create("cities")
db.cities.save({_key: "Munich"})
db.cities.save({_key: "Toulouse")
db._create("company")
db.company.save({_key: "Siemens"})
db.company.save({_key: "Airbus"})
db._create("employees")
db.employees.save({lname: "Kraxlhuber", cname: "Xaver", _key: "user1"})
db.employees.save({lname: "Heilmann", cname: "Vroni", _key: "user2"})
db.employees.save({lname: "Leroy", cname: "Marcel", _key: "user3"})
db._createEdgeCollection("CityInCountry")
db._createEdgeCollection("CompanyIsInCity")
db._createEdgeCollection("WorksAtCompany")
db.CityInCountry.save("cities/Munich", "countries/Germany", {label: "beautiful South near the mountains"})
db.CityInCountry.save("cities/Toulouse", "countries/France", {label: "crowded city at the mediteranian Sea"})
db.CompanyIsInCity.save("company/Siemens", "cities/Munich", {label: "darfs ebbes gscheits sein? Oder..."})
db.CompanyIsInCity.save("company/Airbus", "cities/Toulouse", {label: "Big planes Ltd."})
db.WorksAtCompany.save("employees/user1", "company/Siemens", {employeeOfMonth: true})
db.WorksAtCompany.save("employees/user2", "company/Siemens", {veryDiligent: true})
db.WorksAtCompany.save("employees/user3", "company/Eurocopter", {veryDiligent: true})
In AQL we would write this query the other way around.
We start with the constant time FILTER on the indexed attribute name and start our traversals from there on.
Therefor we filter for the country "Germany":
db._explain("FOR country IN countries FILTER country.name == 'Germany' RETURN country ")
Query string:
FOR country IN countries FILTER country.name == 'Germany' RETURN country
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
6 IndexNode 1 - FOR country IN countries /* hash index scan */
5 ReturnNode 1 - RETURN country
Indexes used:
By Type Collection Unique Sparse Selectivity Fields Ranges
6 hash countries false false 66.67 % [ `name` ] country.`name` == "Germany"
Optimization rules applied:
Id RuleName
1 use-indexes
2 remove-filter-covered-by-index
Now that we have our well filtered start node, we do a graph traversal in reverse direction. Since we know that Employees are exactly 3 steps away from the start Vertex, and we're not interested in the path, we only return the 3rd layer:
db._query("FOR country IN countries FILTER country.name == 'Germany' FOR v IN 3 INBOUND country CityInCountry, CompanyIsInCity, WorksAtCompany RETURN v")
[
{
"cname" : "Xaver",
"lname" : "Kraxlhuber",
"_id" : "employees/user1",
"_rev" : "1286703864570",
"_key" : "user1"
},
{
"cname" : "Vroni",
"lname" : "Heilmann",
"_id" : "employees/user2",
"_rev" : "1286729095930",
"_key" : "user2"
}
]
Some words about this queries performance:
We locate Germany using a hash index is constant time -> O(1)
Based on that we want to traverse m many paths where m is the number of employees in Germany; Each of them can be traversed in constant time.
-> O(m) at this step.
Return the result in constant time -> O(1)
All combined we need O(m) where we expect m to be less than n (the number of employees) as used in your SQL-Traversal.

Resources