empty include paths in cosmos db index policy - database

I have a cosmos DB with the following index policy
{
"indexingMode": "consistent",
"automatic": true,
"includedPaths": [],
"excludedPaths": [
{
"path": "/*"
},
{
"path": "/\"_etag\"/?"
}
]
}
As you can see, includedPaths is empty but I am surprised to see that the index size is 62.6GB. I am wondering what all data is there in the index when there is nothing to index (as per index policy).
Here is the other DB metrics
Data size -> 1.5TB
Index size -> 62.6GB
Document count -> 746M

Related

To fetch elements and compare elements from JSON object in Ruby

I have many json resources similar to the below one. But, I need to only fetch the json resource which satisfies the two conditions:
(1) component.code.text == Diastolic Blood Pressure
(2) valueQuantity.value < 90
This is the JSON object/resource
{
"fullUrl": "urn:uuid:edf9439b-0173-b4ab-6545 3b100165832e",
"resource": {
"resourceType": "Observation",
"id": "edf9439b-0173-b4ab-6545-3b100165832e",
"component": [ {
"code": {
"coding": [ {
"system": "http://loinc.org",
"code": "8462-4",
"display": "Diastolic Blood Pressure"
} ],
"text": "Diastolic Blood Pressure"
},
"valueQuantity": {
"value": 81,
"unit": "mm[Hg]",
"system": "http://unitsofmeasure.org",
"code": "mm[Hg]"
}
}, {
"code": {
"coding": [ {
"system": "http://loinc.org",
"code": "8480-6",
"display": "Systolic Blood Pressure"
} ],
"text": "Systolic Blood Pressure"
},
"valueQuantity": {
"value": 120,
"unit": "mm[Hg]",
"system": "http://unitsofmeasure.org",
"code": "mm[Hg]"
}
} ]
},
}
JSON file
I need to write a condition to fetch the resource with text: "Diastolic Blood Pressure" AND valueQuantity.value > 90
I have written the following code:
def self.hypertension_observation(bundle)
entries = bundle.entry.select {|entry| entry.resource.is_a?(FHIR::Observation)}
observations = entries.map {|entry| entry.resource}
hypertension_observation_statuses = ((observations.select {|observation| observation&.component&.at(0)&.code&.text.to_s == 'Diastolic Blood Pressure'}) && (observations.select {|observation| observation&.component&.at(0)&.valueQuantity&.value.to_i >= 90}))
end
I am getting the output without any error. But, the second condition is not being satisfied in the output. The output contains even values < 90.
Please anyone help in correcting this ruby code regarding fetching only, output which contains value<90
I will write out what I would do for a problem like this, based on the (edited) version of your json data. I'm inferring that the full json file is some list of records with medical data, and that we want to fetch only the records for which the individual's diastolic blood pressure reading is < 90.
If you want to do this in Ruby I recommend using the JSON parser which comes with your ruby distro. What this does is it takes some (hopefully valid) json data and returns a Ruby array of hashes, each with nested arrays and hashes. In my solution I saved the json you posted to a file and so I would do something like this:
require 'json'
require 'pp'
json_data = File.read("medical_info.json")
parsed_data = JSON.parse(json_data)
fetched_data = []
parsed_data.map do |record|
diastolic_text = record["resource"]["component"][0]["code"]["text"]
diastolic_value_quantity = record["resource"]["component"][0]["valueQuantity"]["value"]
if diastolic_value_quantity < 90
fetched_data << record
end
end
pp fetched_data
This will print a new array of hashes which contains only the results with the desired values for diastolic pressure. The 'pp' gem is for 'Pretty Print' which isn't perfect but makes the hierarchy a little easier to read.
I find that when faced with deeply nested JSON data that I want to parse in Ruby, I will save the JSON data to a file, as I did here, and then in the directory where the file is, I run IRB so I can just play with accessing the hash values and array elements that I'm looking for.

Patch request in SCIM with Azure AD

How should I handle the following PATCH request, for a user that when initially added didn't have any address (not even an empty addresses array)?
{
"schemas": [
"urn:ietf:params:scim:api:messages:2.0:PatchOp"
],
"Operations": [
{
"op": "Add",
"path": "addresses[type eq \"work\"].formatted",
"value": "Columbus"
}
]
}
Should I "proactively" create an addresses array, with a single value as following (what seems a very bad solutions)?
{"type": "work", formatted: "Columbus"}
I would expect a patch request that looks like:
{
"schemas": [
"urn:ietf:params:scim:api:messages:2.0:PatchOp"
],
"Operations":[{
"op":"add",
"value":{
"addresses":[
{
"formatted":"Columbus",
"type":"work"
}
]
}]
}
If no array exists yet, then you should create the array and then add the value to the array. You can set it ahead of time to be an empty array, or can leave the value as null until such a point where a value needs to be added to the array, and then at that time create the array and then add the value to it. Kindly check this link

Variant column separate json values in file by comma?

This is a basic question, I am trying to break one variant row into multiple columns and running into an error.
Create or replace table App_versions(data variant);
CREATE or Replace FILE FORMAT x_json
TYPE = "JSON"
COMPRESSION = "GZIP"
FILE_EXTENSION= 'json.gz'
COPY INTO App_versions
FROM #~/staged
file_format = 'x_json'
on_error = 'skip_file';
list #~;
SELECT * FROM App_versions limit 10;
Select data:available,value::boolean as avail, data:color.value::string as col, data:name.value::string as title, data:version.value::float as version from App_versions;
Data Stored in Column
[
{
"available": false,
"color": "Indigo",
"name": "Bigtax",
"version": "2.2.9"
},
{
"available": false,
"color": "Khaki",
"name": "Solarbreeze",
"version": "7.00"
}
]
And I am running into the columns all to be Null values. What am I doing wrong?
I based it off of:https://support.snowflake.net/s/article/json-data-parsing-in-snowflake
If you want each { ... } object to land in it's own row, then use the STRIP_OUTER_ARRAY = TRUE file format option. Or you can FLATTEN() data on the fly after loading. To access multiple objects in single row without flattening, you have to include an index to specify which object you want -- for example... select data[0].available::boolean as avail ....

sagemaker clienterror rows 1-5000 have more fields than expected size 3

I have created a K-means training job with a csv file that I have stored in S3. After a while I receive the following error:
Training failed with the following error: ClientError: Rows 1-5000 in file /opt/ml/input/data/train/features have more fields than than expected size 3.
What could be the issue with my file?
Here are the parameters I am passing to sagemaker.create_training_job
TrainingJobName=job_name,
HyperParameters={
'k': '2',
'feature_dim': '2'
},
AlgorithmSpecification={
'TrainingImage': image,
'TrainingInputMode': 'File'
},
RoleArn='arn:aws:iam::<my_acc_number>:role/MyRole',
OutputDataConfig={
"S3OutputPath": output_location
},
ResourceConfig={
'InstanceType': 'ml.m4.xlarge',
'InstanceCount': 1,
'VolumeSizeInGB': 20,
},
InputDataConfig=[
{
'ChannelName': 'train',
'ContentType': 'text/csv',
"CompressionType": "None",
"RecordWrapperType": "None",
'DataSource': {
'S3DataSource': {
'S3DataType': 'S3Prefix',
'S3Uri': data_location,
'S3DataDistributionType': 'FullyReplicated'
}
}
}
],
StoppingCondition={
'MaxRuntimeInSeconds': 600
}
I've seen this issue appear when doing unsupervised learning, such as the above example using clustering. If you have a csv input, you can also address this issue by setting label_size=0 in the ContentType parameter of the Sagemaker API call, within the InputDataConfig branch.
Here's an example of what the relevant section of the call might look like:
"InputDataConfig": [
{
"ChannelName": "train",
"DataSource": {
"S3DataSource": {
"S3DataType": "S3Prefix",
"S3Uri": "some/path/in/s3",
"S3DataDistributionType": "ShardedByS3Key"
}
},
"CompressionType": "None",
"RecordWrapperType": "None",
"ContentType": "text/csv;label_size=0"
}
]
Make sure your .csv doesn't have column headers, and that the label is the first column.
Also make sure your values for the hyper-parameters are accurate ie feature_dim means number of features in your set. If you give it the wrong value, it'll break.
Heres a list of sagemaker knn hyper-parameters and their meanings: https://docs.aws.amazon.com/sagemaker/latest/dg/kNN_hyperparameters.html

Query nested arrays in ArangoDB

I'm looking for a way to query nested arrays in ArangoDB.
The JSON structure I have is:
{
"uid": "bykwwla4prqi",
"category": "party",
"notBefore": "2016-04-19T08:43:35.388+01:00",
"notAfter": "9999-12-31T23:59:59.999+01:00",
"version": 1.0,
"aspects": [
"participant"
],
"description": [
{ "value": "User Homer Simpson, main actor in 'The Simpsons'", "lang": "en"}
],
"properties": [
{
"property": [
"urn:project:domain:attribute:surname"
],
"values": [
"Simpson"
]
},
{
"property": [
"urn:project:domain:attribute:givennames"
],
"values": [
"Homer",
"Jay"
]
}
]
}
I tried to use a query like the following to find all parties having a given name 'Jay':
FOR r IN resource
FILTER "urn:project:domain:attribute:givennames" IN r.properties[*].targets[*]
AND "Jay" IN r.properties[*].values[*]
RETURN r
but unfortunately it does not work - it returns an empty array. If I use a '1' instead of '*' for the properties array it works. But the array of the properties has no fixed structure.
Does anybody have an idea how to solve this?
Thanks a lot!
You can inspect what the filter does using a simple trick: you RETURN the actual filter condition:
db._query(`FOR r IN resource RETURN r.properties[*].property[*]`).toArray()
[
[
[
"urn:project:domain:attribute:surname"
],
[
"urn:project:domain:attribute:givennames"
]
]
]
which makes it pretty clear whats going on. The IN operator can only work on one dimensional arrays. You could work around this by using FLATTEN() to remove the sub layers:
db._query(`FOR r IN resource RETURN FLATTEN(r.properties[*].property[*])`).toArray()
[
[
"urn:project:domain:attribute:surname",
"urn:project:domain:attribute:givennames"
]
]
However, while your documents are valid json (I guess its converted from xml?) you should alter the structure as one would do it in json:
"properties" : {
"urn:project:domain:attribute:surname":[
"Simpson"
],
"urn:project:domain:attribute:givennames": [
"Homer",
"Jay"
]
}
Since the FILTER combination you specify would also find any other Jay (not only those found in givennames) and the usage of FLATTEN() will prohibit using indices in your filter statement. You don't want to use queries that can't use indices on reasonably sized collections for performance reasons.
In Contrast you can use an array index on givennames with the above document layout:
db.resource.ensureIndex({type: "hash",
fields:
["properties.urn:project:domain:attribute:givennames[*]"]
})
Now doublecheck the explain for the query:
db._explain("FOR r IN resource FILTER 'Jay' IN " +
"r.properties.`urn:project:domain:attribute:givennames` RETURN r")
...
6 IndexNode 1 - FOR r IN resource /* hash index scan */
...
Indexes used:
By Type Collection Unique Sparse Selectivity Fields Ranges
6 hash resource false false 100.00 % \
[ `properties.urn:project:domain:attribute:givennames[*]` ] \
("Jay" in r.`properties`.`urn:project:domain:attribute:givennames`)
that its using the index.

Resources