Columnarization for "non-native" data types loaded into VARIANT column - snowflake-cloud-data-platform

In the documentation there is a section on semi-structured considerations that warns against certain situations in which values for a given path in a VARIANT column will not be materialized; for example, if the values do not all have the same data type. That is, if both {"foo":1} and {"foo":"1"} are present, the value for "foo" obviously cannot be extracted as its column.
But in the case where all values have the same type, it's still not entirely clear how so-called non-native types are handled.
The documentation describes for example dates and timestamps as non-native in the context of the VARIANT data type (meaning that such values are "stored as strings"), my question is whether this extends to number types such as FLOAT8. The documentation suggests that native types might be understood in the context of JSON (which has a native number type that is rather hybrid in nature).
Is FLOAT8 a native type in VARIANT data or is it stored as a string?
Would such a value (stored as a string) be extracted into its own column or appear as a "parsed semi-structured structure" along with those remaining values that weren't extracted into other columns?
The documentation suggests that one makes performance tests "to see which structure provides the best performance" but this would be a lot easier with an accurate understanding of the extraction logic.

The following test suggests that yes FLOAT8 is not stored as a string, so I expect it to be stored in a way where it can be retrieved(/pruned etc) independently of the other data in the JSON during query execution.
create table foo (col variant) as (select object_construct('x', FLOOR(random()*1000000)::float8) col from table(generator(rowcount => 100000000)));
create table foo2 (col variant) as (select object_construct('x', col:x::string) col from foo);
TABLE_NAME BYTES
FOO 800071680
FOO2 1480282112

Related

ORA-22835: Buffer too small and ORA-25137: Data value out of range

We are using a software that has limited Oracle capabilities. I need to filter through a CLOB field by making sure it has a specific value. Normally, outside of this software I would do something like:
DBMS_LOB.SUBSTR(t.new_value) = 'Y'
However, this isn't supported so I'm attempting to use CAST instead. I've tried many different attempts but so far these are what I found:
The software has a built-in query checker/validator and these are the ones it shows as invalid:
DBMS_LOB.SUBSTR(t.new_value)
CAST(t.new_value AS VARCHAR2(10))
CAST(t.new_value AS NVARCHAR2(10))
However, the validator does accept these:
CAST(t.new_value AS VARCHAR(10))
CAST(t.new_value AS NVARCHAR(10))
CAST(t.new_value AS CHAR(10))
Unfortunately, even though the validator lets these ones go through, when running the query to fetch data, I get ORA-22835: Buffer too small when using VARCHAR or NVARCHAR. And I get ORA-25137: Data value out of range when using CHAR.
Are there other ways I could try to check that my CLOB field has a specific value when filtering the data? If not, how do I fix my current issues?
The error you're getting indicates that Oracle is trying to apply the CAST(t.new_value AS VARCHAR(10)) to a row where new_value has more than 10 characters. That makes sense given your description that new_value is a generic audit field that has values from a large number of different tables with a variety of data lengths. Given that, you'd need to structure the query in a way that forces the optimizer to reduce the set of rows you're applying the cast to down to just those where new_value has just a single character before applying the cast.
Not knowing what sort of scope the software you're using provides for structuring your code, I'm not sure what options you have there. Be aware that depending on how robust you need this, the optimizer has quite a bit of flexibility to choose to apply predicates and functions on the projection in an arbitrary order. So even if you find an approach that works once, it may stop working in the future when statistics change or the database is upgraded and Oracle decides to choose a different plan.
Using this as sample data
create table tab1(col clob);
insert into tab1(col) values (rpad('x',3000,'y'));
You need to use dbms_lob.substr(col,1) to get the first character (from the default offset= 1)
select dbms_lob.substr(col,1) from tab1;
DBMS_LOB.SUBSTR(COL,1)
----------------------
x
Note that the default amount (= length) of the substring is 32767 so using only DBMS_LOB.SUBSTR(COL) will return more than you expects.
CAST for CLOB does not cut the string to the casted length, but (as you observes) returns the exception ORA-25137: Data value out of range if the original string is longert that the casted length.
As documented for the CAST statement
CAST does not directly support any of the LOB data types. When you use CAST to convert a CLOB value into a character data type or a BLOB value into the RAW data type, the database implicitly converts the LOB value to character or raw data and then explicitly casts the resulting value into the target data type. If the resulting value is larger than the target type, then the database returns an error.

How to query multiple JSON document schemas in Snowflake?

Could anyone tell me how to change the Stored Procedure in the article below to recursively expand all the attributes of a json file (multiple JSON document schemas)?
https://support.snowflake.net/s/article/Automating-Snowflake-Semi-Structured-JSON-Data-Handling-part-2
Craig Warman's stored procedure posted in that blog is a great idea. I asked him if it was okay to refactor his code, and he agreed. I've used the refactored version in the field, so I know the SP well as well as how it works.
It may be possible to modify the SP to work on your JSON. It will depend on whether or not Snowflake types the JSON in your variant column. The way you have it structured, it may not type everything. You can check by running this SQL and seeing if the result set includes all the columns you need:
set VARIANT_TABLE = 'WEATHER';
set VARIANT_COLUMN = 'V';
with MAIN_TABLE as
(
select * from identifier($VARIANT_TABLE) sample (1000 rows)
)
select distinct REGEXP_REPLACE(REGEXP_REPLACE(f.path, '\\[(.+)\\]'),'[^a-zA-Z0-9]','_') AS path_name, -- This generates paths with levels enclosed by double quotes (ex: "path"."to"."element"). It also strips any bracket-enclosed array element references (like "[0]")
typeof(f.value) AS attribute_type, -- This generates column datatypes.
path_name AS alias_name -- This generates column aliases based on the path
from
MAIN_TABLE,
LATERAL FLATTEN(identifier($VARIANT_COLUMN), RECURSIVE=>true) f
where TYPEOF(f.value) != 'OBJECT'
AND NOT contains(f.path, '[');
Be sure to replace the variables to your table and column names. If this picks up the type information for the columns in your JSON, then it's possible to modify this SP to do what you need. If it doesn't but there's a way to modify the query to get it to pick up the columns, that would work too.
If it doesn't pick up the columns, based on Craig's idea I decided to write type inference for non variant (such as strings from CSV log files without type information). Try the SQL above and see what results first.

Get Count of Shared DataSet when defined with a Parameter

I commonly display the number of rows of my datasets in SSRS e.g.
=CountRows("MyDataSet")
However this doesn't work when the dataset is a shared dataset with a parameter.
=CountRows("MySharedDatasetWithParameter")
Instead it throws an error:
The Value expression for the textrun 'Textbox25.Paragraphs[0].TextRuns[0]' contains an error: (processing): (null != aggregateObj)
How can I get the number of rows in this case?
The dataset "MySharedDatasetWithParameter" does work in normal circumstances, because I am using it to provide the available values to another parameter.
Example of shared dataset with parameter
select [Name], [Value]
from dbo.MyList
where MasterList = #MasterList
A workaround taken from this answer (Its not a duplicate question else I would flag it as such) is to create a hidden, multi-valued, parameter e.g. MyHiddenDataSetValues which stores the values from "MySharedDatasetWithParameter" and then
=Parameters!MyHiddenDataSetValues.Count
gives the number of rows.
Rather clunky, so still hoping for a way to use CountRows.

How to compare numeric in PostgreSQL JSONB

I ran into strange situation working with jsonb type.
Expected behavior
Using short jsonb structure:
{"price": 99.99}
I wrote query like this:
SELECT * FROM table t WHERE t.data->price > 90.90
And it fail with error operator does not exist: jsonb > numeric the same as text (->>) operator does not exist: text > numeric
Then I wrote comparison as mentioned in many resources:
SELECT * FROM table t WHERE (t.data->>price)::NUMERIC > 90.90
And it's works as expected.
What's strange:
SELECT * FROM table t WHERE t.data->price > '90.90';
a little weird but query above works right.
EXPLAIN: Filter: ((data -> 'price'::text) > '90.90'::jsonb)
But if I change jsonb value to text as: {"price": "99.99"}
there is no result any more - empty.
Question: How actually PostgreSQL compare numeric data and what preferable way to do this kind of comparison.
But you aren't comparing numeric data, are you.
I can see that you think price contains a number, but it doesn't. It contains a JSON value. That might be a number, or it might be text, or an array, or an object, or an object containing arrays of objects containing...
You might say "but the key is called 'price' of course it is a number" but that's no use to PostgreSQL, particularly if I come along and sneakily insert an object containing arrays of objects containing...1
So - if you want a number to compare to you need convert it to a number (t.data->>price)::NUMERIC or convert your target value to JSON and let PostgreSQL do a JSON-based comparison (which might do what you want, it might not - I don't know what the exact rules are for JSON).
1 And that's exactly the sort of thing I would do, even though it is Christmas. I'm a bad person.

Create postgresql index on text column casted to array

I have a postgresql table that has a column with data type = 'text' in which I need to create an index which involves this column being type casted to integer[]. However, whenever I try to do so, I get the following error:
ERROR: functions in index expression must be marked IMMUTABLE
Here is the code:
create table test (a integer[], b text);
insert into test values ('{10,20,30}','{40,50,60}');
CREATE INDEX index_test on test USING GIN (( b::integer[] ));
Note that one potential workaround is to create a function that is marked as IMMUTABLE that takes in a column value and performs the type casting within the function, but the problem (aside from adding overhead) is that I have many different 'target' array data types (EG: text[], int2[], int4[], etc...), and it would not be possible to create a separate function for each potential target array data type.
Answered in this thread on the PostgreSQL mailing lists. Click on "Follow-ups" or "next by thread" in the links after the post to follow the (short) thread on the topic.
There's no recipe given there, but Tom's just talking about defining an explicit cast from text[] to integer[]. If time permits I'll flesh this answer out with an example.

Resources