I am loading two parquet files both have 1 row each into a variant column in a table on Snowflake.
When i read those two files and print fields using python , i see same number of fields (30 in this case).
When i load those two parquet files into a variant data type column into a table on a snowflake and query that table, i see only 29 fields from one file and 30 fields from other file.
when i look at the python output for this missing field, i see the one file has a value (13 in this case) and other file has a value as NaN.
For some reason, Snowflake is not showing the field that has a empty value.
Do i need to do something different while loading into snowflake to not ignore fields that doesn't have value in the parquet files.
Parquet file load into Snowflake does omit null fields (or NaN, treated equally within the Parquet file) and offers no options to project them with null values in the VARIANT representation. This is an expected behaviour for semi-structured file-loads currently.
However, the semi-structured data querying behaviour permits looking up non-existent field names across rows, with a NULL returned when the field is not found in any row.
Here's an example where two rows are missing a field due to NaNs being treated as nulls within the source Parquet file:
> SELECT V FROM PRQ;
+-------------------------------+
| V |
|-------------------------------|
| { |
| "a": 1.00, |
| "b": "foo" |
| } |
| { |
| "a": 2.00, |
| "b": "bar" |
| } |
| { |
| "a": 3.00, |
| "b": "spam" |
| } |
| { |
| "b": "eggs" [a is missing] |
| } |
| { |
| "b": "ham" [a is missing] |
| } |
+-------------------------------+
Since querying V:a will emit nulls on the final two rows, you can leverage IFNULL to re-add the NaNs (if the data cannot truly be null):
> SELECT V:b, IFNULL(V:a, 'NaN') FROM PRQ;
+--------+--------------------+
| V:B | IFNULL(V:A, 'NAN') |
|--------+--------------------|
| "foo" | 1 |
| "bar" | 2 |
| "spam" | 3 |
| "eggs" | NaN |
| "ham" | NaN |
+--------+--------------------+
Related
I have this table named student_classes:
| id | name | class_ids |
| ----| ---------| -----------|
| 1 | Rebecca | {1,2,3} |
| 2 | Roy | {1,3,4} |
| 3 | Ted | {2,4,5} |
name is type text / string
class_ids is type integer[]
I created a datastream from PostgreSQL to BigQuery (following these instructions), but when I looked at the table's schema in BigQuery the class_ids field was gone and I am not sure why.
I was expecting class_ids would get ingested into BigQuery instead of getting dropped.
I'm thinking through a database design and I was wondering if anyone could chime in. I have some structured data that I will occasionally be filtering against some somewhat unstructured data. I'm thinking a lot about performance so I'm trying to keep it as denormalized as possible. Do people have opinions about an indexed JSONB column over a separate table? For example:
| (smurfs) id | name | filters (GIN index) |
|-------------+--------+-------------------------------------|
| 1 | Papa | { "color": "blue" } |
| 2 | Brainy | { "brain": "big", "color": "blue" } |
And I'd query against the indexed JSONB data.
Or:
| (smurfs) id | name |
|-------------+--------+
| 1 | Papa |
| 2 | Brainy |
| (filters) id | smurf_id | filter_type | filter_value |
|--------------+----------+-------------+--------------|
| 1 | 1 | color | blue |
| 2 | 2 | brain | big |
| 3 | 2 | color | blue |
and I'd JOIN the filters with the data for my query.
There's a lot of talk about misuse of JSON in relational databases. Does this fit into that category? Would one be preferable over another from a good design standpoint. Is one more performant? I'm trying to optimize for reads on a large table. Seems like in the second case, I'd have 2 large tables instead of one.
EDIT:
Here's what I have: An Access database made up of 3 tables linked from SQL server. I need to create a new table in this database by querying the 3 source tables. Here are examples of the 3 tables I'm using:
PlanTable1
+------+------+------+------+---------+---------+
| Key1 | Key2 | Key3 | Key4 | PName | MainKey |
+------+------+------+------+---------+---------+
| 53 | 1 | 5 | -1 | Bikes | 536681 |
| 53 | 99 | -1 | -1 | Drinks | 536682 |
| 53 | 66 | 68 | -1 | Balls | 536683 |
+------+------+------+------+---------+---------+
SpTable
+----+---------+---------+
| ID | MainKey | SpName |
+----+---------+---------+
| 10 | 536681 | Wing1 |
| 11 | 536682 | Wing2 |
| 12 | 536683 | Wing3 |
+----+---------+---------+
LocTable
+-------+-------------+--------------+
| LocID | CenterState | CenterCity |
+--- ---+-------------+--------------+
| 10 | IN | Indianapolis |
| 11 | OH | Columbus |
| 12 | IL | Chicago |
+-------+-------------+--------------+
You can see the relationships between the tables. The NewMasterTable I need to create based off of these will look something like this:
NewMasterTable
+-------+--------+-------------+------+--------------+-------+-------+-------+
| LocID | PName | CenterState | Key4 | CenterCity | Wing1 | Wing2 | Wing3 |
+-------+--------+-------------+------+--------------+-------+-------+-------+
| 10 | Bikes | IN | -1 | Indianapolis | 1 | 0 | 0 |
| 11 | Drinks | OH | -1 | Columbus | 0 | 1 | 0 |
| 12 | Balls | IL | -1 | Chicago | 0 | 0 | 1 |
+-------+--------+-------------+------+--------------+-------+-------+-------+
The hard part for me is making this new table dynamic. In the future, rows may be added to the source tables. I need my NewMasterTable to reflect any changes/additions to the source. How do I go about building the NewMasterTable as described? Does this make any sort of sense?
Since an Access table is a necessary requirement, then probably the only way to go about it is to create a set of Update and Insert queries that are executed periodically. There is no built-in "dynamic" feature of Access that will monitor and update the table.
First, create the table. You could either 1) do this manually from scratch by defining the columns and constraints yourself, or 2) create a make-table query (i.e. SELECT... INTO) that generates most of the schema, then add any additional columns, edit necessary details and add appropriate indexes.
Define and save Update and Insert (and optional Delete) queries to keep the table synced. I'm not sharing actual code here, because that goes beyond your primary issue I think and requires specifics that you need to define. Due to some ambiguity with your key values (the field names and sample data still are not sufficient to reveal precise relationships and constraints), it is likely that you'll need multiple Update statements.
In particular, the "Wing" columns will likely require a transform statement.
You may not be able to update all columns appropriately using a single query. I recommend not trying to force such an "artificial" requirement. Multiple queries can actually be easier to understand and maintain.
In the event that you experience "query is not updateable" errors, you may need to define other "temporary" tables with appropriate indexes, into which you do initial inserts from the linked tables, then subsequent queries to update your master table from those.
Finally, and I think this is the key to solving your problem, you need to define some Access form (or other code) that periodically runs your set of "sync" queries. Access forms have a [Timer Interval] property and corresponding Timer event that fires periodically. Add VBA code in the Form_Timer sub that runs all your queries. I would suggest "wrapping" such VBA in a transaction and adding appropriate error handling and error logging, etc.
I'm trying to make a database that will hold a table of objects, and these objects are comprised of objects from a second table. One table is a table of possible sets, and the second is a table of possible components. The table of sets has to include fields for each of its components, but each set has an unknown number of components. How do I make a table with fields (Component 1, Component 2, Component 3, ...) that are dependent on each set to decide how many of the fields it needs?
Is there a way to do this just using the Access interface or will I actually have to get into the code behind it?
I think it would also solve my problem if there were a way to make a field in a column that acted as an ArrayList so if anyone could think of how to do that please let me know.
Assuming that a component can be part of more than one set, what you need here is a many-to-many relationship.
In a database you don't do this with an arbitrary number of columns, you use a junction table.
When you need a tabular representation, you use a Pivot / Crosstab query.
Your data model could look like this:
Sets
+--------+----------+
| Set_ID | Set_Name |
+--------+----------+
| 1 | foo |
| 2 | bar |
+--------+----------+
Components
+--------------+----------------+
| Component_ID | Component_Name |
+--------------+----------------+
| 1 | aaa |
| 2 | bbb |
| 3 | ccc |
| 4 | ddd |
+--------------+----------------+
Junction table
+----------+----------------+
| f_Set_ID | f_Component_ID |
+----------+----------------+
| 1 | 2 |
| 1 | 4 |
| 2 | 1 |
| 2 | 2 |
| 2 | 3 |
+----------+----------------+
(f_ as in Foreign Key)
In Cucumber, we can directly validate the database table content in tabular format by mentioning the values in below format:
| Type | Code | Amount |
| A | HIGH | 27.72 |
| B | LOW | 9.28 |
| C | LOW | 4.43 |
Do we have something similar in Robot Framework. I need to run a query on the DB and the output looks like the above given table.
No, there is nothing built in to do exactly what you say. However, it's fairly straight-forward to write a keyword that takes a table of data and compares it to another table of data.
For example, you could write a keyword that takes the result of the query and then rows of information (though, the rows must all have exactly the same number of columns):
| | ${ResultOfQuery}= | <do the database query>
| | Database should contain | ${ResultOfQuery}
| | ... | #Type | Code | Amount
| | ... | A | HIGH | 27.72
| | ... | B | LOW | 9.28
| | ... | C | LOW | 4.43
Then it's just a matter of iterating over all of the arguments three at a time, and checking if the data has that value. It would look something like this:
**** Keywords ***
| Database should contain
| | [Arguments] | ${actual} | #{expected}
| | :FOR | ${type} | ${code} | ${amount} | IN | #{expected}
| | | <verify that the values are in ${actual}>
Even easier might be to write a python-based keyword, which makes it a bit easier to iterate over datasets.