Let's say I have a canvas where there are various objects I can add in, such as a/an:
Drawing
Image
Chart
Note
Table
For each object I need to store the dimensions and the layer order, for example something like this:
ObjectID
LayerIndex
Dimensions ((x1, y1), (x2, y2))
Each of the objects have vastly different properties and so are stored in different tables (or classes or whatever). Would it be possible to store this into a relational database, and if so, how could it be done? In JSON it would be something like this:
// LayerIndex is the ArrayIndex
// No need to store ObjectID, since the object is stored within the array itself
Layers = [
{Type: Drawing, Props: <DrawingPropertyObj>, Dimensions: [(1,2), (3,4)]},
{Type: Chart, Props: <ChartPropertyObj>, Dimensions: [(3,4), (10,4)]},
{Type: Table, Props: <TablePropertyObj>, Dimensions: [(10,20), (30,44)]},
...
]
The one option I thought of is storing a FK to each table, but in that case, I could potentially join this to N different tables for each object type, so if there are 100 object types, ...
A "strict" relational database doesn't suit this task well because you're left with a choice of:
Different tables for each object type with a columns for each attribute that applies to that particular object type
A single table for all object types, with columns for each attribute, most of which aren't used for any given object type
A child table, one row for each attribute
Before moving on to a good general solution., let's discuss these:
1. Different tables for each object type
This is a non-starter. The problems are:
high maintenance cost: you must create a new table every time you add a new object type to your app
painful queries: you must join to every table, either horizontally - every table joined into one enormously long row, or vertically in a series unioned joins, leading to a sparse array (see option 2)
2. A single table for all object types
Although you're dealing with a sparse array, if most object types use most of the attributes (ie it's not that sparse), this is a good option. However, if the number of different attributes across your domain is high, and/or most attributes aren't used by all types, you have to add columns when introducing a new type, which although better than adding tables, still requires a schema change for a new type = high maintenance
3. A child table
This is the classic approach, but the worse to work with, because you either have to run a separate query to collect all the attributes for each object (slow, high maintenance), or write separate queries for each object type, joining to the child table once for each attribute to flatten out the many rows into one row for each object, effectively resulting in option 1, but with an even higher maintenance cost writing the queries
None of those are great options. What you want is:
One row per object
Simple queries
Simple schema
Low maintenance
A document database, such as Elasticsearch gives you all of this out of the box, but you can achieve the same effect with a relational database by relaxing "strictness" and saving the whole object as json in a single column:
create table object (
id int, -- typically auto incrementing
-- FK to parent - see below
json text -- store object as json
);
BTW, postgres would be a good choice, because it has native support for json via the json datatype.
I have used this several times in my career, always successfully. I added a column for the object class type (in a java context):
create table object (
id int,
-- FK to parent - see below
class_name text,
json text
);
and used a json library to deserialize the json using the specified class into an object of that class. Whatever language you're using will have a way of achieving this idea.
As for the hierarchy, a relational database does this well. From the canvas:
create table canvas (
id int,
-- various attributes
);
If objects are not reused:
create table object (
id int,
canvas_id int not null references canvas,
class_name text,
json text,
layer int not null
);
If objects are reused:
If objects are not reused:
create table object (
id int,
class_name text,
json text
);
create table canvas_object (
canvas_id int not null references canvas,
object_id int not null references object,
layer int not null
);
You have many options as shown below.
There is not much difference in which one you pick, but I would avoid the multi-table design which is the one you said. An object type with 100 properties would be scattered in 101 tables for no gain. 101 disk page accesses for each object type being read. That's unnecessary (if those pages are cached then this problem would be lesser than otherwise but is still waste).
Even dual table is not really necessary if you don't wish to filter things like 'all objects with color=red', but I guess performance is not so urgent to reach to this point, other things matters more, or other bottlenecks have more influence in performance, so pick the one of the no-more-than-dual-table that fits best for you.
Single table - flexible schema per object type
objlayerindex
type
props
x0
y0
x1
y1
0
drawing
{color:#00FF00,background-color:#00FFFF}
1
2
3
4
1
chart
{title:2021_sales,values:[[0,0],[3,4]]}
11
22
33
44
in props the keys are used for flexibility, different objects of the same type may have different keys, for example a chart without subtitle can omit this key.
Single table - fixed schema per object type
objlayerindex
type
props
x0
y0
x1
y1
0
drawing
#00FF00,#00FFFF
1
2
3
4
1
chart
2021_sales,"[[0,0],[3,4]]"
11
22
33
44
this schema is fixed - drawing always has color+backgroundcolor; chart always have title+values; etc - less space used but changing schema involves some work on already existing data.
Dual table
Main
objlayerindex
type
x0
y0
x1
y1
0
drawing
1
2
3
4
1
chart
11
22
33
44
Properties
objlayerindex
propertyname
propertyvalue
0
color
#00FF00
0
background-color
#00FFFF
1
title
2021_sales
1
values
[[0,0],[3,4]]
here we assume that property ordering is not important. If it is, an extra column propertyindex would be needed. For those who love normalization, it is possible also to take propertyname out of this table to a propertykey-propertydescription and reference it by its propertykey.
Multi table
Main
objlayerindex
type
x0
y0
x1
y1
0
drawing
1
2
3
4
1
chart
11
22
33
44
Color
objlayerindex
colorcode
0
#00FF00
Background-Color
objlayerindex
colorcode
0
#00FFFF
Title
objlayerindex
title
1
2021_sales
Values
objilayerindex
chart
1
[[0,0],[3,4]]
Specifically this kind of data can be normalized one extra level:
Values
objlayerindex
datapoint
x
y
1
0
0
0
1
1
3
4
You can also use non-relational formats.
Document (Json) Store
[
{type:drawing,props:{color:#00FF0,background-color:#00FF0},position:[1,2,3,4]},
{type:chart,props:{title:2021_sales,values:[[0,0],[3,4]]},position:[11,22,33,44]}
]
we are citing here because it is a popular and simple format, but different encodings can be used instead of JSON (CSV, protocolbuffers, avro, etc)
Related
I have a key column in two many to many related table and sample representation of data is -
(attaching sample version of the table to get the point across as there are other numerous columns not contributing to this visual)
table 1 -
table 2 -
I am making a line graph with date on x axis and the value1 and value 2 on y-axis. The value1 is true for all dates. It is basically a standard target value. Now I want all the value1 summed up to show in my visual as value1 and not just the ones for which I have data on those dates. To explain it better I want 1717 on the graph as well like the total in the table in the following image -
visual -
Is there a way to do this in power BI? I tried to make a shared dimension of all unique key as a separate table and connecting both the tables to that table but there is no change in visual due to that.
You can follow these below steps to achieve your required output-
Step-1 Create a custom column in your *table 1 as below-
value_1_sum =
CALCULATE(
SUM(table_2[value1]),
ALL(table_2)
)
Step-2 Configure your line chart as below. Remember, the aggregation for new custom column will be Average as shown in the image
And here below is the final output-
Additional Reference Here below is list of options you will get after right click on the measure name-
I have an SSRS report with several levels of drilling down. Data is aggregated up for the top level view, but I need to show a different drill down report depending on the type of one of the columns.
Eg:
Table 1 - Apples
Name Cost
Fuji 1.5
Gala 3.5
Table 2 - Squashes
Name Cost
Pumpkin 2
Gourd 4.5
I have a stored procedure which aggregates these and puts them in a table for the top level report to show. Ie:
Name Cost ItemType
Apples 5 1
Squashes 6.5 2
In reality, the two tables have different columns which I need to show in the drill through reports. Is it possible to look at the ItemType column and either drill down to one of two sub-reports, depending on it's value?
If you need to choose between two or more different sub-reports then make the ReportName property of the action on the textbox an expression like this.
=IIF(Fields!ItemType.Value = 1, "subReport_Apples", "subReport_Oranges")
if you have more than a handful SWITCH will probably be better
= SWITCH (
Fields!ItemType.Value = 1, "subReport_Apples",
Fields!ItemType.Value = 2, "subReport_Oranges",
Fields!ItemType.Value = 3, "subReport_Lemons",
True, "subReport_AnythingElse"
)
If you have a LOT of item types, consider adding the names of the subreports to your database creating a new table containing ItemType and subReportName. You can then join to this in your query and get the actual subreport name. The ReportName property of the text action would then simply be Fields!SubReportname.Value
You can add the ItemType as a parameter in your subreport(s). Then from your main report just link or jump to the sub report and pass along the Fields!ItemType.Value in the parameter configuration tab.
I have a legacy database that I am doing some ETL work on. I have columns in the old table that are conditionally mapped to columns in my new table. The conditions are based on an associated column (a column in the same table that represents the shape of an object, we can call that column SHAPE). For example:
Column dB4D is mapped to column:
B4 if SHAPE=5
B3 if SHAPE=1
X if SHAPE=10
or else Y
I am using a condition to split the table based on SHAPE, then I am using 10-15 "copy column" transformations to take the old column (dB4D) and map it to the new column (B4, B3, X, etc).
Some of these columns "overlap". For example, I have multiple legacy columns (dB4D, dB3D, dB2D, dB1D, dC1D, dC2D, etc) and multiple new columns (A, B, C, D, etc). In one of the "copy columns" (which are broken up by SHAPE) I could have something like:
If SHAPE=10
+--------------+--------------+
| Input Column | Output Alias |
+--------------+--------------+
| dB4D | B |
+--------------+--------------+
If SHAPE=5
+--------------+--------------+
| Input Column | Output Alias |
+--------------+--------------+
| dB4D | C |
+--------------+--------------+
I need to now bring these all together into one final staging table (or "destination"). Not two rows will have the same size, so there is no conflict. But I need to map dB4D (and other columns) to different new columns based on a value in another column. I have tried to merge them but can't merge multiple data sources. I have tried to join them but not all columns (or output aliases) would show up in the destination. Can anyone recommended how to resolve this issue?
Here is the current design that may help:
As inputs to your data flow, you have a set of columns dB4D, dB3D, dB2D, etc.
Your destination will only have column names that do not exist in your source data.
Based on the Shape column, you'll project the dB columns into different mappings for your target table.
If the the Conditional Split logic makes sense as you have it, don't try and Union All it back together. Instead, just wire up 8 OLE DB Destinations. You'll probably have to change them from the "fast load" option to the table name option. This means it will perform singleton inserts so hopefully the data volumes won't be an issue. If they are, then create 8 staging table that you do use the "Fast Load" option for and then have a successor task to your Data Flow to perform set based inserts into the final table.
The challenge you'll run into with the Union All component is that if you make any changes to the source, the Union All rarely picks up on the change (the column changed from varchar to int, sorry!).
I have a tricky problem trying to find an efficient way of ordering a set of objects (~1000 rows) that contain a large (~5 million) number of indexed data points. In my case I need a query that allows me to order the table by a specific datapoint. Each datapoint is a 16-bit unsigned integer.
I am currently solving this problem by using an large array:
Object Table:
id serial NOT NULL,
category_id integer,
description text,
name character varying(255),
created_at timestamp without time zone NOT NULL,
updated_at timestamp without time zone NOT NULL,
data integer[],
GIST index:
CREATE INDEX object_rdtree_idx
ON object
USING gist
(data gist__intbig_ops)
This index is not currently being used when I do a select query, and I am not certain it would help anyway.
Each day the array field is updated with a new set of ~5 million values
I have a webserver that needs to list all objects ordered by the value of a particular data point:
Example Query:
SELECT name, data[3916863] as weight FROM object ORDER BY weight DESC
Currently, it takes about 2.5 Seconds to perform this query.
Question:
Is there a better approach? I am happy for the insertion side to be slow as it happens in the background, but I need the select query to be as fast as possible. In saying this, there is a limit to how long the insertion can take.
I have considered creating a lookup table where every value has it's own row - but I'm not sure how the insertion/lookup time would be affected by this approach and I suspect entering 1000+ records with ~5 million data points as individual rows would be too slow.
Currently inserting a row takes ~30 seconds which is acceptable for now.
Ultimately I am still on the hunt for a scalable solution to the base problem, but for now I need this solution to work, so this solution doesn't need to scale up any further.
Update:
I was wrong to dismiss having a giant table instead of an array, while insertion time massively increased, query time is reduced to just a few milliseconds.
I am now altering my generation algorithm to only save a datum if it non-zero and changed from previous update. This has reduced insertions to just a few hundred thousands values which only takes a few seconds.
New Table:
CREATE TABLE data
(
object_id integer,
data_index integer,
value integer,
)
CREATE INDEX index_data_on_data_index
ON data
USING btree
("data_index");
New Query:
SELECT name, coalesce(value,0) as weight FROM objects LEFT OUTER JOIN data on data.object_id = objects.id AND data_index = 7731363 ORDER BY weight DESC
Insertion Time: 15,000 records/second
Query Time: 17ms
First of all, do you really need a relational database for this? You do not seem to be relating some data to some other data. You might be much better off with a flat-file format.
Secondly, your index on data is useless for the query you showed. You are querying for a datum (a position in your array) while the index is built on the values in the array. Dropping the index will make the inserts considerably faster.
If you have to stay with PostgreSQL for other reasons (bigger data model, MVCC, security) then I suggest you change your data model and ALTER COLUMN data SET TYPE bytea STORAGE external. Since the data column is about 4 x 5 million = 20MB it will be stored out-of-line anyway, but if you explicitly set it, then you know exactly what you have.
Then create a custom function in C that fetches your data value "directly" using the PG_GETARG_BYTEA_P_SLICE() macro and that would look somewhat like this (I am not a very accomplished PG C programmer so forgive me any errors, but this should help you on your way):
// Function get_data_value() -- Get a 4-byte value from a bytea
// Arg 0: bytea* The data
// Arg 1: int32 The position of the element in the data, 1-based
PG_FUNCTION_INFO_V1(get_data_value);
Datum
get_data_value(PG_FUNCTION_ARGS)
{
int32 element = PG_GETARG_INT32_P(1) - 1; // second argument, make 0-based
bytea *data = PG_GETARG_BYTEA_P_SLICE(0, // first argument
element * sizeof(int32), // offset into data
sizeof(int32)); // get just the required 4 bytes
PG_RETURN_INT32_P((int32*)data);
}
The PG_GETARG_BYTEA_P_SLICE() macro retrieves only a slice of data from the disk and is therefore very efficient.
There are some samples of creating custom C functions in the docs.
Your query now becomes:
SELECT name, get_data_value(data, 3916863) AS weight FROM object ORDER BY weight DESC;
this question came up based on the responses I got for the question
Getting weird issue with TO_NUMBER function in Oracle
As everyone suggested that storing Numeric values in VARCHAR2 columns is not a good practice (which I totally agree with), I am wondering about a basic Design choice our team has made and whether there are better way to design.
Problem Statement : We Have many tables where we want to give certain number of custom fields. The number of required custom fields is known, but what kind of attribute is mapped to the column is available to the user
E.g. I am putting down a hypothetical scenario below
Say you have a laptop which stores 50 attribute values for every laptop record. Each laptop attributes are created by the some admin who creates the laptop.
A user created a laptop product lets say lap1 with attributes String, String, numeric, numeric, String
Second user created laptop lap2 with attributes String,numeric,String,String,numeric
Currently there data in our design gets persisted as following
Laptop Table
Id Name field1 field2 field3 field4 field5
1 lap1 lappy lappy 12 13 lappy
2 lap2 lappy2 13 lappy2 lapp2 12
This example kind of simulates our requirement and our design
Now here if somebody is lookinup records for lap2 table doing a comparison on field2, We need to apply TO_NUMBER.
select * from laptop
where name='lap2'
and TO_NUMBER(field2) < 15
TO_NUMBER fails in some cases when query plan decides to first apply to_number instead of the other filter.
QUESTIONS
Is this a valid design?
What are the other alternative ways to solve this problem?
One of our team mates suggested creating tables on the fly for such cases. Is that a good idea?
How do popular ORM tools give custom fields or flex fields handling?
I hope I was able to make sense in the question.
Sorry for such a long text..
This causes us to use TO_NUMBER when queryio
This is a common problem and there is no perfect solution. A couple of solutions:
1.
Define X fields of type varchar2, Y fields of type number and Z fields of type date. That comes out as potentially 3 times the number of custom fields but you will never have any conversion problem anymore.
Your example would come out like this:
Id Name field_char1 field2_char2 field_char3 ... field_num1 field_num2 ...
1 lap1 lappy lappy lappy ... 12 13
2 lap2 lappy2 lappy2 lapp2 ... 13 12
In your example you have the same number of numeric values and character values on both rows but it doesn't have to be this way: the third row could have no numeric field for example.
2.
Define X fields of type varchar2 and have apply a bijective function to store number or date field (for example Date could be stored as YYYYMMDDHH24miss). You will also need an extra field that will define the context of the row. You would apply the to_number or to_char function only when the rows are of the good type.
Your example:
Id Name context field1 field2 field3 field4 field5
1 lap1 type A lappy lappy 12 13 lappy
2 lap2 type B lappy2 13 lappy2 lapp2 12
You could query the table using DECODE or CASE:
SELECT *
FROM laptop
WHERE CASE WHEN context = 'TYPE A' THEN to_number(field3) END = 12
The second design is the one used in the Oracle Financials ERP (among others). The context allows you to define CHECK constraints with this design (for example CHECK (CASE WHEN context = 'TYPE A' THEN to_number(field3) > 0) to ensure integrity.
This is a common scenario with shrink-wrapped apps, where it represents the only opportunity for customizing the data model. But from a purist point of view it is bad practice. Because if a column can contain '27-MAY-2010' or 178.50 or 'Red badger' then clearly it is dependent on something external to the database to give it meaning.
But using an XMLType is even worse because you lose what little structure you have. It becomes difficult to query on the flexible columns. Still there are some scenarios where this is the appropriate solution: mainly when we're not interested in the individual elements, just the collection of properties.
So, what is the best way of dealing with it? Customised functions to go with your custom columns:
SQL> create or replace function get_number
2 ( p_str in varchar2 )
3 return number
4 deterministic
5 is
6 return_value number;
7 begin
8 begin
9 return_value := to_number(trim(p_str));
10 exception
11 when others then
12 return_value := null;
13 end;
14 return return_value;
15 end;
16 /
Function created.
SQL>
We can build a function-based on this column, for performance:
SQL> create index t42_flex_idx on t42 ( get_number( flex_col))
2 /
Index created.
SQL>
So given this test data ....
SQL> select * from t42
2 /
ID FLEX_COL
---------- ------------------------------
1 27-MAY-2010
2 138.50
3 Red badger
2 23
SQL>
... here's how it works:
SQL> select * from t42
2 where get_number(flex_col) < 50
3 /
ID FLEX_COL
---------- ------------------------------
2 23
SQL>
If all of the column types are decided at the time the table is created, then generating tables on the fly sounds good to me.
However, if two users are using the same table with different fields, you could create new tables just for the custom fields and join them to the main table. This is more of an object oriented approach.
Could you create an XML graph in the code layer and store it in a SYS.XMLTYPE field type?
http://www.oracle-base.com/articles/9i/XMLTypeDatatype.php
This would allow you to strongly type (in XML) your values and retain meaningful structure.