i am reading about the "entity attribute value model" which sort of reminds me of an star-schema which you use in data warehousing.
One table has all the facts (even if you mix apples,bananas e.g. date of farming, weight, price, color,type,name) and a bunch of tables holding the details (e.g. infected_with _banana_virus_type, apple_specific_acid_level)
I do this in both aproaches, so I can't see a difference in these to words?
Please enlighten me. CHEERS
In all approaches you have entities, attributes and values. Everything reduces to this logically. Since everything has entities, attributes and values, you can always claim that everything is the same. All data structures are -- from that point of view -- identical.
Please draw a diagram of a star schema. With a fact (say web site GET requests) and some dimensions like Time, IP Address, Requested Resource Path, and session User.
Actually draw the actual diagram, please. Don't read the words, look at the picture of five tables.
After drawing that picture, draw a single EAV table.
Actually draw the picture with entity, attribute and value columns. Don't read the words. Look at the picture of one table.
Okay?
Now write down all the differences between the two pictures. Number of tables. Number of columns. Data types of each column. All the differences.
We're not done.
Write a SQL query to count GET requests by day of the week for a given user using the star schema. Actually write the SQL. It's a three-table join. With a GROUP BY and a WHERE
Try and write a SQL query to count GET requests by day of week for the EAV table.
Okay?
Now write down all the differences between the two queries. Complexity of the SQL, for example. Performance of the SQL. Time required to write the SQL.
Now you know the differences.
Related
I wanted to ask, if there may be a different and better approach than mine.
I have a model entity that can have an arbitrary amount of hyperparameters. Depending on the specific model I want to insert as row into the model table, I may have specific hyperparameters. I do not want to continuously add new columns to my model table for new hyperparameters that I encounter when trying out new models (+ I don't like having a lot of columns that are null for many rows). I also want to easily filter models on specific hyperparameter values, e.g. "select * from models where model.hyperparameter_x.value < 0.5". So, an n-to-n relationship to a hyperparameter table comes to mind. The issue is, that the datatype for hyperparameters can be different, so I cannot define a general value column on the relationship table, with a datatype, that's easily comparable across different models.
So my idea is, to define a json type "value" column in the relationship table to support different value datatypes (float, array, string, ...). What I don't like about that idea and what was legitimately critizised by colleagues is that this can result in chaos within the value column pretty fast, e.g. people inserting data with very different json structures for the same hyperparameters. To mitigate this issue, I would introduce a "json_regex_template" column in the hyperparameter table, so that on API level I can easily validate wheter the json for a value for hyperparameter x is correctly defined by the user. An additional "json_example" column in the hyperaparameter table would further help the user on the other side of the API make correct requests.
This solution would still not guarentee non-chaos on database request level (even though no User should directly insert data without using the API, so I don't think thats a very big deal). And the solution still feels a bit hacky. I would believe, that I'm not the first person with this problem and maybe there is a best practice to solve it?
Is my aversion against continuously adding columns reasonable? It's about probl. 3-5 new columns per month, may saturate at some point to a lower number, but thats speculative.
I'm aware of this post (Storing JSON in database vs. having a new column for each key), but it's pretty old, so my hope is that there may be new stuff I could use. The model-hyperparameter thing is of course just a small part of my full database model. Changing to a non-relational database is not an option.
Opinions are much appreciated
What's a good approach to data warehouse design if requested reports require summarized information about the same dimensions (and at the same granularity) but the underlying data is stored in separate fact tables?
For example, a report showing total salary paid and total expenses reported for each employee for each year, when salary and expenses are recorded in different fact tables. Or a report listing total sales per month and inventory received per month for each SKU sold by a company, when sales comes from one fact table and receiving comes from another.
Solving this problem naively seems pretty easy: simply query and aggregate both fact tables in parallel, then stitch together the aggregated results either in the data warehouse or in the client app.
But I'm also interested in other ways to think about this problem. How have others solved it? I'm wondering both about data-warehouse schema and design, as well as making that design friendly for client tools to build reports like the examples above.
Also, does this "dimension sandwich" use-case have a name in canonical data-warehousing terminology? If yes that will make it easier to research via Google.
We're working with SQL Server, but the questions I have at this point are hopefully platform-neutral.
I learned today that this technique is called Drilling Across:
Drilling across simply means making separate queries against two or
more fact tables where the row headers of each query consist of
identical conformed attributes. The answer sets from the two queries
are aligned by performing a sort-merge operation on the common
dimension attribute row headers. BI tool vendors refer to this
functionality by various names, including stitch and multipass query.
Sounds like the naive solution above (query multiple fact tables in parallel and stitch together the results) is also the suggested solution.
More info:
Drilling Across - Kimball overview article
http://blog.oaktonsoftware.com/2011/12/three-ways-to-drill-across.html - SQL implementation suggestions for drilling across
Many thanks to #MarekGrzenkowicz for pointing me in the right direction to find my own answer! I'm answering it here in case someone else is looking for the same thing.
The "naive solution" you described is most of the times the preferred one.
A common exception is when you need to filter the detailed rows of one fact using another fact table. For example, "show me the capital-tieup (stock inventory) for the articles we have not sold this year". You cannot simply sum up the capital-tieup in one query. In this case a consolidated fact can be a solution, if you are able to express both measures on a common grain.
I'm an Excel user trying to solve this one problem, and the only efficient way I can think of is do it by a database. I use arrays in programming VBA/Python and I've queried from databases before, but never really designed a database. So I'm here to look for suggestion on how to structure this db in Access.
Anyway, I currently maintain a sheet of ~50 economics indicators for ~100 countries. It's a very straightforward sheet, with
Column headers: GDP , Unemployment , Interest Rate, ... ... ... Population
And Rows:
Argentina
Australia
...
...
Yemen
Zambia
etc.
I want to take snapshots of this sheet so I can see trends and run some analysis in the future. I thought of just keep duplicating the worksheet in Excel but it just feels inefficient.
I've never designed a database before. My question would be what's the most efficient way to store these data for chronological snapshots? In the future I will probably do these things:
Queue up a snapshot for day mm-dd-yy in the past.
Take two different data point of a metric, of however many countries, and track the change/rate of change etc.
Once I can queue them well enough I'll probably do some kind of statistical analysis, which just requires getting the right data set.
I feel like I need to create an individual table for each country and add a row to every country table every time I take a snapshot. I'll try to play with VBA to automate this.
I can't think of any other way to do this with less tables? What would you suggest? Is it a recommended practice to use more than a dozen tables for this task?
There are a couple of ways of doing this,
Option 1
Id suggest you probably only need a single table, something akin to,
Country, date_of_snapshot, columns 1-50 (GDP etc..)
Effective you would add a new row for each day and each country,
Option 2
You could also use a table atructured as below though this would require more complex queries which may be too much for access,
Country, datofsnapshot, factor, value
with each factor GDP etc... getting a row for each date and country
I am designing a database that needs to store transaction time and valid time, and I am struggling with how to effectively store the data and whether or not to fully time-normalize attributes. For instance I have a table Client that has the following attributes: ID, Name, ClientType (e.g. corporation), RelationshipType (e.g. client, prospect), RelationshipStatus (e.g. Active, Inactive, Closed). ClientType, RelationshipType, and RelationshipStatus are time varying fields. Performance is a concern as this information will link to large datasets from legacy systems. At the same time the database structure needs to be easily maintainable and modifiable.
I am planning on splitting out audit trail and point-in-time history into separate tables, but I’m struggling with how to best do this.
Some ideas I have:
1)Three tables: Client, ClientHist, and ClientAudit. Client will contain the current state. ClientHist will contain any previously valid states, and ClientAudit will be for auditing purposes. For ease of discussion, let’s forget about ClientAudit and assume the user never makes a data entry mistake. Doing it this way, I have two ways I can update the data. First, I could always require the user to provide an effective date and save a record out to ClientHist, which would result in a record being written to ClientHist each time a field is changed. Alternatively, I could only require the user to provide an effective date when one of the time varying attributes (i.e. ClientType, RelationshipType, RelationshipStatus) changes. This would result in a record being written to ClientHist only when a time varying attribute is changed.
2) I could split out the time varying attributes into one or more tables. If I go this route, do I put all three in one table or create two tables (one for RelationshipType and RelationshipStatus and one for ClientType). Creating multiple tables for time varying attributes does significantly increase the complexity of the database design. Each table will have associated audit tables as well.
Any thoughts?
A lot depends (or so I think) on how frequently the time-sensitive data will be changed. If changes are infrequent, then I'd go with (1), but if changes happen a lot and not necessarily to all the time-sensitive values at once, then (2) might be more efficient--but I'd want to think that over very carefully first, since it would be hard to manage and maintain.
I like the idea of requiring users to enter effective daes, because this could serve to reduce just how much detail you are saving--for example, however many changes they make today, it only produces that one History row that comes into effect tomorrow (though the audit table might get pretty big). But can you actually get users to enter what is somewhat abstract data?
you might want to try a single Client table with 4 date columns to handle the 2 temporal dimensions.
Something like (client_id, ..., valid_dt_start, valid_dt_end, audit_dt_start, audit_dt_end).
This design is very simple to work with and I would try and see how ot scales before going with somethin more complicated.
In relation to my previous question where I was asking for some database suggestions; it just occured to me that I don't even know if what I'm trying to store there is appropriate for a database. Or should some other data storage method be used.
I have some physical models testing (let's say wind tunnel data; something similar) where for every model (M-1234) I have:
name (M-1234)
length L
breadth B
height H
L/B ratio
L/H ratio
...
lot of other ratios and dimensions ...
force versus speed curve given in the form of a lot of points for x-y plotting
...
few other similar curves (all of them of type x-y).
Now, what I'm trying to accomplish is store that in some reasonable way, so that the user who will be using the database can come and see what are the closest ten models to L/B=2.5 (or some similar demand). Then for that, somehow get all the data of those models, including the curve data (in a plain text file format).
Is a sql database (or any other, for that matter) an appropriate way of handling something like this ? Or should I take some other approach ?
I have about a month to finish this, and in that time I have to learn enough about databases as well, so ... give your suggestions, please, bearing that in mind. Assume no previous knowledge on the subject, whatsoever.
I think what you're looking for is possible. I'm using Postgresql here, but any database should work. This is my test database
CREATE TABLE test (
id serial primary key,
ratio double precision
);
COPY test (id, ratio) FROM stdin;
1 0.29999999999999999
2 0.40000000000000002
3 0.59999999999999998
4 0.69999999999999996
.
Then, to find the nearest values to a particular ratio
select id,ratio,abs(ratio-0.5) as score from test order by score asc limit 2;
In this case, I'm looking for the 2 nearest to 0.5
I'd probably do a datamodel where you have one table for the main data, the ratios and so on, and then a second table which holds the curve points, as I'm assuming that the curves aren't always the same size.
Yes, a database is probably the best approach for this.
A relational database (which usually uses SQL for data access) is suitable for data that is more or less structured as tables.
To give you an idea:
You could have a main table model with fields name, width etc. . Then subtable(s) for any values which can appear more than once, which refers back to model (look up "foreign key").
Then a subtable for your actual curves, again refering back to model.
How to actually model the curves in the DB I don't know, as I don't know how you model them. But if its lots of numbers, it can go into the DB.
It seems you know little about relational DBMS. Consider reading something on WIkipedia, or doing a few simple DBMS tutorials (PostgreSQL has some: http://www.postgresql.org/docs/8.4/interactive/tutorial.html , but there are many others). Then pick a DBMS for trying out (PostgreSQL is probably not a bad choice, but again there are many others).
Then try implementing a simple table schema, and get back to us with any detail questions (which you'll probably have).
One more thing: Those questions are probably more appropriate to serverfault.com.
This is arguably scientific data: you might find libraries/formats intended for arbitrary scientific data useful: HDF5 http://www.hdfgroup.org/ (note I am not an expert)