A problem with relationships between tables, and data overload caused by it

A problem with relationships between tables, and data overload caused by it - relationship

I don't know how to explain my exact problem, but in any case I will try:
In the item_period table, if I wanted to modify any data and this data was used in the invoices, for example, it will create a new period for the item data, and when creating this period I will have to copy all units and all pricing in the new period, in my view this will cause a large inflation of the base Data, especially since the items may be 100000 or more, in this case the volume of data will be very large, my method may be incorrect, but I use it in order to make each period take its correct data..
And for work, I forgot to set the billing table, because it depends on entering the item on the id_item_rate and not the name of the item or unit, so that it will take the id_item_rate to know the price and to know the item and its unit
My question here is how to avoid data inflation if you modify any data in the item_period table

Related

DB design for sensor data (lots and LOTS of data)

I am writing an application for viewing and management of sensor data. I can have unlimited number of sensors, and each sensors makes one reading every minutes and records the values as (time, value, sensor_id, location_id, [a bunch of other doubles]).
As an example, I might have 1000 sensors and collect data every minute for each one of them, which ends up generating 525,600,000 rows after a year. Multiple users (say up to 20) can plot the data of any time period, zoom in and out in any range, and add annotations to the data of a sensor at a time. Users can also modify certain data points and I need to keep track of the raw data and modified one.
I'm not sure how the database for application like this should look like! Should it be just one table SensorData, with indices for time and sensor_id and location_id? Should I partition this single table based on sensor_id? should I save the data in files for each sensor each day (say .csv files) and load them into a temp table upon request? How should I manage annotations?
I have not decided on a DBMS yet (maybe MySQL or PostgreSQL). But my intention is to get an insight about data management in applications like this in general.

I am assuming the the users cannot change the fields you show (time, value, sensor_id, location_id) but the other fields implied.
In that case, I would suggest Version Normal Form. The fields you name are static, that is, once entered, they never change. However, the other fields are changeable by many users.
You fail to state if users see all user's changes or only their own. I will assume all changes are seen by all users. You should be able to make the appropriate changes if that assumption is wrong.
First, let's explain Version Normal Form. As you will see, it is just a special case of Second Normal Form.
Take a tuple of the fields you have named, rearranged to group the key values together:
R1( sensor_id(k), time(k), location_id, value )
As you can see, the location_id (assuming the sensors are movable) and value are dependent on the sensor that generated the value and the time the measurement was made. This tuple is in 2nf.
Now you want to add updatable fields:
R2( sensor_id(k), time(k), location_id, value, user_id, date_updated, ... )
But the updateable fields (contained in the ellipses) are dependent not only on the original key fields but also on user_id and date_updated. The tuple is no longer in 2nf.
So we add the new fields not to the original tuple, but create a normalized tuple:
R1( sensor_id(k), time(k), location_id, value )
Rv( sensor_id(k), time(k), user_id(k), date_updated(k), ... )
This makes it possible to have a series of any number of versions for each original reading.
To query the latest update for a particular reading:
select R1.sensor_id, R1.time, R1.location_id, R1.value, R2.user_id, R2.date_updated, R2.[...]
from R1
left join Rv as R2
on R2.sensor_id = R1.sensor_id
and R2.time = R1.time
and R2.date_updated =(
select max( date_update )
from Rv
where sensor_id = R2.sensor_id
and time = R2.time )
where R1.sensor_id = :ThisSensor
and R1.time = :ThisTime;
To query the latest update for a particular reading made by a particular user, just add the user_id value to the filtering criteria of the main query and subquery. It should be easy to see how to get all the updates for a particular reading or just those made by a specific user.
This design is very flexible in how you can access the data and, because the key fields are also indexed, it is very fast even on Very Large Tables.

Looking for an answer I came across this thread. While it is not entirely the same as my case, it answers many of my questions; such as is using a relational database a reasonable way of doing this (to which the answer is "Yes"), and what to do about partitioning, maintenance, archiving, etc.
https://dba.stackexchange.com/questions/13882/database-redesign-opportunity-what-table-design-to-use-for-this-sensor-data-col

Finding unique products (never seen before by a user) in a datastore sorted by a dynamically changing value (i.e. product rating)

been trying to solve this problem for a week and couldn't come up with any solutions in all my research so I thought I'd ask you all.
I have a "Product" table and a "productSent" table, here's a quick scheme to help explain:
class Product(ndb.Model):
name = ndb.StringProperty();
rating = ndb.IntegerProperty
class productSent(ndb.Model): <--- the key name here is md5(Product Key+UUID)
pId = ndb.KeyProperty(kind=Product)
uuId = ndb.KeyProperty(kind=userData)
action = ndb.StringProperty()
date = ndb.DateTimeProperty(auto_now_add=True)
My goal is to show users the highest rated product that they've never seen before--fast. So to keep track of the products users have seen, I use the productSent table. I created this table instead of using Cursors because every time the rating order changes, there's a possibility that the cursor skips the new higher ranking product. An example: assume the user has seen products 1-24 in the db. Next, 5 users liked product #25, making it the #10 product in the database--I'm worried that the product will never be shown again to the user (and possibly mess things up on a higher scale).
The problem with the way I'm doing it right now is that, once the user has blown past the first 1,000 products, it really starts slowing down the query performance. Because I'm literally pulling 1,000+ results, checking if they've been sent by querying against the productSent table (doing a keyName lookup to speed things up) and going through the loop until 15 new ones have been detected.
One solution I thought of was to add a repeated property (listProperty) to the Product table of all the users who have seen a product. Or if I don't want to have inequality filters I could put a repeated property of all the users who haven't seen a product. That way when I query I can dynamically take those out. But I'm afraid of what happens when I have 1,000+ users:
a) I'll go through the roof on the limit of repeated properties in one entity.
b) The index size will increase size costs
Has anyone dealt with this problem before (I'm sure someone has!) Any tips on the best way to structure it?
update
Okay, so had another idea. In order to minimize the changes that take place when a rating (number of likes) changes, I could have a secondary column that only has 3 possible values: positive, neutral, negative. And sort by that? Ofcourse for items that have a rating of 0 and get a 'like' (making them a positive) would still have a chance of being out of order or skipped by the cursor--but it'd be less likely. What do y'all think?

Sounds like the inverse, productNotSent would work well here. Every time you add a new product, you would add a new productNotSent entity for each user. When the user wants to see the highest rated product they have not seen, you will only have to query over the productNotSent entities that match that user. If you put the rating directly on the productNotSent you could speed the query up even more, since you will only have to query against one Model.
Another idea would be to limit the number of productNotSent entities per user. So each user only has ~100 of these entities at a time. This would mean your query would be constant for each user, regardless of the number of products or users you have. The creation of new productNotSent entities would become more complex, though. You'd have to have a cron job or something that "tops up" a user's collection of productNotSent entities when they use some up. You also may want to double-check that products rated higher than those already within the user's set of productNotSent entities get pushed in there. These are a little more difficult and well require some design trade-offs.
Hope this helps!

I do not know your expected volumes and exact issues (only did a quick perusal of your question), but you may consider using Json TextProperty storage as part of your plan. Create dictionaries/lists and store them in records by json.dump()ing them to a TextProperty. When the client calls, simply send the TextProperties to the client, and figure everything out on the client side once you JSON.parse() them. We have done some very large array/object processing in JS this way, and it is very fast (particularly indexed arrays). When the user clicks on something, send a transaction back to update their record. Set up some pull or push queue processes to handle your overall product listing updates, major customer rec updates, etc.
One downside is higher bandwidth going out of you app, but I think this cost will be minimal given potential processing savings on GAE. If you structure this right, you may be able to use get_by_id() to replace all or most of your planned indices and queries. We have found json.loads() and json.dumps() to be very fast inside the app, but we only use simple dictionary/list structures.This approach will be, though, a big, big quantum measure lower than your planned use of queries. The other potential issue is that very large objects may run into soft memory limits. Be sure that your Json objects are fairly simple+lightweight to avoid this (e.g. do no include product description, sub-objects, etc. in the Json item, just the basics such as product number). HTH, -stevep

Database Normalization and User Defined Data Storage

am looking to let the users of my web application define their own attributes for products and then enter data for those products. I have found out that this technique is called n(th) normal form.
The following is DB structure I am currently considering deploying and was wondering what the positives and negatives would be in regards to integrity and scalability (and any other -ity's you can think of)
EDIT
(Sorry, This is more what I mean)
I have been staring at this for the last 15mins and I know (where the red arrow is) induces duplication and hence you would have to have integrity checks. But I just don't understand how else what I want could be done.
The products would number no more then 10. The variables would number no more then 200 (max 20 per product). The number of product instances would not exceed 100,000, therefore the maximum size of pVariable_data would not exceed 2 million

This model is called a database in a database and is not nice. Though sometimes it is impossible first check whether you really need it and your database is really the right database for the job.
With PostgreSQL you could use: http://www.postgresql.org/docs/8.4/static/hstore.html which is a standardized solution for this kind of issues.

Assuming that pVariable is more of a pVariable type, drop the reference to product_fk. It would mean that you need a new entry in that table for every Product record. Maybe try something like this:
Product(id, active, allow_new)
pVariable_type(id, name)
pVariable_data(id, product_fk, pvariable_fk, non_typed_value, bool, int, etc)
I would use the non_typed_value as your text value, and (unless you are keeping streams) write a record into that field along with the typed value. It will mean keeping the value of a record twice (and more of a pain on updates etc) but it will make querying easier, along with reporting (anything you just need to display the value for).
Note: it would also be idea to pull anything that is common to all products and put them in the product table. For example all products will most likely have a name, suggested price, etc.

Invoicing database design

I created an application few days ago that deals with invoicing. I would like to know how to best integrate a discount to my invoices. Should I put it as a negative item (in the invoice_items table) or should I create a "discount" column in the invoice table ?

I would have it as a negative-valued item. The reasons are:
With invoicing, it's very important that the calculated value remains contant forever; even if your calculation formula later changes, you can correctly reproduce any given invoice. This is even true if the value was incorrectly calculated at the time - it was what it was.
Having a value amount means that manual adjustments for exceptional circumstances is easily handled - eg, your marketing manager/accountant may decide to give a one-off discount of $100 because of a late delivery. This is trivial with negative values - just add another row, but difficult/hassle with discount rates
You can have multiple discount amounts per invoice
It's totally flexible - it has its own space to exist and be whatever it needs to be. In fact, I would make the discount another "product" (maybe even multiple products - one for each distinct discount reason, eg xmas, coupon, referral, etc.
With its own item, you can add a reason description just like any other "product" - eg "10% discount for paying cash" or whatever
You don't need any special code or database columns! Just total items up as before and print them on the invoice. "There is no spoon (discount)": It's just another line item - what could be more simple than no code/db changes required?
Not all items should be discounted - eg refunds, returns, subscriptions (if applicable). It becomes too complicated and it's unnecessary to represent the business logic of discounts in the database. Leave the calculation etc in the app code, store the result in the db
Having its own item means the calculation can be arbitrarily complex. This means no db maintenance as the complexity grows. It's a whole lot easier to maintain/alter code than it is to maintain/alter a database
Finally, I successfully built an invoicing system, and I took the "item" approach and it worked really well

What consequences would either of those choices have for you down the road? For example, would you like to have multiple discounts, or very specified discounts later on? If there will only be one discount per invoice, then I wouldn't make it any more complicated than need be. In my opinion it's easier and clearer to have it in the invoice table - having it as a negative item will make the processing of items more difficult, I think.

I fully agree with making it as simple as possible, but one thing to consider is if any item should be exempted from the discount? In that case you need to add a bool field in the details to remember which line should have discount.

Bitemporal Database Design Question

I am designing a database that needs to store transaction time and valid time, and I am struggling with how to effectively store the data and whether or not to fully time-normalize attributes. For instance I have a table Client that has the following attributes: ID, Name, ClientType (e.g. corporation), RelationshipType (e.g. client, prospect), RelationshipStatus (e.g. Active, Inactive, Closed). ClientType, RelationshipType, and RelationshipStatus are time varying fields. Performance is a concern as this information will link to large datasets from legacy systems. At the same time the database structure needs to be easily maintainable and modifiable.
I am planning on splitting out audit trail and point-in-time history into separate tables, but I’m struggling with how to best do this.
Some ideas I have:
1)Three tables: Client, ClientHist, and ClientAudit. Client will contain the current state. ClientHist will contain any previously valid states, and ClientAudit will be for auditing purposes. For ease of discussion, let’s forget about ClientAudit and assume the user never makes a data entry mistake. Doing it this way, I have two ways I can update the data. First, I could always require the user to provide an effective date and save a record out to ClientHist, which would result in a record being written to ClientHist each time a field is changed. Alternatively, I could only require the user to provide an effective date when one of the time varying attributes (i.e. ClientType, RelationshipType, RelationshipStatus) changes. This would result in a record being written to ClientHist only when a time varying attribute is changed.
2) I could split out the time varying attributes into one or more tables. If I go this route, do I put all three in one table or create two tables (one for RelationshipType and RelationshipStatus and one for ClientType). Creating multiple tables for time varying attributes does significantly increase the complexity of the database design. Each table will have associated audit tables as well.
Any thoughts?

A lot depends (or so I think) on how frequently the time-sensitive data will be changed. If changes are infrequent, then I'd go with (1), but if changes happen a lot and not necessarily to all the time-sensitive values at once, then (2) might be more efficient--but I'd want to think that over very carefully first, since it would be hard to manage and maintain.
I like the idea of requiring users to enter effective daes, because this could serve to reduce just how much detail you are saving--for example, however many changes they make today, it only produces that one History row that comes into effect tomorrow (though the audit table might get pretty big). But can you actually get users to enter what is somewhat abstract data?

you might want to try a single Client table with 4 date columns to handle the 2 temporal dimensions.
Something like (client_id, ..., valid_dt_start, valid_dt_end, audit_dt_start, audit_dt_end).
This design is very simple to work with and I would try and see how ot scales before going with somethin more complicated.