Currently i'm storing two day's data(200M rows with 5 columns in each) in an RDBMS (mssql)- yesterday's and today's tables, so i keep removing older tables which are no longer useful. I always read and update data in yesterday's table and insert data in today's table.
Whenever i read some rows from yesterday's table, i increment a status column in the table for those rows by 1, so that i'd read those rows again only after i've read all the other rows which match the criteria (some criteria based on other columns).
I want to move to a noSql db for my use case. Please suggest which ones might be worth looking at.
It is not very clear why you would want to move to a NoSQL database - I am not suggesting you shouldn't.
You could take a look at Oracle NoSQL Database, it will allow you to map your current model to its Table API, also supports Time-to-live that purges data based on elapsed time (It is not clear if you purge tables only after some time window is elapsed). It supports JSON Document, and Key-Value API as well.
Related
I have time series data in a relational database (postgres). Data import to the database every 5 minutes, but imput get overwritten during the day, meaning at the end of the day there is only 1 record for that day for specific id (id and date-> composite PKs).
current process is like this ->Data comes in and is evaluated the same way 1:1. (data comes in every table as they are in source, there is many redundancy.
3 problems:
currently performance of getting data out of database(reading) is fast (good performance)
frontend get query from this database and show data. result of the query is very fast. if I do normalization then getting the query become slower, but writing and updating become easier.
how can I optimize this database?
missing data (ignore this problem )
if we are able to store more records daily (history of one ID in different points of time everyday) then we can show comparison of two points in time in a day. does database support huge amoount of data every day?
DWH
source is just one, all data come from one source. can we have DWH for it or since source is only one, there is no need for it?
Edit:
How can I optimise this database?
currently there is only one Schema in a database. Data comes in and is evaluated the same way 1:1. writng is hard since we have redundany.
my solution:
I want to create 3 schemas for this database.
1 schema, for inserting data into tables, tables structure is base on data source. ( I assume data remains here temporary, and will be transfer in second schema)
2 schema, incoming data stored, and data is structured in 3NF.
3 Schema, denormlising data again because we need to get fast query (fast reading is required).
Your three schema model is exactly how this has been done for many years.
Schema 1:
Names: Staging/Landing/Ingestion
Schema matches the source system but it is cleared and reloaded for every load batch. Typically has a "looser" schema definition to allow for import and capture of bad data
Schema 2:
Names: Replica/ODS/Persisted data store
Schema 2 is never cleared, it's permanent. Following a data load, this layer should look exactly like your source systems. Data in schema 1 is "merged" into schema 2 each time. For example on a daily load cycle, Schema 1 just contains that days data but schema 2 contains the entire history of data loaded. Reference data is merged on a known primary key. Transactional data might be merged on a key or it might be merged on a "windowing" basis - i.e. delete the last days data from schema 2 and load schema 1 in
Some people like to have a "point in time view" where they can recreate what the source system looks like a historical point in time. I've never seen anyone use that though.
Schema 3:
Names: Business Layer/Star Schema/Reporting Layer/Datamart/Sematic Layer
Layer 2, which is usually a replica of an OLTP data model (OLTP is optimised for entering data). This is transformed into a data model that is optimised for reporting.
The tried and tested data model here is a star schema. It's been around for decades. If you research any reporting tool (i.e. Power BI), thay all say that the preferred data model to report from is a star schema. Yes a star schema is denormalised and has other benefits beyonf perforamnce, for example it is more easily understood by a business user, supports slowly changing dimensions etc.
All these concepts are explained further online but of you have any specific questions happy to expand further
I need your help in designing one table.
I have some groups tables and we need to load data in that group tables from xml files that contain column names and data.
The column name is actually index of some main column like activity_col1, activity_col2 and so on and not fixed every time, there is possibility that same table file contains 1000 columns sometimes and 10 column values some time also there is maximum limit is also defined so no file will contain more than
2000 column per group.
So I need to design a table that is the best possible solution for this also I need to do the aggregation of column values. The files contain min level data and I need to store this data in min table and after that this min data need to be aggregated in an hour, day, week and month.
If I create max columns in all tables but data will not come every time in all columns so this design seems not good because most of the values will be null.
If I insert column name as rows in column_name column and values against each column value in values column then aggregation will be a tedious task for me
and it will impact performance.
Please suggest.
One option would be EAV, but it's more complicated to build, to query and to insert, and readability is very low.
You require a schema-less design, Allowing an unlimited number of columns,
your best bet is probably to use a NoSQL solution. Even though the weaknesses of EAV relative to relational databases also apply to NoSQL alternatives.
Also take a look at here :
Benefits of NoSQL
Recommendations (as priority):
Choice EAV, If you are using a relational-database and this is
where you turn either the whole table or a portion (in another
table) on its side. This is a good choice if you already have a
relational-database in-house that you can't move away from easily.
Choice NoSQL, If does not matter kind of DBMS for you It is very
flexible and fast and not all of the report writers out there
support this style of storage. There are many example database
implementations of NoSQL. The one that seems to be most popular
right now, is MongoDB.
and the last option that I don't recommend you to use it:
Choice Standard tables with XML columns, If the you don't need to
query them, and you just want to be stored and retrieved as plain
text for using some extra usage.
I hope to be helpful for you:)
Consider a database with several (3-4) tables with a lot of columns (from 15 to 40). In each table we have several thousand records generated per year and about a dozen of changes made for each record.
Right now we need to add a following functionality to our system: every time user makes a change to the record of one of our tables, the system needs to keep track of it - we need to have complete history of changes and also be able to restore row data to selected point.
For some reasons we cannot keep "final" and "historic" data in the same table (so we cannot add some columns to our tables to keep some kind of versioning information, i.e. like wordpress does when it comes to keeping edit history of posts).
What would be best approach to this problem? I was thinking about two solutions:
For each tracked table we have a mirror table with the same columns, and with additional columns where we keep information about versions (i.e. timestamps, id of "original" row etc...)
Pros:
we have data stored exactly in the same way it was in original tables
whenever we need to add a new column to the original table, we can do the same to mirror table
Cons:
we need to create one additional mirror table for each tracked table.
We create one table for "history" revisions. We keep some revisioning information like timestamps etc., and also we keep the track from which table the data originates. But the original data row is being stored in large text column in JSON.
Pros:
we have only one history table for all tracked tables
we don't need to create new mirror tables every time we add new tracked table,
Cons:
there can be some backward compatibility issues while trying to restore data after structure of the original table was changed (i.e. new column was added)
Maybe some other solution?
What would be the best way of keeping the history of versions in such system?
Additional information:
each of the tracked tables can change in the future (i.e. new columns added),
number of tracked tables can change in the future (i.e. new tables added).
FYI: we are using laravel 5.3 and mysql database.
How often do you need access to the auditing data? Is cost of storage ever a concern? Do you need it in the same system that you need the normal data?
Basically, having a table called foo and a second table called foo_log isn't uncommon. It also lets you store foo_log somewhere differently, even possibly a secondary DB. If foo_log is on a spindle disk and foo is on flash, you still get fast reads, but you get somewhat cheaper storage of the backups.
If you don't ever need to display this data, and just need it for legal reasons, or to figure out how something went wrong, the single-table isn't a terrible plan.
But if the issue is backups, which it sounds like it might be, why not just backup the MySQL database on a regular basis and store the backups elsewhere?
I'm a long time programmer who has little experience with DBMSs or designing databases.
I know there are similar posts regarding this, but am feeling quite discombobulated tonight.
I'm working on a project which will require that I store large reports, multiple times per day, and have not dealt with storage or tables of this magnitude. Allow me to frame my problem in a generic way:
The process:
A script collects roughly 300 rows of information, set A, 2-3 times per day.
The structure of these rows never change. The rows contain two columns, both integers.
The script also collects roughly 100 rows of information, set B, at the same time. The
structure of these rows does not change either. The rows contain eight columns, all strings.
I need to store all of this data. Set A will be used frequently, and daily for analytics. Set B will be used frequently on the day that it is collected and then sparingly in the future for historical analytics. I could theoretically store each row with a timestamp for later query.
If stored linearly, both sets of data in their own table, using a DBMS, the data will reach ~300k rows per year. Having little experience with DBMSs, this sounds high for two tables to manage.
I feel as though throwing this information into a database with each pass of the script will lead to slow read times and general responsiveness. For example, generating an Access database and tossing this information into two tables seems like too easy of a solution.
I suppose my question is: how many rows is too many rows for a table in terms of performance? I know that it would be in very poor taste to create tables for each day or month.
Of course this only melts into my next, but similar, issue, audit logs...
300 rows about 50 times a day for 6 months is not a big blocker for any DB. Which DB are you gonna use? Most will handle this load very easily. There are a couple of techniques for handling data fragmentation if the data rows exceed more than a few 100 millions per table. But with effective indexing and cleaning you can achieve the performance you desire. I myself deal with heavy data tables with more than 200 million rows every week.
Make sure you have indexes in place as per the queries you would issue to fetch that data. Whats ever you have in the where clause should have an appropriate index in db for it.
If you row counts per table exceed many millions you should look at partitioning of tables DBs store data in filesystems as files actually so partitioning would help in making smaller groups of data files based on some predicates e.g: date or some unique column type. You would see it as a single table but on the file system the DB would store the data in different file groups.
Then you can also try table sharding. Which actually is what you mentioned....different tables based on some predicate like date.
Hope this helps.
You are over thinking this. 300k rows is not significant. Just about any relational database or NoSQL database will not have any problems.
Your design sounds fine, however, I highly advise that you utilize the facility of the database to add a primary key for each row, using whatever facility is available to you. Typically this involves using AUTO_INCREMENT or a Sequence, depending on the database. If you used a nosql like Mongo, it will add an id for you. Relational theory depends on having a primary key, and it's often helpful to have one for diagnostics.
So your basic design would be:
Table A tableA_id | A | B | CreatedOn
Table B tableB_id | columns… | CreatedOn
The CreatedOn will facilitate date range queries that limit data for summarization purposes and allow you to GROUP BY on date boundaries (Days, Weeks, Months, Years).
Make sure you have an index on CreatedOn, if you will be doing this type of grouping.
Also, use the smallest data types you can for any of the columns. For example, if the range of the integers falls below a particular limit, or is non-negative, you can usually choose a datatype that will reduce the amount of storage required.
I have a database that increases every month. The schema remains the same, so I think I use one of these two methods:
Use only one table, new data will be appended to this table, and will be identified by a date column. The increasing data every month is about 20,000 rows, but in long term, I think this should be problem to search and analyze this data
create dynamically one table per month, the table name will indicate which data it contains (for example, Usage-20101125), this will force us to use dynamic SQL, but in long term, it seems fine.
I must confess that I have no experiences about designing this kind of database. Which one should I use in real world?
Thank you so much
20 000 rows per month is not a lot. Go with your first option. You didn't mention which database you'll be using, but SQL Server, Oracle, Sybase and PostgreSQL, to name just a few, can handle millions of rows comfortably.
You will need to investigate a proper maintenance plan, including indexing and statistics, but that will come with lots of reading and experience.
Look into partitioning your table.
That way you can physically store the data on different disks for performance while logically it would be one table so your database stays well designed.