Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I'm designing a database for tracking stock transactions & real-time portfolio holdings. The system should be able to pull historical position on a custom time-period basis (e.g. end of day holdings). My current design includes a transaction table and a real-time position table, when each transaction is booked into the database it will automatically trigger the update of the real-time position table. I'm planning to use PostgreSQL and the TimescaleDB extension for the transaction table.
I'm kind of confused about how to implement the function of historical holdings, as the historical holdings as a certain timestamp t can be derived by aggregating the transactions with timestamp <=t together. Should I use a separate table to record the historical holdings or simply do the aggregating? I am also considering use binary files to store snapshots of real-time positions at end of each day to support the historical position look-up.
I have little experience with Database design thus any advice/help is appreciated.
This question is lacking detail, so my answer is general.
One thing you could do is have two tables: one for the detailed data and one for the aggregation. Then you calculate one record for the latter from the former every day. Use partitioning for the detail table to be able to get rid of old data easily.
You can also use the same (partitioned) table for both, if the data structure allows it. Then you calculate a new aggregated record every day, drop the partition for that day and extend the partition boundary for the “aggregated” partition.
Consider carefully if you need the TimescaleDB extension. If it offers benefits you know you need, go for it. If not, do without it. It is always nice to have few dependencies. Just because you are storing time series data doesn't mean you need TimescaleDB.
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I use SQL server as a database and I want to know Regardless of the database I use or I mean in general
I had a scenario in an auction website and I wanted to always have the highest price that the user entered, I added an attribute named "Highest Price" to the item entity (which is the item that is being sold), and checked that if the user entered a higher price it will be updated
so my question is, was it faster to do this or that I have a table like this and search in it every time I insert a new record
My direct answer to your question (as you put it into a title) would definitely be searching, as writing takes longer time than reading.
So, here are your two options for your example (that is, if you want to actually identify the user that made the highest bid, and I guess you would do):
Use the "bids" table.
Advantages: Allows you to keep history of bids, adds coherence (totally makes sense linking users and items in a table).
Disadvantages: Way slower as to find the maximum price you need to search using MAX(), plus you will need to write each entry, regardless its price.
Add two attributes to the table that contains the items, referring to the highest price and the user ID that put that price.
Advantages: Faster, you select directly one row and you only write given certain conditions.
Disadvantages: Doesn't look too coherent linking the user to an item, rather than to a bid.
Common things to both options: You write data to two columns, although with different intensity.
If you need absolutely the highest speed possible, go for the second one, but, especially when you're engineering software, you must also take coherence into account.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
We have a feature request where a said users choice of certain entity has to be recorded, this would be an integer that is incremented every time the user chooses that specific entity. This information is later used for ordering the said entities every time the user gets all its associated of the said entities.
Assuming that we are limited to sql server as our persistent storage and that we could have !millions of users and each user could have anywhere between 1 to 10 of those said entities, performance would quickly become an issue.
One option is to have an append only log table where the user id and entity id are saved every time a user chooses an entity and on the query side we get all the records and group by entity id. This could quickly cause the table to grow very large.
The other would be to have three columns consisting of user id, entity id and count and we would increment the count every time the user chooses the said entity.
The above two options are what I have been able to come up with.
I would like to know if there are any other options and also what would be the performance implications of the above solutions.
Thanks
If this was 2001 performance could be an issue.
If you are using SQL server 2012 or 2016
then I say either are good.
Ints or Big ints index well and the performance hit would be insignificant.
You could also store the data in a xml or json varchar field
but I would go with your first option.
Using Big Ints and make sure you use indexes no matter what you do
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
We are moving a logging table from DB2 to Oracle, in here we log exceptions and warnings from many applications. With the move I want to accomplish 2 main things among others: less space consumption (to add more rows because tablespaces are kinda small) while not increasing the server processing usage too much (cpu usage increases our bill).
In DB2 we basically have a table that holds text strings.
In Oracle I am taking the approach of normalizing the tables for columns with duplicated data (event_type, machine, assemblies, versno). I have a procedure that receives multiple parameters and I query the reference tables to get the IDs.
This is the Oracle table description.
One of the feedback I have so far from a co-worker is that I will not necessary reduce table space since indexes take space and my solution might end up using more than what saving all string uses. We don't know if this is true, does anyone have more information on this?
I am taking the right approach?
Will this approach help me accomplish my 2 main goals?
Additional feedback is welcome and appreciated.
The approach of using surrogate keys (numerical ID) and dimension tables (containing the ID key and the description) is popular in both OLPT and data warehouse. IMO the use for logging is a bit strange.
The problem is, the the logging component should not have much assumption about the data to be logged - it is vital to be able to log the exeptional cases.
So in case that you can't map the host name to ID (it was misspelled or simple not configured), it would be not much helpfull to log unknownhost or to suppress the logging at all.
Concerned the storage you can indeed save a lot storing IDs istead of long strings, but (dependent on data) you may get similar effect using table compression.
I think the most important thing about logging is that it works all the time. It must be bullet-proof, because if the logger fails you lose the information you need to diagnose problems in your system. Even worse would be to fail business transactions because the logger abended.
There is a strong inverse relationship between the complexity of a logging implementation and its reliability. The more moving parts it has, the more things there are to go wrong. So, while normalization is a good thing for business data it introduces unwelcome risk. Also, look-ups increase the overhead of writing a log message and that is also undesirable.
" tablespaces are kinda small"
Increasing the chance of failure is not a good trade-off. Ask for more space.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
We are looking to create a software that receive log files from a high number of devices. We are looking around 20 million rows a day with log (2kb / each for each log line).
I have developed a lot of software but never with this large quantity of input data. The data needs to be searchable, sortable, groupable by source IP, dest IP, alert level etc.
It should be combining similiar log entries (occured 6 times etc..)
Any ideas and suggestions on what type of design, database and general thinking around this would be much appreciated.
UPDATE:
Found this presentation, seems like a similar scenario, any thoughts on this?
http://skillsmatter.com/podcast/cloud-grid/mongodb-humongous-data-at-server-density
I see a couple of things you may want to consider.
1) message queue - to drop a log line and let other part (worker) of the system to take care of it when time permits
2) noSQL - reddis, mongodb,cassandra
I think your real problem would be in querying the data , not in storing.
Also you probably would need a scalable solution.
Some of noSql databases are distributed you may need that.
Check this out, it might be helpful
https://github.com/facebook/scribe
A web search on "Stackoverflow logging device data" yielded dozens of hits.
Here is one of them. The question asked may not be exactly the same as yours, but you should get dozens on intersting ideas from the responses.
I'd base many decisions on how users most often will be selecting subsets of data -- by device? by date? by sourceIP? You want to keep indexes to a minimum and use only those you need to get the job done.
For low-cardinality columns where indexing overhead is high yet the value of using an index is low, e.g. alert-level, I'd recommend a trigger to create rows in another table to identify rows corresponding to emergency situations (e.g. where alert-level > x) so that alert-level itself would not have to be indexed, and yet you could rapidly find all high-alert-level rows.
Since users are updating the logs, you could move handled/managed rows older than 'x' days out of the active log and into an archive log, which would improve performance for ad-hoc queries.
For identifying recurrent problems (same problem on same device, or same problem on same ip address, same problem on all devices made by the same manufacturer, or from the same manufacturing run, for example) you could identify the subset of columns that define the particular kind of problem and then create (in a trigger) a hash of the values in those columns. Thus, all problems of the same kind would have the same hash value. You could have multiple columns like this -- it would depend on your definition of "similar problem" and how many different problem-kinds you wanted to track, and on the subset of columns you'd need to enlist to define each kind of problem. If you index the hash-value column, your users would be able to very quickly answer the question, "Are we seeing this kind of problem frequently?" They'd look at the current row, grab its hash-value, and then search the database for other rows with that hash value.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
In an application I'm designing, the user has to be able to specify rather complex event scheduling (continuous time-block vs. daily time-blocks, exception date/times, recurrence patterns etc.)
Does anyone know of a good design page for such a thing online? For example, I was highly impressed with this page's description of how to do database audit trails, and would love something similar.
Current thinking
My database would contain of the following tables: Events and ScheduleItems
The relationship would be Events {1 -- 0..*} ScheduleItems
Events would have the following columns: eventId, schedulePattern
ScheduleItems would have the following columns: eventId, startDateTime, endDateTime
The front end controls would allow the user to specify general rules (daily/continuous, includes/excludes weekends, start/end date/time etc.). If they are not satisfied with the existing controls, they could then opt to display and manually tweak the generated "time-blocks"
On saving the event schedule...
If only the provided controls were used (no tweaking) I would save their selections as a pattern in the Events table (i.e. "sd:2010-04-28;st:09:20:00;ed:2010-05-12;et:17:20:00;r:2w[M-Th];z:EST" etc.)
If the user manually tweaked the generated time-blocks, I would save each individual time-block within the the ScheduleItems table and have Events.schedulePattern be given a special code ("MANUAL" or something).
Pros
I should be able to save > 90% of the events through the pattern field directly, and be able to handle any other corner cases through the "brute-force" ScheduleItems table. Since some of the handled cases include events that can go on for months (which would otherwise result in a very large number of time-blocks), having it in one line is rather attractive.
Cons
This is a fairly complex solution; any other system requiring this data would need to be able to handle parsing the schedulePattern as well as knowing when to fetch ScheduleItems.
The "schedulePattern", in a mature way, uses Cron format to store the users' schedule is a good idea for running task.
This format is simple but sophisticated. In relational database, there would be some performance benefit if you seperate every "entry" of Cron to a column of table with suitable indexes.
The effort, however, is the translation between this format and user interface. And the raw data to which user input should be recorded originally.
I would design two kinds of table, one for raw data inputted by user, another for running task for schedule.