Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
We are moving a logging table from DB2 to Oracle, in here we log exceptions and warnings from many applications. With the move I want to accomplish 2 main things among others: less space consumption (to add more rows because tablespaces are kinda small) while not increasing the server processing usage too much (cpu usage increases our bill).
In DB2 we basically have a table that holds text strings.
In Oracle I am taking the approach of normalizing the tables for columns with duplicated data (event_type, machine, assemblies, versno). I have a procedure that receives multiple parameters and I query the reference tables to get the IDs.
This is the Oracle table description.
One of the feedback I have so far from a co-worker is that I will not necessary reduce table space since indexes take space and my solution might end up using more than what saving all string uses. We don't know if this is true, does anyone have more information on this?
I am taking the right approach?
Will this approach help me accomplish my 2 main goals?
Additional feedback is welcome and appreciated.
The approach of using surrogate keys (numerical ID) and dimension tables (containing the ID key and the description) is popular in both OLPT and data warehouse. IMO the use for logging is a bit strange.
The problem is, the the logging component should not have much assumption about the data to be logged - it is vital to be able to log the exeptional cases.
So in case that you can't map the host name to ID (it was misspelled or simple not configured), it would be not much helpfull to log unknownhost or to suppress the logging at all.
Concerned the storage you can indeed save a lot storing IDs istead of long strings, but (dependent on data) you may get similar effect using table compression.
I think the most important thing about logging is that it works all the time. It must be bullet-proof, because if the logger fails you lose the information you need to diagnose problems in your system. Even worse would be to fail business transactions because the logger abended.
There is a strong inverse relationship between the complexity of a logging implementation and its reliability. The more moving parts it has, the more things there are to go wrong. So, while normalization is a good thing for business data it introduces unwelcome risk. Also, look-ups increase the overhead of writing a log message and that is also undesirable.
" tablespaces are kinda small"
Increasing the chance of failure is not a good trade-off. Ask for more space.
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 12 months ago.
Improve this question
There are some blog-articles in project I work, and,
I guess, its text field should be limited somehow
(probably it would be a JSON field one day)
There are no limitations in sense of domain -
user can write as much as he wants, but,
just to prevent DB harm by hack-attacks with uncommonly huge amounts of text, guess, some limit is needed.
As SO Q/A says:
PostgreSQL limit text field with 1GB Is there a maximum length when storing into PostgreSQL TEXT
http POST limits depend on browser (2GB - 4GB) https://serverfault.com/questions/151090/
By rumors, Nginx's default client_max_body_size is 1MB
So, how to deal with all of that?
Probably, there are some practice like:
"Just limit it with million chars in app-lvl and don't worry"?
This is an interesting question. If I understand correctly you are working on an application where the DB entry will essentially be a blog post (typically 500-1000 words based on most blogs I read).
You are storing the blog post as a text field in your database. You are quite reasonably worried about what happens with large blobs of data.
I fully advocate for you having a limit on the amount of data the user can enter. Without fully understanding the architecture of your system it's impossible to say what is the theoretical max size based on the technologies used.
However, it is better to look at this from a user perspective. What is the maximum reasonable amount of text for you to have to store, then let's add a bit more say 10% because let's face it users will do the unexpected. You can then add an error condition when someone tried to enter more data.
My reason for proposing this approach is a simple one, once you have defined a maximum post size you can use boundary value analysis (testing just either side of the limit) to prove that your product behaves correctly both just below and at the limit. This way you will know and can explain the product behaviour to the users etc.
If you choose to let the architecture define the limit then you will have undefined behaviour. You will need to analyze each component in turn to work out the maximum size they will accept and the behaviour they exhibit when that size is exceeded.
Typically (in my experience) developers don't put this effort in and let the users do the testing for them. This of course is generally worse because the user reports a strange error message and then the debugging is ultimately time consuming and expensive.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
What could prevent SQL Server from treating a query filter optimally, meaning, from using an index efficiently to support the filter? What other query elements could also be affected in a similar manner and what can you do to get optimal treatment?
This is a very broad question - it would be good to have more details about your specific situation if you want help with something.
Therefore the answers here are only pointers to potential problems - you will need to do research into how to fix them in your own specific circumstances
SQL-Server issues
Parameter sniffing and cached execution plans - occurs in stored procedures, where the first time you run it saves an execution plan that isn't great for the same procedure with other parameters.
Complex queries causing poor cardinality estimation - this doesn't often affect choice of indexes imo, but it can (especially if you have multiple indexes on the same table)
Human issues
In these, SQL Server doesn't use an index - but is correct in (not) doing so.
Index fields are in the wrong order e.g., you have training records with an indexed on Student_ID then Unit_ID. You try to select * from trgrecords WHERE Unit_ID = 100. The index doesn't help you much here - it would still need to read all rows of the index to determine which rows match. This also compounds with the problem below.
Inappropriate belief that an index-seek/key-lookup pair is faster. When you're filtering by an index (that isn't a covering index) and it will return a decent proportion of rows, SQL Server will often read the whole table rather than looking up all the relevant rows one-by-one. And it's usually right. But you may think it's not right. (Also note - usually the problem is the other way - SQL Server uses an index-seek/key-lookup strategy when a full scan would be better).
SELECT * (e.g., in sub-queries). Because this is telling SQL Server to get all fields, it will always go to the clustered index at some point (even if you're only needing a few columns/fields) unless you have an index across all fields (not common!). This forces the issue above about index-seek/key-lookups.
Non-SARGable WHERE clauses etc - e.g., SELECT Name FROM Person WHERE Name LIKE 'Simon%' is SARGable, while SELECT Name FROM Person WHERE LEFT(Name, 5) = 'Simon' is not SARGable. By definition, non-SARGable queries require a full scan of the appropriate data.
I suggest watching these by Brent Ozar - I learned a lot from them.
How to think like an SQL Server engine
Identifying and fixing Parameter sniffing
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I have recently been tasked with making an account system for all of our products, much like that of a Windows account across Microsoft's products. However, one of the requirements is that we are able to easily check for accounts with the same information across them so that we can detect traded accounts, sudden and possibly fraudulent changes of information, etc.
When thinking of a way to solve this problem, I thought we could reduce redundancy of our data while we're at it. I thought it might help us save some storage space and processing time, since in the end we're going to just be processing the data set into what I explain below.
A bit of background on how this is set up right now:
An account table just contains an id and a username
A profile table contains a reference to an account and references to separate pieces of profile data: names, mailing addresses, email addresses
A name table contains an id and the first, last, and middle name of an individual
An address table contains data about an address
An email address table contains an id and the mailbox and domain of an email address
A profile record is what relates the unique pieces of profile data (shared across many accounts) to a specific account. If there are fifty people named "John Smith", then there is only one "John Smith" record in the names table. If a user changes any piece of their information, the profile record is soft deleted and a new one is created. This is to facilitate change tracking.
After profiling, I have noticed that creating constraints like UNIQUE(FirstName, MiddleName, LastName) is pretty painful in terms of record insertion. Is that simply the price we're going to have to pay or is there a better approach?
Having records about 2 persons named John Smith is not a redundancy, but a necessity.
The approach you've suggested is far from being optimal. "The profile record is soft deleted and a new one is created. This is to facilitate change tracking." Deleting and re-inserting will cause problems with dependent records in other tables.
There are much easier methods to track changes - search for some 3rd party tools
As for the tables created, it's not necessary to split the data in so many tables. Why don't you merge the Name and Account tables. Are both Address and Email address tables necessary?
I have concluded my research and decided that this approach is fine if insert performance is not critical. In cases where it is critical, increasing data redundancy within reason is an acceptable trade off.
The solution described in my question is adequate for my performance needs. Storage is considered more expensive than insertion time, in our model.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
We are looking to create a software that receive log files from a high number of devices. We are looking around 20 million rows a day with log (2kb / each for each log line).
I have developed a lot of software but never with this large quantity of input data. The data needs to be searchable, sortable, groupable by source IP, dest IP, alert level etc.
It should be combining similiar log entries (occured 6 times etc..)
Any ideas and suggestions on what type of design, database and general thinking around this would be much appreciated.
UPDATE:
Found this presentation, seems like a similar scenario, any thoughts on this?
http://skillsmatter.com/podcast/cloud-grid/mongodb-humongous-data-at-server-density
I see a couple of things you may want to consider.
1) message queue - to drop a log line and let other part (worker) of the system to take care of it when time permits
2) noSQL - reddis, mongodb,cassandra
I think your real problem would be in querying the data , not in storing.
Also you probably would need a scalable solution.
Some of noSql databases are distributed you may need that.
Check this out, it might be helpful
https://github.com/facebook/scribe
A web search on "Stackoverflow logging device data" yielded dozens of hits.
Here is one of them. The question asked may not be exactly the same as yours, but you should get dozens on intersting ideas from the responses.
I'd base many decisions on how users most often will be selecting subsets of data -- by device? by date? by sourceIP? You want to keep indexes to a minimum and use only those you need to get the job done.
For low-cardinality columns where indexing overhead is high yet the value of using an index is low, e.g. alert-level, I'd recommend a trigger to create rows in another table to identify rows corresponding to emergency situations (e.g. where alert-level > x) so that alert-level itself would not have to be indexed, and yet you could rapidly find all high-alert-level rows.
Since users are updating the logs, you could move handled/managed rows older than 'x' days out of the active log and into an archive log, which would improve performance for ad-hoc queries.
For identifying recurrent problems (same problem on same device, or same problem on same ip address, same problem on all devices made by the same manufacturer, or from the same manufacturing run, for example) you could identify the subset of columns that define the particular kind of problem and then create (in a trigger) a hash of the values in those columns. Thus, all problems of the same kind would have the same hash value. You could have multiple columns like this -- it would depend on your definition of "similar problem" and how many different problem-kinds you wanted to track, and on the subset of columns you'd need to enlist to define each kind of problem. If you index the hash-value column, your users would be able to very quickly answer the question, "Are we seeing this kind of problem frequently?" They'd look at the current row, grab its hash-value, and then search the database for other rows with that hash value.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I have a question about best practices regarding how one should approach storing complex workflow states for processing tasks in a database. I've been looking online to no avail, so I figured I'd ask the community what they thought was best.
This question comes out of the same "BoxItem" example I gave in a prior question. This "BoxItem" is being tracked in my system as various tasks are performed on it. The task may take place over several days and with human interaction, so the state of the BoxItem must be persisted. Who did the task (if applicable), and when the task was done must also be tracked.
At first, I approached this by adding three fields to the "BoxItems" table for every human-interactive task that needed to be done:
IsTaskNameComplete
DateTaskNameComplete
UserTaskNameComplete
This worked when the workflow was simple... but now that it has grown to a complex process (> 10 possible human interactions in the flow... about half of which are optional, and may or may not be done for the BoxItem, which resulted in me beginning to add "DoTaskName" fields as well for those optional tasks), I've found that what should've been a simple table now has 40 or so field devoted entirely to the retaining of this state information.
I find myself asking if there isn't a better way to do it... but I'm at a loss.
My first thought was to make a generic "BoxItemTasks" table which defined the tasks that may be done on a given box, but I still would need to save the Date and User information individually, so it didn't really help.
My second thought was that perhaps it didn't matter, and I shouldn't worry if this table has 40 or more fields devoted to state retaining... and maybe I'm just being paranoid. But it feels like that's a lot of information to retain.
Anyways, I'm at a loss as far as what a third option might be, or if one of the two options above is actually reasonable. I can see this workflow potentially getting even more complex in the future, and for each new task I'm going to need to add 3-4 fields just to support the tracking of it... it feels like it's spiraling out of control.
What would you do in this situation?
I should note that this is maintenance of an existing system, one that was built without an ORM, so I can't just leave it up to the ORM to take care of it.
EDIT:
Kev, are you talking about doing something like this:
BoxItems
(PK) BoxItemID
(Other irrelevant stuff)
BoxItemActions
(PK) BoxItemID
(PK) BoxItemTaskID
IsCompleted
DateCompleted
UserCompleted
BoxItemTasks
(PK) TaskType
Description (if even necessary)
Hmm... that would work... it would represent a need to change how I currently approach doing SQL Queries to see which items are in what state, but in the long term something like this looks like it would work better (without having to make a fundamental design change like the Serialization idea represents... though if I had the time, I'd like to do it that way I think.).
So is this what you were mentioning Kin, or am I off on it?
EDIT: Ah, I see your idea as well with the "Last Action" to determine the current state... I like it! I think that might work for me... I might have to change it up a little bit (because at some point tasks happen concurrently), but the idea seems like a good one!
EDIT FINAL: So in summation, if anyone else is looking this up in the future with the same question... it sounds like the serialization approach would be useful if your system has the information pre-loaded into some interface where it's queryable (i.e. not directly calling the database itself, as the ad-hoc system I'm working on does), but if you don't have that, the additional tables idea seems like it should work well! Thank you all for your responses!
If I'm understanding correctly, I would add the BoxItemTasks table (just an enumeration table, right?), then a BoxItemActions table with foreign keys to BoxItems and to BoxItemTasks for what type of task it is. If you want to make it so that a particular task can only be performed once on a particular box item, just make the (Items + Tasks) pair of columns be the primary key of BoxItemActions.
(You laid it out much better than I did, and kudos for correctly interpreting what I was saying. What you wrote is exactly what I was picturing.)
As for determining the current state, you could write a trigger on BoxItemActions that updates a single column BoxItems.LastAction. For concurrent actions, your trigger could just have special cases to decide which action takes recency.
As the previous answer suggested, I would break your table into several.
BoxItemActions, containing a list of actions that the work flow needs to go through, created each time a BoxItem is created. In this table, you can track the detailed dates \ times \ users of when each task was completed.
With this type of application, knowing where the Box is to go next can get quite tricky, so having a 'Map' of the remaining steps for the Box will prove quite helpful. As well, this table can group like crazy, hundreds of rows per box, and it will still be very easy to query.
It also makes it possible to have 'different paths' that can easily be changed. A master data table of 'paths' through the work flow is one solution, where as each box is created, the user has to select which 'path' the box will follow. Or you could set up so that when the user creates the box, they select tasks are required for this particular box. Depends on our business problem.
How about a hybrid of the serialization and the database models. Have an XML document that serves as your master workflow document, containing a node for each step with attributes and elements that detail it's name, order in the process, conditions for whether it's optional or not, etc. Most importantly each step node can have a unique step id.
Then in your database you have a simple two table structure. The BoxItems table stores your basic BoxItem data. Then a BoxItemActions table much like in the solution you marked as the answer.
It's essentially similar to the solution accepted as the answer, but instead of a BoxItemTasks table to store the master list of tasks, you use an XML document that allows for some more flexibility for the actual workflow definition.
For what it's worth, in BizTalk they "dehydrate" long-running message patterns (workflows and the like) by binary serializing them to the database.
I think I would serialize the Workflow object to XML and store in the database with an ID column. It may be more difficult to report on, but it sounds like it may work in your case.