Best practice to limit DB text field just to prevent harm [closed] - database

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 12 months ago.
Improve this question
There are some blog-articles in project I work, and,
I guess, its text field should be limited somehow
(probably it would be a JSON field one day)
There are no limitations in sense of domain -
user can write as much as he wants, but,
just to prevent DB harm by hack-attacks with uncommonly huge amounts of text, guess, some limit is needed.
As SO Q/A says:
PostgreSQL limit text field with 1GB Is there a maximum length when storing into PostgreSQL TEXT
http POST limits depend on browser (2GB - 4GB) https://serverfault.com/questions/151090/
By rumors, Nginx's default client_max_body_size is 1MB
So, how to deal with all of that?
Probably, there are some practice like:
"Just limit it with million chars in app-lvl and don't worry"?

This is an interesting question. If I understand correctly you are working on an application where the DB entry will essentially be a blog post (typically 500-1000 words based on most blogs I read).
You are storing the blog post as a text field in your database. You are quite reasonably worried about what happens with large blobs of data.
I fully advocate for you having a limit on the amount of data the user can enter. Without fully understanding the architecture of your system it's impossible to say what is the theoretical max size based on the technologies used.
However, it is better to look at this from a user perspective. What is the maximum reasonable amount of text for you to have to store, then let's add a bit more say 10% because let's face it users will do the unexpected. You can then add an error condition when someone tried to enter more data.
My reason for proposing this approach is a simple one, once you have defined a maximum post size you can use boundary value analysis (testing just either side of the limit) to prove that your product behaves correctly both just below and at the limit. This way you will know and can explain the product behaviour to the users etc.
If you choose to let the architecture define the limit then you will have undefined behaviour. You will need to analyze each component in turn to work out the maximum size they will accept and the behaviour they exhibit when that size is exceeded.
Typically (in my experience) developers don't put this effort in and let the users do the testing for them. This of course is generally worse because the user reports a strange error message and then the debugging is ultimately time consuming and expensive.

Related

Utilize vuex/redux instead of db [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
In this post I am looking for some advice/clearification regarding performance for external db (or api) vs local storage, specifically for a VueJS/React application (but that doesn't matter).
I am developing a website where the user can list, sort and filter around 200 products. During development I have stored the data in vuex using object litteral format:
{
name: "Productname",
description: "Description of product",
price: 20,
student: true
}
My plan was to put the data in a noSQL database on production. The reasoing being it would give better performance and remove one thousand lines of code from the store. However, using (e.g.) mongoDB I would probably fetch all products on mount and put them in the store anyway.
If I assume all users will see all 200 products, is it a bad idea to fetch all 200 products (from api or db) to vuex/redux on mount? I mean would not risk loads of fetch calls (which could become costly) and data would be preloaded for the user. For this specific example we are talking under 10000 rows of ascii json (but I am also curious in general).
Lets say a user wants to filter for student-products, then could there be any benefits of doing that remotely? Why not do products.filter(p => p.student) in the store or a component (we already fetched all products)?
If 1 and 2 are true then why use an extarnal db? Is it mostly for maintaining the data stored in these places, for example adding/removing products, that we use them? Can this statement be made: "Yes, of course if you have X products then external storage is not needed" and if so then what is X?
It is considered a bad idea performance wise and network wise. The advantage of using an API in this case, is that you can limit the data you send and paginate it. Most of the times a user doesn't need to see 200 items at once and can happen that the item he wants to see/update/delete is in the first ones returned, that means you sent a lot of data that didn't need to be sent. That is why you have pagination or infinite scroll (when you get to the bottom of the page and it loads more data).
You could first filter for the data that is already fetched and if it doesn't return anything you then could do a call to an end route that you defined in your backend and query your db there to return the data the user is searching for.
A user can delete is localStorage and all the items go bye bye unless they are hard coded, in which case why even use localStorage, if your data is in a db and you took all the precautions to make it secure and build the API whithout security faults, then you could make sure that your users would always have data available to them. It doesn't really matter how much X is suppose to be, what really matters is: Would various users have access to the same data that needs to be the same for all of them? Can the users alter the data in any way?
This is really what I've learned and you need to think more about really what your application will do. Your state manager in the frontend should be considered more of a, well, state manager. It will manage the data you fetched so you can guarantee one source of truth for your application.
I hope this somewhat helps, and I would also appreciate if someone with more experience could explain it better or tell me why I'm wrong.

Performance implications to keep count of how many times user chose certain entity [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
We have a feature request where a said users choice of certain entity has to be recorded, this would be an integer that is incremented every time the user chooses that specific entity. This information is later used for ordering the said entities every time the user gets all its associated of the said entities.
Assuming that we are limited to sql server as our persistent storage and that we could have !millions of users and each user could have anywhere between 1 to 10 of those said entities, performance would quickly become an issue.
One option is to have an append only log table where the user id and entity id are saved every time a user chooses an entity and on the query side we get all the records and group by entity id. This could quickly cause the table to grow very large.
The other would be to have three columns consisting of user id, entity id and count and we would increment the count every time the user chooses the said entity.
The above two options are what I have been able to come up with.
I would like to know if there are any other options and also what would be the performance implications of the above solutions.
Thanks
If this was 2001 performance could be an issue.
If you are using SQL server 2012 or 2016
then I say either are good.
Ints or Big ints index well and the performance hit would be insignificant.
You could also store the data in a xml or json varchar field
but I would go with your first option.
Using Big Ints and make sure you use indexes no matter what you do

Logging to Oracle Database Table Approach [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
We are moving a logging table from DB2 to Oracle, in here we log exceptions and warnings from many applications. With the move I want to accomplish 2 main things among others: less space consumption (to add more rows because tablespaces are kinda small) while not increasing the server processing usage too much (cpu usage increases our bill).
In DB2 we basically have a table that holds text strings.
In Oracle I am taking the approach of normalizing the tables for columns with duplicated data (event_type, machine, assemblies, versno). I have a procedure that receives multiple parameters and I query the reference tables to get the IDs.
This is the Oracle table description.
One of the feedback I have so far from a co-worker is that I will not necessary reduce table space since indexes take space and my solution might end up using more than what saving all string uses. We don't know if this is true, does anyone have more information on this?
I am taking the right approach?
Will this approach help me accomplish my 2 main goals?
Additional feedback is welcome and appreciated.
The approach of using surrogate keys (numerical ID) and dimension tables (containing the ID key and the description) is popular in both OLPT and data warehouse. IMO the use for logging is a bit strange.
The problem is, the the logging component should not have much assumption about the data to be logged - it is vital to be able to log the exeptional cases.
So in case that you can't map the host name to ID (it was misspelled or simple not configured), it would be not much helpfull to log unknownhost or to suppress the logging at all.
Concerned the storage you can indeed save a lot storing IDs istead of long strings, but (dependent on data) you may get similar effect using table compression.
I think the most important thing about logging is that it works all the time. It must be bullet-proof, because if the logger fails you lose the information you need to diagnose problems in your system. Even worse would be to fail business transactions because the logger abended.
There is a strong inverse relationship between the complexity of a logging implementation and its reliability. The more moving parts it has, the more things there are to go wrong. So, while normalization is a good thing for business data it introduces unwelcome risk. Also, look-ups increase the overhead of writing a log message and that is also undesirable.
" tablespaces are kinda small"
Increasing the chance of failure is not a good trade-off. Ask for more space.

DB Design for high amount of data (20 millions rows / day) [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
We are looking to create a software that receive log files from a high number of devices. We are looking around 20 million rows a day with log (2kb / each for each log line).
I have developed a lot of software but never with this large quantity of input data. The data needs to be searchable, sortable, groupable by source IP, dest IP, alert level etc.
It should be combining similiar log entries (occured 6 times etc..)
Any ideas and suggestions on what type of design, database and general thinking around this would be much appreciated.
UPDATE:
Found this presentation, seems like a similar scenario, any thoughts on this?
http://skillsmatter.com/podcast/cloud-grid/mongodb-humongous-data-at-server-density
I see a couple of things you may want to consider.
1) message queue - to drop a log line and let other part (worker) of the system to take care of it when time permits
2) noSQL - reddis, mongodb,cassandra
I think your real problem would be in querying the data , not in storing.
Also you probably would need a scalable solution.
Some of noSql databases are distributed you may need that.
Check this out, it might be helpful
https://github.com/facebook/scribe
A web search on "Stackoverflow logging device data" yielded dozens of hits.
Here is one of them. The question asked may not be exactly the same as yours, but you should get dozens on intersting ideas from the responses.
I'd base many decisions on how users most often will be selecting subsets of data -- by device? by date? by sourceIP? You want to keep indexes to a minimum and use only those you need to get the job done.
For low-cardinality columns where indexing overhead is high yet the value of using an index is low, e.g. alert-level, I'd recommend a trigger to create rows in another table to identify rows corresponding to emergency situations (e.g. where alert-level > x) so that alert-level itself would not have to be indexed, and yet you could rapidly find all high-alert-level rows.
Since users are updating the logs, you could move handled/managed rows older than 'x' days out of the active log and into an archive log, which would improve performance for ad-hoc queries.
For identifying recurrent problems (same problem on same device, or same problem on same ip address, same problem on all devices made by the same manufacturer, or from the same manufacturing run, for example) you could identify the subset of columns that define the particular kind of problem and then create (in a trigger) a hash of the values in those columns. Thus, all problems of the same kind would have the same hash value. You could have multiple columns like this -- it would depend on your definition of "similar problem" and how many different problem-kinds you wanted to track, and on the subset of columns you'd need to enlist to define each kind of problem. If you index the hash-value column, your users would be able to very quickly answer the question, "Are we seeing this kind of problem frequently?" They'd look at the current row, grab its hash-value, and then search the database for other rows with that hash value.

Real world usage for artifical neural networks [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have written an artifical neural network (ANN) implementation for myself (it was fun). I am thinking now about where can I use it.
What are the key areas in the real world, where ANN is being used?
ANNs are an example of a "learning" system, one that "trains" on input data (in some domain) in order to effectively classify (unseen) data in that domain. They've been used for everything from character recognition to computer games and beyond.
If you're trying to find a domain, pick some topic or field that interests you, and see what kinds of classification problems exist there.
Most often for classifying noisy inputs into fixed categories, like handwritten letters into their equivalent character, spoken voice into phonemes, or noisy sensor readings into a set of fixed values. Usually, the set of categories is small (23 letters, couple of dozen phonemes, etc.)
Others will point out how all these things are better done with specialized algorithms....
I once wrote an ANN to predict the stock market. It succeeded with about 80% accuracy.
The cue here was to first get hold of a couple of million rows of real stock data. I used this data to train the network and prime it for real data. There were about 8-10 input variables and a single output value that would indicate the predicted value of the stock on the next day.
You could also check out the (ancient) ALVINN network where a car learnt to drive by itself by observing road data when a human driver was behind the wheel.
ANNs are also widely used in bioinformatics.

Resources