Suppose I have a database with some personally identifiable information that I'd like to send to some third-party developer. It needs "real-world" data, but I need to hide people's names, addresses, etc.
Two constraints:
Since the database is not completely normalized, there are some intentionally duplicate strings, so I need to make sure that the replacement doesn't replace those duplicates with two different values.
The columns have different sizes, but the replacement text would have to fit in the same table. That precludes, e.g., blindly taking the first x characters of a hash.
Is there any tool or script out there that does this? If not, what would be a good way to go about this? Global search/replace scripts seem, at best, like a very slow and primitive building block.
Related
Imagine a web form with a set of check boxes (any or all of them can be selected). I chose to save them in a comma separated list of values stored in one column of the database table.
Now, I know that the correct solution would be to create a second table and properly normalize the database. It was quicker to implement the easy solution, and I wanted to have a proof-of-concept of that application quickly and without having to spend too much time on it.
I thought the saved time and simpler code was worth it in my situation, is this a defensible design choice, or should I have normalized it from the start?
Some more context, this is a small internal application that essentially replaces an Excel file that was stored on a shared folder. I'm also asking because I'm thinking about cleaning up the program and make it more maintainable. There are some things in there I'm not entirely happy with, one of them is the topic of this question.
In addition to violating First Normal Form because of the repeating group of values stored in a single column, comma-separated lists have a lot of other more practical problems:
Can’t ensure that each value is the right data type: no way to prevent 1,2,3,banana,5
Can’t use foreign key constraints to link values to a lookup table; no way to enforce referential integrity.
Can’t enforce uniqueness: no way to prevent 1,2,3,3,3,5
Can’t delete a value from the list without fetching the whole list.
Can't store a list longer than what fits in the string column.
Hard to search for all entities with a given value in the list; you have to use an inefficient table-scan. May have to resort to regular expressions, for example in MySQL:
idlist REGEXP '[[:<:]]2[[:>:]]' or in MySQL 8.0: idlist REGEXP '\\b2\\b'
Hard to count elements in the list, or do other aggregate queries.
Hard to join the values to the lookup table they reference.
Hard to fetch the list in sorted order.
Hard to choose a separator that is guaranteed not to appear in the values
To solve these problems, you have to write tons of application code, reinventing functionality that the RDBMS already provides much more efficiently.
Comma-separated lists are wrong enough that I made this the first chapter in my book: SQL Antipatterns, Volume 1: Avoiding the Pitfalls of Database Programming.
There are times when you need to employ denormalization, but as #OMG Ponies mentions, these are exception cases. Any non-relational “optimization” benefits one type of query at the expense of other uses of the data, so be sure you know which of your queries need to be treated so specially that they deserve denormalization.
"One reason was laziness".
This rings alarm bells. The only reason you should do something like this is that you know how to do it "the right way" but you have come to the conclusion that there is a tangible reason not to do it that way.
Having said this: if the data you are choosing to store this way is data that you will never need to query by, then there may be a case for storing it in the way you have chosen.
(Some users would dispute the statement in my previous paragraph, saying that "you can never know what requirements will be added in the future". These users are either misguided or stating a religious conviction. Sometimes it is advantageous to work to the requirements you have before you.)
There are numerous questions on SO asking:
how to get a count of specific values from the comma separated list
how to get records that have only the same 2/3/etc specific value from that comma separated list
Another problem with the comma separated list is ensuring the values are consistent - storing text means the possibility of typos...
These are all symptoms of denormalized data, and highlight why you should always model for normalized data. Denormalization can be a query optimization, to be applied when the need actually presents itself.
In general anything can be defensible if it meets the requirements of your project. This doesn't mean that people will agree with or want to defend your decision...
In general, storing data in this way is suboptimal (e.g. harder to do efficient queries) and may cause maintenance issues if you modify the items in your form. Perhaps you could have found a middle ground and used an integer representing a set of bit flags instead?
Yes, I would say that it really is that bad. It's a defensible choice, but that doesn't make it correct or good.
It breaks first normal form.
A second criticism is that putting raw input results directly into a database, without any validation or binding at all, leaves you open to SQL injection attacks.
What you're calling laziness and lack of SQL knowledge is the stuff that neophytes are made of. I'd recommend taking the time to do it properly and view it as an opportunity to learn.
Or leave it as it is and learn the painful lesson of a SQL injection attack.
I needed a multi-value column, it could be implemented as an xml field
It could be converted to a comma delimited as necessary
querying an XML list in sql server using Xquery.
By being an xml field, some of the concerns can be addressed.
With CSV: Can't ensure that each value is the right data type: no way to prevent 1,2,3,banana,5
With XML: values in a tag can be forced to be the correct type
With CSV: Can't use foreign key constraints to link values to a lookup table; no way to enforce referential integrity.
With XML: still an issue
With CSV: Can't enforce uniqueness: no way to prevent 1,2,3,3,3,5
With XML: still an issue
With CSV: Can't delete a value from the list without fetching the whole list.
With XML: single items can be removed
With CSV: Hard to search for all entities with a given value in the list; you have to use an inefficient table-scan.
With XML: xml field can be indexed
With CSV: Hard to count elements in the list, or do other aggregate queries.**
With XML: not particularly hard
With CSV: Hard to join the values to the lookup table they reference.**
With XML: not particularly hard
With CSV: Hard to fetch the list in sorted order.
With XML: not particularly hard
With CSV: Storing integers as strings takes about twice as much space as storing binary integers.
With XML: storage is even worse than a csv
With CSV: Plus a lot of comma characters.
With XML: tags are used instead of commas
In short, using XML gets around some of the issues with delimited list AND can be converted to a delimited list as needed
Yes, it is that bad. My view is that if you don't like using relational databases then look for an alternative that suits you better, there are lots of interesting "NOSQL" projects out there with some really advanced features.
Well I've been using a key/value pair tab separated list in a NTEXT column in SQL Server for more than 4 years now and it works. You do lose the flexibility of making queries but on the other hand, if you have a library that persists/derpersists the key value pair then it's not a that bad idea.
I would probably take the middle ground: make each field in the CSV into a separate column in the database, but not worry much about normalization (at least for now). At some point, normalization might become interesting, but with all the data shoved into a single column you're gaining virtually no benefit from using a database at all. You need to separate the data into logical fields/columns/whatever you want to call them before you can manipulate it meaningfully at all.
If you have a fixed number of boolean fields, you could use a INT(1) NOT NULL (or BIT NOT NULL if it exists) or CHAR (0) (nullable) for each. You could also use a SET (I forget the exact syntax).
I am a trying to design a database to act as a language dictionary where each word is associated not only to its definition by also to its grammatical "taxon". E.g., it should look something like this:
"eat": verb.imperative
"eat": verb.present
"ate": verb.past
"he": pronoun.masculine.singular
"she": pronoun.feminine.singular
"heiress": noun.feminine.singular
"heirs": noun.masculine.plural
"therefore": adverb
"but": conjunction
It seems that a natural data structure to hold such a grammatical "taxonomy" should be some kind of tree or graph. Although I haven't thought it through, I presume that should make it easier to perform queries of the type
plural OF masculine OF "heiress" -> "heirs"
At this point, however, I am just trying to come up with the least ineffective way to store such a dictionary in a regular relational database (namely a LibreOffice Base). What do you suggest the data schema should be like? Is there something more efficient than the brute force method where I'd have as many boolean columns as there are grammatical types and sub-types? E.g., "she" would be true for the columns pronoun, feminine, and singular, but false for all other column (verbs, adverb, conjunction, etc.)?
This is a really wide-open question, and there are many applications and much related research. Let me give some pointers based on software I have used.
One column would be the lexeme, for example "eat." A second column would give the part of speech, which in your data above would be a string or other identifier that shows whether it is a verb, pronoun, noun, adverb or conjunction.
It might make sense to create another table for verb information. For example, tense, aspect and mood might each be separate columns. But these columns would only make sense for verbs. For the nouns table, the columns would include number (singular, plural) and gender, and perhaps whether it is a count or mass noun. Pronouns would also include person (first, second or third person).
Do you plan to include every form of every word? For example, will this database store "eats" and "eating" as well as "jumps" and "jumping?" It is much more efficient to store rules like "-s" for present singular and "-ing" for progressive. Then if there are exceptions, for example "ate," it can be described as having the underlying form of "eat" + "-ed." This rule would go under the "eat" lexeme, and there would be no separate "ate" entry.
Also there are rules such as that plural changes words that end in y to -ies. This would go under the plural noun suffix ("-s"), not individual verbs.
With these things in mind, I offer a more specific answer to your question: No, I do not think this data is best described hierarchically, nor with a tree or graph, but rather analytically and relationally. LibreOffice Base would be a reasonable choice for a fairly simple project of this type, using macros to help with the processing.
So for:
"heiress" -> masculine plural = "heirs"
The first thing to do would be to analyze "heiress" as "heir" + feminine. Then compose the desired wordform by combining "heir" and "-s."
I was going to add a list of related software such as Python NLTK, but for one thing, the list of available software is nearly endless, and for another, software recommendations are off-topic for stackoverflow.
So I'm making a database for a personal project just to get more than my feet wet with PostgreSQL and certain languages and applications that can use a PostgreSQL database.
I've come to the realization that using an array isn't necessarily even compliant (Arrays are not atomic, right?) with 1NF. So my question is: Is there a lack of efficiency or data safety this way? Should I learn early to not use arrays?
Short answer to the title: No
A bit longer answer:
You should learn to use arrays when appropriate. Arrays are not bad design themselves, they are as atomic as a character varying field (array of characters, no?) and they exists to make our lives easier and our databases faster and lighter. There are issues considering portability (most database systems don't support arrays, or do so in a different way than Postgres)
Example:
You have a blog with posts and tags, and each post may have 0 or more tags. The first thing that comes to mind is to make a different table with two columns postid and tagid and assign the tags in that table.
If we need to search through posts with tagid, then the extra table is necessary (with appropriate indexes of course).
But if we only want the tag information to be shown as the post's extra info, then we can easily add an integer array column in the table of posts and extract the information from there. This can still be done with the extra table, but using an array reduces the size of the database (no needed extra tables or extra rows) and simplifies the query by letting us execute our select queries with joining one less table and seems easier to understand by human eye (the last part is in the eye of the beholder, but I think I speak for a majority here). If our tags are preloaded, then not even one join is necessary.
The example may be poor but it's the first that came to mind.
Conclusion:
Arrays are not necessary. They can be harmful if you use them wrong. You can live without them and have a great, fast and optimized database. When you are considering portability (e.g. rewriting your system to work with other databses) then you must not use arrays.
If you are sure you'll stick with Postgres, then you can safely use arrays where you find appropriate. They exist for a reason and are neither bad design nor non-compliant. When you use them in the right places, they can help a little with simplicity of database structures and your code, as well as space and speed optimization. That is all.
Whether an array is atomic depends on what you're interested in. If you generally want the whole array then it's atomic. If you are more interested in the individual elements then it is being used as structure. A text field is basically a list of characters. However, we're usually interested in the whole string.
Now - from a practical viewpoint, many frameworks and ORMs don't automatically unpack PostgreSQL's array types. Also, if you want to port the database to e.g. MySQL then you'll
Likewise foreign-key constraints can't be added to an array (EDIT: this is still true as of 2021).
Short answer: Yes, it is bad design. Using arrays will guarantee that your design is not 1NF, because to be 1NF there must be no repeating values. Proper design is unequivocal: make another table for the array's values and join when you need them all.
Arrays may be the right tool for the job in certain limited circumstances, but I would still try hard to avoid them. They're a feature of last resort.
The biggest problem with arrays is that they're a crutch. You know them already and you want to use them because they're familiar to you. But they do not work quite like you expect, and they will only allow you to postpone a true understanding of SQL and relational databases. You're much better off waiting until you're forced to use them than learning them and looking for opportunities to rely on them.
I believe arrays are a useful and appropriate design in cases where you're working with array-like data and want to use the power of SQL for efficient queries and analysis. I've begun using PostgreSQL arrays regularly for data science purposes, as well as in PostGIS for edge cases, as examples.
In addition to the well-explained challenges mentioned above, I'm finding the biggest problem in getting third-party client apps to be able to handle the array fields in ways I'd expect. In Tableau and QGIS, for example, arrays are treated as strings, so array operations are unavailable.
Arrays are a first class data type in the SQL standard, and generally allow for a simpler schema and more efficient queries. Arrays, in general, are a great data type. If your implementation is self-contained, and doesn't need to rely on third-party tools without an API or some other middleware that can deal with incompatibilities, then use the array field.
IF, however, you interface with third-party software that directly queries the DB, and arrays are used to produce queries, then I'd avoid them in favor of simpler lookup tables and other traditional relational approaches.
Trying to define some policy for keys in a key-value store (we are using Redis). The keyspace should be:
Shardable (can introduce more servers and spread out the keyspace between them)
Namespaced (there should be some mechanism to "group" keys together logically, for example by domain or associated concepts)
Efficient (try to use as little as possible space in the DB for keys, to allow for as much data as possible)
As collision-less as possible (avoid keys for two different objects to be equal)
Two alternatives that I have considered are these:
Use prefixes for namespaces, separated by some character (like human_resources:person:<some_id>).The upside of this is that it is pretty scalable and easy to understand. The downside would be possible conflicts depending on the separator (what if id has the character : in it?), and possibly size efficiency (too many nested namespaces might create very long keys).
Use some data structure (like Ordered Set or Hash) to store namespaces. The main drawback to this would be loss of "shardability", since the structure to store the namespaces would need to be in a single database.
Question: What would be a good way to manage a keyspace in a sharded setup? Should we use one these alternatives, or is there some other, better pattern that we have not considered?
Thanks very much!
The generally accepted convention in the Redis world is option 1 - i.e. namespaces separated by a character such as colon. That said, the namespaces are almost always one level deep. For example : person:12321 instead of human_resources:person:12321.
How does this work with the 4 guidelines you set?
Shardable - This approach is shardable. Each key can get into a different shard or same shard depending on how you set it up.
Namespaced Namespace as a way to avoid collisions works with this approach. However, namespaces as a way to group keys doesn't work out. In general, using keys as a way to group data is a bad idea. For example, what if the person moves from department to another? If you change the key, you will have to update all references - and that gets tricky.
Its best to ensure the key never changes for an object. Grouping can then be handled externally by creating a separate index.
For example, lets say you want to group people by department, by salary range, by location. Here's how you'd do it -
Individual people go in separate hash with keys persons:12321
Create a set for each group by - For example : persons_by:department - and only store the numeric identifiers for each person in this set. For example [12321, 43432]. This way, you get the advantages of Redis' Integer Set
Efficient The method explained above is pretty efficient memory wise. To save some more memory, you can compress the keys further on the application side. For example, you can store p:12321 instead of persons:12321. You should do this only if you have determined via profiling that you need such memory savings. In general, it isn't worth the cost.
Collision Free This depends on your application. Each User or Person should have a primary key that never changes. Use this in your Redis key, and you won't have collisions.
You mentioned two problems with this approach, and I will try to address them
What if the id has a colon?
It is of course possible, but your application's design should prevent it. Its best not to allow special characters in identifiers - because they will be used across multiple systems. For example, the identifier will very likely be a part of the URL, and colon is a reserved character even for urls.
If you really must allow special characters in your identifier, you would have to write a small wrapper in your code that encodes the special characters. URL encoding is perfectly capable of handling this.
Size Efficiency
There is a cost to long keys, however it isn't too much. In general, you should worry about the data size of your values rather than the keys. If you think keys are consuming too much memory, profile the database using a tool like redis-rdb-tools.
If you do determine that key size is a problem and want to save the memory, you can write a small wrapper that rewrites the keys using an alias.
Could anyone point me to some patterns addressing internationalization on the database level tasks?
The simplest way would be to add a text column for every language for every text column, but that is somehow smelly - really i want to have ability to add supported languages dynamically.
The solution i'm coming to is one main language that is saved in the model and a dictionary entity that gets queried for translations and saved translations to.
All i want is to hear from other people who have done this.
You could create a table that has three columns: target language code, original string, translated string. The index on the table would be on the first two columns, but I wouldn't bind this table to other tables with foreign keys. You'd need to add a join (possibly a left join to account for missing translations) for each of the terms you need to translate in each query you run. However, this will make all your queries very hairy and possibly kill performance as well.
Another thing you need to be aware of is actually translating the terms and maintaining an up-to-date translation table. This is very inconvenient to do directly against the database and is often done by non-technical people.
Usually when localizing an application you'd use something like gettext. The idea behind this suite of tools is to parse can parse source code to extract strings for translation and then create translation files from them. Since this suite has been around for a long time, there are a lot of different utilities based on it that help with the translation task, one of which is Poedit, a nice GUI editor for translating strings into different languages. It might be simpler to generate the unique list of terms as they appear in the database in a format gettext could parse, and do the translation in the application code. This way you'd be able to do the translation of the hard coded strings in the application and the database values using the same technique.