anonymize rdbms datas, keeping key relations

anonymize rdbms datas, keeping key relations - dataset

Is there a "ready-to-use" method to anonymize datas, but keeping relations between keys ?
For example, I have :
Table #1
user code
zip code
ztxp15
45789
And :
Table #2
user code
order date
ztxp15
2021-06-27 06:22pm
I want it anonymized as :
user code
zip code
xvdf65
32165
And :
Table #2
user code
order date
xvdf65
2021-06-27 06:22pm
This would need : a bijective function that transform a data, keeping its format ([a-z]{4}[0-9]{2}), generating the same value, according a passphrase for example.
In this way, unicity will be kept, format too, etc. But maybe I miss something.
I think that this problematic is very common so I am looking for previous work about it.

It is a common practice to use a user identifier, which my itself has no meaning to a viewer. I assume in your case this is the user code.
You should only anonymise PII (Personally Identifiable Information). You can encrypt it for bi-directionality, or hash it for single direction anonymise. Hashing is usually done when exporting data to analytics dashboards.
It is not a common practice to anonymise user code. If all PII is anonymised, then the user code is effectively anonymised.

Related

Should I store uploaded filename in database?

I have a database table with an autoincrement ID as primary key.
For each record of this table, I can have up to 3 files, which can be publicly available so random filename generation is not mandatory, and these files are optional.
I think I have 2 possible solutions:
Store a random generated filename in 3 nullable varchar column and store all the files in the same place:
columns: a | b | c
uploads/f6se54fse654.jpg
Don't store the filenames, but place them in specific folders and name them the same than the primary key value:
uploads/a/1.jpg
uploads/b/1.jpg
uploads/c/1.jpg
With this last solution, I know that uploads/a/1.jpg belongs to record with ID 1, and is a file of type a. But I have to check if the file exists because the files are optional.
Do you think there is a good practice in all that? Or maybe there is a better approach?

If the files you are talking about are intended to be displayed or downloaded by users (whether for visitors or for authenticated users, filtered by roles (ACL) or not), it is important to ensure (IMHO) that the user will not be able to guess other information other than the content of the concerned resource which has been sent to him. There is no perfect solution that can be applied to all cases without exception, so let's take an example to give you more explanations.
In order to enhance the security and total opacity of sensitive data, for example for the specific case of uploads/users/7/invoices/3.pdf, I think it would be wise to ensure that absolutely no one can guess the number of files that are potentially associated with the user or any other entity (because otherwise, in this example, we could imagine that there potentially are other accessible files - 1.pdf and 2.pdf). By design, we generally want to give access to files in a well defined and specific cases and context. However, this may not be the case for an image file which is intended to be seen by everyone (a profile photo, for example). That's why the context matters in some way.
If you choose to keep the auto-incremented identifiers as names to refer to your files, this can also give information about the size of the data stored in your database (/uploads/invoices/128.pdf informs that you may already have 127 invoices on your server) and potentially motivate unscrupulous people to try to reach resources that should never be fetched out of the defined context. This case may be less obvious if you choose to use some kind of unique generated identifiers (GUID).
I recommend that you read this article concerning the generation of (G)/(U)UIDs (a 128-bit hexadecimal numbers) to be stored in your database for each uploaded or created file. If you use MySQL in its latest version it is even possible to host this identifier in a binary (16) type which offers an automatic conversion to UUID, I let you read this interesting topic associated with what I refer about. It will probably output this as /uploads/invoices/b0016303-8e4f-487a-8c30-5dddf1ebf7e9.pdf which is a lot better as long as you ensure that the generated identifier is unique hash.
It does not seem useful to me here to talk about performance issues because today there are many methods for caching files or path and urls, which avoid having to make requests each time in a lot of cases where a resource is called (often ordered by their popularity rank in bigdata cases).
Last, but not least, many web and mobile platform applications (I think of Slack, Discord, Facebook, Twitter...) which store a lot of media files every day which are often associated with accounts users, both public and confidential files and information, generate a unique hash for each of them.
Twitter is using its own unique identifier string (64-bits BIGINT) generator called Twitter Snowflake which you might be interesting to read too. It is based on the UNIX epoch value which is, by definition, unique at each millisecond tick.
There isn't a global and perfect solution which can be applied for everything but I hope that this will help you as you may want to take a deeper look in this and find the "best solution" for each context and entity you'll store and link files.

Representing links to other entries in a database text column

I have a free text column in a database that needs to contain links to other objects in the database, like definitions in an appendix. This database will feed a system like a CMS.
So far, I can only think of two ways of representing links in a free text field:
Markdown format: [link](/entries/999)
HTML
Am I missing any easier solutions?
Also, are there any ways to represent a link to entry 999 (for example) without hardcoding a URL? I want to generate the URLs automatically, and make the contents of the database resilient to changes in the way that the URLs are structured.
Maybe similar: How to insert in a database elements with links to other elements?

I think that to solve a problem like this a couple of important points should be considered:
How important is the efficiency of the database queries (and this depends on the size of the database, on the frequency of the queries, on the load over the server, etc.)
What kind of updates are done to the data: is text (pages) modified frequently? These updates modify the links?
And another minor point is: how do you prefer to balance the work between the database server and the application, in terms of complexity of programming, performance, etc.
If I have understood your problem correctly (and I am not sure of this), there is always the need to translate “links” from the application level to the database level (otherwise you should not have particular problems). If this is true, then I think you have the following options:
Maintains the link with a “database semantics”: that is transform them in link to fields through a pair of values (primary key of record, name (or number) of field referred). Then, you have two sub-options: maintain those links inside the text, or extracting them in a (sub)table that contains the two end-points of the links (the starting point should be something like [key of record, name of field, position in the text where there is the link: the ending point could be simply [key of record, name of field]).
Leave a “text semantics” for the link, keeping them inside the text, and invent some kind of URL-like notation that can be easily converted into a database link or, alternatively, that could be used to perform effient searches in the database.
To evaluate which option to choose, than, one should consider the points above about the efficiency and the kind of updates.

Database design - should I use 30 columns or 1 column with all data in form of JSON/XML?

I am doing a project which need to store 30 distinct fields for a business logic which later will be used to generate report for each
The 30 distinct fields are not written at one time, the business logic has so many transactions, it's gonna be like:
Transaction 1, update field 1-4
Transaction 2, update field 3,5,9
Transaction 3, update field 8,12, 20-30
...
...
N.B each transaction(all belong to one business logic) would be updating arbitrary number of fields & not in any particular order.
I am wondering what's my database design would be best:
Have 30 columns in postgres database representing those 30 distinct
field.
Have 30 filed store in form of xml or json and store it in just one
column of postgres.
1 or 2 which one is better ?
If I choose 1>:
I know for programming perspective is easier Because in this way I don't need to read the overall xml/json and update only a few fields then write back to database, I can only update a few columns I need for each transaction.
If I choose 2>:
I can potentially generic reuse the table for something else since what's inside the blob column is only xml. But is it wrong to use the a table generic to store something totally irrelevant in business logic just because it has a blob column storing xml? This does have the potential to save the effort of creating a few new table. But is this kind of generic idea of reuse a table is wrong in a RDBMS ?
Also by choosing 2> it seem I would be able to handle potential change like change certain field /add more field ? At least it seems I don't need to change database table. But I still need to change c++ & c# code to handle the change internally , not sure if this is any advantage.
I am not experiences enough in database design, so I cannot make the decision which one to choose. Any input is appreciated.
N.B there is a good chance I probabaly don't need to do index or search on those 30 columsn for now, a primary key will be created on a extra column is I choose 2>. But I am not sure if later I will be required to do search based on any of those columns/field.
Basically all my fields are predefined from requirement documents, they generally like simple field:
field1: value(max len 10)
field2: value(max len 20)
...
field20: value((max len 2)
No nest fields. Is it worth to create 20 columns for each of those fields(some are string like date/time, some are string, some are integer etc).
2>
Is putting different business logic in a shared table a bad design idea? If it only being put in a shared table because they share the same structure? E.g. They all have Date time column , a primary key & a xml column with different business logic inside ? This way we safe some effort of creating new tables... Is this saving effort worth doing ?

Always store your XML/JSON fields as separate fields in a relational database. Doing so you will keep your database normalized, allowing the database to do its thing with queries/indices etc. And you will save other developers the headache of deciphering your XML/JSON field.
It will be more work up front to extract the fields from the XML/JSON and perhaps to maintain it if fields need to be added, but once you create a class or classes to do so that hurdle will be eliminated and it will more than make up for the cryptic blob field.

In general it's wise to split the JSON or XML document out and store it as individual columns. This gives you the ability to set up constraints on the columns for validation and checking, to index columns, to use appropriate data types for each field, and generally use the power of the database.
Mapping it to/from objects isn't generally too hard, as there are numerous tools for this. For example, Java offers JAXB and JPA.
The main time when splitting it out isn't such a great idea is when you don't know in advance what the fields of the JSON or XML document will be or how many of them there will be. In this case you really only have two choices - to use an EAV-like data model, or store the document directly as a database field.
In this case (and this case only) I would consider storing the document in the database directly. PostgreSQL's SQL/XML support means you can still create expression indexes on xpath expressions, and you can use triggers for some validation.
This isn't a good option, it's just that EAV is usually an even worse option.
If the document is "flat" - ie a single level of keys and values, with no nesting - the consider storing it as hstore instead, as the hstore data type is a lot more powerful.

(1) is more standard, for good reasons. Enables the database to do heavy lifting on things like search and indexing for one thing.

Database management (SQLite) and table generation

I was building an RSS reader, which stores the articles pulled in an database (SQLite in particular, but I don't think that matters).
Anyway, when I originally designed and coded it, the idea was to create a new table for every feed the user is subscribed to, and to have a big meta table. After reading a bit more about database management, I found another way to handle this was to have two tables, the meta table, and a table for every item in the rss feed, and in that table, have a column with the id of the feed it came from.
So, is there any major reason why I should switch the model that I'm using to be a large items table, rather than having one for each feed the user is subscribed to?

From what you wrote :
to create a new table for every feed
the user is subscribed to
In a database world, at least for me, that is insane.
Just try to picture the user wants to subscribe to 1.000 rss feeds, will you create 1.000 tables ? No way.
You can put your data in relation thanks to Primary Key and foreign keys why don't you use this strenght.
First it will be easier for you to write your query. You won't have to worry about table name. you will have a table rssfeed and a table post then everything will be link togheter.
Spend time modelling your database. In your case it won't be that hard.
You might need 3 to 4 tables in order to handle rssfeeds, post, and metadatas.
Ask another question here on : How to design a database for this need ?
People will help you with pleasure.
Ask your question you'll save time, money (even if its not about it), and best-practices(avoiding ugly design).

The typical way of storing such data (assuming that the structure of the data is the same for all feeds) is indeed to have a single table for all feeds.
Why? Because this will allow you to access all feeds in the same way. For example, lets say you want to combine all feeds in a single view, or calculate some kind of statistic on all of your feeds. By having them all located in a single table this will be extremely simple; having them all in different tables will make this much more complex, without any (as far as I can see) added value.

It's a matter of simplicity of coding versus the probably slight performance edge of having one table per RSS feed. Having one table (rather than one per feed) means your code doesn't have to do any DDL and you could more easily do cross-RSS-feed searching; but queries and updates could be a little slower. I'd probably opt for a single table with a Feed column (indexed) to make searches simpler.

Storing email messages in a database

What sort of database schema would you use to store email messages, with as much header information as practical/possible, into a database?
Assume that they have been fed into a script from the MTA and parsed into the relevant headers/body/attachments.
Would you store the message body whole in the database table, or split any MIME-parts apart? What about attachments?

You may want to check the architecture and the DB schema of "Archiveopteryx".

You may want to use a schema where the message body and attachment records can be shared between multiple recipients on the message. It's not uncommon to see email servers where fully 50% of the disk storage is used by duplicate emails.
A simple hash of the body/attachment would be enough to see if that record was already in the database. However, you would still need to keep separate headers.

Depends on what you're going to be doing with it. If you're going to need to do frequent searching against certain bits of it, you'll want to break it up in a way that makes sense for your usage case. If it's just for something like storage of e-mail for Sarbanes-Oxley compliance, you'd probably be okay storing the whole thing - headers, parts, etc. - as one big text field.

Suggestion: create a well defined table for storing e-mail with a column for each relevant part of a message: sender, header, subject, body. It is going to be much simpler later if you want to query, for example, by subject field. In the same table you can define a field to keep the path of a attachment and store the attached file on the file system, rather than storing it in blob fields.

An important step in database schema design is to figure out what types of entity you want to model. For this application the entities might be:
Messages
E-mail addresses
Conversation threads (perhaps: if you want to do efficient threading)
Attachments (perhaps: as suggested in other answers)
...
Once you know the entities, you can identify relationships between entities, which can be represented by tables:
Messages have a many-many relationship to messages (In-Reply-To and References headers).
Messages have a many-many relationship to e-mail addresses (From, To, Cc etc headers).
Messages have a many-one relationship with threads.
Messages have a many-many relationship with attachments.
...

It all depends on what you want to do with the data, but in general I would want to store all data and also make sure that the semantics interpreted by the MUA are preserved in the db, so for example:
- All headers that are parsed should have their own column
- A column should contain the whole headers
- The attachments (including body, multipart) should be in a many to one table with the email table.

You'll probably want to at least store attachments separately to optimize storage. It's astonishing to see the size and quantity of attachments (videos, etc.) that most users unhesitatingly attach to emails.
In the case of outgoing emails you may have multiple emails sending the same attachment. It's far more efficient to store a single copy of the attachment that is referenced by all emails that share it.
Another reason for storing attachments separately is that it gives you some archiving options later on. Should storage space become an issue, you can always go back and delete large attachments older than a given date in order to compact the database.

If it is already split up, and you can be sure that the routine to split the data is sound, then I would split up the table as granular as possible. You can always parse it back together in your middle tier. If space is not an issue, you could always store it twice. One, split up into the relevant fields, and another field that has the whole thing as one blob, if putting it back together is hard.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight