Our application will store some information from a user that we do not want to be traced back to any other records in the database. For example (albeit a stupid one) - a user must pay to tell us anonymously what their favorite color is. We want to store each color record as a new row in the database and keep track of the transaction information.
If we stored the colors and transactions in separate tables, the rows could be correlated to one another if the server were hacked, by using the sequential ID of the rows (because a color will always have a transaction) or by the creation time of the row. So to solve this we won't have a sequential ID column for the colors table, or an update/modification time for the colors table.
Now, the only way to associate a color with a transaction is to look at the files that are used to actually store the database information. While this may be difficult and tedious, I imagine it is still possible because the colors table information would probably be stored sequentially in the files.
How can I store database information in an un-ordered matter, so that this could never happen? I suppose a more general question is how do I store information anonymously and securely? (But that is way too broad)
Obviously, an answer is don't let your database get hacked, but not a good one.
You can pre-generate millions of rows and randomly populate them.
If you need to analyze data, you will need to understand it, and if you can attacker can also. No matter what clever solution you will come up with correletion will still be possible. Relational DB transaction logs, wil show what and when and where was inserted updated deleted. So you cannot provide 100% decoupling of data, if you want to use the same db. You could encrypt data with some HSM, which would render stolen data useless for attacker. Or you can store data on some other machine with random delay or some batch processing, (wait and insert 20 records instead of one)... but it can be tricky and it can fail.
Consider leveraging a non-realtional database, e.g. NoSQL.
Related
Consider a database with several (3-4) tables with a lot of columns (from 15 to 40). In each table we have several thousand records generated per year and about a dozen of changes made for each record.
Right now we need to add a following functionality to our system: every time user makes a change to the record of one of our tables, the system needs to keep track of it - we need to have complete history of changes and also be able to restore row data to selected point.
For some reasons we cannot keep "final" and "historic" data in the same table (so we cannot add some columns to our tables to keep some kind of versioning information, i.e. like wordpress does when it comes to keeping edit history of posts).
What would be best approach to this problem? I was thinking about two solutions:
For each tracked table we have a mirror table with the same columns, and with additional columns where we keep information about versions (i.e. timestamps, id of "original" row etc...)
Pros:
we have data stored exactly in the same way it was in original tables
whenever we need to add a new column to the original table, we can do the same to mirror table
Cons:
we need to create one additional mirror table for each tracked table.
We create one table for "history" revisions. We keep some revisioning information like timestamps etc., and also we keep the track from which table the data originates. But the original data row is being stored in large text column in JSON.
Pros:
we have only one history table for all tracked tables
we don't need to create new mirror tables every time we add new tracked table,
Cons:
there can be some backward compatibility issues while trying to restore data after structure of the original table was changed (i.e. new column was added)
Maybe some other solution?
What would be the best way of keeping the history of versions in such system?
Additional information:
each of the tracked tables can change in the future (i.e. new columns added),
number of tracked tables can change in the future (i.e. new tables added).
FYI: we are using laravel 5.3 and mysql database.
How often do you need access to the auditing data? Is cost of storage ever a concern? Do you need it in the same system that you need the normal data?
Basically, having a table called foo and a second table called foo_log isn't uncommon. It also lets you store foo_log somewhere differently, even possibly a secondary DB. If foo_log is on a spindle disk and foo is on flash, you still get fast reads, but you get somewhat cheaper storage of the backups.
If you don't ever need to display this data, and just need it for legal reasons, or to figure out how something went wrong, the single-table isn't a terrible plan.
But if the issue is backups, which it sounds like it might be, why not just backup the MySQL database on a regular basis and store the backups elsewhere?
I'm pretty proficient with VBA, but I know almost nothing about Access! I'm running a complex simulation using Arrrays in VBA, and I want to store the results somewhere. Since the results of the simulation will be quite large (~1GB in memory), I'd like to store this in Access rather than Excel.
I currently have a large number of Arrays populated with my data, but I'm not sure how to write these to a database, or even how to create one with VBA. Here's what I need to do, in a nutshell, with VBA:
Create a new Access Database
Create a new Access Table (the db will be only a single table)
Create ~1200 fields programmatically
Copy the results from my arrays to the new Access table.
I've looked at a number of answers on here, but none of them seem to answer my question fully. For instance, Adding field to MS Access Table using VBA talks about adding fields to a database. But I don't see doubles listed here. Most of my arrays are doubles. Will this be a problem?
EDIT:
Here are a few more details about the project:
I am running a network design simulation. Thus, I start by generating ~150,000 unique networks. Then, I run a lot of calculations (no, these can't be simplified to queries unfortunately!) of characteristics for the network. There end up being ~1200 for each possible network (unique record). Thus, I would like to store these in an Access database. Each record will be a unique network, and each field will be a specific characteristic associated with that network.
Virtually all of the fields (arrays at this point!) are doubles.
You (almost?) never want a database with one table. You might as well store it in a text file. One of the main benefits of databases is relating data in different tables, and with one table you don't need it.
Fortunately for you, you need more than one table and a database might be the way to go. You (almost) never need to create permanent tables in code (temp tables, sure, but not permanent ones). If your field names are variable, you need to change your design. When data is variable, it goes in the data part of a database. When it's fixed, it can be a table or a field. Based on what you've said, I think you need this:
In Access create a tabled called tblNetworks with these fields
NetworkID AutoNumber
NetworkName Short Text
Then create another tabled called tblCalculations with these fields
CalcID Autonumber
NetworkID Long (Relates to tblNetworks, one to many)
CalcDesc Short Text
Result Number (Double)
What you were going to name your fields in your Access table will be the CalcDesc data. You'll use ADODB to execute INSERT INTO sql statements that put the data in the tables.
You'll end with tblNetworks with 150k records and tblCalculations with 1,200 x 150k records or so. When you're tables grow longer and not wider as things change, that a good indication you designed it right.
If you're really unfamiliar with Access, I recommend learning how to create Tables, setting up Relationships, and Referential Integrity. If you don't know SQL, search for INSERT INTO. And if you haven't used ADO before in Excel, search for ADODB Connections and the Execute method.
Update
You can definitely get away with a CSV for this. Like you said, it's pretty low overhead. Whether a text file or a database is the right answer probably depends more on how you're going to use the data and how often.
If you're going to pull this into Excel a small number of times, do a few sorts or filters, maybe a pivot table, then any performance hit you get from a CSV isn't going to be that bad. And if you only need to deal with a subset of the data at a time, you can use ADO to read a text file and only pull in the data you want at that time, further mitigating the slowness of sorting and filtering 150k rows. Not to mention if you have a few gigs of RAM, 150k x 1,200 probably won't be bad at all.
If you find that the performance of a CSV stinks because your hardware isn't up to the task, you have to access it often, or you doing a ton of different queries against the data, it may be to your benefit to use the database. If you fields are structured as you say, you may benefit from even more tables. You'd still have the network table and the calc table, but you'd also have Market, Slot, and Characteristic tables. Then your Calc table would look like:
CalcID
CalcDesc
NetworkID
MarketID
SlotID
CharacteristicID
Result
If you looking for data a lot of times and you need it quickly, you're not going to do better than a bunch of INNER JOINs on those tables and a WHERE clause that limits what you want.
But only you can decide if it's worth all the setup and overhead of using a database. And because of that, I would start down the CSV path until the reason to change presented itself. I would design my code in a way that switching from CSV to database only touched a few procedure (like by using class modules) so that the change didn't affect any already-tested business logic.
I decided to use HBase in a project to store the users activities in a social network. Despite the fact that HBase has a simple way to express data (column oriented) I'm facing some difficulties to decide how I would represent the data.
So, imagine that you have millions of users, and each user is generating an activity when they, for example, comment in a thread, publishes something, like, vote, etc. I thought basically in two approaches with an Activity hbase table:
The key could be the user reference + timestamp of activity creation, the value all the activity metadata (most of time fixed size)
The key is the user reference, and then each activity would be stored as a new column inside a column family.
I saw examples for others types of system (such as blogs) that uses the 2nd approach. The first approach (with fixed columns, varying only when you change the schema) is more commonly seen.
What would be the impact in the way I access the data for these 2 approaches?
In general you are asking if your table should be wide or long. HBase works with both, up to a point. Wide tables should never have a row that exceeds region size (by default 256MB) -- so a really prolific user may crash the system if you store large chunks of data for their actions. However, if you are only storing a few bytes per action, then putting all user activity in one row will allow you to get their full history with one get. However, you will be retrieving the full row, which could cause some slowdown for a lot of history (10s of seconds for > 100MB rows).
Going with a tall table and an inverse time stamp would allow you to get a users recent activity very quickly (start a scan with the key = user id).
Using timestamps as the end of a key is a good idea if you want to query by time, but it is a bad idea if you want to optimize writes to your database (writes will always be in the most recent region in the system, causing hot spotting).
You might also want to consider putting more information (such as the activity) in the key so that you can pick up all activity of a particular type more easily.
Another example to look at is OpenTSDB
I am working on an employee objectives web application.
Lead/Manager sets objectives for team members after discussing with them. This is an yearly/half-yearly/quarterly depending on appraisal cycle the organization follows.
Now question is is better approach to add time period based fields or archive previous quarter's/year's data. When a user want to see previous objectives (not so frequent activity), the archive that belongs to that date may be restored in some temp table and shown to employee.
Points to start with
archiving: reduces db size, results in simpler db queries, adds an overhead when someone tried to see old data.
time-period based field/tables: one or more extra joins in queries, previous data is treated similar to current data so no overhead in retrieving old data.
PS: it is not space cost, my point is if we can achieve some optimization in terms of performance, as this is a web app and at peak times all the employees in an organization will be looking/updating it. so removing time period makes my queries a lot simpler.
Thanks
Assuming you're talking about data that changes over time, as opposed to logging-type data, then my preferred approach is to keep only the "latest" version of the data in your primary table(s), and to automatically copy the previous version of the data into a archive table. This archive table would mirror the primary, with the addition of versioned fields, such as timestamps. This archiving can be done with a trigger.
The main benefit that I see with this approach is that it doesn't compromise your database design. In particular, you don't have to worry about using composite keys that incorporate the version fields (in fact using time-based fields as keys may not even be permitted by your database).
If you need to go and look at the old data, you can run a select against the archive table and add version constraints to the query.
I would start off adding your time period fields and waiting until size becomes an issue. The kind of data you are describing does not sound like it is going to consume a lot of storage space.
Should it grow uncontrollably you can always look at the archive approach later - but the coding is going to take much longer than simply storing the relevant period with your data.
It seems to me that if you have the requirement that a user can look arbitrarily far back in the past, then you really must keep the data accessible.
This just won't be sustainable:
the archive that belongs to that date may be restored in some temp table and shown to employee.
My recommendation would be to periodically (read when absolutely necessary) move 'very old' data to another table for this purpose. Disk space is extremely cheap at this point, so keeping that data around is not nearly as expensive as implementing the system that can go back to an arbitrary time and restore an archive.
I'm looking for some design help here.
I'm doing work for a client that requires me to store data about their tens of thousands of employees. The data is being given to me in Excel spreadsheets, one for each city/country in which they have offices.
I have a database that contains a spreadsheets table and a data table. The data table has a column spreadsheet_id which links it back to the spreadsheets table so that I know which spreadsheet each data row came from. I also have a simple shell script which uploads the data to the database.
So far so good. However, there's some data missing from the original spreadsheets, and instead of giving me just the missing data, the client is giving me a modified version of the original spreadsheet with the new data appended to it. I cannot simply overwrite the original data since the data was already used and there are other tables that link to it.
The question is - how do I handle this? It seems to me that I have the following options:
Upload the entire modified spreadsheet, and mark the original as 'inactive'.
PROS: It's simple, straightforward, and easily automated.
CONS: There's a lot of redundant data being stored in the database unnecessarily, especially if the spreadsheet changes numerous times.
Do a diff on the spreadsheets and only upload the rows that changed.
PROS: Less data gets loaded into the database.
CONS: It's at least partially manual, and therefore prone to error. It also means that the database will no longer tell the entire story - e.g. if some data is missing at some later date, I will not be able to authoritatively say that I never got the data just by querying the database. And will doing diffs continue working even if I have to do it multiple times?
Write a process that compares each spreadsheet row with what's in the database, inserts the rows that have changed data, and sets the original data row to inactive. (I have to keep track of the original data also, so I can't overwrite it.)
PROS: It's automated.
CONS: It will take time to write and test such a process, and it will be very difficult for me to justify the time spent doing so.
I'm hoping to come up with a fourth and better solution. Any ideas as to what that might be?
If you have no way to be 100 % certain you can avoid human error in option 2, don't do it.
Option 3: It should not be too difficult (or time consuming) to write a VBA script that does the comparison for you. VBA is not fast, but you can let it run over night. Should not take more than one or two hours to get it running error free.
Option 1: This would be my preferred approach: Fast, simple, and I can't think of anything that could go wrong right now. (Well, you should first mark the original as 'inactive', then upload the new data set IMO). Especially if this can happen more often in the future, having a stable and fast process to deal with it is important.
If you are really worried about all the inactive entries, you can also delete them after your update (delete from spreadsheets where status='inactive' or somesuch). But so far, all databases I have seen in my work had lots of those. I wouldn't worry too much about it.