How do BTRFS and ZFS snapshots work? - filesystems

More specifically, how do they manage to look at the entire subvolume and remember everything about it (files, sizes of files, folder structure) while fitting it into such a small amount of data.

Suppose I have a list of names:
Joe
Bob
Fred
You tell me to remember this list. So I go, okay:
Joe
Bob
Fred
(as of 06/01/15)
The next day, you tell me to add the name "John" to the end of the list. I then copy over the list to get:
Joe
Bob
Fred
(as of 06/01/15)
Joe
Bob
Fred
John
(current)
This is the super simple description of how a snapshot works. The filesystem leaves itself a note of when the snapshot took place, and then when changes are made, it will make a fresh copy from the snapshot and write to that instead.
Of course, the snapshot is on-demand. Only the parts of files you write to will be copied. The net effect from a high-level point of view is that BTRFS "freezes" the files and then records future changes as deltas against the frozen data. Of course, the deltas can be stacked and branched etc.
To answer your question, the act of saying "Note to self: Don't touch these files!" doesn't take much time at all.

Related

Can individual entry in NTFS USN Journal be deleted?

Lets say NTFS's journalling is enabled but I dont want some of my file's change records to be added in the journal. Is this possible? and if not, Is there any way that even if the change related to a particular file is added into the USN journal, I can delete only that record related to that particular file? From what I have read so far that you can delete whole journal in one go using de-fragmentation API or using fsutil tool but not individual record.
Any help would be appreciated.
It's true. While the journal exists, you cannot hide file changes. And you cannot delete single usn records the regular way. As Xearinox pointed out, the only way to manipulate that data is through direct disk write operations.
If you are interested in that, this is what you want to read:
Keeping an Eye on Your NTFS Drives: the Windows 2000 Change Journal Explained
Keeping an Eye on Your NTFS Drives, Part II: Building a Change Journal Application
In short: The USN journal is a non-fragmented series of USN records. The Update Sequence Number is actually just an offset. [1] So the whole structure is pretty straight forward.
The Change Journal always writes new records to the end of the file, so the implementors chose to use the file offset of a record as its USN
Source: Keeping an Eye on Your NTFS Drives: the Windows 2000 Change Journal Explained

Which of these data models is the most correct?

I'm in the process of creating the data model for an application I will be developing, and I was hoping to get some feedback on part of the model. The app will be a complete redevelopment of something that was created in Lotus Notes, and one of the main purposes of the redevelopment is to move toward a relational data storage layer.
The application is focused on managing Things. The requirements/constraints of the application are:
A Thing must have an associated Location.
A Location could be for example 'McDonalds', or 'Melbourne Uni, Building AK, Room 301' where 'Melbourne Uni', 'Building AK', and 'Room 301' are seperate related Locations.
(at least) 3 levels/tiers of Location must exist
There must be a provision for 'Other' locations, so that users can enter free text for a location that does not exist in the database
So I've come up with 4 different implementations of the above, but I don't really have enough DBA experience to know which one is the most correct.
Location / Thing relational model
Any thoughts and/or suggestions on this would be greatly appreciated!
Both option 1s are likely to prove inflexible and difficult to amend if your estimate of three levels proves to be insufficient.
In your option 2s, the entity that looks dubious is ThingOtherLocations. Anything that (from its name) is concatenating two different concepts is automatically suspect.If it is the case that you do have two separate concepts here, then the structure of option 2b does not need either OtherLocation or ThingOtherLocation. I suspect that the relation you are trying to represent (gets its name from) is actually another relation between locations - though I am not clear on this.
EDIT
In the light of your clarification of the ThingOtherLocations, I would suggest that you treat the text associated with Other simply as a new location, and store the new location along with other locations. There does not seem to be any reason to include special database handling for these cases.
EDIT
To deal with the child location issue, you might like to consider Joe Celko's work on nested sets. The primary reference for this is:
Joe Celko.
Joe Celko's Trees and Hierarchies in SQL for Smarties,
(The Morgan Kaufmann Series in Data Management Systems)
ISBN 1-55860-920-2

What are the arguments against merging contact details into a single field?

We have a customer that insists on putting contact details, at this time first and last names, into a single field. Take, for example, Mr. Bob Smith and Mrs. Jane Smith. Mr. Bob and Mrs. Jane would be entered into the first name field and Smith would be entered into the last name. It gets messier if the contacts have different last names or if there is a hyphenated name. The customer only wants one contact record so they came up with this system and implemented it on their own.
Our system is designed around contacts and each individual person is intended to be an individual contact, even married. Due to some of the attributes we must assign to people and notes we need to keep, a contact-centric approach is best. The above issue occurs in about 1/3 of the cases we handle.
Internally, my team has discussed how to sell the customer on using the database the way it was designed. We listed form letters and contact lists as being the main reasons for keeping the data clean and in the fields we designed. For example, using our recommendation, the customer will have much more granular control over form letter creation and sorting of data.
Any suggestions for how we sell this to the customer?
Tell them what they can get out of your system is only as good as what gets put in. If they want to enter inconsistent data, the cost they'll pay down the line is the inability to generate letters or mailing lists in the future.
They may need to learn this lesson the hard way for themselves. I see more problems with switching the names, for example, entering Smith as the first name and Bob as the last.
Also, can you make both fields required?
It sounds like what they want to enter is similar to AddressLine1, AddressLine2. It's just a poor design, I thought you had 2 name fields but they would only enter data in one of them (the first name).
All you can do it try to help them when they ask for it. They'll get the system they deserve.
Just show your customer the normal forms for database design:
> http://www.phlonx.com/resources/nf3/
Tell him that these normal forms are designed to make the database more manageable over time and make it more flexible.
Can't you just create a view that holds First and Last name together? For some servers you can also create editable views... So your customer will be happy and data will be stored normalized.
I'd try to put it in terms of money and time. You're going to spend more time trying to keep duplicates out of a db with their design, more time building relevant reports or queries (constantly having to parse a block name field... do they want address all in one too?!?), more money to scrub the data (either themselves or someone else) if they ever want to send the data to a third party for analysis and metrics.
It sounds like they don't want to let go of their design, maybe partly because they understand it. You may want to try and meet them halfway somehow at first, and involve them in the process of making incremental improvements to the design. That way they can see and understand the benefits that right now may just be over their head, pushing them out of their comfort zone. They have to trust you with their baby :)
The best argument is that you won't be responsible for the behavior of the database unless they put things where they belong.
If they want to make a single mailing to each "household", then I'm sure your app can do that. (Probably already does.) Y'all just have to come to terms on what "household" means. Since there may be rented rooms or long-term guests, it doesn't always mean "only one mailing piece per address".
FWIW, I've been doing this stuff for decades, and I still find doctors and attorneys (and their staffs) the hardest people to deal with. One time, I walked out of a meeting (and, of course, lost the chance to bid on the contract) when a doctor's IT guy stood up, pounded his fist on the table, and screamed at me over and over, "Doctors are not people! Doctors are not people!".

How to keep an ordered list on the App Engine DataStore, in a way each entity knows its position on the list?

I have a ranking of people, based on points they earn. I need to tell each participant their position on the game for example,
John (1) - 170 points, Mary (2) - 160 points, Sarah (3) - 110 points
So John's the first one, Mary's the seconds and Sarah's the third. Now if Mary wins 20 more points, she'll be the first and john will be the second.
I'm trying to avoid having to run a task on cron to list and recalculate everybody's position.
My first try was to maintain a separate set of entities (PersonRank) so I wouldn't run into transaction problems, this rank would have the same key name, so I could db.get() by key. This entity would have the person's calculated rank, so when a Person receives points, I'd have to check if the next Person on the line has fewer points than me, and exchange places with me so that's true.
The problem is that Sarah, on the example, may have won 100 points, and is now number one. On the previous algorithm, I'd have to "walk" among a lot of entities, which means a lot of DataStore gets and puts (updating each involved Entity to the new position).
My next guess is maybe some kind of linked list with ReferenceProperties, maybe using the key names to denote the position.
Any clues about how to implement this ?
Much more complex than I though, but hopefully some guys at Google already implemented this solution - http://googleappengine.blogspot.com/2009/01/google-code-jams-ranking-library.html
As you self-answered 3 years ago, Google implemented this solution, but they did that for Python exclusively. There is unofficial solution for java, someone re-writted Python solution and decided to share it:
http://toolongdidntread.com/google-app-engine/using-googles-datastore-to-implement-a-ranked-scoreboard/
People say it works fine, I did not use it yet though. I might decide to use it in next few months so I could write more about it then. Be aware that both Java and Python solutions can deal with 3 requests/sec only (!).
More reading here:
https://cloud.google.com/developers/articles/fast-and-reliable-ranking-in-datastore/
Basically, it says Google upgraded the code so it could deal with 300 requests/sec, and they said if you buy their premium-level support, they could help you with that, but they will not share the ready-to-go solution with the mere mortals. Also, all the code and references are in Python, but thats pretty obvious.
There is still no better solution than those mentioned here, at least I'm not aware of any.

Are zip code and postal code violation of 3rd normal form?

Given that state information is implicit in the zip code aren't storing both of them some violaiton of third normal form? Can or should you simply combine them into one field?
According to this post, there are a few zip codes that cross state boundaries. So no, it is not a violation of 3NF.
Actually, there are a few rare cases where a ZIP Code crosses state boundaries. Usually it is due to access problems, such as being on a military base or due to constraints of the transporation network.
One such case is Protem, Missouri (ZIP Code 65733). Some of the Arkansas roads north of Bull Shoals Lake can best be accessed by the Protem delivery unit rather than an Arkansas post office. Some examples of such roads include Ann Street, Kalijah Road, McBride Road, Red Oak Lane, and Vance Road on Highway Carrier Route H002 in ZIP Code 65733. McBride Road actually crosses across the state boundary. If you look at the road network in an online mapping program, you can see that a rural carrier from say, nearby Diamond City, AR (ZIP code 72644), on the south side of Bull Shoals Lake, would need to drive several miles to be able to access the roads listed above.
For another example, Fort Campbell, Kentucky (ZIP Code 42223) also has some roads that exist within Tennessee.
That statement isn't actually true in all geographical areas. Australia has a few sister cities that straddle state boundaries yet share the same postcode.
And 3NF, while incredibly useful, is not inviolable. I've sometimes reverted some table information back to 2NF for performance reasons.
Nope. There are some zip codes that cross state lines. See Wikipedia for some examples. Furthermore, normalization reduces redundancy, while addresses are actually fairly complicated things that are easy to get one component of wrong. Redundancy means that even if part of the address is wrong, there is a good chance that the mail will be able to get where its going.
I recall a time when a hiker from Europe stayed at my fraternity, and wanted to send a thank-you note. He did not understand American addresses or geography very well, so when he sent the note it was addressed to "<fraternity name> <not quite correct name of university> New England? USA". The mail actually got there, amazingly enough.
Redundancy in addresses can be a very good thing, and you generally shouldn't assume more about an address than you need to. For instance, some people don't have a street number; you put "general delivery", and the mailman is expect to know where the letter goes (or you can pick it up at the post office if he doesn't).
There is a different issue. You might want to make a difference between the data that was entered (which could be conflicting) and the conclusion you make from that.
3NF violation by example
Let's look at the below denormalized table for a blog posts project. It's not the 3rd normal form, it's broken. Let's say there are multiple
posts with same author, we may update a few rows and leave others un-updated. Leaving the table data inconsistent.
Hence this violates normalization because it violates a common way to describing normalized tables in 3rd normal form, which is that every non-key attribute in the table must provide a fact about the key, the whole key and nothing but the key. And that's of a play on words for what you say in a US courtroom, telling the truth, the whole truth and nothing but the truth. The key in this case, is the Post Id and there is a non-key attribute Author Email which does not follow that. Because it does, in fact tell something about the author. And so it violates that 3rd normal form by not achieving the goals of normalization
hope this helps.

Resources