Is there a design pattern for separating data into spaces? - database

I'm working on a problem where application data needs to be separated into spaces;
a production space
a demo space
a test space
a training space
...
Some product data is shared across spaces, but almost all user data should be read/write per space controlled by a permission system.
Is there a common design pattern for this? I'm concerned about the risk of accidentally mixing data between spaces, i.e. test space data ending up in the production space.
Thanks!

Related

Multiple Json Files or A single file with multiple arrays

I have a large blob (azure) file with 10k json objects in a single array. This does not perform because of its size. As I look to re-architect it, I can either create multiple files with a single array in each of 500-1000 objects or I could keep the one file, but burst the single array into an array of arrays-- maybe 10 arrays of 1000 objects each.
For simplicity, I'd rather break into multiple files. However, I thought this was worth asking the question and seeing if there was something to be learned in the answers.
I would think this depends strongly on your use-case. The multiple files or multiple arrays you create will partition your data somehow: will the partitions be used mostly together or mostly separate? I.e. will there be a lot of cases in which you only read one or a small number of the partitions?
If the answer is "yes, I will usually only care about a small number of partitions" then creating multiple files will save you having to deal with most of your data on most of your calls. If the answer is "no, I will usually need either 1.) all/most of my data or 2.) data from all/most of my partitions" then you probably want to keep one file just to avoid having to open many files every time.
I'll add: in this latter case, it may well turn out that the file structure (one array vs an array-of-arrays) doesn't change things very much, since a full scan is a full scan is a full scan etc. If that's the case, then you may need to start thinking about how to move to the prior case where you partition your data so that your calls fall neatly within few partitions, or how to move to a different data format.

How does SQL server treat text comparisons with LEADING spaces

Much is made (and easily able to be found on the internet) about how you do not need to use where rtrim(columnname) = 'value' in sql server, because it automatically considers a value with or without trailing spaces to be the same.
However I've had a hard time finding info about LEADING spaces. What if (for whatever reason) our data warehouse has leading spaces on certain varchar / char type of fields and we need to have where clauses - do we still need where ltrim() ? I'm trying to avoid this big performance hit by researching out other options.
Thank You
Leading spaces are never ignored in comparisons of any text based data type. If you are comparing the equality of text columns, the best option is to validate your values on data entry to make sure that text with unwanted spaces in front is not allowed. For example if your database is expecting a user to type something from a list of possible values that your database application is expecting, do not allow your user interfaces to let users enter the text free-form, force them to enter one of the explicit valid values. If you need the user to be able to enter free-form text but never want leading spaces, then strip them on the insert. Normalizing your database should prevent a lot of these types of issues.

Lots of little objects in data store adds a lot of meta information space usage?

I've been getting more into app engine, and am a little concerned about space used per object in the datastore (I'm using java). It looks like for each record, the names of the object field are encoded as part of the object. Therefore, if I have lots of tiny records, the additional space used for encoding field names could grow to be quite a significant portion of my datastore usage - is this true?
If true, I can't think of any good strategies around this - for example, in a game application, I want to store a record each time a user makes a move. This will result in a lot of little objects being stored. If I were to rewrite and use a single object which stored all the moves a user makes, then serialize/deserialize time would increase as the moves list grows:
So:
// Lots of these?
class Move {
String username;
String moveaction;
long timestamp;
}
// Or:
class Moves {
String username;
String[] moveactions;
long[] timestamps;
}
am I interpreting the tradeoffs correctly?
Thank you
Your assessment is entirely correct. You can reduce overhead somewhat by choosing shorter field names; if your mapping framework supports it, you can then alias the shorter names so that you can use more user-friendly ones in your app.
Your idea of aggregating moves into a single entity is probably a good one; it depends on your access pattern. If you regularly need to access information on only a single move, you're correct that the time spent will grow with the number of moves, but if you regularly access lists of sequential moves, this is a non-issue. One possible compromise is separating the moves into groups - one entity per hundred moves, for example.

Hadoop block size issues

I've been tasked with processing multiple terabytes worth of SCM data for my company. I set up a hadoop cluster and have a script to pull data from our SCM servers.
Since I'm processing data with batches through the streaming interface, I came across an issue with the block sizes that O'Reilly's Hadoop book doesn't seem to address: what happens to data straddling two blocks? How does the wordcount example get around this? To get around the issue so far, we've resorted to making our input files smaller than 64mb each.
The issue came up again when thinking about the reducer script; how is aggregated data from the maps stored? And would the issue come up when reducing?
This should not be an issue providing that each block can cleanly break a part the data for the splits (like by line break). If your data is not a line by line data set then yes this could be a problem. You can also increase the size of your blocks on your cluster too (dfs.block.size).
You can also customize in your streaming how the inputs are going into your mapper
http://hadoop.apache.org/common/docs/current/streaming.html#Customizing+the+Way+to+Split+Lines+into+Key%2FValue+Pairs
Data from the map step gets sorted together based on a partioner class against the key of the map.
http://hadoop.apache.org/common/docs/r0.15.2/streaming.html#A+Useful+Partitioner+Class+%28secondary+sort%2C+the+-partitioner+org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner+option%29
The data is then shuffled together to make all the map keys get together and then transferred to the reducer. Sometimes before the reducer step happens a combiner comes in if you like.
Most likely you can create your own custom -inputreader (here is example of how to stream XML documents http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/streaming/StreamXmlRecordReader.html)
If you have multiple terabytes input you should consider setting block size to even more then 128MB.
If file is bigger than one block it can either be split, so each block of file would go to different mapper, or whole file can go to one mapper (for example if this file is gzipped). But I guess you can set this using some configuration options.
Splits are taken care of automatically and you should not worry about it. Output from maps is stored in tmp directory on hdfs.
Your question about "data straddling two blocks" is what the RecordReader handles. The purpose of a RecordReader is 3 fold:
Ensure each k,v pair is processed
Ensure each k,v pair is only processed once
Handle k,v pairs which are split across blocks
What actually happens in (3) is that the RecordReader goes back to the NameNode, gets the handle of a DataNode where the next block lives, and then reaches out via RPC to pull in that full block and read the remaining part of that first record up to the record delimiter.

Should I trim strings before storing in the database?

The situation I have run into is this. When storing the Code of an entity (must be unique in database), someone could technically put "12345" and "12345 " as codes and the database would think they're unique, but to the end user, displaying of the space makes it look like they are duplicated and could cause confusion.
In this case, I would definitely trim before storing.
Should this become the standard for all strings?
I would think that unless the space is important to the data, that you should remove it.
This is one of those questions whose answer is "It depends".
What you need to keep in mind here is the principal of least astonishment. A user would be very astonished to see two codes that look identical as especially when you display it in a form or table, the space at the end essentially vanishes. The user is also expecting that these codes are unique and they're probably expecting your system to enforce this. For a user, a space is not really something that they expect to cause a difference.
In some other cases, for example in a Content Management System or word processor for example, when the user is consciously putting in spaces, he expects the underlying data store to persist his spaces. In this case the user is probably putting in spaces to align content or for visual purposes. In this case removing spaces at the end would astonish the user.
So always look to model the user's workflow as far as possible.
put a constraint on the table that disallows leading spaces if you use varchar, remeber GIGO?
If codes are numeric then use a numeric datatype

Resources