Store attribute as binary or String (JSON) - database

I have to store some attributes in DynamoDB and confused if some of JSON attributes should be stored as String/Binary. I understand that storing it as binary will reduce the size of attribute.
I considered DDB limits as 1 Read/Write IOPS consumes 4KB.
My total data in item is less than 4KB even if I store it as String.
What things should I consider to choose binary vs String ?

Given that your item sizes are less than 4KB uncompressed, whether to encode attributes in byte or string depends on whether the attribute will be a partition / range key of the table and your typical read patterns.
A partition key has a max sz of 2048 bytes, or ~2Kb.
A sort key (if you specify one on the table) has a max sz of 1024 bytes, or ~1Kb.
If you foresee your string attribute exceeding the above maximums on any items, it would make sense to compress to binary first to keep your attribute sizes in congruence with DynamoDB requirements.
Depending on how many items are in your typical query and your tolerance for throttled queries, your RCU's may not satisfy a Query / Scan where you perform the read in a single request.
For instance,
If you have 1KB items and want to query 100 items in a single request, your RCU req will be as follows:
(100 * 1024 bytes = 100 KB) / 4 KB = 25 read capacity units
Converting some attributes to binary could reduce your RCU requirement in this case. Again it largely depends on your typical usage pattern.


why does tfidf object takes so much space?

I have roughly 100,000 long articles totally about 5GB of texts, when I perform
from sklearn it constructs a model with 6GB. How is that possible? Isn't that we only need to store the document frequency of that 4000 words and what that 4000 words are? I am guessing TfidfVectorizer of stores such 4000 dimension vector for every document. Is it possible somehow I have some settings wrongly set?
A TF-IDF matrix shape is (number_of_documents, number_of_unique_words). So for each document you get a feature for each word from the dataset. It can get bloated for large datasets.
In your case
(100000 (docs) * 4000 (words) * 4 (np.float64 bytes))/1024**3 ~ 1.5 Gb
Moreover, the Scipy TfidfVectorizer by default tries to compensate it using a sparse matrix (scipy.sparse.csr.csr_matrix). Even for long documents the matrix tends to contain lots of zeros. So it is usually an order less than the original size. If I am correct, it should be lower than 1.5 GB.
Thus is the question. Do you really have only 4000 words in your model (controlled by TfidfVectorizer(max_features=4000)?
If you don't care about individual word frequencies you can decrease the vector size using PCA or other techniques.
dense_matrix = tf_idf_matrix.todense()
components_number = 300
reduced_data = PCA(n_components=300).fit_transform(dense_matrix)
Or you can use something like doc2vec.
Using it you'll get the matrix of the shape (number_of_documents, embedding_size). The embedding size is usually in the range between (100 and 600). You can train a doc2vec model without storing individual word vectors using the dbow_words parameter.
If you care about individual word features, the only reasonable solution that I see is to decrease the amount of words.
Relevant stackoverflow posts:
----On dimensinality reduction
How do i visualize data points of tf-idf vectors for kmeans clustering?
----On using generators to train TFIDF
Sklearn TFIDF on large corpus of documents
How to get tf-idf matrix of a large size corpus, where features are pre-specified?
tf-idf on a somewhat large (65k) amount of text files
Models itself should not occupy so much space. I suppose it is possible if only you have some heavy objects in TfidfVectorizer tokenizer or preprocessor attributes.
class Tokenizer:
def __init__(self):
self.s = np.random.uniform(0,1, size=(10000,10000))
def tokenizer(self, text):
text = text.lower().split()
return text
tokenizer = Tokenizer()
vectorizer = TfidfVectorizer(tokenizer=tokenizer.tokenizer)
pickle.dump(vectorizer, open("vectorizer.pcl", "wb"))
This will occupy more than 700mb after pickling.
I know there is an answer but some additional information to consider for others. When you directly pickle the TFIDFVectorizer you also saving stop words attribute of the vectorizer but that is not necessary after vocabulary is established. In one of our models, there were 3000 words in vocabulary but saved model occupied 250MB space so inspecting the model we saw 10 Million stop words also is stored with the model. Then we saw the following warning at TfidfVectorizer
"The stop_words_ attribute can get large and increase the model size when pickling. This attribute is provided only for introspection and can be safely removed using delattr or set to None before pickling."
Applying that reduced our model size significantly.

Compressing a sparse bit array

I have arrays of 1024 bytes (8192 bits) which are mostly zero.
Between 0.01% and 10% of bits will be set (random, no pattern).
How could these be compressed, given the lack of structure and the relatively small size?
(My first thought was to store the distances between set bits. I need 13 bits for each distance, but at worst case 10% occupancy this needs 13 * 816 / 8 = 1326 bytes, which is not an improvement.)
This is for ultra-low bandwidth comms, so every byte matters.
I've dealt deeply with a similar problem, but my sets are much bigger (30 million possible values with between 1 and 30 million elements in each set), so they both gain much more from compression and the compression metadata is insignificant compared to the size of the data. I have never gone down to squeezing things into units smaller than uint16_t, so the things I write below might not apply if you start chopping up 13 bit values into pieces. It feels like it should work, but caveat emptor.
What I've found works is to employ several strategies that depend on the particular data we have. The good news is that the count of elements in each set is a very good indicator of which compression strategy will work best for a particular set. So all the metadata you need is a count of elements in the set. In my data format the first and only metadata value (I'll be unspecific and just call it "value", you can squeeze things in bytes, 16 bit values or 13 bit values however you feel) is the count of elements in the set, the rest is just the encoding of the set elements.
The strategies are:
If very few elements are in the set, you can't do better than an array that says "1, 4711, 8140", so in this case the data is encoded as: [3, 1, 4711, 8140]
If almost all elements are in the set, you can just keep track of elements that aren't. For example [8190, 17, 42].
If around half of the elements are in the set you pretty much can't do much better than a bitmap, so you get [4000, {bitmap}], this is the only case where your data ends up being longer than strictly uncompressed.
If more than "a few" but many fewer than "around half" elements are set, I found another strategy. Divide the bits of your possible values in the set in half. Let's say we have 2^16 (it's easier to describe, it should probably work for 2^13) possible values. The values are divided into 256 ranges with each range with 256 possible values. We then have an array with 256 bytes, each of these bytes describes how many values are in each range (so byte 0 tells us how many elements are [0,255], byte 1 gives us [256,511], etc.) immediately after follow arrays with the values in each range mod 256. The trick here is that while every element in the set encoded as an array (strategy 1) would be 2 bytes, in this scheme each element is only 1 bytes + 256 static bytes for the counts of elements. This means that as soon as we have more than 256 elements in the set this saves us space by switching from strategy 1 to 4.
Strategy 4 can be refined (probably meaningless if your data is random as you mention, but my data had more patterns sometimes, so it worked for me). Since we still need 8 bits for each element in the previous encoding, as soon as a sub-array of elements goes over 32 elements (256 bytes), we can store it as a bitmap instead. This is also a good breakpoint for switching strategies between 4/5 to 3. If all the arrays in this strategy are just bitmaps, then we should just use strategy 3 (it's more complicated than that, but the breakpoint between strategies can be precomputed quite accurately that you'll end up picking the most likely efficient strategy each time).
I have only vaguely tried saving deltas between numbers in the set. Quick experiments showed that they weren't really much more efficient than the strategies I mentioned above, had unpredictable degenerate cases, but most importantly, the application I work with really likes to not have to deserialise its data, just use it raw straight from disk (mmap).

Worth a unique table for database values that repeat ~twice?

I have a static database of ~60,000 rows. There is a certain column for which there are ~30,000 unique entries. Given that ratio (60,000 rows/30,000 unique entries in a certain column), is it worth creating a new table with those entries in it, and linking to it from the main table? Or is that going to be more trouble than it's worth?
To put the question in a more concrete way: Will I gain a lot more efficiency by separating out this field into it's own table?
** UPDATE **
We're talking about a VARCHAR(100) field, but in reality, I doubt any of the entries use that much space -- I could most likely trim it down to VARCHAR(50). Example entries: "The Gas Patch and Little Canada" and "Kora Temple Masonic Bldg. George Coombs"
If the field is a VARCHAR(255) that normally contains about 30 characters, and the alternative is to store a 4-byte integer in the main table and use a second table with a 4-byte integer and the VARCHAR(255), then you're looking at some space saving.
Old scheme:
T1: 30 bytes * 60 K entries = 1800 KiB.
New scheme:
T1: 4 bytes * 60 K entries = 240 KiB
T2: (4 + 30) bytes * 30 K entries = 1020 KiB
So, that's crudely 1800 - 1260 = 540 KiB space saving. If, as would be necessary, you build an index on the integer column in T2, you lose some more space. If the average length of the data is larger than 30 bytes, the space saving increases. If the ratio of repeated rows ever increases, the saving increases.
Whether the space saving is significant depends on your context. If you need half a megabyte more memory, you just got it — and you could squeeze more if you're sure you won't need to go above 65535 distinct entries by using 2-byte integers instead of 4 byte integers (120 + 960 KiB = 1080 KiB; saving 720 KiB). On the other hand, if you really won't notice the half megabyte in the multi-gigabyte storage that's available, then it becomes a more pragmatic problem. Maintaining two tables is harder work, but guarantees that the name is the same each time it is used. Maintaining one table means that you have to make sure that the pairs of names are handled correctly — or, more likely, you ignore the possibility and you end up without pairs where you should have pairs, or you end up with triplets where you should have doubletons.
Clearly, if the type that's repeated is a 4 byte integer, using two tables will save nothing; it will cost you space.
A lot, therefore, depends on what you've not told us. The type is one key issue. The other is the semantics behind the repetition.

How big is AppEngine BlobStore meta-data per entry?

I'm trying to get a handle on the data-overhead required to store a blob in AppEngine's BlobStore.
Let's say I save a 1KB blob, how many bytes will that cost me in BlobStore and in DataStore respectively?
In other words: How big does an entity need to be, before it's worth it to move it to BlobStore?
The answer to this question is not documented, but you can do a bit of guess-work to get a minimum overhead per blob.
Each blob created requires a blob info and a blob key. The blob key, I believe, is 500 bytes. The blob-info has a content_type (string), creation time (datetime), filename (string), and size (integer). We can assume that each string uses 1 more byte than their length. Also, assuming you do not use the optional file-name or content type field. Then the blob-info items will be approximately, 1 bytes, 8 bytes, 1 bytes, and 8 bytes, respectively, totaling 18 bytes.
Therefore, the minimum likely overhead for a blob item will be at least 518 bytes per blob, stored in the datastore. But we're not done, we still need to figure out the optimal pricing.
Pricing for the blob-store per month will be:
= $0.13/G * blob_file_size + 518 bytes * $0.24 / gig
= blob_file_size/1024/1024*.13 + 0.00011856079
Whereas pricing for storage entirely in the datastore is:
= blob_file_size/1024/1024*.24
The break-even point where the two cost the same amount is 1130.2 bytes. Any more and the blobstore will be cheaper, and less and the datastore will be cheaper. Of course, this is based on the minimum overhead of 518 bytes, and I would bet the overhead will often be higher, so maybe a rule of thumb would be 2kb.

What is the length of time to send a list of 200,000 integers from a client's browser to an internet sever?

Over the connections that most people in the USA have in their homes, what is the approximate length of time to send a list of 200,000 integers from a client's browser to an internet sever (say Google app engine)? Does it change much if the data is sent from an iPhone?
How does the length of time increase as the size of the integer list increases (say with a list of a million integers) ?
Context: I wasn't sure if I should write code to do some simple computations and sorting of such lists for the browser in javascript or for the server in python, so I wanted to explore this issue of how long it takes to send the output data from a browser to a server over the web in order to help me decide where (client's browser or app engine server) is the best place for such computations to be processed.
More Context:
Type of Integers: I am dealing with 2 lists of integers. One is a list of ids for the 200,000 objects whose integers look like {0,1,2,3,...,99,999}. The second list of 100,000 is just single digits {...,4,5,6,7,8,9,0,1,...} .
Type of Computations: From the browser a person will create her own custom index (or rankings) based changing the weights associated to about 10 variables referenced to the 100,000 objects. INDEX = w1*Var1 + w2*Var2 + ... wNVarN. So the computations refer to vector (array) multiplication to a scalar and addition of 2 vectors, as well as sorting the final INDEX variable vector of 100,000 values.
In a nutshell...
This is probably a bad idea,
in particular with/for mobile devices where, aside from the delay associated with transfer(s), limits and/or extra fees associated with monthly volumes exceeding various plans limits make this a lousy economical option...
A rough estimate (more info below) is that the one-way transmission takes between 0.7 and and 5 seconds.
There is a lot of variability in this estimate, due mainly to two factors
Network technology and plan
compression ratio which can be obtained for a 200k integers.
Since the network characteristics are more or less a given, the most significant improvement would come from the compression ratio. This in turn depends greatly on the statistic distribution of the 200,000 integers. For example, if most of them are smaller than say 65,000, it would be quite likely that the list would compress to about 25% of its original size (75% size reduction). The time estimates provided assumed only a 25 to 50% size reduction.
Another network consideration is the availability of binary mime extension (8 bits mime) which would avoid the 33% overhead of B64 for example.
Other considerations / idea:
This type of network usage for iPhone / mobile devices plans will not fare very well!!!
ATT will love you (maybe), your end-users will hate you at least the ones with plan limits, which many (most?) have.
Rather than sending one big list, you could split the list over 3 or 4 chunks, allowing the server-side sorting to take place [mostly] in parallel to the data transfer.
One gets better compression ratio for integers when they are [roughly] sorted, maybe you can have a first pass sorting of some kind client-side.
How do I figure? ...
1) Amount of data to transfer (one-way)
200,000 integers
= 800,000 bytes (assumes 4 bytes integers)
= 400,000 to 600,000 bytes compressed (you'll want to compress!)
= 533,000 to 800,000 bytes in B64 format for MIME encoding
2) Time to upload (varies greatly...)
Low-end home setup (ADSL) = 3 to 5 seconds
broadband (eg DOCSIS) = 0.7 to 1 second
iPhone = 0.7 to 5 seconds possibly worse;
possibly a bit better with high-end plan
3) Time to download (back from server, once list is sorted)
Assume same or slightly less than upload time.
With portable devices, the differential is more notable.
The question is unclear of what would have to be done with the resulting
(sorted) array; so I didn't worry to much about the "return trip".
==> Multiply by 2 (or 1.8) for a safe estimate of a round trip, or inquire
about specific network/technlogy.
By default, typically integers are stored in a 32-bit value, or 4 bytes. 200,000 integers would then be 800,000 bytes, or 781.25 kilobytes. It would depend on the client's upload speed, but at 640Kbps upload, that's about 10 seconds.
well that is 800000 bytes or 781.3 kb, or you could say the size of a normal jpeg photo. for broadband, that would be within seconds, and you could always consider compression (there are libraries for this)
the time increases linearly for data.
Since you're sending the data from JavaScript to the server, you'll be using a text representation. The size will depend a lot on the number of digits in each integer. Are talking about 200,000 two to three digit integers or six to eight integers? It also depends on if HTTP compression is enabled and if Safari on the iPhone supports it (I'm not sure).
The amount of time will be linear depending on the size. Typical upload speeds on an iPhone will vary a lot depending on if the user is on a business wifi, public wifi, home wifi, 3G, or Edge network.
If you're so dependent on performance perhaps this is more appropriate for a native app than an HTML app. Even if you don't do the calculations on the client, you can send/receive binary data and compress it which will reduce time.
