Does anyone have a practical downloadable/viewable example of a 16mb (max size) mongodb doucument?
Should be alot of data but im trying to get the feel and understanding how much data can you store in a 16 mb document,
like "How many sql rows of a 10 column table would that be" sort of question
Thanks
You can calculate the size of various documents using the BSON spec.
For example, a document {a:1} consisting of one key with an integer value would take 5+1+2+4=12 bytes.
You can use various drivers to convert your data to BSON to see how much space it actually takes up:
serene% irb -rbson
irb(main):001:0> {a:1}.to_bson.to_s
=> "\f\x00\x00\x00\x10a\x00\x01\x00\x00\x00\x00"
irb(main):002:0> {a:1}.to_bson.to_s.length
=> 12
If you have, let's say, documents which are flat (non-nested) mappings with keys that are 10 bytes long and 64-bit integer values, each key-value pair takes up 1+10+1+8=20 bytes. You can have about 800,000 such key-value pairs in a single document.
As Wernfried Domscheit said, https://ourworldindata.org/coronavirus-source-data provides a json with alot of data to view and test with
I suggest to import the data into MongoDB compass to view all the documents clearly.
Related
This question may be a little vague, but let me try to explain it clearly. I have been reading a database related tutorial, and it mentioned tables are serialized to bytes to be persisted on the disk. When we deserialize them, we can locate each column based on the size of its type.
For example, we have a table:
---------------------------------------------------
| id (unsigned int 8) | timestamp (signed int 32) |
---------------------------------------------------
| Some Id | Some time |
---------------------------------------------------
When we are deserializing a byte array loaded from a file, we know the first 8 bits are the id, and the following 32 bits are the timestamp.
But the tutorial never mentioned how strings are handled in databases. They are not specific to a limited size, like 32 bits, and they are not predictable the size wise (there can always be a long long String, who knows). So how exactly does databases handle strings?
I know in RDBMS, you need to specify the size of the string as Varchar(45) for example, then it becomes easier. But what about dbs like MongoDB or Redis, which does not require a specification for string sizes, do they just assume it to be a specific length and increase the size once a longer one comes in?
That is basically my vague non-specific question, hope someone can give me some ideas on this. Thank you very much
In MongoDB, documents are serialized as BSON (Binary JSON-like objects). See BSON spec for more details regarding the datatypes for each type.
For string type, it is stored as:
<unsigned32 strsizewithnull><cstring>
From this line in the MongoDB source.
So a string field is stored with its length (including the null terminator) in the BSON object. The string itself is UTF-8 encoded as per BSON spec, so it can be encoded using a variable amount of bytes per symbol. Together with other fields that makes up a document, they are compressed using Snappy by default. This compressed representation is the one persisted to disk.
WiredTiger is a no-overwrite storage engine. If that document is updated, WiredTiger creates a new document and updates the internal pointer to the new one, and mark the old document as "space available for reuse".
I have roughly 100,000 long articles totally about 5GB of texts, when I perform
TfidfVectorizer
from sklearn it constructs a model with 6GB. How is that possible? Isn't that we only need to store the document frequency of that 4000 words and what that 4000 words are? I am guessing TfidfVectorizer of stores such 4000 dimension vector for every document. Is it possible somehow I have some settings wrongly set?
A TF-IDF matrix shape is (number_of_documents, number_of_unique_words). So for each document you get a feature for each word from the dataset. It can get bloated for large datasets.
In your case
(100000 (docs) * 4000 (words) * 4 (np.float64 bytes))/1024**3 ~ 1.5 Gb
Moreover, the Scipy TfidfVectorizer by default tries to compensate it using a sparse matrix (scipy.sparse.csr.csr_matrix). Even for long documents the matrix tends to contain lots of zeros. So it is usually an order less than the original size. If I am correct, it should be lower than 1.5 GB.
Thus is the question. Do you really have only 4000 words in your model (controlled by TfidfVectorizer(max_features=4000)?
If you don't care about individual word frequencies you can decrease the vector size using PCA or other techniques.
dense_matrix = tf_idf_matrix.todense()
components_number = 300
reduced_data = PCA(n_components=300).fit_transform(dense_matrix)
Or you can use something like doc2vec. https://radimrehurek.com/gensim/models/doc2vec.html
Using it you'll get the matrix of the shape (number_of_documents, embedding_size). The embedding size is usually in the range between (100 and 600). You can train a doc2vec model without storing individual word vectors using the dbow_words parameter.
If you care about individual word features, the only reasonable solution that I see is to decrease the amount of words.
Relevant stackoverflow posts:
----On dimensinality reduction
How do i visualize data points of tf-idf vectors for kmeans clustering?
----On using generators to train TFIDF
Sklearn TFIDF on large corpus of documents
How to get tf-idf matrix of a large size corpus, where features are pre-specified?
tf-idf on a somewhat large (65k) amount of text files
Models itself should not occupy so much space. I suppose it is possible if only you have some heavy objects in TfidfVectorizer tokenizer or preprocessor attributes.
class Tokenizer:
def __init__(self):
self.s = np.random.uniform(0,1, size=(10000,10000))
def tokenizer(self, text):
text = text.lower().split()
return text
tokenizer = Tokenizer()
vectorizer = TfidfVectorizer(tokenizer=tokenizer.tokenizer)
pickle.dump(vectorizer, open("vectorizer.pcl", "wb"))
This will occupy more than 700mb after pickling.
I know there is an answer but some additional information to consider for others. When you directly pickle the TFIDFVectorizer you also saving stop words attribute of the vectorizer but that is not necessary after vocabulary is established. In one of our models, there were 3000 words in vocabulary but saved model occupied 250MB space so inspecting the model we saw 10 Million stop words also is stored with the model. Then we saw the following warning at TfidfVectorizer
"The stop_words_ attribute can get large and increase the model size when pickling. This attribute is provided only for introspection and can be safely removed using delattr or set to None before pickling."
Applying that reduced our model size significantly.
Say 100 million digits, one string.
The purpose is to query the DB to find recurrence of a search string.
While I know that LONGTEXT type in MySQL would allow for the string to be stored, I am not sure that querying the substring would actually result in acceptable performances.
Would a NoSQL key-value model would perform better?
Any suggestion, experience (does not have to be PI..).
This may be taking you in the wrong direction but...
Using MySQL seems to be a high-overhead way of solving the specific problem of finding a string in a file.
100M digits, 8 bytes each, is only 100MB. In Python, you could write the file as a sequence of bytes, each byte representing an ascii number. Or, you could pack them into nibbles (4 bits will cover digits 0-9).
In Python, you can read the file using:
fInput = open(<yourfilenamehere>, "rb")
fInput.seek(number_of_digit_you_want)
fInput.read(1) # to get a single byte
From this, it is easy to build a search solution to seek out a specific string.
Again, mySQL might be the way to go but, for your application, a non-high-overhead method might be the ticket.
I am using Redis to store some information and detect changes in that information over time (for example, think users and locations). What is the value to using a longer or shorter keyname? Using a longer key is clearer, but is there much cost for memory or performance to using longer keyname?
Here are examples:
SET L:123456 "<name> <latitude> <longitude> ..."
HSET U:987654321 loc 123456 time <epoch>
or
SET loc:{123456} "<name> <latitude> <longitude> ..."
HSET user:{U987654321} loc 123456 time <epoch>
It all depends on how you are going to use it.
If every byte counts, for example when you have to pay for each kB transferred to a cloud service, you can calculate the costs. The maths is simple; a byte is a byte 'on the wire'. Inside redis, for larger values it is equally simple. For smaller values, Redis does some memory optimization.
In your HSET example, you split out the members, which only makes sense if you need them separated from eachother most of the time. A better approach -might- be: HSET user:data 987654321 '{"loc": "123456", "time": "2014-01-01T13:00:00"}'. Separate keys/members 'cost' a lot more than longer strings, performance wise. You can even put a whole table or dataset in one member if it's only going to be used as one complete semi-static entity.
Speed and Size: There is a notable difference between keys and values.
Keys:
Shorter is generally more memory efficient as well as speed efficient. If you use a redis Sorted Set you can even use 'numbers' as keys (sorted set 'members' plus 'scores'). I say 'numbers' because a score is technically a float64, but to be used as an ID it has to be between -999999999999999 and 999999999999999 including (that's 15 digits), without any fractional part. This can be really helpful, since Redis does fast and scalable O(log(n)) on-the-fly sorting of Sorted Sets (using skiplists, simplified).
Values:
The MsgPack format (uncompressed) takes up the least space, especially if you store the definitions once and the values many. JSON is a bit less memory efficient, but is ofcourse such a common IPC format that it should not be left out. Raw strings, character separated, fixed length (ugh), whatever your desire, it's possible to use. You can always compress your data before storing it in Redis. So far memory efficiency. When it comes to speed, it's less simple. If you want to use Lua server-side scripting (which you should), you can't do anything with compressed data. JSON and MsgPack can be deserialized, but only 'as a whole'. Which is fine in mosts scenarios. Most flexible is storing separate values (for example as members of a HSET), but this comes at a price as well (most of the time: too high a price). You also can combine all these. What we use most: a prefix of two or three delimiter-separated values, followed by a MsgPack payload.
My general advice is: start with using only HSET's and ZSET's, don't split out data that belongs together, use descriptive PascalCased names for your keys between 10-25 chars, use ':' if you need delimiters in your keys (namespaces), serialize as JSON (for simplicity, but code for easy switching to MsgPack), use Lua scripting (even if you don't know Lua, the subset you use in Redis is tiny).
I wouldn't worry about it too much in the startup phase of your project, you can always change it later on and do some A/B comparisons as soon as you have some interpolatable data.
Hope this helps, TW
Now that Redis v3.2 is almost here, you should consider switching to the new geo hashing functionality: http://redis.io/commands/geoadd
I am going through the database sizes of open dictionaries like wordnet It has almost 52 MB of database size. But I have seen some offline dictionary applications on Google Play like
1: http://wordnet.princeton.edu/ English Dictionary app which uses Wiktionary Database. I do not know how they are managing to provide an offline dictionary with 15 MB size only and more than 167000 words?
What might be the way of keeping words in database?
Wordnet packs quite a punch in small memory footprint.
How? - Here is brief picture:
words are stored into index files for fast search - index.noun,
index.adj, etc.
relation between word and offset in definition file - data.noun, etc
is provided.
each of line in definition file corresponds to one definition and
relationships between words are marked by a symbol and offset. eg.
! for antonym, # for kind of, etc.
This makes the whole thing pretty compact.
For more info on this read: man 5 wndb.
Regarding size:
52MB = 52000KB
for about ~180K words in Wordnet - each word has 52000K/180K ~ 300bytes.
average about 300 bytes to represent definition + relations - good enough.
e.g. (approx)average for each 4 definition(20 chars), 2 usages(20chars), overhead of relation(10 relations)