I have a graph collection that is just a single 10GB collection and have inserted about 37000 vertices and edges(combined). I started to get errors about reaching the storage quota for it but I should be nowhere near the storage quota. The storage metrics page says I am at 10GiB with 36.97K documents. I queried samples of these vertices and edges and uploaded the json to the Capacity Planner (https://www.documentdb.com/capacityplanner) and according to that my total data storage for what is there should be around 30.75MB. I uploaded two sample files (a vertex sample and an edge sample and input the exact qty of each of them to obtain that 30.75MB number.
For sample document 1 11.21 KB x 18560 = 18.71 MB
For sample document 2 20.85 KB x 18405 = 12.04 MB
Why and how would this run the collection up to the max size? What is going on with this?
Related
We are facing high fingerprint match solr query time. Following is our setup Info:
echonest/echoprint-server running on single node (solr 1.0) running on amazon ec2 instance m3.2x large box with 30G RAM
& 8 cores
2.5 million tracks(segment count 19933333) ingested with solr 1.0 index size around 91G.
Applied optimization HashQueryComponent.java https://github.com/playax/echoprint-server/commit/706d26362bbe9141203b2b6e7846684e7a417616#diff-f9e19e870c128c0d64915f304cf43677
Also tried to capture stats of eval method, some of the loop iterations of sequential subreader of index reader took more than 1 second to iterate over all the terms.
Any suggestions or pointers in the right directions will be very helpful.
I have a lucene index which has close to 480M documents. The size of the index is 36G. And I ran around 10000 queries against the index. Each query is a boolean AND query with 3 term queries inside. That is the query has 3 operands which MUST occur. Executing such 3 word queries gives the following latency percentiles.
50th = 16 ms
75th = 52 ms
90th = 121 ms
95th = 262 ms
99th = 76010 ms
99.9th = 76037 ms
Is the latency expected to degrade when the number of docs is as high as 480M? All the segments in the index are merged into one segment. Even when the segments are not merged, the latencies are not very different. Each document has 5-6 stored fields. But as mentioned above, the above latencies are for boolean queries that don't access any stored fields, but just do a posting list lookup on 3 tokens.
Any ideas on what could be wrong here?
I'm looking for a way to order Google Book's Ngram's by frequency.
The original dataset is here: http://books.google.com/ngrams/datasets. Inside each file the ngrams are sorted alphabetically and then chronologically.
My computer is not powerful enough to handle 2.2 TB worth of data, so I think the only way to sort this would be "in the cloud".
The AWS-hosted version is here: http://aws.amazon.com/datasets/8172056142375670.
Is there a financially efficient way to find the 10,000 most frequent 1grams, 2grams, 3grams, 4grams, and 5grams?
To throw a wrench in it, the datasets contain data for multiple years:
As an example, here are the 30,000,000th and 30,000,001st lines from file 0
of the English 1-grams (googlebooks-eng-all-1gram-20090715-0.csv.zip):
circumvallate 1978 313 215 85
circumvallate 1979 183 147 77
The first line tells us that in 1978, the word "circumvallate" (which means
"surround with a rampart or other fortification", in case you were wondering)
occurred 313 times overall, on 215 distinct pages and in 85 distinct books
from our sample.
Ideally, the frequency lists would only contain data from 1980-present (the sum of each year).
Any help would be appreciated!
Cheers,
I would recommend using Pig!
Pig makes things like this very easy and straight-forward. Here's a sample pig script that does pretty much what you need:
raw = LOAD '/foo/input' USING PigStorage('\t') AS (ngram:chararray, year:int, count:int, pages:int, books:int);
filtered = FILTER raw BY year >= 1980;
grouped = GROUP filtered BY ngram;
counts = FOREACH grouped GENERATE group AS ngram, SUM(filtered.count) AS count;
sorted = ORDER counts BY count DESC;
limited = LIMIT sorted 10000;
STORED limited INTO '/foo/output' USING PigStorage('\t');
Pig on AWS Elastic MapReduce can even operate directly on S3 data, so you would probably replace /foo/input and /foo/output with S3 buckets too.
I need a NoSql database to write continuous log data. Approx. 100 write per second. And a single data is contains 3 column and less than 1kb. Read is necessarily only once a day, then I can delete all daily data. But I can't decide that which is the cheapest solution; Google App Engine and Datastore or Heroku and Mongolab?
I can give you costs for GAE:
Taking billing docs and assuming you'll have about 258M operations per (86400 second per day * 100 requests/s) this would cost you
Writing: 258M record * ($0.2 / 100k) = $516 for writing unindexed data
Reading: 258M records * ($0.07 / 100k ops) = $180 for reading once a month
Deleting 258M rec * ($0.2 / 100k) = $516 for deleting unindexed data
Storage: 8.6M entities at 1kb per day = 8.6GB per day = 240 GB / month = averaged 120 GB
Storage cost: 120 GB * 0.12$/GB = $15 / month
So your total operation per month on GAE would be about $1300 per month. Note that using a structured database for writing unstructured data is not optimal and it reflects on the price.
With App Engine, it is recommended that you use memcache for operations like this, and memcache does not incur database charges. Using python 2.7 and ndb, memcache is automatically used and you will get at most 1 database write per second.
At current billing:
6 cents per day for reads/writes.
Less than $1 per day storage
I realized it is very expensive to create many new entities(and properties), so I decided to store data chunk (~50kb, zipped Json)in one entity as byte array (blob) to Datastore.
However, I have no idea how many write/read ops might be neccessary to write/read blob data. I wonder whether it is depends on size of blob data or it is just constant write/read ops.
Thank you in advance :)
blobstore data is stored data
Stored Data (billable)
The total amount of data stored in datastore entities and corresponding indexes, in the task queue, and in the Blobstore.
So you like entities you pay by read, write and not by size
https://developers.google.com/appengine/docs/quotas#Datastore
Costs: https://developers.google.com/appengine/docs/billing
Entity Get (per entity) 1 Read
New Entity Put (per entity, regardless of entity size) 2 Writes + 2 Writes per indexed property value + 1 Write per composite index value
Existing Entity Put (per entity) 1 Write + 4 Writes per modified indexed property value + 2 Writes per modified composite index value
Entity Delete (per entity) 2 Writes + 2 Writes per indexed property value + 1 Write per composite index value
Query 1 Read + 1 Read per entity retrieved
Query (keys only) 1 Read + 1 Small per entity retrieved
Key allocation (per key) 1 Small
Write $0.10 per 100k operations
Read $0.07 per 100k operations
Small $0.01 per 100k operations
Also consider storage costs
Stored Data (Blobstore) gigabytes per month $0.13 [free limit 5 gb]
Stored Data (Datastore) gigabytes per month $0.24 [free limit 1 gb]