Regarding the data amount limit of Snowflake - snowflake-cloud-data-platform

・ Maximum number of records per table
・ Maximum capacity limit per table
・ Limitation on the number of tables that can be created with Snowflake
Are there any restrictions such as?

There are no limits like that as of now.
For more read here.

I've used Snowflake extensively. In a previous position, we had more than 50 Petabytes of data in Snowflake, spread out over more than 10,000 tables (I don't have the exact number, I just stopped counting at 10,000).
In my current position, we have a single table with more than 100 TB of data - that is the compressed size on Snowflake. We can run text search queries on this table in a matter of seconds.

Snowflake scales really well and has no "limits" on things like this, per se. Just be aware that with the pay-for-what-you-use pricing model, your budget can be the limiting factor.
I've heard a horror story about a company that shifted their entire process into Snowflake with no apparent problems, but their first month's bill exceeded their entire project's budget. Find ways to learn from mistakes while they're small, and impose your own limits while you figure out how to optimize things for cost.

Related

What decides the number of partitions in a DynamoDB table?

I'm a beginner to DynamoDB, and my online constructor doesn't answer his Q/A lol, and i've been confused about this.
I know that the partition key decides the partition in which the item will be placed.
I also know that the number of partitions is calculated based on throughput or storage using the famous formulas
So let's say a table has user_id as its partition Key, with 200 user_ids. Does that automatically mean that we have 200 partitions? If so, why didn't we calculate the no. of partitions based on the famous formulas?
Thanks
Let's establish 2 things.
A DynamoDB partition can support 3000 read operations and 1000 write operations. It keeps a divider between read and write ops so they do not interfere with each other. If you had a table that was configured to support 18000 reads and 6000 writes, you'd have at least 12 partition, but probably a few more for some head room.
A provisioned capacity table has 1 partition by default, but an on-demand partition has 4 partitions by default.
So, to answer your question directly. Just because you have 200 items, does not mean you have 200 partitions. It is very possible for those 200 items to be in just one partition if your table was in provisioned capacity mode. If the configuration of the table changes or it takes on more traffic, those items might move around to new partitions.
There are a few distinct times where DynamoDB will add partitions.
When partitions grow in storage size larger than 10GB. DynamoDB might see that you are taking on data and try to do this proactively, but 10GB is the cut off.
When your table needs to support more operations per second that it is currently doing. This can happen manually because you configured your table to support 20,000 reads/sec where before I only supported 2000. DynamoDB would have to add partitions and move data to be able to handle that 20,000 reads/sec. Or is can happen automatically to add partitions because you configured floor and ceiling values in DynamoDB auto-scaling and DynamoDB senses your ops/sec is climbing and will therefore adjust the number of partitions in response to capacity exceptions.
Your table is in on-demand capacity mode and DynamoDB attempts to automatically keep 2x your previous high water mark of capacity. For example, say your table just reached 10,000 RCU for the first time. DynamoDB would see that is past your previous high water mark and start adding more partitions as it tries to keep 2x the capacity at the ready in case you peak up again like you just did.
DynamoDB is actively monitoring your table and if it sees one or more items are particularly being hit hard (hot keys), are in the same partition and this might create a hot partition. If that is happening, DynamoDB might split the table to help isolate those items and prevent or fix a hot partition situation.
There are one or two other more rare edge cases, but you'd likely be talking to AWS Support if you encountered this.
Note: Once DynamoDB creates partitions, the number of partitions never shrinks and this is ok. Throughput dilution is no longer a thing in DynamoDB.
The partition key value is hashed to determine the actual partition to place the data item into.
Thus the number of distinct partition key values has zero affect on the number of physical partitions.
The only things that affect the physical number of partitions are RCUs/WCUs (throughput) and the amount of data stored.
Nbr Partions Pt = RCU/3000 + WCU/1000
Nbr Partions Ps = GB/10
Unless one of the above is more than 1.0, there will likely only be a single partition. But I'm sure the split happens as you approach the limits, when exactly is something only AWS knows.

Best approach for Cassandra Partitioning

I am very new to Cassandra, I have worked with Oracle SQL and Mongo DB and I am trying to learn Apache Cassandra to use it in a project I am working on.
I have a certain number of sensors (let's say 20), that might increase in the future. They send the data to store every 10 seconds. I am aware of bucketing to deal with this type of situations but wondering which one is better.
PRIMARY KEY ((sensor_id, day_month_year), reported_at);
PRIMARY KEY ((sensor_id, month_year), reported_at);
I don't know if using month_year is too much data for a single partition and on the other hand I think that if I use day_month_year it creates too many partitions and it slows reading too much when trying to get data since it has to access multiple partitions.
Which one should I use? If you have other good suggestions or just some explanations for me I'd like to hear them.
Posting my answer here you also asked on https://community.datastax.com/questions/10596/.
Sensor data collected every 10 seconds is equivalent to:
6 entries per minute
360 entries per hour
8,640 entries per day
260K entries per month
Depending on what other data you store for each row, it will be difficult to keep the size of each partition to the recommended 100MB. This isn't a hard limit so your partitions can go beyond 100MB but you are trading off performance the larger your partition gets.
On its own, Cassandra isn't ideal for performing analytics queries because it is optimised for OLTP workloads where you are reading one partition for each app request. If you need to do OLAP, you will need to do in Spark for efficiency. Cheers!

Set the right partitions for Crate Database

I am modelling for the Database CrateDB.
I have an avg. of 400 customers and the produce different amounts of time-series data every day. (Between 5K and 500K; avg. ~15K)
Later I should be able to query per customer_year_month and per customer_year_calendar_week.
That means that I will only query for the intervals:
week
and month
Now I'am asking myself how to partition this table?
I would partion per customer and year.
Does this make sense?
Or would it be better to partion by customer, year and month?
so the question of partitioning a table is quite complex and should consider a lot of things. Among others:
What queries should be run?
The way the data is inserted
Available hardware resources
Cluster size
Essentially, each partition also creates overhead by multiplying the shard count (a partition can be considered a "sub-table" based on a column value), which - if chosen improperly - can hinder performance a lot.
So in your case 15k inserts a day is not too much, however the distribution of inserts might cause problems, a customer's partition that grows with 500k inserts a day will run into performance problems earlier than the 5k person. As a consequence I would use weekly partitioning only.
create table "customer-logging" (
customer_id long,
log string,
ts timestamp,
week as date_trunc('week', ts)
) partitioned by (week) into 8 shards
Please only use 8 shards if you have an appropriate amount of CPU cores ;)
Docs: date_trunc(), partitioned tables
Ideally you try out a few different combinations and find what works best for you. Insights into shard sizes and locations are provided by our sys tables, so you can see if there's a particularly fat shard that overloads a node ;)
Cheers, Claus

Database design: storing many large reports for frequent historical analysis

I'm a long time programmer who has little experience with DBMSs or designing databases.
I know there are similar posts regarding this, but am feeling quite discombobulated tonight.
I'm working on a project which will require that I store large reports, multiple times per day, and have not dealt with storage or tables of this magnitude. Allow me to frame my problem in a generic way:
The process:
A script collects roughly 300 rows of information, set A, 2-3 times per day.
The structure of these rows never change. The rows contain two columns, both integers.
The script also collects roughly 100 rows of information, set B, at the same time. The
structure of these rows does not change either. The rows contain eight columns, all strings.
I need to store all of this data. Set A will be used frequently, and daily for analytics. Set B will be used frequently on the day that it is collected and then sparingly in the future for historical analytics. I could theoretically store each row with a timestamp for later query.
If stored linearly, both sets of data in their own table, using a DBMS, the data will reach ~300k rows per year. Having little experience with DBMSs, this sounds high for two tables to manage.
I feel as though throwing this information into a database with each pass of the script will lead to slow read times and general responsiveness. For example, generating an Access database and tossing this information into two tables seems like too easy of a solution.
I suppose my question is: how many rows is too many rows for a table in terms of performance? I know that it would be in very poor taste to create tables for each day or month.
Of course this only melts into my next, but similar, issue, audit logs...
300 rows about 50 times a day for 6 months is not a big blocker for any DB. Which DB are you gonna use? Most will handle this load very easily. There are a couple of techniques for handling data fragmentation if the data rows exceed more than a few 100 millions per table. But with effective indexing and cleaning you can achieve the performance you desire. I myself deal with heavy data tables with more than 200 million rows every week.
Make sure you have indexes in place as per the queries you would issue to fetch that data. Whats ever you have in the where clause should have an appropriate index in db for it.
If you row counts per table exceed many millions you should look at partitioning of tables DBs store data in filesystems as files actually so partitioning would help in making smaller groups of data files based on some predicates e.g: date or some unique column type. You would see it as a single table but on the file system the DB would store the data in different file groups.
Then you can also try table sharding. Which actually is what you mentioned....different tables based on some predicate like date.
Hope this helps.
You are over thinking this. 300k rows is not significant. Just about any relational database or NoSQL database will not have any problems.
Your design sounds fine, however, I highly advise that you utilize the facility of the database to add a primary key for each row, using whatever facility is available to you. Typically this involves using AUTO_INCREMENT or a Sequence, depending on the database. If you used a nosql like Mongo, it will add an id for you. Relational theory depends on having a primary key, and it's often helpful to have one for diagnostics.
So your basic design would be:
Table A tableA_id | A | B | CreatedOn
Table B tableB_id | columns… | CreatedOn
The CreatedOn will facilitate date range queries that limit data for summarization purposes and allow you to GROUP BY on date boundaries (Days, Weeks, Months, Years).
Make sure you have an index on CreatedOn, if you will be doing this type of grouping.
Also, use the smallest data types you can for any of the columns. For example, if the range of the integers falls below a particular limit, or is non-negative, you can usually choose a datatype that will reduce the amount of storage required.

Maximum number of records for a custom object in salesforce.com

What is the maximum number of records within a single custom object in salesforce.com?
There does not seem to be a limit indicated in https://login.salesforce.com/help/doc/en/limits.htm
But of course, there has to be a limit of some kind. EG: Could 250 million records be stored in a single salesforce.com custom object?
As far as I'm aware the only limit is your data storage, you can see what you've used by going to Setup -> Administration Setup -> Data Management -> Storage Usage.
In one of the Orgs I work with I can see one object has almost 2GB of data for just under a million records, and this accounts for a little over a third of the storage available. Your storage space depends on your Salesforce Edition and number of users. See here for details.
I've seen the performance issue as well, though after about 1-2M records the performance hit appears magically to plateau, or at least it didn't appear to significantly slow down between 1M and 10M. I wonder if orgs are tier-tuned based on volume... :/
But regardless of this, there are other challenges which make it less than ideal for big data. Even though they've increased the SOQL governor limit to permit up to 50 million records to be retrieved in one call, you're still strapped with a 200,000 line execution limit in Apex and a 10K DML limit (per execution thread). These can be bypassed through Batch Apex, yet this has limitations as well. You can only execute 250K batches in 24 hours and only have 5 batches running at any given time.
So... the moral of the story seems to be that even if you managed to get a billion records into a custom object, you really can't do much with the data at that scale anyway. Therefore, it's effectively not the right tool for that job in its current state.
2-cents
LaceySnr is correct. However, there is an inverse relationship between the number of records for an object and performance. Any part of the system that filters on that object will be impacted, such as views, reports, SOQL queries, etc.
It's hard to talk specific numbers since salesforce has upwards of a dozen server clusters, each with their own performance characteristics. And there's probably a lot of dynamic performance management that occurs regularly. But, in the past I've seen performance issues start to creep in around 2M records. One possible remedy is you can ask salesforce to index fields that you plan to filter on.

Resources