Maximum Number of Cells in a Cassandra Table - database

I have a system that stores measurements from machines with many transducers, once per second. I'm considering using Cassandra and would like to store the 1 second sample of machine state measurements in a single table, which would be something like:
create table inst_samples (
machine_id text,
batch_id int,
sample_time timestamp,
var1 double,
var2 double,
.....
varN double,
PRIMARY KEY ((machine_id, batch_id), sample_time)
);
There are approximately 20 machines with 400 state variables each and the batch_id will update every 1-2 hours. I have reviewed the documentation on the 2 billion cells maximum per table and noted similar questions
here What are the maximum number of columns allowed in Cassandra and here Cassandra has a limit of 2 billion cells per partition, but what's a partition?
If I am understanding this limit correctly I would hit the 2 billion cell limit for a single machine in the inst_samples table in approximately 60 days?
(2e9 cells / 400 cols/row) / (3600 rows / hour) / (24 hours / day) =~ 58 days?
I am a total Cassandra newbie. Thanks.

This 2 billion limit is for partition, and if you have good data model, you should have many partitions. In practice, it's recommended to keep number of cells per partition under control - something like, not more 100,000 cells per partition, otherwise there could be some performance problems, etc. But the actual limit depends on the multiple factors, like Cassandra version, what queries are executed, etc.
In your case, we have partition key of machine_id + batch_id, and that gives us for batch size of 2 hours: 400x7200 = 2880000 - almost 3 million cells. It may still work (would be better if you set batch size to 1 hour), but will require testing on real hardware - this could be done for example, with NoSQLBench.
There are also other ways to optimize your data model - for example, instead of allocating a separate column for every variable, just use frozen<map<text, double>> - in this case, all measurements will be stored as a single cell. The drawback of it - you can't change the individual values without reading the map & inserting it with changed value. Another drawback is that you'll need to read all measurements at once - but this could be ok.

Related

Finding unique customers in a selected day range

I have a simple table as follows:
day order_id customer_id
1 1 1
1 2 1
1 3 2
2 4 1
2 5 1
I want to find a number of unique customers from Day 1 to Day 2. And the answer is 2.
But my size of the table is huge and querying takes long. So I want to store an aggregated data in another table to reduce the data size and query faster. I have created a new table out of the above table.
day uniq_customer
1 2
2 1
Now if I want to find a unique customer from Day 1 to Day 2, I am getting 2 + 1 = 3, whereas the answer is 2.
Is there any way to find a work around without having to query the old table.
PS: I am using Druid as a data source.
This depends on the trends on your data. For example, if you have a small number of distinct customers and days, you can keep the customers in a bit vector per day. At the end, just or the bitvectors of the days in the query and the result will be the sum of the bits. May be boring to implement.
If you have a large number of distinct customers and days, chunk them per customer and sort by date. Then for each customer, get the index of first row where day is greater than or equal to the query start and get the index of the first row where day is less than or equal to the query end using binary search. Difference between those two indices plus 1 gives you the number-of-query-fitting-days for the customer. Complexity becomes #customers x 2 x O(log #customerRecords).
Apache Druid supports the use of approximations for this kind of query. Take a look at the tutorial on using approximations in Druid: https://druid.apache.org/docs/latest/tutorials/tutorial-sketches-theta.html
In Druid you can also partially aggregate into Theta Sketches at ingestion time and aggregated them over time or over other grouping dimensions at query time. This is specifically designed to deal with large data volumes and you can control the accuracy of the approximations.

Choosing best datatype for numeric column in SQL Server

I have a table in SQL Server with large amount of data - around 40 million rows. The base structure is like this:
Title
type
length
Null distribution
Customer-Id
number
8
60%
Card-Serial
number
5
70%
-
-
-
-
-
-
-
-
Note
string-unicode
2000
40%
Both numeric columns are filled by numbers with specific length.
I have no idea which data type to choose to have a database in the smallest size and having good performance by indexing the customerId column. Refer to this Post if I choose CHAR(8), database consume 8 bytes per row even in null data.
I decided to use INT to reduce the database size and having good index, but null data will use 4 bytes per rows again. If I want to reduce this size, I can use VARCHAR(8), but I don't know, the system has good performance on setting index on this type or not. The main question is reducing database size is important or having good index on numeric type.
Thanks.
If it is a number - then by all means choose a numeric datatype!! Don't store your numbers as char(n) or varchar(n) !! That'll just cause you immeasurable grief and headaches later on.
The choice is pretty clear:
if you have whole numbers - use TINYINT, SMALLINT, INT or BIGINT - depending on the number range you need
if you need fractional numbers - use DECIMAL(p,s) for the best and most robust behaviour (no rounding errors like FLOAT or REAL)
Picking the most appropriate datatype is much more important than any micro-optimization for storage. Even with 40 million rows - that's still not a big issue, whether you use 4 or 8 bytes. Whether you use a numeric type vs. a string type - that makes a huge difference in usability and handling of your database!

Google App Engine - Search API index growth

I would like to know how can I estimate the growth (how much the size increasez in a period of time) of an index of App engine Search API (FTS) based on the number of entities inserted and amount of information. For this I would like to know basically how is the index size calculated (on what does it depend). Specifically:
When inserting new entities, is the growth (size) influenced by the number of previous existing entities? (ie. is the growth exponential)? For ex. if I have 1000 entities and I insert 10, the index will grow with X bytes. But if I have 100000 entities and insert 10, will it increase with X or much more than X (exponentially, let' say 10*X) ?
Does the number of fields (properties) influences the size exponentially? For ex. if I have entity A with 2 fields and entity B with 4 fields (let's say identical in values, for mathematical simplicity) will the size increase, when adding entity B, twice as that of entity A or much more than that?
What other means can I use to find statistical information; do I have other tools in the cloud console of app engine, or can I do this programmatically ?
Thank you.
You can check the size of a given index by running the code below.
from google.appengine.api import search
for index in search.get_indexes(fetch_schema=True):
logging.info("index %s", index.storage_usage)
# pseudo code
amount_of_items_to_add = 100
x = 0
for x <= amount_of_items_to_add:
search_api_insert_insert(data)
x+=1
#rerun for loop to see how much the size increased
for index in search.get_indexes(fetch_schema=True):
logging.info("index %s", index.storage_usage)
This code is obviously not a complete working example, but you should be able to build a simple method that takes some data inserts it into the search api and returns how much the used storage increased.
I have run a number of tests for different number of entities and different number of indexed properties per entity and it seams thst the estimated growth of the index reported by the api is not exponential it is linear.
But the most interesting fact to know is that although the size reported is realtime almost, after deleting documents from the index, it may take 12, 24 even 36 hours to update.

Worth a unique table for database values that repeat ~twice?

I have a static database of ~60,000 rows. There is a certain column for which there are ~30,000 unique entries. Given that ratio (60,000 rows/30,000 unique entries in a certain column), is it worth creating a new table with those entries in it, and linking to it from the main table? Or is that going to be more trouble than it's worth?
To put the question in a more concrete way: Will I gain a lot more efficiency by separating out this field into it's own table?
** UPDATE **
We're talking about a VARCHAR(100) field, but in reality, I doubt any of the entries use that much space -- I could most likely trim it down to VARCHAR(50). Example entries: "The Gas Patch and Little Canada" and "Kora Temple Masonic Bldg. George Coombs"
If the field is a VARCHAR(255) that normally contains about 30 characters, and the alternative is to store a 4-byte integer in the main table and use a second table with a 4-byte integer and the VARCHAR(255), then you're looking at some space saving.
Old scheme:
T1: 30 bytes * 60 K entries = 1800 KiB.
New scheme:
T1: 4 bytes * 60 K entries = 240 KiB
T2: (4 + 30) bytes * 30 K entries = 1020 KiB
So, that's crudely 1800 - 1260 = 540 KiB space saving. If, as would be necessary, you build an index on the integer column in T2, you lose some more space. If the average length of the data is larger than 30 bytes, the space saving increases. If the ratio of repeated rows ever increases, the saving increases.
Whether the space saving is significant depends on your context. If you need half a megabyte more memory, you just got it — and you could squeeze more if you're sure you won't need to go above 65535 distinct entries by using 2-byte integers instead of 4 byte integers (120 + 960 KiB = 1080 KiB; saving 720 KiB). On the other hand, if you really won't notice the half megabyte in the multi-gigabyte storage that's available, then it becomes a more pragmatic problem. Maintaining two tables is harder work, but guarantees that the name is the same each time it is used. Maintaining one table means that you have to make sure that the pairs of names are handled correctly — or, more likely, you ignore the possibility and you end up without pairs where you should have pairs, or you end up with triplets where you should have doubletons.
Clearly, if the type that's repeated is a 4 byte integer, using two tables will save nothing; it will cost you space.
A lot, therefore, depends on what you've not told us. The type is one key issue. The other is the semantics behind the repetition.

Storing vast amounts of (simple) timeline graph data in a DB

I need to store the number of plays for every second of a podcast / audio file. This will result in a simple timeline graph (like the "hits" graph in Google Analytics) with seconds on the x-axis and plays on the y-axis.
However, these podcasts could potentially go on for up to 3 hours, and 100,000 plays for each second is not unrealistic. That's 10,800 seconds with up to 100,000 plays each. Obviously, storing each played second in its own row is unrealistic (it would result in 1+ billion rows) as I want to be able to fetch this raw data fast.
So my question is: how do I best go about storing these massive amounts of timeline data?
One idea I had was to use a text/blob column and then comma-separate the plays, each comma representing a new second (in sequence) and then the number for the amount of times that second has been played. So if there's 100,000 plays in second 1 and 90,000 plays in second 2 and 95,000 plays in second 3, then I would store it like this: "100000,90000,95000,[...]" in the text/blob column.
Is this a feasible way to store such data? Is there a better way?
Thanks!
Edit: the data is being tracked to another source and I only need to update the raw graph data every 15min or so. Hence, fast reads is the main concern.
Note: due to nature of this project, each played second will have to be tracked individually (in other words, I can't just track 'start' and 'end' of each play).
Problem with the blob storage is you need to update the entire blob for all of the changes. This is not necessarily a bad thing. Using your format: (100000, 90000,...), 7 * 3600 * 3 = ~75K bytes. But that means you're updating that 75K blob for every play for every second.
And, of course, the blob is opaque to SQL, so "what second of what song has the most plays" will be an impossible query at the SQL level (that's basically a table scan of all the data to learn that).
And there's a lot of parsing overhead marshalling that data in and out.
On the other hand. Podcast ID (4 bytes), second offset (2 bytes unsigned allows pod casts up to 18hrs long), play count (4 byte) = 10 bytes per second. So, minus any blocking overhead, a 3hr song is 3600 * 3 * 10 = 108K bytes per song.
If you stored it as a blob, vs text (block of longs), 4 * 3600 * 3 = 43K.
So, the second/row structure is "only" twice the size (in a perfect world, consult your DB server for details) of a binary blob. Considering the extra benefits this grants you in terms of being able to query things, that's probably worth doing.
Only down side of second/per row is if you need to to a lot of updates (several seconds at once for one song), that's a lot of UPDATE traffic to the DB, whereas with the blob method, that's likely a single update.
Your traffic patterns will influence that more that anything.
Would it be problematic to use each second, and how many plays is on a per-second basis?
That means 10K rows, which isn't bad, and you just have to INSERT a row every second with the current data.
EDIT: I would say that that solutions is better than doing a comma-separated something in a TEXT column... especially since getting and manipulating data (which you say you want to do) would be very messy.
I would view it as a key-value problem.
for each second played
Song[second] += 1
end
As a relational database -
song
----
name | second | plays
And a hack psuedo-sql to start a second:
insert into song(name, second, plays) values("xyz", "abc", 0)
and another to update the second
update song plays = plays + 1 where name = xyz and second = abc
A 3-hour podcast would have 11K rows.
It really depends on what is generating the data ..
As I understand you want to implement a map with the key being the second mark and the value being the number of plays.
What is the pieces in the event, unit of work, or transaction you are loading?
Can I assume you have a play event along the podcastname , start and stop times
And you want to load into the map for analysis and presentation?
If that's the case you can have a table
podcastId
secondOffset
playCount
each even would do an update of the row between the start and ending position
update t
set playCount = playCount +1
where podCastId = x
and secondOffset between y and z
and then followed by an insert to add those rows between the start and stop that don't exist, with a playcount of 1, unless you preload the table with zeros.
Depending on the DB you may have the ability to setup a sparse table where empty columns are not stored, making more efficient.

Resources