I'm using Google BigTable to store event log data according to the following constraints:
Each key should contain a username and timestamp, allowing contiguous reads for time-series data on a per-user basis, like this: USERNAME_TIMESTAMP.
I will be storing up to 10,000,000 event logs or more per day, and so naturally, I need to avoid hotspotting and ensure that I am evenly distributing records across each node.
There is a massive security component to this database, and as such, I'd like to encrypt the username before using it as a key in BigTable.
Obviously, I'd like to avoid doing extra steps whenever I read or write, so I was thinking of encrypting usernames using SHA1 before adding them as a key in BigTable. As a result, all keys in BigTable will now be formatted like this:
cf23df2207d99a74fbe169e3eba035e633b65d94_2018_01_30_15090001
We know that SHA1 is normally distributed, so given that, is it safe to assume that all of my records will be evenly distributed across nodes, while ensuring that all usernames will reside together? Will this in effect prevent hotspotting? Are there any edge cases in this approach that I've missed?
Assuming that User Id is well distributed (i.e. there isn't a user that will have more than 10K operations per second), this approach should be fine.
FYI, Cloud Bigtable measures operations in rows per second, and you want to consider your peak throughput in determining the number of nodes. Each node can support 10,000 simple reads or writes per second. Our smallest production configuration is 3 nodes, which can support up to 30,000 rows per second (2.6 Billion rows per day if used continuously at the maximum).
Following is the scenario:
Customer places an order.
Order has type: Physical / Downloadable.
Order is placed from: Web / App.
Order is placed from a Location: UK,AUS,etc.
Can have more dimensions in future.
Consider that all of the dimensions change frequently in every order. And the data is quite huge, approximately 1.3 million records per hour.
Want to design this in a way that reports should be able able to drill down with any requested dimension for each customer.
Example:
- Customer 'A' has placed how many orders of type 'Physical' from 'AUS'
- Customer 'A' has placed how many orders in all.
- Customer 'A' has placed how many orders from of type 'Downloadable' from'APP'.
etc.
Need these reports on realtime, hence low latency writes and reads are a must. What nosql database can be a good fit. And how can this data be well structured to be able to sliced and diced in any required dimension as well as combination of more than one dimension.
If you need high performance then I would recommend ScyllaDB which can handle over 1M ops/s per node (on a good hardware). It shares data model with Cassandra so you can model and query your data using CQL. You can give it a free test drive with just couple of clicks here.
Regarding modeling: A useful technique is to model around your queries. So if you have a particular query you should prepare a table that will serve this query in most efficient way. In this technique you duplicate data by creating as many tables with the same data as many different types of queries you have. Duplicating data comes with a price so you need to trade off the performance and cost depending on your needs. You can read more about it here.
I'm developing a worldwide application in which most searches are based on geospacial data (nearest records given coordinates) and date ranges.
So, basically is likely the main searches of applications like AirBnb, Booking, etc.
Which Partition Key should I choose in a DocumentDB Partitioned Collection considering these context?
Thank you!
UPDATE: like I told to Matias (see answers), me and my friend we're thinking about something like the Country.
The app is all about searches. And another important thing is that we have dates. Tons of dates.
Since we are new to DDB, our question is: "what happens if we choose Country as Partition Key and our queries must search within different countries?". i.e. a georadius search near country borders.
Like Matias mentioned, some more information will help us provide a better recommendation. I've added some ideas/options for partition key selection below:
Use a generic partition key like user ID or product ID. In this model, your geospatial queries will be executed across partitions, but since DocumentDB locally builds a spatial index within partitions, this might meet your performance needs
Use a partitioning scheme based on the GeoHash of the location. This will ensure that data points in similar locations will get placed on the same partitions. This will require some additional work in your app to add "GeoHash > abcdef and GeoHash < abcfff" clauses to narrow down query execution to a few partitions
Partition based on a property like country, if most of your queries fall within a single country. Rare queries that need to span countries will also perform well (though not as low latency as queries against a single partition/country), as they can use the local index within each partition. You might need to handle special cases separately. For example if US has >30-40% of your data, you might want to choose a hybrid approach where US data uses state as the partition key, and countries with lesser data use the country as the partition key. A composite key of country + day/month/year might also work depending on the data distribution.
If your queries are spread evenly across time ranges, you can consider using dates as the partition key. But for most applications, since recent data is more frequently accessed, this is not a good option.
Without knowing a bit more is hard to say but I'd start with these official Partition guide: Partitioning and scaling, especially the section about Designing.
Main points should be throughput distribution (You don't want "hot spots") and Transaction atomicity probably. Remember that when you issue a Query it can spans multiple Partitions and DDB will distribute throughput evenly (you can use this feature with the EnableCrossPartitionQuery option).
So, what truly determines which would be the best partition keys really depend on how your data is distributed and how your queries are built.
Since the app is worldwide, maybe the best Partition approach is to divide by country/continent/region (one of those) but it really depends on the amount of data, it should be evenly distributed to avoid having a really hot partition/zone.
Finally, you can also check the Performance and scale test example and DocumentDB performance tips to work on improving performance.
If you are using partitioning because you have a lot of data but expect queries to return one or a few records only based on the geospatial criteria alone then something like country may work as it will cut out a lot of irrelevant data immediately and indexes within the partition will allow the required documents to be found quickly. This will likely cause irregular partition sizes though - imagine if Russia and China end up in the same partition.
If, however, your queries will return a lot of documents based on the geospatial criteria and you wish to either extract all of those records or apply further filtering or other functions over them then you will want to spread that processing over as many partitions as possible. In this case you want a partition key that will spread data evenly over the partitions. If you expect queries to combine multiple document types for the same coordinates, user id or site id etc then it is best to have a key based on that value so that all related documents can be processed together within the same partition.
In practical applications I have found using an incrementing value as partition key to be the best general purpose solution as it allows queries to be processed evenly over all partitions.
In Google App Engine Datastore HRD in Java,
We can't do joins and query multiple table using Query object or GQL directly
I just want to know that my idea is correct approach or not
If We build Index in Hierarchical Order Like Parent - Child - Grand child by node
Node
- Key
- IndexedProperty
- Set
In case if we want to collect all the sub child's & grand child's. We can collect all the keys which are matching within the hierarchy filter condition and provide the result of keys
and In Memcache we can hold each key and pointing to DB entity, if the cache does not have also in a single query using set of keys we can get all the records from DB.
Pros
1) Fast retrieval - Google recommends using get entities by keys.
2) Single Transaction is enough to collect multiple table data.
3) Memcache and Persistent Datastore will represent the same form.
4) It will scan only the related data to the group like user or parent node.
Cons
1) Meta Data of the DB size will increase so the DB size increase.
2) If the Index of the Single Parent is going to take more than 1MB then we have to split and Save as blob in the DB.
This structure is good approach or not.
In case If we have long deeper levels in the hierarchy, this will solve running lot of query operation to collect all the items dependent to parents.
In case of multiple parents -
Collect all the Indexes and Get the Keys related to the Query.
Collect all the data in single transactions using list of keys.
If any one found some more Pros or Cons Please add them and justify this approach will correct or not.
Many thanks
Krishnan
There are quite a few things going on here that are important to think about:
Datastore is not a relational database. You definitely should not be approaching your data storage from a tables and join perspective. It will lead to a messy and most likely inefficient setup.
It seems like you are trying to restructure your use of Datastore to provide complete transactional and strongly consistent use of your data. The reason Datastore cannot provide this natively is that it is too inefficient to provide these guarantees along with high availability.
With the Datastore, you want to be able to provide the ability to support many (thousands, hundreds of thousands, millions, etc) writes per second to different entities. The reason that the Datastore provides the notion of an entity group is that it allows the developer to specify a specific scope of consistency.
Consider an example todo tracking service. You might define a User and a Todo kind. You wouldn't want to provide strong consistency for all Todos, since every time a user adds a new note, the underlying system would have to ensure that it was put transactionally with all other users writing notes. On the other hand, using entity groups, you can say that a single User represents your unit of consistency. This means that when a user writes a new note, this has to be updated transactionally with any other modification to that user's notes. This is a much better unit of consistency since as your service scales to more users, they won't conflict with each other.
You are talking about creating and managing your own indexes. You almost certainly don't want to do this from an efficiency point of view. Further, you'd have to be very careful since it seems you would have a huge number of writes to a single entity / range of entities which represent your table. This is a known Datastore anti-pattern.
One of the hard parts about the Datastore is that each project may have very different requirements and thus data layout. There is definitely not one size fits all for how to structure your data, but here are some resources:
What actually happens when you do a write to Datastore
How Datastore stores data
Datastore Entity relationship modeling
Datastore transaction isolation
I am creating a system which polls devices for data on varying metrics such as CPU utilisation, disk utilisation, temperature etc. at (probably) 5 minute intervals using SNMP. The ultimate goal is to provide visualisations to a user of the system in the form of time-series graphs.
I have looked at using RRDTool in the past, but rejected it as storing the captured data indefinitely is important to my project, and I want higher level and more flexible access to the captured data. So my question is really:
What is better, a relational database (such as MySQL or PostgreSQL) or a non-relational or NoSQL database (such as MongoDB or Redis) with regard to performance when querying data for graphing.
Relational
Given a relational database, I would use a data_instances table, in which would be stored every instance of data captured for every metric being measured for all devices, with the following fields:
Fields: id fk_to_device fk_to_metric metric_value timestamp
When I want to draw a graph for a particular metric on a particular device, I must query this singular table filtering out the other devices, and the other metrics being analysed for this device:
SELECT metric_value, timestamp FROM data_instances
WHERE fk_to_device=1 AND fk_to_metric=2
The number of rows in this table would be:
d * m_d * f * t
where d is the number of devices, m_d is the accumulative number of metrics being recorded for all devices, f is the frequency at which data is polled for and t is the total amount of time the system has been collecting data.
For a user recording 10 metrics for 3 devices every 5 minutes for a year, we would have just under 5 million records.
Indexes
Without indexes on fk_to_device and fk_to_metric scanning this continuously expanding table would take too much time. So indexing the aforementioned fields and also timestamp (for creating graphs with localised periods) is a requirement.
Non-Relational (NoSQL)
MongoDB has the concept of a collection, unlike tables these can be created programmatically without setup. With these I could partition the storage of data for each device, or even each metric recorded for each device.
I have no experience with NoSQL and do not know if they provide any query performance enhancing features such as indexing, however the previous paragraph proposes doing most of the traditional relational query work in the structure by which the data is stored under NoSQL.
Undecided
Would a relational solution with correct indexing reduce to a crawl within the year? Or does the collection based structure of NoSQL approaches (which matches my mental model of the stored data) provide a noticeable benefit?
Definitely Relational. Unlimited flexibility and expansion.
Two corrections, both in concept and application, followed by an elevation.
Correction
It is not "filtering out the un-needed data"; it is selecting only the needed data. Yes, of course, if you have an Index to support the columns identified in the WHERE clause, it is very fast, and the query does not depend on the size of the table (grabbing 1,000 rows from a 16 billion row table is instantaneous).
Your table has one serious impediment. Given your description, the actual PK is (Device, Metric, DateTime). (Please don't call it TimeStamp, that means something else, but that is a minor issue.) The uniqueness of the row is identified by:
(Device, Metric, DateTime)
The Id column does nothing, it is totally and completely redundant.
An Id column is never a Key (duplicate rows, which are prohibited in a Relational database, must be prevented by other means).
The Id column requires an additional Index, which obviously impedes the speed of INSERT/DELETE, and adds to the disk space used.
You can get rid of it. Please.
Elevation
Now that you have removed the impediment, you may not have recognised it, but your table is in Sixth Normal Form. Very high speed, with just one Index on the PK. For understanding, read this answer from the What is Sixth Normal Form ? heading onwards.
(I have one index only, not three; on the Non-SQLs you may need three indices).
I have the exact same table (without the Id "key", of course). I have an additional column Server. I support multiple customers remotely.
(Server, Device, Metric, DateTime)
The table can be used to Pivot the data (ie. Devices across the top and Metrics down the side, or pivoted) using exactly the same SQL code (yes, switch the cells). I use the table to erect an unlimited variety of graphs and charts for customers re their server performance.
Monitor Statistics Data Model.
(Too large for inline; some browsers cannot load inline; click the link. Also that is the obsolete demo version, for obvious reasons, I cannot show you commercial product DM.)
It allows me to produce Charts Like This, six keystrokes after receiving a raw monitoring stats file from the customer, using a single SELECT command. Notice the mix-and-match; OS and server on the same chart; a variety of Pivots. Of course, there is no limit to the number of stats matrices, and thus the charts. (Used with the customer's kind permission.)
Readers who are unfamiliar with the Standard for Modelling Relational Databases may find the IDEF1X Notation helpful.
One More Thing
Last but not least, SQL is a IEC/ISO/ANSI Standard. The freeware is actually Non-SQL; it is fraudulent to use the term SQL if they do not provide the Standard. They may provide "extras", but they are absent the basics.
Found very interesting the above answers.
Trying to add a couple more considerations here.
1) Data aging
Time-series management usually need to create aging policies. A typical scenario (e.g. monitoring server CPU) requires to store:
1-sec raw samples for a short period (e.g. for 24 hours)
5-min detail aggregate samples for a medium period (e.g. 1 week)
1-hour detail over that (e.g. up to 1 year)
Although relational models make it possible for sure (my company implemented massive centralized databases for some large customers with tens of thousands of data series) to manage it appropriately, the new breed of data stores add interesting functionalities to be explored like:
automated data purging (see Redis' EXPIRE command)
multidimensional aggregations (e.g. map-reduce jobs a-la-Splunk)
2) Real-time collection
Even more importantly some non-relational data stores are inherently distributed and allow for a much more efficient real-time (or near-real time) data collection that could be a problem with RDBMS because of the creation of hotspots (managing indexing while inserting in a single table). This problem in the RDBMS space is typically solved reverting to batch import procedures (we managed it this way in the past) while no-sql technologies have succeeded in massive real-time collection and aggregation (see Splunk for example, mentioned in previous replies).
You table has data in single table. So relational vs non relational is not the question. Basically you need to read a lot of sequential data. Now if you have enough RAM to store a years worth data then nothing like using Redis/MongoDB etc.
Mostly NoSQL databases will store your data on same location on disk and in compressed form to avoid multiple disk access.
NoSQL does the same thing as creating the index on device id and metric id, but in its own way. With database even if you do this the index and data may be at different places and there would be a lot of disk IO.
Tools like Splunk are using NoSQL backends to store time series data and then using map reduce to create aggregates (which might be what you want later). So in my opinion to use NoSQL is an option as people have already tried it for similar use cases. But will a million rows bring the database to crawl (maybe not , with decent hardware and proper configurations).
Create a file, name it 1_2.data. weired idea? what you get:
You save up to 50% of space because you don't need to repeat the fk_to_device and fk_to_metric value for every data point.
You save up even more space because you don't need any indices.
Save pairs of (timestamp,metric_value) to the file by appending the data so you get a order by timestamp for free. (assuming that your sources don't send out of order data for a device)
=> Queries by timestamp run amazingly fast because you can use binary search to find the right place in the file to read from.
if you like it even more optimized start thinking about splitting your files like that;
1_2_january2014.data
1_2_february2014.data
1_2_march2014.data
or use kdb+ from http://kx.com because they do all this for you:) column-oriented is what may help you.
There is a cloud-based column-oriented solution popping up, so you may want to have a look at: http://timeseries.guru
You should look into Time series database. It was created for this purpose.
A time series database (TSDB) is a software system that is optimized for handling time series data, arrays of numbers indexed by time (a datetime or a datetime range).
Popular example of time-series database InfluxDB
I think that the answer for this kind of question should mainly revolve about the way your Database utilize storage.
Some Database servers use RAM and Disk, some use RAM only (optionally Disk for persistency), etc.
Most common SQL Database solutions are using memory+disk storage and writes the data in a Row based layout (every inserted raw is written in the same physical location).
For timeseries stores, in most cases the workload is something like: Relatively-low interval of massive amount of inserts, while reads are column based (in most cases you want to read a range of data from a specific column, representing a metric)
I have found Columnar Databases (google it, you'll find MonetDB, InfoBright, parAccel, etc) are doing terrific job for time series.
As for your question, which personally I think is somewhat invalid (as all discussions using the fault term NoSQL - IMO):
You can use a Database server that can talk SQL on one hand, making your life very easy as everyone knows SQL for many years and this language has been perfected over and over again for data queries; but still utilize RAM, CPU Cache and Disk in a Columnar oriented way, making your solution best fit Time Series
5 Millions of rows is nothing for today's torrential data. Expect data to be in the TB or PB in just a few months. At this point RDBMS do not scale to the task and we need the linear scalability of NoSql databases. Performance would be achieved for the columnar partition used to store the data, adding more columns and less rows kind of concept to boost performance. Leverage the Open TSDB work done on top of HBASE or MapR_DB, etc.
I face similar requirements regularly, and have recently started using Zabbix to gather and store this type of data. Zabbix has its own graphing capability, but it's easy enough to extract the data out of Zabbix's database and process it however you like. If you haven't already checked Zabbix out, you might find it worth your time to do so.