Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
Does anybody know how data in Google Analytics is organized? Difficult selection from large amounts of data they perform very-very fast, what structure of database is it?
AFAIK Google Analytics is derived from Urchin. As it has been said it is possible that since now Analytics is part of the Google family it is using MapReduce/BigTable. I can assume that Google had integrated the old format of Urchin DB with the new BigTable/MapReduce.
I found this links which talk about Urchin DB. Probably some of the things are still in use at the moment.
http://www.advanced-web-metrics.com/blog/2007/10/16/what-is-urchin/
this says:
[snip] ...still use a proprietary database to store reporting data, which makes ad-hoc queries a bit more limited, since you have to use Urchin-developed tools rather than the more flexible SQL tools.
http://www.urchinexperts.com/software/faq/#ques45
What type of database does Urchin use?
Urchin uses a proprietary flat file database for report data storage. The high-performance database architecture handles very high traffic sites efficiently. Some of the benefits of the data base architecture include:
* Small database footprint approximately 5-10% of raw logfile size
* Small number of database files required per profile (9 per month of historical reporting)
* Support for parallel processing of load-balanced webserver logs for increased performance
* Databases are standard files that are easy to back up and restore using native operating system utilitiesv
More info about Urchin
http://www.google.com/support/urchin45/bin/answer.py?answer=28737
Long time ago I used to have a tracker and on their site they were discussing about data normalization: http://www.2enetworx.com/dev/articles/statisticus5.asp
There you can find a bit of info of how to reduce the data in DB and maybe it is a good start in research.
BigTable
Google Publication: Chang, Fay, et al. "Bigtable: A distributed storage system for structured data."ACM Transactions on Computer Systems (TOCS) 26.2 (2008):
Bigtable is used by more than sixty Google products and projects,
including Google Analytics, Google Finance, Orkut, Personalized
Search, Writely, and Google Earth.
I'd assume they use their 'Big Table'
I can't know exactly how they implement it.
But because I've made a product that extracts non-sampled, non-aggregated data from Google Analytics I have learned a thing or two about the structure.
I makes sense that the data is populated via BigTable.
BT offers localization data awareness and map/reduce querying across n-nodes.
Distinct counts
(Whether a data service can provide distinct counts or not is a simple measure of flexibility of a data model - but it's typically also a measure of cost and performance)
Google Analytics is not built to do distinct counts even though GA can count users across almost any dimension - but it can't count e.g. Sessions per ga:pagePath?
How so...
Well they only register a session with the first pageView in a session.
This means that we can only count how many landingpages that have had a session.
We have no count for all the other 99% of pages on your site. :/
The reason for this is that Google made the choice NOT to count discount counts at all. It simply doesn't scale well economically when serving millions of sites for free.
They needed an approach where they could avoid counting distinct. Distinct count is all about sorting, grouping lists of ids for every cell in data intersection.
But...
Isn't it simple to count the distinct number of session on a ga:pagePath value?
I'll answer this in a bit
The User and data partitioning
The choice they made was to partition data on users (clientIds or userIds)
Because when they know that clientId/userId X is only present in a certain table in BT, they can run a map/reduce function that counts users and they don't have to be concerned that the same user is present in another dataset and be forced to store all clientIds/userIds in a list - group them - and then count them - distinct.
Since the current GA tracking script is called Universal Analytics they have to be able to count users correct. Especially when focusing on cross-device tracking.
OK, but how does this affect session count?
You have a set of users, each having multiple sets of sessions each having a list of page hits.
When counting within a specific session looking for a pagePaths, you will find the same page multiple times but you will not count the page more than once.
You need to write down you've already seen this page before.
When you have traversed all pages within that session you need only count the session once per page. This procedure requires a state/memory. And since the counting process is probably done in parallel on the same server. You can't be sure that a specific session is handled by the same process. Which makes the counting even more memory consuming.
Google decided not to chase that rabit any longer and just ignore that the session count is wrong for pagePath and other hit scoped dimensions.
"Cube" storage
The reason I write "cube" is that I don't know exactly if they use traditional a OLAP cube structure, but I know they have up to 100 cubes populated for answering different dimension/metric combinations.
By isolation/grouping dimensions in smaller cubes, data won't explode exponentially like it would if they put all data in a single cube.
The drawback is that not all data combinations are allowed. Which we know is true.
E.g. ga:transactionId and ga:eventCategory can't be queried together.
By choosing this structure the dataset can scale well economical and performance-wise
Many places and applications in the Google portfolio use the MapReduce algorithm for storage and processing of large quantities of data.
See the Google Research Publications on MapReduce for further information and also have a look at page 4 and page 5 of this Baseline article.
Google analytics runs on 'Mesa: Geo-Replicated, Near Real-Time, Scalable DataWarehousing'.
https://storage.googleapis.com/pub-tools-public-publication-data/pdf/42851.pdf
"Mesa is a highly scalable analytic data warehousing systemthat stores critical measurement data related to Google’sInternet advertising business."
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I'm currently receiving 2000 prices per second from a stock exchange and need to save those in an appropriate database. My current choice is PostgresQL which is way too slow. I need to save those prices (ticks) in an aggregated form like OHLC. So if I want to save D1 data for instance, I need to first get the previous D1 record for the stock from the database, check if the high or low price has changed and set a new close price and then save it to the database again. This is taking forever and is not possible with Postgres. I don't want to save the OHLC data, I prefer querying (aggregating) those in real-time.
So my requirements are:
persistance
fast writes (currently 2k per second, up to 10k)
queries, e.g. aggregating OHLC data in real-time (50-100 per second)
adoptable to any modern programming language without writing raw queries (SDK for Python or JS for that database)
deployable on AWS or GCP without hassle
I was thinking about Apache Cassandra. I'm not familiar with Cassandra, are powerful queries like OHLC one possible? Are there any alternatives to Cassandra?
Thanks in advance!
Given what I've understood from your question, I believe Cassandra should easily fit your use-case.
Regarding your requirements:
persistence : Cassandra will not only persist your data but also cover redundancy with minimal configuration;
fast writes : this is what Cassandra is most optimized for and while the exact throughput depends on a lot of factors, in general Cassandra will manage writes measured in the thousands/sec/core; Also, the eventual number o writes is not really relevant as Cassandra can scale linearly with no real penalty so 5k,10k, 100k or more are all doable;
adaptability : Cassandra has official drivers for the most common languages(Python, C family, NodeJs, Java, Ruby, PHP, Scala) as well as community developed ones for more languages (list of divers);
deployable : It's very easy to deploy in the cloud. You can chose to deploy it manually on independent instances or maybe use a managed Cassandra cluster (AWS has one, it's called 'AWS Keyspaces', Datastax(the company driving most of the development behind Cassandra) has one called 'Astra' and there are even more possible solutions. Given that Cassandra is one of the major players when it comes to big-data storage finding a place for you DB in the cloud should be easy.
I have only mentioned 4 of the 5 requirements. That is because when talking about reading, things get more complex and a larger discussion is needed.
500-100 reads/s given the 2k+ writes per second seem to be in line with the general idea of Cassandra being optimized for write intensive tasks. In Cassandra the way you will model your tables will dictate how well things can work. For a task like you have described my first thoughts are:
You bucket each stock per day => you get a partition with around 30k rows (1 update/s for 8 trading hours) and a size of under 0.2MB (30k * 4B). This would be well within the recommended values and clearly under the worst case scenario ones;
when you need the aggregated data you have 2 options:
2a. You read the partition as is and aggregate it application side (what I would recommend);
2b. You implement an "User-Defined Aggregate" function on your database that will do the work (docs). This should be doable although I won't guarantee it. Apart from being harder to implement, the problem is that putting this kind of extra workload on the DB might not be want you want given your apparent use-case. Let me explain: I'd expect your reading load to be most active during certain times, (before, during and after trading hours) with times when the load is lighter. Depending on your architecture, you could have multiple application instances up during peak times, and then scale them back during off-peak in order to lower costs. While applications can be easily scaled up and down on cloud providers like AWS and GC. Cassanadra cannot be scaled up and down like this (5 nodes in the morning, 3 in the night and so on)(well it could but it's not designed to and would be a terrible decision). So moving as much of the non-constant workload to the application seems the best idea;
(Optional) have a worker that at the end of the day/trading day will aggregate the values for each stock and save them to another table so that when looking at historic data it will be easier. This data could even be bucketed by week, month or even year depending on how much space the aggregated data takes.
You could also add Spark and Kafka in front of Casandra for a more powerful approach to the real-time aggregation but we should't deviate that much from the question at hand.
Cassandra is very powerful with the right modeling and the right architecture. At first glance what you need seems to be a good fit for Cassandra however as powerful as it can be, as bad as it can get if you use it in ways it wasn't designed to. I hope this answer puts you on a path into making the right decision.
Cheers.
Reaching out to the community to pressure test our internal thinking.
We are building a simplified business intelligence platform that will aggregate metrics (i.e. traffic, backlinks) and text list (i.e search keywords, used technologies) from several data providers.
The data will be somewhat loosely structured and may change over time with vendors potentially changing their response formats.
Data volume may be long term 100,000 rows x 25 input vectors.
Data would be updated and read continuously but not at massive concurrent volume.
We'd expect to need to do some ETL transformations on the gathered data from partners along the way to the UI (e.g show trending information over the past five captured data points).
We'd want to archive every single data snapshot (i.e. version it) vs just storing the most current data point.
The persistence technology should be readily available through AWS.
Our assumption is our requirements lend themselves best towards DynamoDB (vs Amazon Neptune or Redshift or Aurora).
Is that fair to assume? Are there any other questions / information I can provide to elicit input from this community?
Because of your requirement to have a schema-less structure, and to version each item, DynamoDB is a great choice. You will likely want to build the table as a composite Partition/Sort key structure, with the Sort key being the Version, and there are several techniques you can use to help you locate the 'latest' version etc. This is a very common pattern, and with DDB Autoscaling you can ensure that you only provision the amount of capacity that you actually need.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
Following the Prometheus webpage one main difference between Prometheus and InfluxDB is the usecase: while Prometheus stores time series only InfluxDB is better geared towards storing individual events. Since there was some major work done on the storage engine of InfluxDB I wonder if this is still true.
I want to setup a time series database and apart from the push/push model (and probably a difference in performance) I can see no big thing which separates both projects. Can someone explain the difference in usecases?
InfluxDB CEO and developer here. The next version of InfluxDB (0.9.5) will have our new storage engine. With that engine we'll be able to efficiently store either single event data or regularly sampled series. i.e. Irregular and regular time series.
InfluxDB supports int64, float64, bool, and string data types using different compression schemes for each one. Prometheus only supports float64.
For compression, the 0.9.5 version will have compression competitive with Prometheus. For some cases we'll see better results since we vary the compression on timestamps based on what we see. Best case scenario is a regular series sampled at exact intervals. In those by default we can compress 1k points timestamps as an 8 byte starting time, a delta (zig-zag encoded) and a count (also zig-zag encoded).
Depending on the shape of the data we've seen < 2.5 bytes per point on average after compactions.
YMMV based on your timestamps, the data type, and the shape of the data. Random floats with nanosecond scale timestamps with large variable deltas would be the worst, for instance.
The variable precision in timestamps is another feature that InfluxDB has. It can represent second, millisecond, microsecond, or nanosecond scale times. Prometheus is fixed at milliseconds.
Another difference is that writes to InfluxDB are durable after a success response is sent to the client. Prometheus buffers writes in memory and by default flushes them every 5 minutes, which opens a window of potential data loss.
Our hope is that once 0.9.5 of InfluxDB is released, it will be a good choice for Prometheus users to use as long term metrics storage (in conjunction with Prometheus). I'm pretty sure that support is already in Prometheus, but until the 0.9.5 release drops it might be a bit rocky. Obviously we'll have to work together and do a bunch of testing, but that's what I'm hoping for.
For single server metrics ingest, I would expect Prometheus to have better performance (although we've done no testing here and have no numbers) because of their more constrained data model and because they don't append writes to disk before writing out the index.
The query language between the two are very different. I'm not sure what they support that we don't yet or visa versa so you'd need to dig into the docs on both to see if there's something one can do that you need. Longer term our goal is to have InfluxDB's query functionality be a superset of Graphite, RRD, Prometheus and other time series solutions. I say superset because we want to cover those in addition to more analytic functions later on. It'll obviously take us time to get there.
Finally, a longer term goal for InfluxDB is to support high availability and horizontal scalability through clustering. The current clustering implementation isn't feature complete yet and is only in alpha. However, we're working on it and it's a core design goal for the project. Our clustering design is that data is eventually consistent.
To my knowledge, Prometheus' approach is to use double writes for HA (so there's no eventual consistency guarantee) and to use federation for horizontal scalability. I'm not sure how querying across federated servers would work.
Within an InfluxDB cluster, you can query across the server boundaries without copying all the data over the network. That's because each query is decomposed into a sort of MapReduce job that gets run on the fly.
There's probably more, but that's what I can think of at the moment.
We've got the marketing message from the two companies in the other answers. Now let's ignore it and get back to the sad real world of time-data series.
Some History
InfluxDB and prometheus were made to replace old tools from the past era (RRDtool, graphite).
InfluxDB is a time series database. Prometheus is a sort-of metrics collection and alerting tool, with a storage engine written just for that. (I'm actually not sure you could [or should] reuse the storage engine for something else)
Limitations
Sadly, writing a database is a very complex undertaking. The only way both these tools manage to ship something is by dropping all the hard features relating to high-availability and clustering.
To put it bluntly, it's a single application running only a single node.
Prometheus has no goal to support clustering and replication whatsoever. The official way to support failover is to "run 2 nodes and send data to both of them". Ouch. (Note that it's seriously the ONLY existing way possible, it's written countless times in the official documentation).
InfluxDB has been talking about clustering for years... until it was officially abandoned in March. Clustering ain't on the table anymore for InfluxDB. Just forget it. When it will be done (supposing it ever is) it will only be available in the Enterprise Edition.
https://influxdata.com/blog/update-on-influxdb-clustering-high-availability-and-monetization/
Within the next few years, we will hopefully have a well-engineered time-series database that is handling all the hard problems relating to databases: replication, failover, data safety, scalability, backup...
At the moment, there is no silver bullet.
What to do
Evaluate the volume of data to be expected.
100 metrics * 100 sources * 1 second => 10000 datapoints per second => 864 Mega-datapoints per day.
The nice thing about times series databases is that they use a compact format, they compress well, they aggregate datapoints, and they clean old data. (Plus they come with features relevant to time data series.)
Supposing that a datapoint is treated as 4 bytes, that's only a few Gigabytes per day. Lucky for us, there are systems with 10 cores and 10 TB drives readily available. That could probably run on a single node.
The alternative is to use a classic NoSQL database (Cassandra, ElasticSearch or Riak) then engineer the missing bits in the application. These databases may not be optimized for that kind of storage (or are they? modern databases are so complex and optimized, can't know for sure unless benchmarked).
You should evaluate the capacity required by your application. Write a proof of concept with these various databases and measures things.
See if it falls within the limitations of InfluxDB. If so, it's probably the best bet. If not, you'll have to make your own solution on top of something else.
InfluxDB simply cannot hold production load (metrics) from 1000 servers. It has some real problems with data ingestion and ends up stalled/hanged and unusable. We tried to use it for a while but once data amount reached some critical level it could not be used anymore. No memory or cpu upgrades helped.
Therefore our experience is definitely avoid it, it's not mature product and has serious architectural design problems. And I am not even talking about sudden shift to commercial by Influx.
Next we researched Prometheus and while it required to rewrite queries it now ingests 4 times more metrics without any problems whatsoever compared to what we tried to feed to Influx. And all that load is handled by single Prometheus server, it's fast, reliable, and dependable. This is our experience running huge international internet shop under pretty heavy load.
IIRC current Prometheus implementation is designed around all the data fitting on a single server. If you have gigantic quantities of data, it may not all fit in Prometheus.
So I just started messing around with Google BigQuery about 10 minutes ago, and I was wondering if anyone is aware of the underlying architecture that they're using to store the data? For example, is this just the next generation of their own BigTable infrastructure?
Also, is it clear what sorts of strategies they're using for indexes, index rebuilds, etc? I'm just trying to analyze whether this is mature enough at this point where you can be 100% sure of what's going on with your data end-to-end, or is there a bit of a black box area where "things just work"?
There are no indexes... every query is a table scan. The query architecture is described here.
Your data is stored in a proprietary columnar format called ColumnIO on Colossus (a successor to GFS). Colossus replicates the data within a datacenter and your data is also replicated to other geographic regions to make sure it stays available even if a Google datacenter goes offline.
To answer your specific questions
While data may be temporarily stored in Bigtable, all data is stored long-term in Colossus (for now!).
New data added to bigquery is encrypted at rest (that is, whenever it is written out to permanent storage). It is also encrypted when sent over the network.
As mentioned, no indexes, so there are no strategies for rebuilding the index. Depending on how you add data to your table, your table may be coalesced, which means rewriting the underlying files in a more efficient manner.
Colossus underlies a massive amount of Google data across a wide range of services, ColumnIO is a standard throughout Google. I would call both of these technologies mature.
However, you should also consider it a black box. All of the details here may change as storage systems at Google mature or architectures change. However, it should always "just work" (within SLA caveats, of course)
If you're interested in more details about how BigQuery works under the covers or how to use it effectively, here is a shameless plug for our book on the subject which is due out in June.
I alway wondered how could a very big site like facebook to be faster than any other sites ,though the very big large amount of data which stored everyday ..
what they are using to store information and if I use sql server to store e.g news feed is that ok or what (the news feed will be stored in a separate table which called News) .
in the other hand what could happen if I joined many huge tables with each other - it should be slow (maybe) or it doesn't matter how big the table is !?
thanx :)
When you talk about scaling at the size of Facebook, is a whole different ball park. Latest estimates put Facebook datacenter at about 60000 servers (sixty thousand). Only the cache is estimated to be at about 30 TB (terabytes) ina a masive Memcached cluster. Although their back end is stil MySQL, is used as a pure key-value store, according to publicly available information:
Facebook uses MySQL, but
primarily as a key-value persistent
storage, moving joins and logic onto
the web servers since optimizations
are easier to perform there (on the
“other side” of the Memcached layer).
There are various other technologies in use there:
HipHop to compile PHP into native code
Haystack for media (photo) storage
BigPipe for HTTP delivery
Cassandra for Inbox search
You can also watch this year SIGMOD 2010 key address Building Facebook: Performance at big scale. They even present their basic internal API:
cache_get ($ids,
'cache_function',
$cache_params,
'db_function',
$db_params);
So if you connect the dots you'll see that at such scale you no longer talk about a 'big database'. You talk about huge clusters of services, key-value storage partitioned across thousands of servers, many technologies used together and so on and so forth.
As a side note, you can also see a pretty good presentation of MySpace internals. Although the technology stack is completely different (Microsoft .Net and SQL Server based, with a huge emphasis on message passing via Service Broker) there are similar points in how they approach storage. To sum up: application layer partitioning.
It depends, Facebook is very fast because they have a server farm, so queries are optimised and each single query hits many servers.
In regards to huge tables, they can be fast as long as you have enough physical memory to index whatever you need to search on. Having correct index's can improve database performance hugely (When it comes to retrieving data).
As long as it makes sense to join many huge tables together into one then yes, but if they're separate, and not related then no. If you provide more details on what kind of tables you would be looking to merge, we might be able to help you more.
According to link text and other pages Facebook uses a technique called Sharding.
It simply uses a bunch of databases with a small portion of the site on each database. A simple algorithm for deciding which database to use could be using the first letter in the username as an index for the database. One database for 'a', one for 'b', etc. I'm sure Facebook has a more advanced scheme than that, but the principle is the same.
The result is many small independent databases that are small enough to handle the load. Facebook and all other major sites has all sorts of similar tricks to make the sites fast and responsive.
They continuously monitor the sites for performance and other metrics and come up with solutions to the issues the find.
I think the monitoring part is more important to the performance success than the actual techniques used to gain the performance. You can not make a fast site by blindly throw some "good performance spells" at it. You have to know where and why you have bottlenecks before you can remove them.
Depends what the performance bottleneck is. One problem is often using the wrong technology for the problem, eg using a relational DB when an object DB or document store would be better, or vice versa of course.
Some people try and use the same DB for everything which is not always the answer. Sometimes it is useful to have multiple denormalizations of the same data for different purposes.
Thinking about the nature of the data and how it is written, read, queried etc is important. You can put all write-once data in one DB and optimize that db for that. Other data that is written frequently could be stored on a db optimized for that.
Distribution techniques can also assist with upscaling.