BigTable aggregation data - database

I have been trying use BigTable with connector to BigQuery. And when I try test query performance from 1 million rows I got query speed result ~ 50sec.
My SQL:
SELECT
DATE(geo_table_cell.timestamp) AS day,
geo_table_cell.value,
COUNT(*) AS countNumber
FROM
`project-dev.project_dev_bt_eu.dev-project`,
UNNEST(geo.COLUMN) AS geo_table,
UNNEST(geo_table.cell) AS geo_table_cell
WHERE
geo_table.name = 'cc'
AND rowkey LIKE 'profile%'
GROUP BY
geo_table_cell.value,
DATE(geo_table_cell.timestamp)
My questions is:
What the best solution for aggregate data from BigTable? (The same aggregation from ElasticSearch takes less than ~2 sec)
Why BigQuery works to slow with BigTable connector?
If I understand correct BigTable not good choice for present data to dashboards (filters work very slow);

1.- In case query speed is a must, loading the data into BigQuery instead of setting up an external data source would be the most efficient way.
Nevertheless, there are some things you can do to improve BigQuery, or BigTable performances.
2.- This connector is still in the Beta stage, and has some performance considerations. We should also take into consideration that BigTable is a noSQL (non relational) database and is not intended for SQL queries.
In case you are exploring the data model you want to use in your application, I recommend you consider all these options and choose the one that fits better with your needs.
3.- I would say it is not a good choice if you want to query your data using SQL. Understanding de non relational architecture of BigTable, the most effective way to read your data would be sending read requests. You can find some code samples about this, in different languages in the official documentation.

Related

will/can solr, elasticsearch and kibana out perform sql server's cube technologies?

This trio of products came up as an alternative to sql server for searching and presenting analytics over a survey based pattern of about 100 million data points. A survey pattern is basically questions x answers x forms x studies and in our case very qa oriented about how people did their jobs. About 7% of our data points cannot be quantified because they are comments.
So, can this community envision (perhaps provide a link to a success story) leveraging these products for slicing and dicing metrics (via drag and drop) along with comments over 100 million data points and out performing sql server? Our metrics can be $'s, scores, counts, hours depending on the question. We have at least two hierarchies, one over people and the other over depts. Both are temporal in that depending on the date, have different relationships (aka changing dimensions). In all there are about 90 dimensions for each data point depending on how you count the hierarchy levels.
You cant compare SQL engine and elasticsearch/solr.
It depends how you want to query it: join or not, full text search or not etc...
Like Thomas said, it depends. Depends on your data and how you want to query it. In general, for text oriented data then NoSQL will be better and provide more functionalities than SQL. However, if I understand correctly, only 7% of your data is text focused (the comments), so I assume the rest is structured.
In terms of performance, it depends what kind of text analysis you want to do and what kind of queries you're wanting to recreate. For example, joining is usually much simpler and quicker in SQL than its non-relational equivalent. You could set up a basic Solr instance, recreate some of your text related SQL queries in Solr SQL equivalents, and see how it performs on your data in comparison.
While overall, NoSQL is usually touted as better at scaling, it's highly dependent on your data and requirements as to which of the two would be better in certain situations.

What type of database is suited for realtime aggregated operations on millions of rows

I need to store 15-30 millions of rows of data. Most of the queries will be group by operations (aggregations). I'm currently using Teradata as the database backend. But the response time is not real-time (some queries are taking about 30 seconds). I was looking into Cassandra as a substitute but in some documentation, I found that if there are group by operations, then Cassandra is not the best option.
What database would be most suited for my use case given that a maximum of 100 users will use the application at a time (along with data updates happening in parallel)? Any traditional RDBMS can handle this kind of requirements?
Any help would be appreciated. Thanks in advance.
Cassandra itself is not so good for aggregation, consider Cassandra + Storm/Spark
Teradata is designed to handle very large datasets with parallelism in mind and should scale mostly linearly. In other words, add more horsepower to your resource-bound queries and get better performance.
What bottlenecks do you have with your current 30-second queries? Can you post a sample query with an EXPLAIN to look at? It could be that a quick optimization will speed it up -- STATISTICs, index selection, join indexes, PPI (table partitioning), etc.

Google Bigtable vs BigQuery for storing large number of events

Background
We'd like to store our immutable events in a (preferably) managed service. Average size of one event is less than 1 Kb and we have between 1-5 events per second. The main reason for storing these events is to be able to replay them (perhaps using table scanning) once we create future services that might be interested in these events. Since we're in the Google Cloud we're obviously looking at Google's services as first choice.
I suspect that Bigtable would be a good fit for this but according to the price calculator it'll cost us more than 1400 USD per month (which to us is a big deal):
Looking at something like BigQuery renders a price of 3 USD per month (if I'm not missing something essential):
Even though a schema-less database would be better suited for us we would be fine with essentially storing our events as a blob with some metadata.
Questions
Could we use BigQuery for this instead of Bigtable to reduce costs? For example BigQuery has something called streaming inserts which to me seems like something we could use. Is there anything that'll bite us in the short or long term that I might not be aware of if going down this route?
Bigtable is great for large (>= 1TB) mutable data sets. It has low latency under load and is managed by Google. In your case, I think you're on the right track with BigQuery.
FYI
Cloud Bigtable is not a relational database; it does not support SQL queries or joins, nor does it support multi-row transactions.
Also, it is not a good solution for small amounts of data (< 1 TB).
Consider these cases:
- If you need full SQL support for an online transaction processing
(OLTP) system, consider Google Cloud SQL.
If you need interactive querying in an online analytical processing
(OLAP) system, consider Google BigQuery.
If you need to store immutable blobs larger than 10 MB, such as large
images or movies, consider Google Cloud Storage.
If you need to store highly structured objects, or if you require
support for ACID transactions and SQL-like queries, consider Cloud
Datastore.
The overall cost boils down to how often you will 'query' the data. If it's an backup and you don't replay events too often, it'll be dirt cheap. However, if you need to replay it once daily, you start triggering the 5$/TB scanned too easily. We were surprised too how cheap inserts and storage were, but this is ofc because Google expects you to run expensive queries at some point in time on them. You'll have to design around a few things though. E.g. AFAIK streaming inserts have no guarantue's of being written to the table and you have to poll frequently on tail of list to see if it was really written. Tailing can be done efficiently with time range table decorator, though (not paying for scanning whole dataset).
If you don't care about order, you can even list a table for free. No need to run a 'query' then.
This flowchart may help in deciding between different Google cloud storage offerings (Disclaimer! copied this image from Google cloud's page)
If your usecase is a live database(let's say, backend of a website), BigTable is what you need (Still it's not really an OLTP system though) . If it is more of an data analytics/ datawarehouse kind of purpose, then BigQuery is what you need.
Think of OLTP vs OLAP; Or if you are familiar with Cassandra and Hadoop, BigTable roughly equates to Cassandra, BigQuery roughly equates to Hadoop (Agreed, not a fair comparison, but you get the idea)
https://cloud.google.com/images/storage-options/flowchart.svg
Please keep in mind that Bigtable is not a relational database, it's a noSQL solution without any SQL features like JOIN etc. If you want an RDBMS OLTP, you might need to look at cloudSQL (mysql/ postgres) or spanner.
Cloud spanner is relatively young, but is powerful and promising. At least, google marketing claims that it's features are best of both worlds (Traditional RDBMS and noSQL)
Cost Aspect
Cost aspect is already covered nicely here https://stackoverflow.com/a/34845073/6785908
I know this is very late answer, but adding it anyway incase it may help somebody else in future.
Hard to summarize better than it is already done by Google.
I think you need to figure out how you are going to use (replay) your data (events) and this can help you in making final decision.
So far, BigQuery looks like a best choice for you
Bigtable is a distributed (run on clusters) database for applications that manage massive data. Its designed for massive unstructured data, scales horizontally and made of column families. It stores data in key value pairs as opposed to relational or structured databases.
BigQuery is a datawarehouse application. That means it provides connection to several data sources or streams such that they can be extracted, transformed and loaded into bigQuery table for further analysis. Unlike Bigtable, It does store data in structured tables and supports SQL queries.
Use cases; If you want to do analytics or business intelligence by deriving insights from collected data on from different sources (applications, research, surveys, feedback, logs etc...) of your organisation , you may want to pull all this information into one location. This location will most likely be a Bigquery data warehouse.
If you have an application that collects Big data, in other words massive information (High data volume) per time at higher speeds (High velocity) and in unstructured inconsistent forms with different data types as audio, text, video, images, etc... ( Variety and veracity), then your probable choice of database application for this app would be Bigtable.

HBase or Hive - web requests

Are either HBase/Hive suitable replacements as your traditional (non)relational database? Will they be able to serve up web-requests from web clients and respond in a timely manner? Are HBase/Hive only suitable for large dataset analysis? Sorry I'm a noob at this subject. Thanks in advance!
Hive is not at all suitable for any real time need such as timely web responses. You can use HBase though. But don't think about either HBase or Hive as a replacement of traditional RDBMSs. Both were meant to serve different needs. If your data is not huge enough better go with a RDBMS. RDBMSs are still the best choice(if they fit into your requirements). Technically speaking, HBase is really more a DataStore than DataBase because it lacks many of the features you find in an RDBMS, such as typed columns, secondary indexes, triggers, and advanced query languages, etc.
And the most important thing which could struck a newbie is the lack of SQL support by HBase, since it belongs to NoSQL family of stores.
And HBase/Hive are not the only options to handle large datasets. You have several options like Cassandra, Hypertable, MongoDB, Accumulo etc etc. But each one is meant for solving some specific problem. For example, MongoDB is used handling document data. So, you need to analyze your use case first and based on that you have to choose the datastore which suits your requirements.
You might find this list useful which compares different NoSQL datastores.
HTH
Hive is data warehouse tool, and it is mainly used for batch processing.
HBase is NoSQL database which allows random access based on rowkey (primary key). It is used for transactional access. It doesn't have indexing support which could be limitation for your needs.
Thanks,
Dino

Why does Google App Engine restrict GQL queries?

I was reading about App Engine on wikipedia and came across some GQL restrictions:
JOIN is not supported
can SELECT from at most one table at a time
can put at most 1 column in the WHERE clause
What are the advantages of these restrictions?
Are these restrictions common in other places where scalability is a priority?
The datastore that GQL talks to is:
not a relational database like MySQL or PostgreSQL
is a Column-oriented DBMS called BigTable
One reason to have a database like this is to have a very high performance database that you can scale across hundreds of servers.
GQL is not SQL it is SQL-like.
Here are some references:
http://en.wikipedia.org/wiki/Column-oriented_DBMS
http://en.wikipedia.org/wiki/BigTable
http://code.google.com/appengine/docs/datastore/overview.html
http://code.google.com/appengine/docs/datastore/gqlreference.html
I believe the answer is in fact to do with the underlying technology of the datastore rather than any kind of restriction on what is available. Google aren't using a relational database under the hood, but instead BigTable, they have just added a nice API which uses SQL like queries to limit the learning curve for those who are used to using a relational database. For those who are more used to using ORM's will take to it like a duck to water.
the existing answers do a good job with the high-level question.
one additional note: the third restriction you mention isn't actually true. GQL queries can include as many columns in the WHERE clause as you like. there are a few caveats, but number of columns is not explicitly limited. more:
http://code.google.com/appengine/docs/python/datastore/queries.html

Resources