I am able to connect tableau with my database but the table size is really large here. Everytime I try to load the table into tableau, it is crashing and I am not able to find any work around. The table size varies from 10 million - 400 million rows. How should I approach this issue any suggestion ?
You don't "load" data into Tableau, you point Tableau at an external data source. Then Tableau sends a query to the external data source requesting only the summary info (aka query results) needed to create the visualization you designed.
So, for an extreme example, if you place CNT(Number of Records) on the Columns shelf, Tableau will send a simple short query to the external database asking it to report the number of records. Something along the lines of "select count(*) from xxx".
So even if there are billions of rows in the external database, Tableau will send a small amount of information to the database (a query) and receive back a small amount of information (the query results) to display. This allows Tableau to be very fast on its end, and performance depends on how fast the external database can respond to the query. Tuning your database depends on all kinds of factors: type and amount of memory and disk, how indices are set up, etc.
So the first step is to make sure that the database can perform as needed, regardless of Tableau.
That's the purist response. Now for a few messy details. It is possible to design a very complex visualization in Tableau that will send a complex query asking for a very large result set. For instance, you can design a dashboard that draws a dot on the map for every row in the database, and then refreshes a large volume of data everytime you wave the mouse over the marks on the map.
If you have millions or billions of data rows, and you want high performance, then don't do that. No user can read 60 million dots anyway, and they certainly don't want to wait for them to be sent over the wire. Instead first plot aggregate values, min, max, sum, avg etc and then drill down into more detail on demand.
As others suggest, you can use a Tableau extract to offload workload and cache data in a form for fast use by Tableau. An extract is similar to an optimized materialized view stored in Tableau. Extracts are very helpful with speeding up Tableau, but if you want high performance, filter and aggregate your extracts to contain only the data and level of detail needed to support your views. If you blindly make an extract of your entire database, you are simply copying all your data from one form of database to another.
I found a simple solution for optimising Tableau to work with very large datasets (1 billion+ rows): Google BigQuery, which is essentially a managed data warehouse.
Upload data to BigQuery (you can append multiple files into a single table).
Link that table to Tableau as an external data source
Tableau then sends SQL-like commands to BigQuery whenever a new 'view' is requested. The queries are processed quickly on Google's computing hardware, which then sends a small amount of information back to Tableau.
This method allowed me visualise a 100gb mobile call record dataset with ~1 billion rows on a MacBook.
There are two ways to interpret this question:
The data source (which might be a single table, a view, etc.) has 10M to 400M rows and Tableau is crashing at some point in the load process. In that case, I suggest you to contact Tableau tech support. They really like to hear about situations like that and help people through them.
You are trying to create a visualization (such as a text table or crosstab) that has N records resulting in 10M to 400M displayed rows. In that case, you're into a territory that Tableau isn't designed for. A text table with 10M rows is not going to be useful for much of anything than exporting to something else, and in that case, there are better tools than Tableau for doing that (such as export/import tools built into most databases).
Not really sure what your use case is, but I find it unlikely that you need all that data for one Tableau view.
You can parse down / aggregate the data using a view in the database or custom SQL from your Tableau connection. Also, try to use extracts rather than live database connections, as they will preform faster.
I like to use views in the database and then use those views to refresh my Tableau extracts on Tableau Server.
Related
For background, I collect API usage logs (request, response, latency, userId, etc) for an application. A typical day will accumulate 200-300 million records. This data is currently stored on s3 in parquet format, and I use AWS Athena for ad-hoc querying. I'd like to move towards building a web-based dashboard that would display per customer metrics; an example query would be request volume by customer by hour for the past 6 hours. I'll only need that kind of detailed usage data for the previous 30 days.
Ideally, I continue to utilize the AWS ecosystem for this solution. What I'm trying to determine is a general direction. Can Redshift efficiently compute those types of queries against the raw log data, on the fly, within 1s or so to make it usable on the web? Is there a better tool? Or should I be looking at running ETLs and rollup type operations to generate those metrics, populate a different table (perhaps in redshift) and then use that to serve the dashboard?
Any thoughts, or even suggested readings, are welcome - thanks.
There are a log of approaches you can take to this kind of problem, I'll try to detail some of the products you could use based upon your problem above.
Preprocess anything you can rather than calculate on the fly. Summarise your hourly metrics for example in a key value store rather than computing across large numbers of metrics. You could efficiently store these metrics in DynamoDB and retrieve.
Redshift can return data quickly depending on your schema definitions (distribution keys, sort keys), however if you are writing individual transactions will be not be as efficient with the writes. You will want to do this bulk for periods. It will need to be settled as a near real-time solution.
Common dashboards that require large computation but do not need to be live (i.e. hourly or daily stats) can be generated and stored in S3, therefore it will be fast but not require to be read from the DB every time the user.
Athena is for querying a data lake, if you use this for large portions of near real time data it will not be as efficient at getting the data results to you. In saying this if you use Redshift you can join queries from your data lake using Redshift Spectrum.
My last couple of questions have been on how to connect to snowflake and add and read data with the python connector in a ipython notebook. However, I am having troubling with the next best step to create a report with the data I seek to visualize.
I would like to upload all of the data, store it, then analyze it, kind of like a homemade dashboard.
So what I have done so far is a small version:
Staged my data from a local file, and I will run adding new data
each time I open the notebook
Then I will use the python connector to call any data from storage
Create visualizations with numpy objects in the local notebook.
My data will start out very small, but over time I would imagine I would have to move computation to the cloud to minimize the memory used locally for the small dashboard.
My question is, my data is called from a api that results in json files, new data is no bigger that 75 MB a day 8 columns, with two aggregate calls to the data, done in the sql call. If I run these visualizations monthly, is it better to aggregate the information in Snowflake, or locally?
Put the raw data into Snowflake. Use tasks and procedures to aggregate it and store the result. Or better yet, don't do any aggregations except for when you want the data - let Snowflake do the aggregations in real-time off the raw data.
I think what you might be asking is whether you should ETL your data or ELT your data:
ETL: Extract, Transform, Load (in that order) - Extract data from your API. Transform it locally on your computer. Load it into Snowflake.
ELT: Extract, Load, Transform (in that order) - Extract data from your API. Load it into Snowflake. Transform it after it's in Snowflake.
Both ETL and ELT are valid. Many companies use both approaches w/ snowflake interchangeably. But Snowflake was built for it to kind of be your data lake - the idea being, "Just throw all your data up here and then use our awesome compute and storage resources to transform them quickly and easily."
Do a Google search on "Snowflake ELT" or "ELT vs ETL" for more information.
Here are some considerations either way off the top of my head:
Tools you're using: Some tools like SSIS were built w/ ETL in mind - transformation of the data before you store it in your warehouse. That's not to say you can't ELT, but it wasn't built w/ ELT in mind. More modern tools - like Fivetran or even Snowpipe assume you're going to aggregate all your data into Snowflake, and then transform it once it's up there. I really like the ELT paradigm - i.e. just get your data into the cloud - transform it quickly once it's up there.
Size and growth of your data: If your data is growing, it becomes harder and harder to manage it on local resources. It might not matter when your data is in gigabytes or millions of rows. But as you get into billions of rows or terabytes of data, the scale-ability of the cloud can't be matched. If you feel like this might happen and you think putting it into the cloud isn't a premature optimization, I'd load your raw data into Snowflake and transform it after it's up there.
Compute and Storage Capacity: Maybe you have a massive amount of storage and compute at your fingertips. Maybe you have an on-prem cluster you can provision resources from at the drop of a hat. Most people don't have that.
Short-Term Compute and Storage Cost: Maybe you have some modest resources you can use today and you'd rather not pay Snowflake while your modest resources can do the job. Having said that, it sounds like the compute to transform this data will be pretty minimal, and you'll only be doing it once a day or once a month. If that's the case, the compute cost will be very minimal.
Data Security or Privacy: Maybe you have a need to anonymize data before moving it to the public cloud. If this is important to you you should look into Snowflake's security features, but if you're in an organization where it's super difficult to get a security review and you need to move forward with something, transforming it on-prem while waiting for security review is a good alternative.
Data Structure: Do you have duplicates in your data? Do you need access to other data in Snowflake to join on in order to perform your transformations? As you start putting more and more data into Snowflake, it makes sense to transform it after it's in Snowflake - that's where all your data is and you will find it easier to join, query and transform in the cloud where all your other data is.
My question is, my data is called from a api that results in json files, new data is no bigger that 75 MB a day 8 columns, with two aggregate calls to the data, done in the sql call. If I run these visualizations monthly, is it better to aggregate the information in Snowflake, or locally?
I would flatten your data in python or Snowflake - depending on which you feel more comfortable using or how complex the data is. You can just do everything on the straight json, although I would rarely look to design something that way myself (it's going to be the slowest to query.)
As far as aggregating the data, I'd always do that on Snowflake. If you would like to slice and dice the data various ways, you may look to design a data mart data model and have your dashboard simply aggregate data on the fly via queries. Snowflake should be pretty good with that, but for additional speed then aggregating it up to months may be a good idea too.
You can probably mature your process from being local python script driven too something like serverless lambda and event driven wwith a scheduler as well.
Ok so I'm working on an ASP MVC Web Application that queries a fairly large amount of data from an SQL Server 2008. When the application starts, the user is presented with a search mask that includes several fields. Using the search mask, the user can search for data in the data base and also filter the search by specifying parameters in the mask. In order to speed up searching I'm storing the result set returned by the data base query in the server session. During subsequent searches I can then search the data I have in the session thus avoiding unecessary trips to the DB.
Since the amount of data that can be returned by a data base query can be quite large, the scalability of the web application is severily limited. If there are, let's say, 100 users using the application at the same time, the server will keep search results in its session for each separate user. This will eventually eat up quite a bit of memory. My question now is, what's the best alternative to storing the data in session? The query in the DB can take quite a while at times so, at the moment, I would like to avoid having to run the query on subsequent searches if the data I already retrieved earlier contains the data that is now being searched for. Options I've considered are creating a temp table in the DB in my search query that stores the retrieved data and which can be used for subsequent searches. The problem with that is, I don't have all too much experience with SQL Server so I don't know if the SQL Server would create temp tables for each user if there are multiple users performing the search. Are there any other possibilities? Could the idea with the temp table in the SQL Server work or would it only lead to memory issues on the SQL Server? Thanks for the help! :)
Edit: Thanks a lot for the helpful and insightful answers, guys! However, I failed to mention a detail that's kind of important. When I query the database, the format of the result set can vary from user to user. This is because the user can decide which columns the result table can have by selecting columns from a predefined multiselect box in the search mask. If user A wants ColA, ColB and ColC to be displayed in his result table, he selects those values from the multiselect box in the search mask. User B, however, might select ColA and ColC only. Therefore, caching the results in a single table for all users might be a bit tricky since the table columsn are not necessarily going to be the same for all users. Therefore, I'm thinking, I'll almost have to use an alternative that saves each user's cached table separately. The HTML5 Local Storage alternative option mentioned below sounds interesting. Since this is an intranet application, it might be fair to assume (or require) that users have an up to date browser that supports HTML5. What do you guys think? Again, thanks for the help :)
If you want to cache query results, they'll have to be either on the web server or client in some form or another. All options will require memory, and since search results are user-specific, that memory usage will increase as a linear function of the number of current users.
My suggestions are to limit the number of rows returned from SQL (with TOP) and/or to look into optimizing your query on the SQL end. If your DB query takes a noticeable amount of time there's a good chance it can be optimized in SQL.
Have you already tought about the NoSql databases?
The idea of a NoSql database is to store information that is optimized for reading or writing and is accessed with 'easy queries' (for example a look-up on search terms). They scale easily horizontally and would allow you to search trough a whole lot of data very fast (Think of Google's BigData for example!)
if HTML5 is a possibility, you could use Local Storage.
You could try turning on sql session state.
http://support.microsoft.com/kb/317604
Plus: Effortless, you will find out if this fixes the memory pressure and has acceptable performance (i.e. reading and writing the cache to the DB). If the perf is okay, then you may want to implement the sort of thing that sql session does yourself because there is a ...
Down side: If you aren't aggressive about removing it from session, it will be serialized and deserialized on each request on unrelated pages.
I work with an application that it switching from filebased datastorage to database based. It has a very large amount of code that is written specifically towards the filebased system. To make the switch I am implementing functionality that will work as the old system, the plan is then making more optimal use of the database in new code.
One problem is that the filebased system often was reading single records, and read them repeatedly for reports. This have become alot of queries to the database, which is slow.
The idea I have been trying to flesh out is using two datasets. One dataset to retrieve an entire table, and another dataset to query against the first, thereby decreasing communication overhead with the database server.
I've tried to look at the DataSource property of TADODataSet but the dataset still seems to require a connection, and it asks the database directly if Connection is assigned.
The reason I would prefer to get the result in another dataset, rather than navigating the first one, is that there is already implemented a good amount of logic for emulating the old system. This logic is based on having a dataset containing only the results as queried with the old interface.
The functionality only have to support reading data, not writing it back.
How can I use one dataset to supply values for another dataset to select from?
I am using Delphi 2007 and MSSQL.
You can use a ClientDataSet/DataSetProvider pair to fetch data from an existing DataSet. You can use filters on the source dataset, filters on the ClientDataSet and provider events to trim the dataset only to the interesting records.
I've used this technique with success in a couple of migrating projects and to mitigate similar situation where a old SQL Server 7 database was queried thousands of times to retrieve individual records with painful performance costs. Querying it only one time and then fetching individual records to the client dataset was, at the time, not only an elegant solution but a great performance boost to that particular application: The most great example was an 8 hour process reduced to 15 minutes... poor users loved me that time.
A ClientDataSet is just a TDataSet you can seamlessly integrate into existing code and UI.
What techniques/tips can you give in regards to summarizing report data points so you don't have to store the raw data in the database?
For example, if I was storing page view traffic for a website, and my reports were accurate to the hour I could roll-up all database rows by the hour, and then possible even create further summary tables by the various increments like per day/month etc.
Any other tricks/tips along these lines?
You are talking about data warehousing / data mining. You still need OLTP ("the raw") data in a database, but you'd create an additional OLAP data warehouse with "pre-crunched" numbers for faster report access. However, this is an expensive venture in dollars and time - definitely not suitable for web site statistics unless you were Google or Amazon. So you are better off just keeping the setup that you have, and use your queries to summarize data.