TPC-DS queries generator - database

I need to test my data warehouse using TPC-DS. How can I generate queries for my data warehouse using TPC-DS?
I tried to generate but it generate for a specific data warehouse.
Thanks.

I'm not sure what you mean by "testing your data warehouse" using TPC-DS.
TPC-DS is a benchmark for database engines, focused on typical decision support access patterns; a data warehouse is an information systems concept that is usually built using a variety of database management systems (and other tools).
This being clarified, you can use TPC-DS to benchmark the database engine that you plan to use as a data store for your data warehouse. If that's your goal, you need to:
either generate the data using the official TPC-DS tool, or download the dataset if you can find it online (alternatively, maybe your database vendor provides it already).
load the data into the benchmark's model on the database you're testing.
run the benchmark (the queries) over the data model you created. You can find an example of the queries here (for Impala, in this case), but you may have to translate them into the SQL idiom used by whatever DBMS you're using.
The TPC-DS specification doc not only provides this information but it can also help you understand some essential concepts on this topic: http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-ds_v2.11.0.pdf

Related

Where to download large data set sample for Oracle

is there any place where I can download a large data set of some dummy or real-life (anonymized) data for practicing performance tuning in Oracle?
I have only found StackOverflow dumps - and I am already downloading those (tho they allegedly are for MySQL only)...
Do you have any idea where I could find such dumps?
Thank you for any help
EDIT:
Ok I found some data sources worth a try ...
https://openlibrary.org/developers/dumps
https://catalog.data.gov/dataset/bulk-download-facility
https://www.yelp.com/dataset/download (non-commercial use only)
Any data set anywhere can be used for this in Oracle.
The easiest types of data to load into Oracle are 'delimited' - the most famous of which is known as 'CSV'
You can then build a SQLLoader scenario or External Table to load a massive amount of data - hundreds of gigabytes or more, OR you can use a GUI like Oracle SQL Developer to load the rows into a new or existing table.
I talk about both approaches here. It says 'Excel' but CSV works much the same way.
I've used my own data for this. I've gotten dumps from Twitter (all my Tweets), Untappd (all my beers), Strava (all my activities).
I've used public data for this - NHL stats (35,000 rows), Airports, etc. Just google 'open data' and look for CSV.
Of course I'd be remiss not to mention our own public sample data sets. We make these available on Github.
Customer Orders is the newest one (2019)
I would recommend the (free) data generator and load generator tool SwingBench by Dominic Giles:
http://www.dominicgiles.com/swingbench.html
It can create an OLTP schema and run a bunch of transactions and queries at the concurrency level you specify. Also you can create a "TPC-DS like" schema for data warehouse query testing (and it even supports running some transactions on the DW schema if you want to).

What are the approaches to the Big-Data problems? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Let us consider the following problem. We have a system containing a huge amount of data (Big-Data). So, in fact we have a data base. As the first requirement we want to be able to write to and to read from the data base quickly. We also want to have a web-interface to the data-bases (so that different clients can write to and read from the data base remotely).
But the system that we want to have should be more than a data base. First, we want to be able to run different data-analysis algorithm on the data to find regularities, correlations, abnormalities and so on (as before we do care a lot about the performance). Second, we want to bind a machine learning machinery to the data-base. Which means that we want to run machine learning algorithms on the data to be able to learn "relations" present on the data and based on that predict the values of entries that are not yet in the data base.
Finally, we want to have a nice clicks based interface that visualize the data. So that the users can see the data in form of nice graphics, graphs and other interactive visualisation objects.
What are the standard and widely recognised approaches to the above described problem. What programming languages have to be used to deal with the described problems?
I will approach your question like this: I assume you are firmly interested in big data database use already and have a real need for one, so instead of repeating textbooks upon textbooks of information about them, I will highlight some that meet your 5 requirements - mainly Cassandra and Hadoop.
1) The first requirement we want to be able to write to and to read from the database quickly.
You'll want to explore NoSQL databases which are often used for storing “unstructured” Big Data. Some open-source databases include Hadoop and Cassandra. Regarding the Cassandra,
Facebook needed something fast and cheap to handle the billions of status updates, so it started this project and eventually moved it to Apache where it's found plenty of support in many communities (ref).
References:
Big Data and NoSQL: Five Key Insights
NoSQL standouts: New databases for new applications
Big data woes: Which database should I use?
Cassandra and Spark: A match made in big data heaven
List of NoSQL databases (currently 150)
2) We also want to have a web interface to the database
See the list of 150 NoSQL databases to see all the various interfaces available, including web interfaces.
Cassandra has a cluster admin, a web-based environment, a web-admin based on AngularJS, and even GUI clients.
References:
150 NoSQL databases
Cassandra Web
Cassandra Cluster Admin
3) We want to be able to run different data-analysis algorithm on the data
Cassandra, Hive, and Hadoop are well-suited for data analytics. For example, eBay uses Cassandra for managing time-series data.
References:
Cassandra, Hive, and Hadoop: How We Picked Our Analytics Stack
Cassandra at eBay - Cassandra Summit
An Introduction to Real-Time Analytics with Cassandra and Hadoop
4) We want to run machine learning algorithms on the data to be able to learn "relations"
Again, Cassandra and Hadoop are well-suited. Regarding Apache Spark + Cassandra,
Spark was developed in 2009 at UC Berkeley AMPLab, open sourced in
2010, and became a top-level Apache project in February, 2014. It has
since become one of the largest open source communities in big data, with over 200 contributors in 50+ organizations (ref).
Regarding Hadoop,
With the rapid adoption of Apache Hadoop, enterprises use machine learning as a key technology to extract tangible business value from their massive data assets.
References:
Getting Started with Apache Spark and Cassandra
What is Apache Mahout?
Data Science with Apache Hadoop: Predicting Airline Delays
5) Finally, we want to have a nice clicks-based interface that visualize the data.
Visualization tools (paid) that work with the above databases include Pentaho, JasperReports, and Datameer Analytics Solutions. Alternatively, there are several open-source interactive visualization tools such as D3 and Dygraphs (for big data sets).
References:
Data Science Central - Resources
Big Data Visualization
Start looking at:
what kind of data you want to store in the Database?
what kind of relationship between data you got?
how this data will be accessed? (for instance you need to access a certain set of data quite often)
are they documents? text? something else?
Once you got an answer for all those questions, you can start looking at which NoSQL Database you could use that would give you the best results for your needs.
You can choose between 4 different types: Key-Value, Document, Column family stores, and graph databases.
Which one will be the best fit can be determined answering the question above.
There are ready to use stack that may really help to start with your project:
Elasticsearch that would be your Database (it has a REST API that you can use to write them to the DB and to make queries and analysis)
Kibana is a visualization tool, it will allows you to explore and visualize your data, it it quite powerful and will be more than enough for most of your needs
Logstash can centralize the data processing and help you process and save them in elasticsearch, it already support quite few sources of logs and events, and you can also write your own plugin as well.
Some people refers to them as the ELK stack.
I don't believe you should worry about the programming language you have to use at this point, try to select the tools first, sometimes the choices are limited by the tools you want to use and you can still use a mixture of languages and make the effort only if/when it make sense.
A common way to solve such a requirements is to use Amazon Redshift and the ecosystem around it.
Redshift is a peta-scale data warehouse (it can also start with giga-scale), that exposes Ansi SQL interface. As you can put as much data as you like into the DWH and you can run any type of SQL you wish against this data, this is a good infrastructure to build almost any agile and big data analytics system.
Redshift has many analytics functions, mainly using Window functions. You can calculate averages and medians, but also percentiles, dense rank etc.
You can connect almost every SQL client you want using JDBS/ODBC drivers. It can be from R, R studio, psql, but also from MS-Excel.
AWS added lately a new service for Machine Learning. Amazon ML is integrating nicely with Redshift. You can build predictive models based on data from Redshift, by simply giving an SQL query that is pulling the data needed to train the model, and Amazon ML will build a model that you can use both for batch prediction as well as for real-time predictions. You can check this blog post from AWS big data blog that shows such a scenario: http://blogs.aws.amazon.com/bigdata/post/TxGVITXN9DT5V6/Building-a-Binary-Classification-Model-with-Amazon-Machine-Learning-and-Amazon-R
Regarding visualization, there are plenty of great visualization tools that you can connect to Redshift. The most common ones are Tableau, QliView, Looker or YellowFin, especially if you don't have any existing DWH, where you might want to keep on using tools like JasperSoft or Oracle BI. Here is a link to a list of such partners that are providing free trial for their visualization on top of Redshift: http://aws.amazon.com/redshift/partners/
BTW, Redshift also provides a free trial for 2 months that you can quickly test and see if it fits your needs: http://aws.amazon.com/redshift/free-trial/
Big Data is a tough problem primarily because it isn't one single problem. First if your original database is a normal OLTP database that is handling business transactions throughout the day, you will not want to also do your big data analysis on this system since the data-analysis you will want to do will interfere with the normal business traffic.
Problem #1 is what type of database do you want to use for data-analysis? You have many choices ranging from RDBMS, Hadoop, MongoDB, and Spark. If you go with RDBMS then you will want to change the schema to be more compliant with data-analysis. You will want to create a data warehouse with a star schema. Doing this will make many tools available to you because this method of data analysis has been around for a very long time. All of the other "big data" and data analysis databases do not have the same level of tooling available, but they are quickly catching up. Each one of these will require research on which one you will want to use based on your problem set. If you have big batches of data RDBMS and Hadoop will be good. If you have streaming types of data then you will want to look at MongoDB and Spark. If you are a Java shop then RDBMS, Hadoop or Spark. If you are JavaScript MongoDB. If you are good with Scala then Spark.
Problem #2 is getting your data from your transactional database into your big data storage. You will need to find a programming language that has libraries to talk to both databases and you will have to decide when and where you will be moving this data. You can use Python, Java or Ruby to do this work.
Problem #3 is your UI. If you decide to go with RDBMS then you can use many of the available tools available or you can build your own. The other data storage solutions will have tool support but it isn't as mature is that available for the RDBMS. You are most likely going to build your own here anyway because your analysts will want to have the tools built to their specifications. Java works with all of these storage mechanisms but you can probably get Python to work too. You may want to provide a service layer built in Java that provides a RESTful interface and then put a web layer in front of that service layer. If you do this, then your web layer can be built in any language you prefer.
These three languages are most commonly used for machine learning and data mining on the Server side: R, Python, SQL. If you are aiming for heavy mathematical functions and graph generation, Haskell is very popular.

Building OLAP style applications with SalesForce/Apex

We are considering moving a planning and budgeting app to the Salesforce platform. The existing app is built on a dimensional data model, and has extensive ad-hoc query capability implemented through star joins.
We see how the platform will allow us to put together the data entry screens quickly, but the underlying datamodel and query languages do not seem suitable for our reporting requirements.
Is it possible to have fast and flexible reporting with this platform? If not, how cumbersome is it to extract the data on a regular basis to bring it into an analytical application?
Hmm - I guess I answer my own question? The relative silence on this (even with bounty- who wants to have anything to do with something that is ignored on stackoverflow?) is a kind of answer.
So - No, this platform is not well suited for applications that have any kind of ROLAP requirements. I guess shame on me for asking a silly question, but I welcome any responses...
Doing native, fast, OLAP-like queries: possible, but somewhat cumbersome since SFDC is basically a traditional-style RDBMS with somewhat limited joining capability within its native reporting. You can do OLAP-like things with custom code but it can get cumbersome if you are used to using established high-end OLAP solutions.
Extracting data from SFDC to use in other applications: really easy and supported across a number of technologies, the most common is extracting CSV files or using the data web service. There are tools like the SFDC data loader which also let you extract/load data via command line or UI. That's probably what I would recommend to a client who has pre-existing expertise in a given analysis tool.
I would not attempt to build an OLAP data model in salesforce. The limitations in both the joins and roll-up of data from child to parent make it difficult to implement a star schema with aggregations.
There are some products such as IQ 20/20 that can integrate with salesforce and provide near real time business intelligence functionality.
Analytical snapshots can also help as they provide a way to build aggregate tables. The snapshots pull data from a report and can be scheduled to run periodically. The different salesforce editions give different features regarding the scheduling so it is best to check the limits for your edition before going too far into the design.

Database selection for a web-scale analytics application

I want to build a web-application similar to Google-Analytics, in which I collect statistics on my customers' end-users, and show my customers analysis based on that data.
Characteristics:
High scalability, handle very large volume
Compartmentalized - Queries always run on a single customer's data
Support analytical queries (drill-down, slices, etc.)
Due to the analytical need, I'm considering to use an OLAP/BI suite, but I'm not sure it's meant for this scale. NoSQL database? Simple RDBMS would do?
These what I am using at work in a production environnement and it works like a charm.
I copled three things
PostgreSQL + LucidDB + Mondrian (More generally the whole Pentaho BI suite components)
PostgreSQL : I am not going to describe postgresql, really strong open source RDBMS will let you do - certainly - everything you need. I use it to store my operational data.
LucidDB : LucidDB is an Open source column-store database. Highly scalable and will provide a really gain of processing time compare to PostgreSQL for retrieving a large amount of data. It is not optimized for transaction processing but for intensive reads. This is my Datawarehouse database
Mondrian : Mondrian is an Open Source R-OLAP cube. LucidDB made it easy to connect those two programs together.
I would recommend you to look at the whole Pentaho BI Suite, it worth it, you might want to use some of there components.
Hope I could help,
There are two main architectures you could opt for for true web-scale:
1. "BI" architecture
Event journaller (e.g. LWES Journaller) or immutable event store (e.g. HDFS) feeds
Analytics/column-store database (e.g. Greenplum, InfiniDB, LucidDB, Infobright) feeds
Business intelligence reporting tool (e.g. Microstrategy, Pentaho Business Analytics)
2. "NoSQL" architecture
(Optional) Event journaller or immutable event store feeds
NoSQL database (e.g. Cassandra, Riak, HBase) feeds
A custom analytics UI (e.g. using D3.js)
The immutable event store or journaller is there because in most cases you want to be batching your analytics events and doing bulk updates to your database (even with something like HDFS) - rather than doing an atomic write for every single page view etc.
For SnowPlow, our open-source analytics platform built on Hadoop and Hive, the event logs are all collected on S3 first before being batch loaded into Hive.
Note that the "NoSQL architecture" will involve a fair bit more development work. Remember that with either architecture, you can always shard by customer if the volumes grow truly epic (billions of rows per customer) - because there's no need (I'm guessing) for cross-customer analytics.
I'd say that having put in place OLAP analysis is always nice and then has great potential for sophisticated data analysis using MDX.
What do you mean by large volume ?
Where are your customer user information?
What kind of front-end and reporting are you going to use?
Cheers.
Disclaimer : I'll make some publicity for my own solution - have a look to www.icCube.com and contact me for more details

Updating data from several different sources

I'm in the process of setting up a database with customer information. The database will handle customer data (customer id, address, phonenr etc.) as well as some basic information about which kind of advertisement a specific customer has been subjected to, and how they reacted to it.
The data will be maintained both from a central data-warehouse, but additional information about customers and the advertisement will also be updated from other sources. For example, if an external advertisement agency runs a campaign, I want them to be able to feed back data about OptOuts, e-mail bounces etc. I guess what I need is an API which can be easily handed out to any number of agencies.
My first thought was to set up a web service API for all external sources, but since we'll probably be talking large amounts of data (millions of records per batch) I'm not sure a web service is the best option.
So my question is, what's the best practice here? I need a solution simple enough for advertisement agencies (likely with moderately skilled IT-people) to make use of. Simplicity is of the essence – by which I mean “simplicity over performance” in this case. If the set up gets too complex, it won't work.
The system will very likely be based on Microsoft technology.
Any suggestions?
The process you're describing is commonly referred to as Data Integration using ETL processes. ETL stands for Extract-Transform-Load. The idea is to build up your central data warehouse by extracting information from a lot of different data-sources, transform it and then load it into your data warehouse.
A variety of (also graphical) tools exist to implement such a process. Since you said you'll probably running a Microsoft stack, I suggest having a look at Sql Server Integration Services (SSIS).
Regarding your suggestion to implement integration using a web-service, I don't think that's a good idea too. Similarily, I don't think shifting the burden of data integration to your customers is a good idea either. You should agree with your customers on some form of a data exchange format, it could be as simple as a CSV file, or XML, Excel sheets, Access databases, use whatever suits your needs.
Any modern ETL tool like SSIS is capable of working with those different data sources.

Resources