Alteryx only pulls 1 million rows from Snowflake, max

Alteryx only pulls 1 million rows from Snowflake, max - snowflake-cloud-data-platform

I'm using Alteryx workflows to spool some data from Snowflake.
The problem is that for some reason, Alteryx only pulls 1 million records, tops.
The input data tool has no max records marked, it should pull as much as it finds.
This is odd since I thought Alteryx was pretty much meant to be used with very large data sets. Is there some special configuration one needs to apply to pull over million records or is it just something that alteryx cannot do?
I looked in alteryx resources and couldn't find anything mentioning the issues. I don't know where to begin from.

Related

SQL Server - Inserting new data worsens query performance

We have a 4-5TB SQL Server database. The largest table is around 800 GB big containing 100 million rows. 4-5 other comparable tables are 1/3-2/3 of this size. We went through a process to create new indexes to optimize performance. While the performance certainly improved we saw that the newly inserted data was slowest to query.
It's a financial reporting application with a BI tool working on top of the database. The data is loaded overnight continuing in the late morning, though the majority of the data is loaded by 7am. Users start to query data around 8am through the BI tool and are most concerned with the latest (daily) data.
I wanted to know if newly inserted data causes indexes to go out of order. Is there anything we can do so that we get better performance on the newly inserted data than the old data. I hope I have explained the issue well here. Let me know in case of any missing information. Thanks
Edit 1
Let me describe the architecture a bit.
I have a base table (let’s call it Base) with Date,id as clustered index.
It has around 50 columns
Then we have 5 derived tables (Derived1, Derived2,...) , according to different metric types, which also have Date,Id as clustered index and foreign key constraint on the Base table.
Tables Derived1 and Derived2 have 350+ columns. Derived3,4,5 have around 100-200 columns. There is one large view created to join all the data tables due limitations of the BI tool. The date,ID are the joining columns for all the tables joining to form the view (Hence I created clustered index on those columns). The main concern is with regard to BI tool performance. The BI tool always uses the view and generally sends similar queries to the server.
There are other indexes as well on other filtering columns.
The main question remains - how to prevent performance from deteriorating.
In addition I would like to know
If NCI on Date,ID on all tables would be better bet in addition to the clustered index on date,ID.
Does it make sense to have 150 columns as included in NCI for the derived tables?

You have about a 100 million rows, increasing every day with new portions and those new portions are usually selected. I should use partitioned indexes with those numbers and not regular indexes.
Your solution within sql server would be partitioning. Take a look at sql partitioning and see if you can adopt it. Partitioning is a form of clustering where groups of data share a physical block. If you use year and month for example, all 2018-09 records will share the same physical space and easy to be found. So if you select records with those filters (and plus more) it is like the table has the size of 2018-09 records. That is not exactly accurate but its is quite like it. Be careful with data values for partitioning - opposite to standard PK clusters where each value is unique, partitioning column(s) should result a nice set of different unique combinations thus partitions.
If you cannot use partitions you have to create 'partitions' yourself using regular indexes. This will require some experiments. The basic idea is data (a number?) indicating e.g. a wave or set of waves of imported data. Like data imported today and the next e.g. 10 days will be wave '1'. Next 10 days will be '2' and so on. Filtering on the latest e.g. 10 waves, you work on the latest 100 days import effectively skip out all the rest data. Roughly, if you divided your existing 100 million rows to 100 waves and start on at wave 101 and search for waves 90 or greater then you have 10 million rows to search if SQL is put correctly to use the new index first (will do eventually)

This is a broad question especially without knowing your system. But one thing that I would try is manually update your stats on the indexes/table once you are done loading data. With tables that big, it is unlikely that you will manipulate enough rows to trigger an auto-update. Without clean stats, SQL Server won't have an accurate histogram of your data.
Next, dive into your execution plans and see what operators are the most expensive.

Developing an optimal solution/design that sums many database rows for a reporting engine

Problem: I am developing a reporting engine that displays data about how many bees a farm detected (Bees is just an example here)
I have 100 devices that each minute count how many bees were detected on the farm. Here is how the DB looks like:
So there can be hundreds of thousands of rows in a given week.
The farmer wants a report that shows for a given day how many bees came each hour. I developed two ways to do this:
The server takes all 100,000 rows for that day from the DB and filters it down. The server uses a large amount of memory to do this and I feel this is a brute force solution
I have a Stored Procedure that returns a temporarily created table, with every hour the amount of bees collected for each device totaled. The server takes this table and doesn't need to process 100,000 rows.
This return (24 * 100) rows. However it takes much longer than I expected to do this ~
What are some good candidate solutions for developing a solution that can consolidate and sum this data without taking 30 seconds just to sum a day of data (where I may need a months worth divided between days)?

If performance is your primary concern here, there's probably quite a bit you can do directly on the database. I would try indexing the table on time_collected_bees so it can filter down to 100K lines faster. I would guess that that's where your slowdown is happening, if the database is scanning the whole table to find the relevant entries.
If you're using SQL Server, you can try looking at your execution plan to see what's actually slowing things down.
Give database optimization more of a look before you architect something really complex and hard to maintain.

Should I normalize a 1.3 million record flat file for analysis?

I've been handed an immense flat file of health insurance claims data. It contains 1.3 million rows and 154 columns. I need to do a bunch of different analyses on these data. This will be in SQL Server 2012.
The file has 25 columns for diagnosis codes (DIAG_CD01
through DIAG_CD_25), 8 for billing codes (ICD_CD1 through ICD_CD8), and 4 for procedure modifier codes (MODR_CD1 through MODR_CD4). It looks like it was dumped from a relational database. The billing and diagnosis codes are going to be the basis for much of the analysis.
So my question is whether I should split the file into a mock relational database. Writing analysis queries on a table like this will be a nightmare. If I split it into a parent table and three child tables (Diagnoses, Modifiers, and Bill_codes) my query code will much easier. But if I do that I'll have, on top of the 1.3 million parent records, up to 32.5 million diagnosis records, up to 10.4 million billing code records, and up to 5.2 million modifier records. On the other hand, a huge portion of the flat data of the three sets is null fields, which are supposed to screw up query performance.
What are the likely performance consequences of querying these data as a mock relational database vs. as the giant flat file? Reading about normalization it sounds like performance should be better, but the sheer number of records in a four table split gives me pause.

Seems like if you keep it denormalized you would have to repeat query logic a whole bunch of times (25 for Diagnoses), and even worse, you have to somehow aggregate all those pieces together.
Do like you suggested and split the data into logical tables like Diagnosis Codes, Billing Codes, etc. and your queries will be much easier to handle.
If you have a decent machine these row counts should not be a performance problem for sql server. Just make sure you have indexes to help with your joins, etc.
Good luck!

Recommended ETL approachs for large to big data

I've been reading up on this (via Google search) for a while now and still not a getting a clear answer so finally decided to post.
Am trying to get a clear idea on what is a good process for setting up automated ETL process. Let's take the following problem:
Data:
Product Code, Month, Year, Sales, Flag
15 million rows for data, where you had 5000 products. Given this data, calculate whether the cumulative sales for a particular product exceeds X. And set flag = 1 if at that point in time the threshold was exceeded.
How would people approach this task? My approach was to attempt it using SQL Server but that was painfully slow at times. In particular there was step in the transformation that would have required me to write a Stored Proc that created an index on a temp table on the fly on order to speed up .. all of which seemed like a lot of bother.
Should I have coded it in Java or Python? Should I have used Alteryx or Lavastorm? Is this something I should ideally be doing using Hadoop?

Database storage requirements and management for lots of numerical data

I'm trying to figure out how to manage and serve a lot of numerical data. Not sure an SQL database is the right approach. Scenario as follows:
10000 sets of time series data collected per hour
5 floating point values per set
About 5000 hours worth of data collected
So that gives me about 250 million values in total. I need to query this set of data by set ID and by time. If possible, also filter by one or two of the values themselves. I'm also continuously adding to this data.
This seems like a lot of data. Assuming 4 bytes per value, that's 1TB. I don't know what a general "overhead multiplier" for an SQL database is. Let's say it's 2, then that's 2TB of disk space.
What are good approaches to handling this data? Some options I can see:
Single PostgreSQL table with indices on ID, time
Single SQLite table -- this seemed to be unbearably slow
One SQLite file per set -- lots of .sqlite files in this case
Something like MongoDB? Don't even know how this would work ...
Appreciate commentary from those who have done this before.

Mongo is a key-value store; might work for your data but I don't have much experience.
I can tell you that PostgreSQL will be a good choice. It will be able to handle that kind of data. SQLite is definitely not optimized for those use-cases.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight