Understanding the snowflake kafka connector config tuning parameters - snowflake-cloud-data-platform

I'm working on streaming ~2000 tables from Kafka to Snowflake using the Snowflake connector for the Kafka Connect platform. I would like to understand how to tune the parameters of the connector for the best throughput. Suggestions for Kafka and Kafka Connect settings are also welcome though my primary interest is understanding the connector parameters.
My topic sizes range from < 1GB to 100's of GB. We currently have only a single parition per topic and the topics are distributed across 30 connector tasks using the RoundRobin partitioner. Our max messages size across all topics is configured to 3MB and we are using AVRO with lz4 compression.
One of our largest topics has ~7 billion events on it and is only transferring to snowflake at a rate of ~2000 events/second. I imagine that increasing the number of partitions is my main lever but I also suspect 2000 events/second is lower than in could be with a change in configuration parameters.
The parameters I suspect should be tuned are:
buffer.count.records - default 10000 events
buffer.flush.time - default 120 seconds
buffer.size.bytes - default 5MB
Currently we are using the default values.
Any advice on how to use these parameters or others to increase our throughput?

I can't speak from experience with the Kafka connector, but Snowflake prefers file in the size range of 10 - 100 MB, so I would expect that bigger that the default would be better, and also try a bigger warehouse.

Related

What is the maximum throughput or upload limit on the amount of simultaneous data being transferred using an Azure IRT?

Is there a maximum throughput or upload limit on the amount of simultaneous data being transferred using an Azure IRT as part of an Azure data factory before a pipeline and/or activity may timeout or fail?
Good question. The answer is we cannot tell.
There is an interesting article about Hyperscale throughput and most of the research I did around throughput they don't give me a concrete answer.
On the same article you can find a comment that states:
Dont’t bother with Azure Data Factory. For some reason with type
casting from blob to Azure SQL, but also Azure SQL Database as a
source, the throughput is dramatic. From 6 MB/s to 13 MB/s on high
service tiers for transferring 1 table, 5GB in total. That is beyond
bad.
So there are too many factor to count in: where the data come from, where the data go, the products you are using, etc...
You might need to open a (paid) ticket with Azure support and ask if they hardcoded that trigs a timeout. But in my experience I've seen simple database migration from SQL Server to Azure SQL DB failing for no reason and then complete at a second try.

Nifi ExecuteSQLRecord: how to capture performance bottleneck

Objective
I have Apache Nifi Docker container on Azure VM with attached premium very high-throughput SSD disk. I have MSSQL Server 2012 database on AWS. Nifi to database communication happens through mssql jar v6.2, and through high-throughtput AWS Direct Connect MPLS network.
Within Nifi Flow only one processor is executed - ExecuteSQLRecord. It use only one thread/CPU and has 4 GB JVM Heap Space available. ExecuteSQLRecord execute query that return 1 million of rows, which equals to 60MB Flow File. Query is based on table indexes, so there is nothing to optimize on DB side. Query looks like: SELECT * FROM table WHERE id BETWEEN x AND y.
The issue
ExecuteSQLRecord with 1 thread/CPU, 1 query , retrieves 1M of rows (60MB) in 40 seconds.
In the same time, the same query run from MSSMS and database internal network takes 18 seconds.
In the same time query is already optimized on DB side (with indexes), and throughtput scale linearly with increasing number of threads/CPUs - network is not a bottleneck.
Questions
Is this performance okay for Nifi 1 CPU? Is it okay that Nifi spends 22 seconds (from 40) for retrieval and storing the results to Content Repository?
How does Nifi pull the data from MSSQL Server? Is this a pull approach? If yes, maybe we have to many roundtrips?
How can I check how much time Nifi spending on converting result set to CSV, and how much time for writting into Content Repository?
Are you using the latest Docker image (1.11.4)? If so you should be able to set the fetch size on the ExecuteSQLRecord processor (https://issues.apache.org/jira/browse/NIFI-6865)
I got a couple of different results when I searched for the default fetch size for the MSSQL driver, one site said 1 and another said 32. In your case for that many records I'd imagine you'd want it to be way higher (see https://learn.microsoft.com/en-us/previous-versions/sql/legacy/aa342344(v=sql.90)?redirectedfrom=MSDN#use-the-appropriate-fetch-size for setting the appropriate fetch size).
To add to Matt's answer, you can examine the provenance data for each flowfile and see the lineage duration (amount of time) it spent in each segment of the flow. You can also see the status history for every processor, so you can examine the data in/out by size and number of flowfiles, CPU usage, etc. for each processor.

SQL Server files configuration

I have a database that receives information per second from multiple devices via SNMP, GPS and others. The insertion rate of each device is 1sec and we speak of a group of devices ranging between 80 and 120.
This is generating a significant growth of the database, to the point that it is growing at a rate increasingly close to 1 gb daily.
What is the growth configuration and minimum sizes of the data server (SQL Server 2016) should have in its logs and data files so the db is as fluid as possible?
Any additional recommendations regarding capabilities, maintenance and best practices in relation to the scenario described above?
Thank you all!
You should never have auto-growth events. Ideally, you would want to grow your data files to a size large enough so that you will not be troubled by auto-growth events, ever.
See this post over at DBA Stack Exchange that relates to growing log files.

SQL server scalability question

We are trying to build an application which will have to store billions of records. 1 trillion+
a single record will contain text data and meta data about the text document.
pl help me understand about the storage limitations. can a databse SQL or oracle support this much data or i have to look for some other filesystem based solution ? What are my options ?
Since the central server has to handle incoming load from many clients, how will parallel insertions and search scale ? how to distribute data over multiple databases or tables ? I am little green to database specifics for such scaled environment.
initally to fill the database the insert load will be high, later as the database grows, search load will increase and inserts will reduce.
the total size of data will cross 1000 TB.
thanks.
1 trillion+
a single record will contain text data
and meta data about the text document.
pl help me understand about the
storage limitations
I hope you have a BIG budget for hardware. This is big as in "millions".
A trillion documents, at 1024 bytes total storage per document (VERY unlikely to be realistic when you say text) is a size of about 950 terabyte of data. Storage limitations means you talk high end SAN here. Using a non-redundant setup of 2tb discs that is 450 discs. Make the maths. Adding redundancy / raid to that and you talk major hardware invesment. An this assumes only 1kb per document. If you have on average 16kg data usage, this is... 7200 2tb discs.
THat is a hardware problem to start with. SQL Server does not scale so high, and you can not do that in a single system anyway. The normal approach for a docuemnt store like this would be a clustered storage system (clustered or somehow distributed file system) plus a central database for the keywords / tagging. Depending on load / inserts possibly with replciations of hte database for distributed search.
Whatever it is going to be, the storage / backup requiments are terrific. Lagre project here, large budget.
IO load is gong to be another issue - hardware wise. You will need a large machine and get a TON of IO bandwidth into it. I have seen 8gb links overloaded on a SQL Server (fed by a HP eva with 190 discs) and I can imagine you will run something similar. You will want hardware with as much ram as technically possible, regardless of the price - unless you store the blobs outside.
SQL row compression may come in VERY handy. Full text search will be a problem.
the total size of data will cross 1000
TB.
No. Seriously. It will be a bigger, I think. 1000tb would assume the documents are small - like the XML form of a travel ticket.
According to the MSDN page on SQL Server limitations, it can accommodate 524,272 terabytes in a single database - although it can only accommodate 16TB per file, so for 1000TB, you'd be looking to implement partitioning. If the files themselves are large, and just going to be treated as blobs of binary, you might also want to look at FILESTREAM, which does actually keep the files on the file system, but maintains SQL Server notions such as Transactions, Backup, etc.
All of the above is for SQL Server. Other products (such as Oracle) should offer similar facilities, but I couldn't list them.
In the SQL Server space you may want to take a look at SQL Server Parallel Data Warehouse, which is designed for 100s TB / Petabyte applications. Teradata, Oracle Exadata, Greenplum, etc also ought to be on your list. In any case you will be needing some expert help to choose and design the solution so you should ask that person the question you are asking here.
When it comes to database its quite tricky and there can be multiple components involved to get performance like Redis Cache, Sharding, Read replicas etc.
Bellow post describes simplified DB scalability.
http://www.cloudometry.in/2015/09/relational-database-scalability-options.html

Database solution for 200million writes/day, monthly summarization queries

I'm looking for help deciding on which database system to use. (I've been googling and reading for the past few hours; it now seems worthwhile to ask for help from someone with firsthand knowledge.)
I need to log around 200 million rows (or more) per 8 hour workday to a database, then perform weekly/monthly/yearly summary queries on that data. The summary queries would be for collecting data for things like billing statements, eg. "How many transactions of type A did each user run this month?" (could be more complex, but that's the general idea).
I can spread the database amongst several machines, as necessary, but I don't think I can take old data offline. I'll definitely need to be able to query a month's worth of data, maybe a year. These queries would be for my own use, and wouldn't need to be generated in real-time for an end-user (they could run overnight, if needed).
Does anyone have any suggestions as to which databases would be a good fit?
P.S. Cassandra looks like it would have no problem handling the writes, but what about the huge monthly table scans? Is anyone familiar with Cassandra/Hadoop MapReduce performance?
I'm working on a very similar process at the present (a web domain crawlling database) with the same significant transaction rates.
At these ingest rates, it is critical to get the storage layer right first. You're going to be looking at several machines connecting to the storage in a SAN cluster. A singe database server can support millions of writes a day, it's the amount of CPU used per "write" and the speed that the writes can be commited.
(Network performance also often is an early bottleneck)
With clever partitioning, you can reduce the effort required to summarise the data. You don't say how up-to-date the summaries need to be, and this is critical. I would try to push back from "realtime" and suggest overnight (or if you can get away with it monthly) summary calculations.
Finally, we're using a 2 CPU 4GB RAM Windows 2003 virtual SQL Server 2005 and a single CPU 1GB RAM IIS Webserver as our test system and we can ingest 20 million records in a 10 hour period (and the storage is RAID 5 on a shared SAN). We get ingest rates upto 160 records per second batched in blocks of 40 records per network round trip.
Cassandra + Hadoop does sound like a good fit for you. 200M/8h is 7000/s, which a single Cassandra node could handle easily, and it sounds like your aggregation stuff would be simple to do with map/reduce (or higher-level Pig).
Greenplum or Teradata will be a good option. These databases are MPP and can handle peta-scale data. Greenplum is a distributed PostgreSQL db and also has it own mapreduce. While Hadoop may solve your storage problem but it wouldn't be helpful for performing summary queries on your data.

Resources