Table in parallel environment - apache-flink

If I have a table in parallel =2 environment , does it mean I have 2 separate tables?
For example, I use a table for calculating the message (received from kafka) counts (and average values) for a window period (select count (x),avg(x) from TABLE(TUMBLE()) ... ) and use then toChangelogStream(). if I have 1M messages passed from kafka to this table in a window period, is the result we have, 2 messages in 2 streams , each with count value 500K? If yes, is there a way to use the table in parallel =2 but get the total count?

Related

Snowflake query pruning by Column

in the Snowflake Docs it says:
First, prune micro-partitions that are not needed for the query.
Then, prune by column within the remaining micro-partitions.
What is meant with the second step?
Let's take the example table t1 shown in the link. In this example table I use the following query:
SELECT * FROM t1
WHERE
Date = ‚11/3‘ AND
Name = ‚C‘
Because of the Date = ‚11/3‘ it would only scan micro partitions 2, 3 and 4. Because of the Name = 'C' it can prune even more and only scan micro-partions 2 and 4.
So in the end only micro-partitions 2 and 4 would be scanned.
But where does the second step come into play? What is meant with prune by column within the remaining micro partitions?
Does it mean, that only rows 4, 5 and 6 on micro-partition 2 and row 1 on micro-partition 4 are scanned, because date is my clustering key and is sorted so you can prune even further with the date?
So in the end only 4 rows would be scanned?
But where does the second step come into play? What is meant with prune by column within the remaining micro partitions?
Benefits of Micro-partitioning:
Columns are stored independently within micro-partitions, often referred to as columnar storage.
This enables efficient scanning of individual columns; only the columns referenced by a query are scanned.
It is recommended to avoid SELECT * and specify required columns explicitly.
It simply means to only select the columns that are required for the query. So in your example it would be:
SELECT col_1, col_2 FROM t1
WHERE
Date = ‚11/3‘ AND
Name = ‚C‘

Is hash based ordering possible in akka-stream-kafka?

I am exploring akka-stream-kafka for one of my use case and going through this documentation. According to the documentation, the producer sink divides the payload i.e. data records into all the Kafka partitions equally, which is logical. However I wanted to have control over the partition where a message is going. My use case is, I will get millions of rows with key as record_id, now I want to send all the records for the same record_id lets assume 1234 to same partition lets assume partition number 10. So to summarized it lets say I have 1000 records and 10 partitions. Out of those 1000 records 3700 are with record_id 1234. Lets say kafka sent that record_id to partition number 1. So I want all those 3700 records to go through partition 1 as I want to maintain the order of those records. Similarly for other record_id. The plainsink implementation that the documentation have divides the records evenly to all partition.
Is there a way I can control the record flow based on hashing of keys?
When you create the ProducerRecord, you have the chance to provide a partition index where you want it to end up.
To calculate the partition index you can simply use recordId % numberOfPartitions, and you will make sure all messages with the same recordId will end up in the same partition.
Example below:
val source: Source[Record, NotUsed] = ???
source
.map { record =>
val partition = record.recordId % 10
new ProducerRecord[Array[Byte], Record]("topic1", partition, null, record)
}
.runWith(Producer.plainSink(producerSettings))

SQL join running slow

I have 2 sql queries doing the same thing, first query takes 13 sec to execute while second takes 1 sec to execute. Any reason why ?
Not necessary all the ids in ProcessMessages will have data in ProcessMessageDetails
-- takes 13 sec to execute
Select * from dbo.ProcessMessages t1
join dbo.ProcessMessageDetails t2 on t1.ProcessMessageId = t2.ProcessMessageId
Where Id = 4 and Isdone = 0
--takes under a sec to execute
Select * from dbo.ProcessMessageDetails
where ProcessMessageId in ( Select distinct ProcessMessageId from dbo.ProcessMessages t1
Where Where Id = 4 and Isdone = 0 )
I have clusterd index on t1.processMessageId(Pk) and non clusterd index on t2.processMessageId (FK)
I would need the actual execution plans to tell you exactly what SqlServer is doing behind the scenes. I can tell you these queries aren't doing the exact same thing.
The first query is going through and finding all of the items that meet the conditions for t1 and finding all of the items for t2 and then finding which ones match and joining them together.
The second one is saying first find all of the items that are meet my criteria from t1, and then find the items in t2 that have one of these IDs.
Depending on your statistics, available indexes, hardware, table sizes: Sql Server may decide to do different types of scans or seeks to pick data for each part of the query, and it also may decide to join together data in a certain way.
The answer to your question is really simple the first query which have used will generate more number of rows as compared to the second query so it will take more time to search those many rows that's the reason your first query took 13 seconds and the second one to only one second
So it is generally suggested that you should apply your conditions before making your join or else your number of rows will increase and then you will require more time to search those many rows when joined.

SSIS- Split output into multiple files

I am using SSIS Data Tools to create data extracts from a legacy system.
Our new system needs the files that it imports to be split into 5MB files.
Is there anyway that I can split the files into separate files?
I'm thinking that because the data is already in the database, I can do a loop, or something similar, will select a certain amount of records at a time.
Any input appreciated!
If your source is SQL, use the Row_Number function against the table key to allocate a number per row e.g.
Row_number() OVER (Order by Customer_Id) as RowNumber
and then wrap your query in a CTE or make it a sub query with a where clause to give you the number of rows that will equate to a 5MD file e.g.
WHERE RowNumber >= 5000 and RowNumber <10000
You will need to call this source target several times (with different Row Start and Row End values), so probably best to
Find number of total records in control flow and set a TotalRows parameter
Create a loop in your control flow
Set 3 parameters in your control flow to iterate the through each set of records and store the data in seperate files. e.g. first loop would set
RowStart = 0
RowEnd = 5000
FileName = MyFile_[date]_0_to_4999

100k Rows Returned in a random order, without a SQL time out please

Ok,
I've been doing a lot of reading on returning a random row set last year, and the solution we came up with was
ORDER BY newid()
This is fine for <5k rows. But when we are getting >10-20k rows we are getting SQL time outs, the Execution planned tells me that 76% of my query cost comes from this line. and removing this line increase the speed by an order of magnitude when we have a large amount of rows.
Our users have a requirement of doing up to 100k rows at a time like this.
To give you all a bit more details.
We have a table with 2.6 million 4 digit alpha-numeric codes. We use a random set of these to gain entry into a venue. For example, if we have an event with a 5000 capacity, a random set of 5000 of these will be drawn from the table then issued to the each customer as a bar-code, then the bar-code scanning app at the door with have the same list of 5000. The reason for using a 4 digit alpha numeric code (and not a stupidly long number like a GUID) is that it easy for people to write the number down (or SMS it to a friend) and just bring the number and have it entered manually, so we don't want large amount of characters. Customers love the last bit btw.
Is there a better way than ORDER BY newid(), or is there a faster way to get 100k random rows from a table with 2.6 mil?
Oh, and we are using MS SQL 2005.
Thanks,
Jo
There is an MSDN article entitled "Selecting Rows Randomly from a Large Table" that talks about this exact problem and shows a solution (using no sorting but instead using a WHERE clause on a generated column to filter the rows).
The reason your query is slow is that the ORDER BY clause causes the whole table to be copied into tempdb for sorting.
If you want to generate random 4-digit codes, why not just generate them instead of trying to pull them out of a database?
Generate 100k unique numbers from 0 to 1,679,616 (which is the number of unique four-digit alphanumeric codes, ignoring case - 2.6 million rows must have some duplicates) and convert them to your four-digit codes.
You don't have to sort.
DECLARE #RandomNumber int
DECLARE #Threshold float
SELECT #RandomNumber = COUNT(*) FROM customers
SELECT #Threshold = 50000 / #RandomNumber
SELECT TOP 50000 * FROM customers WHERE rand() > #Threshold ORDER BY newid()
Just as a matter of interest, what is the performance like if you replace
ORDER BY newid()
by
ORDER BY CHECKSUM(newid())
One thought is to break down the process into steps. Add a column in the table for a GUID then do an update statement into the table adding the GUIDs. This can be done ahead of time if necessary. You should then be able to run the query with an orderby on the GUID column to recieve the results the same way.
Have you tried using % (modulo) on a given int column? Not sure what your table structure is, but you could do something like this:
select top 50000 *
from your_table
where CAST((CAST(ASCII(SUBSTRING(venuecode,1,1)) as varchar(3))+
CAST(ASCII(SUBSTRING(venuecode,2,1))as varchar(3))+
CAST(ASCII(SUBSTRING(venuecode,3,1))as varchar(3))+
CAST(ASCII(SUBSTRING(venuecode,4,1))as varchar(3))) as bigint) % 500000 between 0 and 50000
The above code will take all of your alpha numeric venues and convert them to an integer and then split the entire table into 500,000 buckets of which you are taking the top 50000 that fall between 0 and 50000. You can play with the number after the % since (500,000) and you can play with the between. This should randomize it for you. Not sure if the where clause will bite you on performance, but it's worth a shot. Also, without an order by, there is no guarantee of the order (if you have multiple cpus and threading).

Resources