Is hash based ordering possible in akka-stream-kafka?

Is hash based ordering possible in akka-stream-kafka? - akka-stream

I am exploring akka-stream-kafka for one of my use case and going through this documentation. According to the documentation, the producer sink divides the payload i.e. data records into all the Kafka partitions equally, which is logical. However I wanted to have control over the partition where a message is going. My use case is, I will get millions of rows with key as record_id, now I want to send all the records for the same record_id lets assume 1234 to same partition lets assume partition number 10. So to summarized it lets say I have 1000 records and 10 partitions. Out of those 1000 records 3700 are with record_id 1234. Lets say kafka sent that record_id to partition number 1. So I want all those 3700 records to go through partition 1 as I want to maintain the order of those records. Similarly for other record_id. The plainsink implementation that the documentation have divides the records evenly to all partition.
Is there a way I can control the record flow based on hashing of keys?

When you create the ProducerRecord, you have the chance to provide a partition index where you want it to end up.
To calculate the partition index you can simply use recordId % numberOfPartitions, and you will make sure all messages with the same recordId will end up in the same partition.
Example below:
val source: Source[Record, NotUsed] = ???
source
.map { record =>
val partition = record.recordId % 10
new ProducerRecord[Array[Byte], Record]("topic1", partition, null, record)
}
.runWith(Producer.plainSink(producerSettings))

Related

Is "offset-fetch and order by" ordering the whole table or partial table?

In SQL Server, if I try the following query:
select id from table
order by id
offset 1000000 ROWS
fetch next 1000000 ROWS ONLY;
How will SQL Server work? What strategy does SQL server use?
1. Do a sorting on the whole table first and then select the 1 million rows we need
2. Do a sorting on partial table and then return the 1 million rows we need.
I assume it is 2nd option. If so, how does SQL server decide which range of the table to be sorted?
Edit 1:
I am asking this question to understand what could cause the query slow. I am testing with two queries:
--Query 1:
select id from table
order by id
offset 1 ROWS
fetch next 1 ROWS ONLY;
and
--Query 2:
select id from table
order by id
offset 1000000000 ROWS
fetch next 1 ROWS ONLY;
I found the second query can take me about 30 minutes to finish while the first takes almost 0 second.
So I am curious on what causes this difference? If the two have same time used for order by (or does it even really do a sorting on the whole table? The id is the clustered indexed column of the table. I cannot imagine that it takes 0 second to finish sorting on a terabyte table.)
Then if the sorting takes same time, only difference would be the clustered-index scan. For first query, it only needs to scan first 1 or 10 (a small number) of rows. While for the second query, it needs to scan a much bigger number of rows ( >1000000000 ). But I am not quite sure if this is correct.
Thank you for your help!

Let me take a simple example..
order by id
offset 50 rows fetch 25 rows only
For the above query,the steps would be
1.Table should be sorted by id (if not pay penalty of sort,there is no partial sort,always a full sort)
2.Then scan 50+25 rows(paying cost of 75 rows) and return 25 rows only..
Below is an example of orders table i have(orderid is Pk,so sorted),you can see even though, we are getting only 20 rows ,you are paying cost of 120 rows...
Coming to your question,there is no partial sort (Which implies first option regarding sort only),even you try to return one row like below..
select top 1* from table
order by orderid

Postgres ordering table by element in large data set

I have a tricky problem trying to find an efficient way of ordering a set of objects (~1000 rows) that contain a large (~5 million) number of indexed data points. In my case I need a query that allows me to order the table by a specific datapoint. Each datapoint is a 16-bit unsigned integer.
I am currently solving this problem by using an large array:
Object Table:
id serial NOT NULL,
category_id integer,
description text,
name character varying(255),
created_at timestamp without time zone NOT NULL,
updated_at timestamp without time zone NOT NULL,
data integer[],
GIST index:
CREATE INDEX object_rdtree_idx
ON object
USING gist
(data gist__intbig_ops)
This index is not currently being used when I do a select query, and I am not certain it would help anyway.
Each day the array field is updated with a new set of ~5 million values
I have a webserver that needs to list all objects ordered by the value of a particular data point:
Example Query:
SELECT name, data[3916863] as weight FROM object ORDER BY weight DESC
Currently, it takes about 2.5 Seconds to perform this query.
Question:
Is there a better approach? I am happy for the insertion side to be slow as it happens in the background, but I need the select query to be as fast as possible. In saying this, there is a limit to how long the insertion can take.
I have considered creating a lookup table where every value has it's own row - but I'm not sure how the insertion/lookup time would be affected by this approach and I suspect entering 1000+ records with ~5 million data points as individual rows would be too slow.
Currently inserting a row takes ~30 seconds which is acceptable for now.
Ultimately I am still on the hunt for a scalable solution to the base problem, but for now I need this solution to work, so this solution doesn't need to scale up any further.
Update:
I was wrong to dismiss having a giant table instead of an array, while insertion time massively increased, query time is reduced to just a few milliseconds.
I am now altering my generation algorithm to only save a datum if it non-zero and changed from previous update. This has reduced insertions to just a few hundred thousands values which only takes a few seconds.
New Table:
CREATE TABLE data
(
object_id integer,
data_index integer,
value integer,
)
CREATE INDEX index_data_on_data_index
ON data
USING btree
("data_index");
New Query:
SELECT name, coalesce(value,0) as weight FROM objects LEFT OUTER JOIN data on data.object_id = objects.id AND data_index = 7731363 ORDER BY weight DESC
Insertion Time: 15,000 records/second
Query Time: 17ms

First of all, do you really need a relational database for this? You do not seem to be relating some data to some other data. You might be much better off with a flat-file format.
Secondly, your index on data is useless for the query you showed. You are querying for a datum (a position in your array) while the index is built on the values in the array. Dropping the index will make the inserts considerably faster.
If you have to stay with PostgreSQL for other reasons (bigger data model, MVCC, security) then I suggest you change your data model and ALTER COLUMN data SET TYPE bytea STORAGE external. Since the data column is about 4 x 5 million = 20MB it will be stored out-of-line anyway, but if you explicitly set it, then you know exactly what you have.
Then create a custom function in C that fetches your data value "directly" using the PG_GETARG_BYTEA_P_SLICE() macro and that would look somewhat like this (I am not a very accomplished PG C programmer so forgive me any errors, but this should help you on your way):
// Function get_data_value() -- Get a 4-byte value from a bytea
// Arg 0: bytea* The data
// Arg 1: int32 The position of the element in the data, 1-based
PG_FUNCTION_INFO_V1(get_data_value);
Datum
get_data_value(PG_FUNCTION_ARGS)
{
int32 element = PG_GETARG_INT32_P(1) - 1; // second argument, make 0-based
bytea *data = PG_GETARG_BYTEA_P_SLICE(0, // first argument
element * sizeof(int32), // offset into data
sizeof(int32)); // get just the required 4 bytes
PG_RETURN_INT32_P((int32*)data);
}
The PG_GETARG_BYTEA_P_SLICE() macro retrieves only a slice of data from the disk and is therefore very efficient.
There are some samples of creating custom C functions in the docs.
Your query now becomes:
SELECT name, get_data_value(data, 3916863) AS weight FROM object ORDER BY weight DESC;

Using a sort order column in a database table

Let's say I have a Product table in a shopping site's database to keep description, price, etc of store's products. What is the most efficient way to make my client able to re-order these products?
I create an Order column (integer) to use for sorting records but that gives me some headaches regarding performance due to the primitive methods I use to change the order of every record after the one I actually need to change. An example:
Id Order
5 3
8 1
26 2
32 5
120 4
Now what can I do to change the order of the record with ID=26 to 3?
What I did was creating a procedure which checks whether there is a record in the target order (3) and updates the order of the row (ID=26) if not. If there is a record in target order the procedure executes itself sending that row's ID with target order + 1 as parameters.
That causes to update every single record after the one I want to change to make room:
Id Order
5 4
8 1
26 3
32 6
120 5
So what would a smarter person do?
I use SQL Server 2008 R2.
Edit:
I need the order column of an item to be enough for sorting with no secondary keys involved. Order column alone must specify a unique place for its record.
In addition to all, I wonder if I can implement something like of a linked list: A 'Next' column instead of an 'Order' column to keep the next items ID. But I have no idea how to write the query that retrieves the records with correct order. If anyone has an idea about this approach as well, please share.

Update product set order = order+1 where order >= #value changed
Though over time you'll get larger and larger "spaces" in your order but it will still "sort"
This will add 1 to the value being changed and every value after it in one statement, but the above statement is still true. larger and larger "spaces" will form in your order possibly getting to the point of exceeding an INT value.
Alternate solution given desire for no spaces:
Imagine a procedure for: UpdateSortOrder with parameters of #NewOrderVal, #IDToChange,#OriginalOrderVal
Two step process depending if new/old order is moving up or down the sort.
If #NewOrderVal < #OriginalOrderVal --Moving down chain
--Create space for the movement; no point in changing the original
Update product set order = order+1
where order BETWEEN #NewOrderVal and #OriginalOrderVal-1;
end if
If #NewOrderVal > #OriginalOrderVal --Moving up chain
--Create space for the momvement; no point in changing the original
Update product set order = order-1
where order between #OriginalOrderVal+1 and #NewOrderVal
end if
--Finally update the one we moved to correct value
update product set order = #newOrderVal where ID=#IDToChange;
Regarding best practice; most environments I've been in typically want something grouped by category and sorted alphabetically or based on "popularity on sale" thus negating the need to provide a user defined sort.

Use the old trick that BASIC programs (amongst other places) used: jump the numbers in the order column by 10 or some other convenient increment. You can then insert a single row (indeed, up to 9 rows, if you're lucky) between two existing numbers (that are 10 apart). Or you can move row 370 to 565 without having to change any of the rows from 570 upwards.

Here is an alternative approach using a common table expression (CTE).
This approach respects a unique index on the SortOrder column, and will close any gaps in the sort order sequence that may have been left over from earlier DELETE operations.
/* For example, move Product with id = 26 into position 3 */
DECLARE #id int = 26
DECLARE #sortOrder int = 3
;WITH Sorted AS (
SELECT Id,
ROW_NUMBER() OVER (ORDER BY SortOrder) AS RowNumber
FROM Product
WHERE Id <> #id
)
UPDATE p
SET p.SortOrder =
(CASE
WHEN p.Id = #id THEN #sortOrder
WHEN s.RowNumber >= #sortOrder THEN s.RowNumber + 1
ELSE s.RowNumber
END)
FROM Product p
LEFT JOIN Sorted s ON p.Id = s.Id

It is very simple. You need to have "cardinality hole".
Structure: you need to have 2 columns:
pk = 32bit int
order = 64bit bigint (BIGINT, NOT DOUBLE!!!)
Insert/UpdateL
When you insert first new record you must set order = round(max_bigint / 2).
If you insert at the beginning of the table, you must set order = round("order of first record" / 2)
If you insert at the end of the table, you must set order = round("max_bigint - order of last record" / 2)
If you insert in the middle, you must set order = round("order of record before - order of record after" / 2)
This method has a very big cardinality. If you have constraint error or if you think what you have small cardinality you can rebuild order column (normalize).
In maximality situation with normalization (with this structure) you can have "cardinality hole" in 32 bit.
It is very simple and fast!
Remember NO DOUBLE!!! Only INT - order is precision value!

One solution I have used in the past, with some success, is to use a 'weight' instead of 'order'. Weight being the obvious, the heavier an item (ie: the lower the number) sinks to the bottom, the lighter (higher the number) rises to the top.
In the event I have multiple items with the same weight, I assume they are of the same importance and I order them alphabetically.
This means your SQL will look something like this:
ORDER BY 'weight', 'itemName'
hope that helps.

I am currently developing a database with a tree structure that needs to be ordered. I use a link-list kind of method that will be ordered on the client (not the database). Ordering could also be done in the database via a recursive query, but that is not necessary for this project.
I made this document that describes how we are going to implement storage of the sort order, including an example in postgresql. Please feel free to comment!
https://docs.google.com/document/d/14WuVyGk6ffYyrTzuypY38aIXZIs8H-HbA81st-syFFI/edit?usp=sharing

100k Rows Returned in a random order, without a SQL time out please

Ok,
I've been doing a lot of reading on returning a random row set last year, and the solution we came up with was
ORDER BY newid()
This is fine for <5k rows. But when we are getting >10-20k rows we are getting SQL time outs, the Execution planned tells me that 76% of my query cost comes from this line. and removing this line increase the speed by an order of magnitude when we have a large amount of rows.
Our users have a requirement of doing up to 100k rows at a time like this.
To give you all a bit more details.
We have a table with 2.6 million 4 digit alpha-numeric codes. We use a random set of these to gain entry into a venue. For example, if we have an event with a 5000 capacity, a random set of 5000 of these will be drawn from the table then issued to the each customer as a bar-code, then the bar-code scanning app at the door with have the same list of 5000. The reason for using a 4 digit alpha numeric code (and not a stupidly long number like a GUID) is that it easy for people to write the number down (or SMS it to a friend) and just bring the number and have it entered manually, so we don't want large amount of characters. Customers love the last bit btw.
Is there a better way than ORDER BY newid(), or is there a faster way to get 100k random rows from a table with 2.6 mil?
Oh, and we are using MS SQL 2005.
Thanks,
Jo

There is an MSDN article entitled "Selecting Rows Randomly from a Large Table" that talks about this exact problem and shows a solution (using no sorting but instead using a WHERE clause on a generated column to filter the rows).
The reason your query is slow is that the ORDER BY clause causes the whole table to be copied into tempdb for sorting.

If you want to generate random 4-digit codes, why not just generate them instead of trying to pull them out of a database?
Generate 100k unique numbers from 0 to 1,679,616 (which is the number of unique four-digit alphanumeric codes, ignoring case - 2.6 million rows must have some duplicates) and convert them to your four-digit codes.

You don't have to sort.
DECLARE #RandomNumber int
DECLARE #Threshold float
SELECT #RandomNumber = COUNT(*) FROM customers
SELECT #Threshold = 50000 / #RandomNumber
SELECT TOP 50000 * FROM customers WHERE rand() > #Threshold ORDER BY newid()

Just as a matter of interest, what is the performance like if you replace
ORDER BY newid()
by
ORDER BY CHECKSUM(newid())

One thought is to break down the process into steps. Add a column in the table for a GUID then do an update statement into the table adding the GUIDs. This can be done ahead of time if necessary. You should then be able to run the query with an orderby on the GUID column to recieve the results the same way.

Have you tried using % (modulo) on a given int column? Not sure what your table structure is, but you could do something like this:
select top 50000 *
from your_table
where CAST((CAST(ASCII(SUBSTRING(venuecode,1,1)) as varchar(3))+
CAST(ASCII(SUBSTRING(venuecode,2,1))as varchar(3))+
CAST(ASCII(SUBSTRING(venuecode,3,1))as varchar(3))+
CAST(ASCII(SUBSTRING(venuecode,4,1))as varchar(3))) as bigint) % 500000 between 0 and 50000
The above code will take all of your alpha numeric venues and convert them to an integer and then split the entire table into 500,000 buckets of which you are taking the top 50000 that fall between 0 and 50000. You can play with the number after the % since (500,000) and you can play with the between. This should randomize it for you. Not sure if the where clause will bite you on performance, but it's worth a shot. Also, without an order by, there is no guarantee of the order (if you have multiple cpus and threading).

Get "surrounding" rows in NHibernate query

I am looking for a way to retrieve the "surrounding" rows in a NHibernate query given a primary key and a sort order?
E.g. I have a table with log entries and I want to display the entry with primary key 4242 and the previous 5 entries as well as the following 5 entries ordered by date (there is no direct relation between date and primary key). Such a query should return 11 rows in total (as long as we are not close to either end).
The log entry table can be huge and retrieving all to figure it out is not possible.
Is there such a concept as row number that can be used from within NHibernate? The underlying database is either going to be SQlite or Microsoft SQL Server.
Edited Added sample
Imagine data such as the following:
Id Time
4237 10:00
4238 10:00
1236 10:01
1237 10:01
1238 10:02
4239 10:03
4240 10:04
4241 10:04
4242 10:04 <-- requested "center" row
4243 10:04
4244 10:05
4245 10:06
4246 10:07
4247 10:08
When requesting the entry with primary key 4242 we should get the rows 1237, 1238 and 4239 to 4247. The order is by Time, Id.
Is it possible to retrieve the entries in a single query (which obviously can include subqueries)? Time is a non-unique column so several entries have the same value and in this example is it not possible to change the resolution in a way that makes it unique!

"there is no direct relation between date and primary key" means, that the primary keys are not in a sequential order?
Then I would do it like this:
Item middleItem = Session.Get(id);
IList<Item> previousFiveItems = Session.CreateCriteria((typeof(Item))
.Add(Expression.Le("Time", middleItem.Time))
.AddOrder(Order.Desc("Time"))
.SetMaxResults(5);
IList<Item> nextFiveItems = Session.CreateCriteria((typeof(Item))
.Add(Expression.Gt("Time", middleItem.Time))
.AddOrder(Order.Asc("Time"))
.SetMaxResults(5);
There is the risk of having several items with the same time.
Edit
This should work now.
Item middleItem = Session.Get(id);
IList<Item> previousFiveItems = Session.CreateCriteria((typeof(Item))
.Add(Expression.Le("Time", middleItem.Time)) // less or equal
.Add(Expression.Not(Expression.IdEq(middleItem.id))) // but not the middle
.AddOrder(Order.Desc("Time"))
.SetMaxResults(5);
IList<Item> nextFiveItems = Session.CreateCriteria((typeof(Item))
.Add(Expression.Gt("Time", middleItem.Time)) // greater
.AddOrder(Order.Asc("Time"))
.SetMaxResults(5);

This should be relatively easy with NHibernate's Criteria API:
List<LogEntry> logEntries = session.CreateCriteria(typeof(LogEntry))
.Add(Expression.InG<int>(Projections.Property("Id"), listOfIds))
.AddOrder(Order.Desc("EntryDate"))
.List<LogEntry>();
Here your listOfIds is just a strongly typed list of integers representing the ids of the entries you want to retrieve (integers 4242-5 through 4242+5 ).
Of course you could also add Expressions that let you retrieve Ids greater than 4242-5 and smaller than 4242+5.

Stefan's solution definitely works but better way exists using a single select and nested Subqueries:
ICriteria crit = NHibernateSession.CreateCriteria(typeof(Item));
DetachedCriteria dcMiddleTime =
DetachedCriteria.For(typeof(Item)).SetProjection(Property.ForName("Time"))
.Add(Restrictions.Eq("Id", id));
DetachedCriteria dcAfterTime =
DetachedCriteria.For(typeof(Item)).SetMaxResults(5).SetProjection(Property.ForName("Id"))
.Add(Subqueries.PropertyGt("Time", dcMiddleTime));
DetachedCriteria dcBeforeTime =
DetachedCriteria.For(typeof(Item)).SetMaxResults(5).SetProjection(Property.ForName("Id"))
.Add(Subqueries.PropertyLt("Time", dcMiddleTime));
crit.AddOrder(Order.Asc("Time"));
crit.Add(Restrictions.Eq("Id", id) || Subqueries.PropertyIn("Id", dcAfterTime) ||
Subqueries.PropertyIn("Id", dcBeforeTime));
return crit.List<Item>();
This is NHibernate 2.0 syntax but the same holds true for earlier versions where instead of Restrictions you use Expression.
I have tested this on a test application and it works as advertised