How to attach schema to a Flink DataStream - on the fly? - apache-flink

I am dealing with a stream of database mutations, i.e., a change log stream. I want to able to transform the values using a SQL query.
I am having difficulty putting together the following three concepts
RowTypeInfo, Row, and DataStream.
NOTE: I don't know the schema beforehand. I construct it on-fly using the data within the Mutation object (Mutation is a custom type)
More specifically I have code that looks like this.
val execEnv = StreamExecutionEnvironment.getExecutionEnvironment
val tableEnv: StreamTableEnvironment = TableEnvironment.getTableEnvironment(execEnv)
// Mutation is a custom type
val mutationStream: DataStream[Mutation] = ...
// toRows returns an object of type org.apache.flink.types.Row
val rowStream:DataStream[Row] = mutationStream.flatMap({mutation => toRows(mutation)})
tableEnv.registerDataStream("spinal_tap_table", rowStream)
tableEnv.sql("select col1 + 2")
NOTE: Row object is positional, and doesn't have a placeholder for column names.
I couldn't find a place to attach the schema to the DataStream object.
I want to pass some sort of a struct similar to Row that contains the complete information {columnName: String, columnValue: Object, columnType: TypeInformation[_]} for the query.

In Flink SQL a table schema is mandatory when the Table defined. It is not possible to run queries on dynamically typed records.
Regarding the concepts of RowTypeInfo, Row and DataStream:
Row is the actual record that holds the data
RowTypeInfo is a schema description for Rows. It contains names and TypeInformation for each field of a Row.
DataStream is a logical stream of records. A DataStream[Row] is a stream of rows. Note that this is not the actual stream but just an API concept to represent a stream in the API.

Related

Flink interprets a mapped Row as a single RAW

Im able to sink a static Row to a database:
DataStream<Row> staticRows = environment.fromElements("value1", "value2")
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(environment); // convert to table API
Table inputTable = tableEnv.fromDataStream(staticRows);
tableEnv.executeSql(myDDLAndSinkProperties);
inputTable.executeInsert("MYTABLE");
But mapping an unbounded stream to a Row like this:
DataStream<Row> kafkaRows = kafkaEvents.map(new MyKafkaRecordToRowMapper());
Throws an error where input and sink schemas do not match when DB insert is attempted.
Query schema: [f0: RAW('org.apache.flink.types.Row', '...')]
The same code works for a POJO and Tuple, but I have more than 25 columns and the POJO doesn't serve any other purpose - so Im hoping it could replaced by a general purpose sequence of fields (which Row claims to be).
How do I use Row to input into a database? Given examples only show it used for static datastreams and database output.
I think this will work better if you change it to something like this (after adjusting the column names and types, of course):
DataStream<Row> kafkaRows = kafkaEvents
.map(new MyKafkaRecordToRowMapper())
.returns(Types.ROW_NAMED(
new String[] {"id", "quota", "ts", ...},
Types.STRING,
Types.LONG,
TypeInformation.of(Instant.class)),
...);

FLINK ,trigger event based on JSON dynamic input data ( like map object data)

I would like to know if FLINK can support my requirement, I have gone through with lot of articles but not sure if my case can be solved or not
Case:
I have two input source. a)Event b)ControlSet
Event sample data is:
event 1-
{
"id" :100
"data" : {
"name" : "abc"
}
}
event 2-
{
"id" :500
"data" : {
"date" : "2020-07-10";
"name" : "event2"
}
}
if you see event-1 and event-2 both have different attribute in "data". so consider like data is free form field and name of the attribute can be same/different.
ControlSet will give us instruction to execute the trigger. for example trigger condition could be like
(id = 100 && name = abc) OR (id =500 && date ="2020-07-10")
please help me if these kind of scenario possible to run in flink and what could be the best way. I dont think patternCEP or SQL can help here and not sure if event dataStream can be as JSON object and can be query like JSON path on this.
Yes, this can be done with Flink. And CEP and SQL don't help, since they require that the pattern is known at compile time.
For the event stream, I propose to key this stream by the id, and to store the attribute/value data in keyed MapState, which is a kind of keyed state that Flink knows how to manage, checkpoint, restore, and rescale as necessary. This gives us a distributed map, mapping ids to hash maps holding the data for each id.
For the control stream, let me first describe a solution for a simplified version where the control queries are of the form
(id == key) && (attr == value)
We can simply key this stream by the id in the query (i.e., key), and connect this stream to the event stream. We'll use a RichCoProcessFunction to hold the MapState described above, and as these queries arrive, we can look to see what data we have for key, and check if map[attr] == value.
To handle more complex queries, like the one in the question
(id1 == key1 && attr1 == value1) OR (id2 == key2 && attr2 == value2)
we can do something more complex.
Here we will need to assign a unique id to each control query.
One approach would be to broadcast these queries to a KeyedBroadcastProcessFunction that once again is holding the MapState described above. In the processBroadcastElement method, each instance can use applyToKeyedState to check on the validity of the components of the query for which that instance is storing the keyed state (the attr/value pairs derived from the data field in the even stream). For each keyed component of the query where an instance can supply the requested info, it emits a result downstream.
Then after the KeyedBroadcastProcessFunction we key the stream by the control query id, and use a KeyedProcessFunction to assemble together all of the responses from the various instances of the KeyedBroadcastProcessFunction, and to determine the final result of the control/query message.
It's not really necessary to use broadcast here, but I found this scheme a little more straightforward to explain. But you could instead route keyed copies of the query to only the instances of the RichCoProcessFunction holding MapState for the keys used in the control query, and then do the same sort of assembly of the final result afterwards.
That may have been hard to follow. What I've proposed involves composing two techniques I've coded up before in examples: https://github.com/alpinegizmo/flink-training-exercises/blob/master/src/main/java/com/ververica/flinktraining/solutions/datastream_java/broadcast/TaxiQuerySolution.java is an example that uses broadcast to trigger the evaluation of query predicates across keyed state, and https://gist.github.com/alpinegizmo/5d5f24397a6db7d8fabc1b12a15eeca6 is an example that uses a unique id to re-assemble a single response after doing multiple enrichments in parallel.

How to find a MoveTo destination filled by database?

I could need some help with a Anylogic Model.
Model (short): Manufacturing scenario with orders move in a individual route. The workplaces (WP) are dynamical created by simulation start. Their names, quantity and other parameters are stored in a database (excel Import). Also the orders are created according to an import. The Agent population "order" has a collection routing which contains the Workplaces it has to stop in the specific order.
Target: I want a moveTo block in main which finds the next destination of the agent order.
Problem and solution paths:
I set the destination Type to agent and in the Agent field I typed a function agent.getDestination(). This function is in order which returns the next entry of the collection WP destinationName = routing.get(i). With this I get a Datatype error (while run not compiling). I quess it's because the database does not save the entrys as WP Type but only String.
Is there a possiblity to create a collection with agents from an Excel?
After this I tried to use the same getDestination as String an so find via findFirst the WP matching the returned name and return it as WP. WP targetWP = findFirst(wps, w->w.name == destinationName);
Of corse wps (the population of Workplaces) couldn't be found.
How can I search the population?
Maybe with an Agentlink?
I think it is not that difficult but can't find an answer or a solution. As you can tell I'm a beginner... Hope the description is good an someone can help me or give me a hint :)
Thanks
Is there a possiblity to create a collection with agents from an Excel?
Not directly using the collection's properties and, as you've seen, you can't have database (DB) column types which are agent types.1
But this is relatively simple to do directly via Java code (and you can use the Insert Database Query wizard to construct the skeleton code for you).
After this I tried to use the same getDestination as String an so find via findFirst the WP matching the returned name and return it as WP
Yes, this is one approach. If your order details are in Excel/the database, they are presumably referring to workplaces via some String ID (which will be a parameter of the workplace agents you've created from a separate Excel worksheet/database table). You need to use the Java equals method to compare strings though, not == (which is for comparing numbers or whether two objects are the same object).
I want a moveTo block in main which finds the next destination of the agent order
So the general overall solution is
Create a population of Workplace agents (let's say called workplaces in Main) from the DB, each with a String parameter id or similar mapped from a DB column.
Create a population of Order agents (let's say called orders in Main) from the DB and then, in their on-startup action, set up their collection of workplace IDs (type ArrayList, element class String; let's say called workplaceIDsList) using data from another DB table.
Order probably also needs a working variable storing the next index in the list that it needs to go to (so let's say an int variable nextWorkplaceIndex which starts at 0).
Write a function in Main called getWorkplaceByID that has a single String argument id and returns a Workplace. This gets the workplace from the population that matches the ID; a one-line way similar to yours is findFirst(workplaces, w -> w.id.equals(id)).
The MoveTo block (which I presume is in Main) needs to move the Order to an agent defined by getWorkplaceByID(agent.workplaceIDsList.get(nextWorkplaceIndex++)). (The ++ bit increments the index after evaluating the expression so it is ready for the next workplace to go to.)
For populating the collection, you'd have two tables, something like the below (assuming using strings as IDs for workplaces and orders):
orders table: columns for parameters of your orders (including some String id column) other than the workplace-list. (Create one Order agent per row.)
order_workplaces table: columns order_id, sequence_num and workplace_id (so with multiple rows specifying the sequence of workplace IDs for an order ID).
In the On startup action of Order, set up the skeleton query code via the Insert Database Query wizard as below (where we want to loop through all rows for this order's ID and do something --- we'll change the skeleton code to add entries to the collection instead of just printing stuff via traceln like the skeleton code does).
Then we edit the skeleton code to look like the below. (Note we add an orderBy clause to the initial query so we ensure we get the rows in ascending sequence number order.)
List<Tuple> rows = selectFrom(order_workplaces)
.where(order_workplaces.order_id.eq(id))
.orderBy(order_workplaces.sequence_num.asc())
.list();
for (Tuple row : rows) {
workplaceIDsList.add(row.get(order_workplaces.workplace_id));
}
1 The AnyLogic database is a normal relational database --- HSQLDB in fact --- and databases only understand their own specific data types like VARCHAR, with AnyLogic and the libraries it uses translating these to Java types like String. In the user interface, AnyLogic makes it look like you set the column types as int, String, etc. but these are really the Java types that the columns' contents will ultimately be translated into.
AnyLogic does support columns which have option list types (and the special Code type column for columns containing executable Java code) but these are special cases using special logic under the covers to translate the column data (which is ultimately still a string of characters) into the appropriate option list instance or (for Code columns) into compiled-on-the-fly-and-then-executed Java).
Welcome to Stack Overflow :) To create a Population via Excel Import you have to create a method and call Code like this. You also need an empty Population.
int n = excelFile.getLastRowNum(YOUR_SHEET_NAME);
for(int i = FIRST_ROW; i <= n; i++){
String name = excelFile.getCellStringValue(YOUR_SHEET_NAME, i, 1);
double SEC_PARAMETER_TO_READ= excelFile.getCellNumericValue(YOUR_SHEET_NAME, i, 2);
WP workplace = add_wps(name, SEC_PARAMETER_TO_READ);
}
Now if you want to get a workplace by name, you have to create a method similar to your try.
Functionbody:
WP workplaceToFind = wps.findFirst(w -> w.name.equals(destinationName));
if(workplaceToFind != null){
//do what ever you want
}

Django Query Optimisation

I am working currently on telecom analytics project and newbie in query optimisation. To show result in browser it takes a full minute while just 45,000 records are to be accessed. Could you please suggest on ways to reduce time for showing results.
I wrote following query to find call-duration of a person of age-group:
sigma=0
popn=len(Demo.objects.filter(age_group=age))
card_list=[Demo.objects.filter(age_group=age)[i].card_no
for i in range(popn)]
for card in card_list:
dic=Fact_table.objects.filter(card_no=card.aggregate(Sum('duration'))
sigma+=dic['duration__sum']
avgDur=sigma/popn
Above code is within for loop to iterate over age-groups.
Model is as follows:
class Demo(models.Model):
card_no=models.CharField(max_length=20,primary_key=True)
gender=models.IntegerField()
age=models.IntegerField()
age_group=models.IntegerField()
class Fact_table(models.Model):
pri_key=models.BigIntegerField(primary_key=True)
card_no=models.CharField(max_length=20)
duration=models.IntegerField()
time_8bit=models.CharField(max_length=8)
time_of_day=models.IntegerField()
isBusinessHr=models.IntegerField()
Day_of_week=models.IntegerField()
Day=models.IntegerField()
Thanks
Try that:
sigma=0
demo_by_age = Demo.objects.filter(age_group=age);
popn=demo_by_age.count() #One
card_list = demo_by_age.values_list('card_no', flat=True) # Two
dic = Fact_table.objects.filter(card_no__in=card_list).aggregate(Sum('duration') #Three
sigma = dic['duration__sum']
avgDur=sigma/popn
A statement like card_list=[Demo.objects.filter(age_group=age)[i].card_no for i in range(popn)] will generate popn seperate queries and database hits. The query in the for-loop will also hit the database popn times. As a general rule, you should try to minimize the amount of queries you use, and you should only select the records you need.
With a few adjustments to your code this can be done in just one query.
There's generally no need to manually specify a primary_key, and in all but some very specific cases it's even better not to define any. Django automatically adds an indexed, auto-incremental primary key field. If you need the card_no field as a unique field, and you need to find rows based on this field, use this:
class Demo(models.Model):
card_no = models.SlugField(max_length=20, unique=True)
...
SlugField automatically adds a database index to the column, essentially making selections by this field as fast as when it is a primary key. This still allows other ways to access the table, e.g. foreign keys (as I'll explain in my next point), to use the (slightly) faster integer field specified by Django, and will ease the use of the model in Django.
If you need to relate an object to an object in another table, use models.ForeignKey. Django gives you a whole set of new functionality that not only makes it easier to use the models, it also makes a lot of queries faster by using JOIN clauses in the SQL query. So for you example:
class Fact_table(models.Model):
card = models.ForeignKey(Demo, related_name='facts')
...
The related_name fields allows you to access all Fact_table objects related to a Demo instance by using instance.facts in Django. (See https://docs.djangoproject.com/en/dev/ref/models/fields/#module-django.db.models.fields.related)
With these two changes, your query (including the loop over the different age_groups) can be changed into a blazing-fast one-hit query giving you the average duration of calls made by each age_group:
age_groups = Demo.objects.values('age_group').annotate(duration_avg=Avg('facts__duration'))
for group in age_groups:
print "Age group: %s - Average duration: %s" % group['age_group'], group['duration_avg']
.values('age_group') selects just the age_group field from the Demo's database table. .annotate(duration_avg=Avg('facts__duration')) takes every unique result from values (thus each unique age_group), and for each unique result will fetch all Fact_table objects related to any Demo object within that age_group, and calculate the average of all the duration fields - all in a single query.

Best database design (model) for user tables

I'm developping a web application using google appengine and django, but I think my problem is more general.
The users have the possibility to create tables, look: tables are not represented as TABLES in the database. I give you an example:
First form:
Name of the the table: __________
First column name: __________
Second column name: _________
...
The number of columns is not fixed, but there is a maximum (100 for example). The type in every columns is the same.
Second form (after choosing a particular table the user can fill the table):
column_name1: _____________
column_name2: _____________
....
I'm using this solution, but it's wrong:
class Table(db.Model):
name = db.StringProperty(required = True)
class Column(db.Model):
name = db.StringProperty(required = True)
number = db.IntegerProperty()
table = db.ReferenceProperty(table, collection_name="columns")
class Value(db.Model):
time = db.TimeProperty()
column = db.ReferenceProperty(Column, collection_name="values")
when I want to list a table I take its columns and from every columns I take their values:
data = []
for column in data.columns:
column_data = []
for value in column.values:
column_data.append(value.time)
data.append(column_data)
data = zip(*data)
I think that the problem is the order of the values, because it is not true that the order for one column is the same for the others. I'm waiting for this bug (but until now I never seen it):
Table as I want: as I will got:
a z c a e c
d e f d h f
g h i g z i
Better solutions? Maybe using ListProperty?
Here's a data model that might do the trick for you:
class Table(db.Model):
name = db.StringProperty(required=True)
owner = db.UserProperty()
column_names = db.StringListProperty()
class Row(db.Model):
values = db.ListProperty(yourtype)
table = db.ReferenceProperty(Table, collection_name='rows')
My reasoning:
You don't really need a separate entity to store column names. Since all columns are of the same data type, you only need to store the name, and the fact that they are stored in a list gives you an implicit order number.
By storing the values in a list in the Row entity, you can use an index into the column_names property to find the matching value in the values property.
By storing all of the values for a row together in a single entity, there is no possibility of values appearing out of their correct order.
Caveat emptor:
This model will not work well if the table can have columns added to it after it has been populated with data. To make that possible, every time that a column is added, every existing row belonging to that table would have to have a value appended to its values list. If it were possible to efficiently store dictionaries in the datastore, this would not be a problem, but list can really only be appended to.
Alternatively, you could use Expando...
Another possibility is that you could define the Row model as an Expando, which allows you to dynamically create properties on an entity. You could set column values only for the columns that have values in them, and that you could also add columns to the table after it has data in it and not break anything:
class Row(db.Expando):
table = db.ReferenceProperty(Table, collection_name='rows')
#staticmethod
def __name_for_column_index(index):
return "column_%d" % index
def __getitem__(self, key):
# Allows one to get at the columns of Row entities with
# subscript syntax:
# first_row = Row.get()
# col1 = first_row[1]
# col12 = first_row[12]
value = None
try:
value = self.__dict__[Row.__name_for_column_index]
catch KeyError:
# The given column is not defined for this Row
pass
return value
def __setitem__(self, key, value):
# Allows one to set the columns of Row entities with
# subscript syntax:
# first_row = Row.get()
# first_row[5] = "New values for column 5"
self.__dict__[Row.__name_for_column_index] = value
# In order to allow efficient multiple column changes,
# the put() can go somewhere else.
self.put()
Why don't you add an IntegerProperty to Value for rowNumber and increment it every time you add a new row of values and then you can reconstruct the table by sorting by rowNumber.
You're going to make life very hard for yourself unless your user's 'tables' are actually stored as real tables in a relational database. Find some way of actually creating tables and use the power of an RDBMS, or you're reinventing a very complex and sophisticated wheel.
This is the conceptual idea I would use:
I would create two classes for the data-store:
table this would serve as a
dictionary, storing the structure of
the pseudo-tables your app would
create. it would have two fields :
table_name, column_name,
column_order . where column_order
would give the position of the
column within the table
data
this would store the actual data in
the pseudo-tables. it would have
four fields : row_id, table_name,
column_name , column_data. row_id
would be the same for data
pertaining to the same row and would
be unique for data across the
various pseudo-tables.
Put the data in a LongBlob.
The power of a database is to be able to search and organise data so that you are able to get only the part you want for performances and simplicity issues : you don't want the whole database, you just want a part of it and want it fast. But from what I understand, when you retrieve a user's data, you retrieve it all and display it. So you don't need to sotre the data in a normal "database" way.
What I would suggest is to simply format and store the whole data from a single user in a single column with a suitable type (LongBlob for example). The format would be an object with a list of columns and rows of type. And you define the object in whatever language you use to communicate with the database.
The columns in your (real) database would be : User int, TableNo int, Table Longblob.
If user8 has 3 tables, you will have the following rows :
8, 1, objectcontaintingtable1;
8, 2, objectcontaintingtable2;
8, 3, objectcontaintingtable3;

Resources