Flink interprets a mapped Row as a single RAW - apache-flink

Im able to sink a static Row to a database:
DataStream<Row> staticRows = environment.fromElements("value1", "value2")
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(environment); // convert to table API
Table inputTable = tableEnv.fromDataStream(staticRows);
tableEnv.executeSql(myDDLAndSinkProperties);
inputTable.executeInsert("MYTABLE");
But mapping an unbounded stream to a Row like this:
DataStream<Row> kafkaRows = kafkaEvents.map(new MyKafkaRecordToRowMapper());
Throws an error where input and sink schemas do not match when DB insert is attempted.
Query schema: [f0: RAW('org.apache.flink.types.Row', '...')]
The same code works for a POJO and Tuple, but I have more than 25 columns and the POJO doesn't serve any other purpose - so Im hoping it could replaced by a general purpose sequence of fields (which Row claims to be).
How do I use Row to input into a database? Given examples only show it used for static datastreams and database output.

I think this will work better if you change it to something like this (after adjusting the column names and types, of course):
DataStream<Row> kafkaRows = kafkaEvents
.map(new MyKafkaRecordToRowMapper())
.returns(Types.ROW_NAMED(
new String[] {"id", "quota", "ts", ...},
Types.STRING,
Types.LONG,
TypeInformation.of(Instant.class)),
...);

Related

How to attach schema to a Flink DataStream - on the fly?

I am dealing with a stream of database mutations, i.e., a change log stream. I want to able to transform the values using a SQL query.
I am having difficulty putting together the following three concepts
RowTypeInfo, Row, and DataStream.
NOTE: I don't know the schema beforehand. I construct it on-fly using the data within the Mutation object (Mutation is a custom type)
More specifically I have code that looks like this.
val execEnv = StreamExecutionEnvironment.getExecutionEnvironment
val tableEnv: StreamTableEnvironment = TableEnvironment.getTableEnvironment(execEnv)
// Mutation is a custom type
val mutationStream: DataStream[Mutation] = ...
// toRows returns an object of type org.apache.flink.types.Row
val rowStream:DataStream[Row] = mutationStream.flatMap({mutation => toRows(mutation)})
tableEnv.registerDataStream("spinal_tap_table", rowStream)
tableEnv.sql("select col1 + 2")
NOTE: Row object is positional, and doesn't have a placeholder for column names.
I couldn't find a place to attach the schema to the DataStream object.
I want to pass some sort of a struct similar to Row that contains the complete information {columnName: String, columnValue: Object, columnType: TypeInformation[_]} for the query.
In Flink SQL a table schema is mandatory when the Table defined. It is not possible to run queries on dynamically typed records.
Regarding the concepts of RowTypeInfo, Row and DataStream:
Row is the actual record that holds the data
RowTypeInfo is a schema description for Rows. It contains names and TypeInformation for each field of a Row.
DataStream is a logical stream of records. A DataStream[Row] is a stream of rows. Note that this is not the actual stream but just an API concept to represent a stream in the API.

OrientDB - find "orphaned" binary records

I have some images stored in the default cluster in my OrientDB database. I stored them by implementing the code given by the documentation in the case of the use of multiple ORecordByte (for large content): http://orientdb.com/docs/2.1/Binary-Data.html
So, I have two types of object in my default cluster. Binary datas and ODocument whose field 'data' lists to the different record of binary datas.
Some of the ODocument records' RID are used in some other classes. But, the other records are orphanized and I would like to be able to retrieve them.
My idea was to use
select from cluster:default where #rid not in (select myField from MyClass)
But the problem is that I retrieve the other binary datas and I just want the record with the field 'data'.
In addition, I prefer to have a prettier request because I don't think the "not in" clause is really something that should be encouraged. Is there something like a JOIN which return records that are not joined to anything?
Can you help me please?
To resolve my problem, I did like that. However, I don't know if it is the right way (the more optimized one) to do it:
I used the following SQL request:
SELECT rid FROM (FIND REFERENCES (SELECT FROM CLUSTER:default)) WHERE referredBy = []
In Java, I execute it with the use of the couple OCommandSQL/OCommandRequest and I retrieve an OrientDynaElementIterable. I just iterate on this last one to retrieve an OrientVertex, contained in another OrientVertex, from where I retrieve the RID of the orpan.
Now, here is some code if it can help someone, assuming that you have an OrientGraphNoTx or an OrientGraph for the 'graph' variable :)
String cmd = "SELECT rid FROM (FIND REFERENCES (SELECT FROM CLUSTER:default)) WHERE referredBy = []";
List<String> orphanedRid = new ArrayList<String>();
OCommandRequest request = graph.command(new OCommandSQL(cmd));
OrientDynaElementIterable objects = request.execute();
Iterator<Object> iterator = objects.iterator();
while (iterator.hasNext()) {
OrientVertex obj = (OrientVertex) iterator.next();
OrientVertex orphan = obj.getProperty("rid");
orphanedRid.add(orphan.getIdentity().toString());
}

store strings of arbitrary length in Postgresql

I have a Spring application which uses JPA (Hibernate) initially created with Spring Roo. I need to store Strings with arbitrary length, so for that reason I've annotated the field with #Lob:
public class MyEntity{
#NotNull
#Size(min = 2)
#Lob
private String message;
...
}
The application works ok in localhost but I've deployed it to an external server and it a problem with encoding has appeared. For that reason I'd like to check if the data stored in the PostgreSQL database is ok or not. The application creates/updates the tables automatically. And for that field (message) it has created a column of type:
text NOT NULL
The problem is that after storing data if I browse the table or just do a SELECT of that column I can't see the text but numbers. Those numbers seems to be identifiers to "somewhere" where that information is stored.
Can anyone tell me exactly what are these identifiers and if there is any way of being able to see the stored data in a #Lob columm from a pgAdmin or a select clause?
Is there any better way to store Strings of arbitrary length in JPA?
Thanks.
I would recommend skipping the '#Lob' annotation and use columnDefinition like this:
#Column(columnDefinition="TEXT")
see if that helps viewing the data while browsing the database itself.
Use the #LOB definition, it is correct. The table is storing an OID to the catalogs -> postegreSQL-> tables -> pg_largeobject table.
The binary data is stored here efficiently and JPA will correctly get the data out and store it for you with this as an implementation detail.
Old question, but here is what I found when I encountered this:
http://www.solewing.org/blog/2015/08/hibernate-postgresql-and-lob-string/
Relevant parts below.
#Entity
#Table(name = "note")
#Access(AccessType.FIELD)
class NoteEntity {
#Id
private Long id;
#Lob
#Column(name = "note_text")
private String noteText;
public NoteEntity() { }
public NoteEntity(String noteText) { this.noteText = noteText }
}
The Hibernate PostgreSQL9Dialect stores #Lob String attribute values by explicitly creating a large object instance, and then storing the UID of the object in the column associated with attribute.
Obviously, the text of our notes isn’t really in the column. So where is it? The answer is that Hibernate explicitly created a large object for each note, and stored the UID of the object in the column. If we use some PostgreSQL large object functions, we can retrieve the text itself.
Use this to query:
SELECT id,
convert_from(loread(
lo_open(note_text::int, x'40000'::int), x'40000'::int), 'UTF-8')
AS note_text
FROM note

How to select out of a one to many property

I have an appengine app which has been running for about a year now, i have mainly been using JDO queries until now, but i am trying to collect stats and the queries are taking too long. I have the following entity (Device)
public class Device implements Serializable{
...
#Persistent
private Set<Key> feeds;// Key for the Feed entity
...
}
So I want to get a count of how many Devices have a certain Feed. I was doing it in JDOQL before as such (uses javax.jdo.Query):
Query query = pm.newQuery("select from Device where feeds.contains(:feedKey)");
Map<String, Object> paramsf = new HashMap<String, Object>();
paramsf.put("feedKey",feed.getId());
List<Device> results = (List<Device>) query.executeWithMap(paramsf);
Though this code times out now. I was trying to use the Datastore API so I could set chunk size,etc to see if i could speed the query up or use a cursor, but I am unsure how to search in a Set field. I was trying this (uses com.google.appengine.api.datastore.Query)
Query query = new Query("Device");
query.addFilter("feeds", FilterOperator.IN, feed.getId());
query.setKeysOnly();
final FetchOptions fetchOptions = FetchOptions.Builder.withPrefetchSize(100).chunkSize(100);
QueryResultList<Entity> results = dss.prepare(query).asQueryResultList(fetchOptions);
Essentially i am unsure how to search in the one-to-many filed (feeds) for a single key. Is it possible to index it somehow?
hope it makes sense....
Lists (and other things that are implemented as lists, like sets) are indexed individually. As a result, you can simply use an equality filter in your query, the same as if you were filtering on a single value rather than a list. A record will be returned if any of the items in the list match.

Best database design (model) for user tables

I'm developping a web application using google appengine and django, but I think my problem is more general.
The users have the possibility to create tables, look: tables are not represented as TABLES in the database. I give you an example:
First form:
Name of the the table: __________
First column name: __________
Second column name: _________
...
The number of columns is not fixed, but there is a maximum (100 for example). The type in every columns is the same.
Second form (after choosing a particular table the user can fill the table):
column_name1: _____________
column_name2: _____________
....
I'm using this solution, but it's wrong:
class Table(db.Model):
name = db.StringProperty(required = True)
class Column(db.Model):
name = db.StringProperty(required = True)
number = db.IntegerProperty()
table = db.ReferenceProperty(table, collection_name="columns")
class Value(db.Model):
time = db.TimeProperty()
column = db.ReferenceProperty(Column, collection_name="values")
when I want to list a table I take its columns and from every columns I take their values:
data = []
for column in data.columns:
column_data = []
for value in column.values:
column_data.append(value.time)
data.append(column_data)
data = zip(*data)
I think that the problem is the order of the values, because it is not true that the order for one column is the same for the others. I'm waiting for this bug (but until now I never seen it):
Table as I want: as I will got:
a z c a e c
d e f d h f
g h i g z i
Better solutions? Maybe using ListProperty?
Here's a data model that might do the trick for you:
class Table(db.Model):
name = db.StringProperty(required=True)
owner = db.UserProperty()
column_names = db.StringListProperty()
class Row(db.Model):
values = db.ListProperty(yourtype)
table = db.ReferenceProperty(Table, collection_name='rows')
My reasoning:
You don't really need a separate entity to store column names. Since all columns are of the same data type, you only need to store the name, and the fact that they are stored in a list gives you an implicit order number.
By storing the values in a list in the Row entity, you can use an index into the column_names property to find the matching value in the values property.
By storing all of the values for a row together in a single entity, there is no possibility of values appearing out of their correct order.
Caveat emptor:
This model will not work well if the table can have columns added to it after it has been populated with data. To make that possible, every time that a column is added, every existing row belonging to that table would have to have a value appended to its values list. If it were possible to efficiently store dictionaries in the datastore, this would not be a problem, but list can really only be appended to.
Alternatively, you could use Expando...
Another possibility is that you could define the Row model as an Expando, which allows you to dynamically create properties on an entity. You could set column values only for the columns that have values in them, and that you could also add columns to the table after it has data in it and not break anything:
class Row(db.Expando):
table = db.ReferenceProperty(Table, collection_name='rows')
#staticmethod
def __name_for_column_index(index):
return "column_%d" % index
def __getitem__(self, key):
# Allows one to get at the columns of Row entities with
# subscript syntax:
# first_row = Row.get()
# col1 = first_row[1]
# col12 = first_row[12]
value = None
try:
value = self.__dict__[Row.__name_for_column_index]
catch KeyError:
# The given column is not defined for this Row
pass
return value
def __setitem__(self, key, value):
# Allows one to set the columns of Row entities with
# subscript syntax:
# first_row = Row.get()
# first_row[5] = "New values for column 5"
self.__dict__[Row.__name_for_column_index] = value
# In order to allow efficient multiple column changes,
# the put() can go somewhere else.
self.put()
Why don't you add an IntegerProperty to Value for rowNumber and increment it every time you add a new row of values and then you can reconstruct the table by sorting by rowNumber.
You're going to make life very hard for yourself unless your user's 'tables' are actually stored as real tables in a relational database. Find some way of actually creating tables and use the power of an RDBMS, or you're reinventing a very complex and sophisticated wheel.
This is the conceptual idea I would use:
I would create two classes for the data-store:
table this would serve as a
dictionary, storing the structure of
the pseudo-tables your app would
create. it would have two fields :
table_name, column_name,
column_order . where column_order
would give the position of the
column within the table
data
this would store the actual data in
the pseudo-tables. it would have
four fields : row_id, table_name,
column_name , column_data. row_id
would be the same for data
pertaining to the same row and would
be unique for data across the
various pseudo-tables.
Put the data in a LongBlob.
The power of a database is to be able to search and organise data so that you are able to get only the part you want for performances and simplicity issues : you don't want the whole database, you just want a part of it and want it fast. But from what I understand, when you retrieve a user's data, you retrieve it all and display it. So you don't need to sotre the data in a normal "database" way.
What I would suggest is to simply format and store the whole data from a single user in a single column with a suitable type (LongBlob for example). The format would be an object with a list of columns and rows of type. And you define the object in whatever language you use to communicate with the database.
The columns in your (real) database would be : User int, TableNo int, Table Longblob.
If user8 has 3 tables, you will have the following rows :
8, 1, objectcontaintingtable1;
8, 2, objectcontaintingtable2;
8, 3, objectcontaintingtable3;

Resources