Apache Storm - Accessing database from SPOUT - connection pooling - database

Having a spout which on each tick goes to Postgre database and reads an additional row. The spout code looks as follows:
class RawDataLevelSpout extends BaseRichSpout implements Serializable {
private int counter;
SpoutOutputCollector collector;
#Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("col1", "col2"));
}
#Override
public void open(Map map, TopologyContext context, SpoutOutputCollector spoutOutputCollector) {
collector = spoutOutputCollector;
}
private Connection initializeDatabaseConnection() {
try {
Class.forName("org.postgresql.Driver");
Connection connection = null;
connection = DriverManager.getConnection(
DATABASE_URI,"root", "root");
return connection;
} catch (ClassNotFoundException e) {
e.printStackTrace();
} catch (SQLException e) {
e.printStackTrace();
}
return null;
}
#Override
public void close() {
}
#Override
public void nextTuple() {
List<String> values = new ArrayList<>();
PreparedStatement statement = null;
try {
Connection connection = initializeDatabaseConnection();
statement = connection.prepareStatement("SELECT * FROM table1 ORDER BY col1 LIMIT 1 OFFSET ?");
statement.setInt(1, counter++);
ResultSet resultSet = statement.executeQuery();
resultSet.next();
ResultSetMetaData resultSetMetaData = resultSet.getMetaData();
int totalColumns = resultSetMetaData.getColumnCount();
for (int i = 1; i <= totalColumns; i++) {
String value = resultSet.getString(i);
values.add(value);
}
connection.close();
} catch (SQLException e) {
e.printStackTrace();
}
collector.emit(new Values(values.stream().toArray(String[]::new)));
}
}
What is the standard way how to approach connection pooling in Spouts in apache storm? Furthermore, is it possible to somehow synchronize the coutner variable accross multiple running instances within the cluster topology?

Regarding connection pooling, you could pool connections via static variable if you wanted, but since you aren't guaranteed to have all spout instances running in the same JVM, I don't think there's any point.
No, there is no way to synchronize the counter. The spout instances may be running on different JVMs, and you don't want them all blocking while the spouts agree what the counter value is. I don't think your spout implementation makes sense though. If you wanted to just read one row at a time, why would you not just run a single spout instance instead of trying to synchronize multiple spouts?
You seem to be trying to use your relational database as a queue system, which is probably a bad fit. Consider e.g. Kafka instead. I think you should be able to use either one of https://www.confluent.io/product/connectors/ or http://debezium.io/ to stream data from your Postgres to Kafka.

Related

Skip message in Kafka deserialization schema if any problems occur

I have a simple Apache Flink job that ends with a Kafka sink. I'm using a KafkaRecordSerializationSchema<CustomType> to handle the message from the previous (RichFlatMap) operator:
public final class CustomTypeSerializationSchema implements KafkaRecordSerializationSchema<CustomType> {
private static final long serialVersionUID = 5743933755381724692L;
private final String topic;
public CustomTypeSerializationSchema(final String topic) {
this.topic = topic;
}
#Override
public ProducerRecord<byte[], byte[]> serialize(final CustomType input, final KafkaSinkContext context,
final Long timestamp) {
final var result = new CustomMessage(input);
try {
return new ProducerRecord<>(topic,
JacksonJsonMapper.writeValueAsString(result).getBytes(StandardCharsets.UTF_8));
} catch (final Exception e) {
logger.warn("Unable to serialize message [{}]. This was the reason:", result, e);
}
return new ProducerRecord<>(topic, new byte[0]);
}
}
The problem I'm trying to avoid is to send an "empty" ProducerRecord — like the one that will be executed by default if something happens within the try-catch. Basically, I'm looking for a behavior similar to KafkaRecordDeserializationSchema, where what's put in the collector is what's going to be received in subsequent operators, and the rest is discarded.
Is there a way to achieve this with another *SerializationSchema type?

Flink streaming example that generates its own data

Earlier I asked about a simple hello world example for Flink. This gave me some good examples!
However I would like to ask for a more ‘streaming’ example where we generate an input value every second. This would ideally be random, but even just the same value each time would be fine.
The objective is to get a stream that ‘moves’ with no/minimal external touch.
Hence my question:
How to show Flink actually streaming data without external dependencies?
I found how to show this with generating data externally and writing to Kafka, or listening to a public source, however I am trying to solve it with minimal dependence (like starting with GenerateFlowFile in Nifi).
Here's an example. This was constructed as an example of how to make your sources and sinks pluggable. The idea being that in development you might use a random source and print the results, for tests you might use a hardwired list of input events and collect the results in a list, and in production you'd use the real sources and sinks.
Here's the job:
/*
* Example showing how to make sources and sinks pluggable in your application code so
* you can inject special test sources and test sinks in your tests.
*/
public class TestableStreamingJob {
private SourceFunction<Long> source;
private SinkFunction<Long> sink;
public TestableStreamingJob(SourceFunction<Long> source, SinkFunction<Long> sink) {
this.source = source;
this.sink = sink;
}
public void execute() throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<Long> LongStream =
env.addSource(source)
.returns(TypeInformation.of(Long.class));
LongStream
.map(new IncrementMapFunction())
.addSink(sink);
env.execute();
}
public static void main(String[] args) throws Exception {
TestableStreamingJob job = new TestableStreamingJob(new RandomLongSource(), new PrintSinkFunction<>());
job.execute();
}
// While it's tempting for something this simple, avoid using anonymous classes or lambdas
// for any business logic you might want to unit test.
public class IncrementMapFunction implements MapFunction<Long, Long> {
#Override
public Long map(Long record) throws Exception {
return record + 1 ;
}
}
}
Here's the RandomLongSource:
public class RandomLongSource extends RichParallelSourceFunction<Long> {
private volatile boolean cancelled = false;
private Random random;
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
random = new Random();
}
#Override
public void run(SourceContext<Long> ctx) throws Exception {
while (!cancelled) {
Long nextLong = random.nextLong();
synchronized (ctx.getCheckpointLock()) {
ctx.collect(nextLong);
}
}
}
#Override
public void cancel() {
cancelled = true;
}
}

How to log SqlDataAdapter exceptions asynchronously?

I have ASP.NET MVC application and I'm using a SqlDataAdapter for handling sql server methods. The method below returns nothing when an exception occurs, so I don't have any information when there is an error with a sql server command.
public int ExecuteNonQuery(string pSql, List<SqlParameter> prm)
{
SqlCommand cmd = new SqlCommand(pSql, sqlConnection);
int result = 0;
if (sqlConnection.State == ConnectionState.Closed)
{
sqlConnection.Open();
}
try
{
cmd.Parameters.AddRange(prm.ToArray());
result = cmd.ExecuteNonQuery();
return result;
}
catch (Exception ex)
{
return 0;
}
finally
{
sqlConnection.Close();
}
}
How can I asynchronously log exceptions without disrupting the user experience?Additionally, I want to log it to database and if there is a timeout exception then logging this timeout can get timeout again when logging.
There is no SqlDataAdapter code in your sample. That being said, there are three DataAdapter events that you can use to respond to changes made to data at the data source.
RowUpdating,
RowUpdated, &
FillError
There is also a:
Status property to determine if an error has occurred during the
operation and, if desired, to control the actions against the current
and resulting rows.
You can asynchronously log exceptions in those events when there is an error.
Examples from MSDN:
protected static void OnRowUpdated(
object sender, SqlRowUpdatedEventArgs args)
{
if (args.Status == UpdateStatus.ErrorsOccurred)
{
args.Row.RowError = args.Errors.Message;
args.Status = UpdateStatus.SkipCurrentRow;
Task.Run(() => LogError());
}
}
protected static void FillError(object sender, FillErrorEventArgs args)
{
if (args.Errors.GetType() == typeof(System.OverflowException))
{
// Code to handle precision loss.
//Add a row to table using the values from the first two columns.
DataRow myRow = args.DataTable.Rows.Add(new object[]
{args.Values[0], args.Values[1], DBNull.Value});
//Set the RowError containing the value for the third column.
args.RowError =
"OverflowException Encountered. Value from data source: " +
args.Values[2];
args.Continue = true;
Task.Run(() => LogError());
}
}
private void LogError()
{
// logging code
}

ChannelFactory method call increse memory

I have an winform application which consumes windows service, i user ChannelFactory
to connect to service, problem is when i call service method using channel the memory usage increase and after
method execute memory not go down(even after form close), i call GC.Collect but no change
channel Create class
public class Channel1
{
List<ChannelFactory> chanelList = new List<ChannelFactory>();
ISales salesObj;
public ISales Sales
{
get
{
if (salesObj == null)
{
ChannelFactory<ISales> saleschannel = new ChannelFactory<ISales>("SalesEndPoint");
chanelList.Add(saleschannel);
salesObj = saleschannel.CreateChannel();
}
return salesObj;
}
}
public void CloseAllChannels()
{
foreach (ChannelFactory chFac in chanelList)
{
chFac.Abort();
((IDisposable)chFac).Dispose();
}
salesObj = null;
}
}
base class
public class Base:Form
{
public Channel1 channelService = new Channel1();
public Channel1 CHANNEL
{
get
{
return channelService;
}
}
}
winform class
Form1:Base
private void btnView_Click(object sender, EventArgs e)
{
DataTable _dt = new DataTable();
try
{
gvAccounts.AutoGenerateColumns = false;
_dt = CHANNEL.Sales.GetDatatable();
gvAccounts.DataSource = _dt;
}
catch (Exception ex)
{
MessageBox.Show("Error Occurred while processing...\n" + ex.Message, "Warning", MessageBoxButtons.OK, MessageBoxIcon.Warning);
}
finally
{
CHANNEL.CloseAllChannels();
_dt.Dispose();
//GC.Collect();
}
}
You're on the right track in terms of using ChannelFactory<T>, but your implementation is a bit off.
ChannelFactory<T> creates a factory for generating channels of type T. This is a relatively expensive operation (as compared to just creating a channel from the existing factory), and is generally done once per life of the application (usually at start). You can then use that factory instance to create as many channels as your application needs.
Generally, once I've created the factory and cached it, when I need to make a call to the service I get a channel from the factory, make the call, and then close/abort the channel.
Using your posted code as a starting point, I would do something like this:
public class Channel1
{
ChannelFactory<ISales> salesChannel;
public ISales Sales
{
get
{
if (salesChannel == null)
{
salesChannel = new ChannelFactory<ISales>("SalesEndPoint");
}
return salesChannel.CreateChannel();
}
}
}
Note that I've replaced the salesObj with salesChannel (the factory). This will create the factory the first time it's called, and create a new channel from the factory every time.
Unless you have a particular requirement to do so, I wouldn't keep track of the different channels, especially if follow the open/do method/close approach.
In your form, it'd look something like this:
private void btnView_Click(object sender, EventArgs e)
{
DataTable _dt = new DataTable();
try
{
gvAccounts.AutoGenerateColumns = false;
ISales client = CHANNEL.Sales
_dt = client.GetDatatable();
gvAccounts.DataSource = _dt;
((ICommunicationObject)client).Close();
}
catch (Exception ex)
{
((ICommunicationObject)client).Abort();
MessageBox.Show("Error Occurred while processing...\n" + ex.Message, "Warning", MessageBoxButtons.OK, MessageBoxIcon.Warning);
}
}
The code above gets a new ISales channel from the factory in CHANNEL, executes the call, and then closes the channel. If an exception happens, the channel is aborted in the catch block.
I would avoid using Dispose() out of the box on the channels, as the implementation in the framework is flawed and will throw an error if the channel is in a faulted state. If you really want to use Dispose() and force the garbage collection, you can - but you'll have to work around the WCF dispose issue. Google will give you a number of workarounds (google WCF Using for a start).

Session-Per-Request with SqlConnection / System.Transactions

I've just started using Dapper for a project, having mostly used ORMs like NHibernate and EF for the past few years.
Typically in our web applications we implement session per request, beginning a transaction at the start of the request and committing it at the end.
Should we do something similar when working directly with SqlConnection / System.Transactions?
How does StackOverflow do it?
Solution
Taking the advice of both #gbn and #Sam Safron I'm not using transactions. In my case I'm only doing read queries so it seems there is no real requirement to use transactions (contrary to what I've been told about implicit transactions).
I create a lightweight session interface so that I can use a connection per request. This is quite beneficial to me as with Dapper I often need to create a few different queries to build up an object and would rather share the same connection.
The work of scoping the connection per request and disposing it is done by my IoC container (StructureMap):
public interface ISession : IDisposable {
IDbConnection Connection { get; }
}
public class DbSession : ISession {
private static readonly object #lock = new object();
private readonly ILogger logger;
private readonly string connectionString;
private IDbConnection cn;
public DbSession(string connectionString, ILogger logger) {
this.connectionString = connectionString;
this.logger = logger;
}
public IDbConnection Connection { get { return GetConnection(); } }
private IDbConnection GetConnection() {
if (cn == null) {
lock (#lock) {
if (cn == null) {
logger.Debug("Creating Connection");
cn = new SqlConnection(connectionString);
cn.Open();
logger.Debug("Opened Connection");
}
}
}
return cn;
}
public void Dispose() {
if (cn != null) {
logger.Debug("Disposing connection (current state '{0}')", cn.State);
cn.Dispose();
}
}
}
This is what we do:
We define a static called DB on an object called Current
public static DBContext DB
{
var result = GetContextItem<T>(itemKey);
if (result == null)
{
result = InstantiateDB();
SetContextItem(itemKey, result);
}
return result;
}
public static T GetContextItem<T>(string itemKey, bool strict = true)
{
#if DEBUG // HttpContext is null for unit test calls, which are only done in DEBUG
if (Context == null)
{
var result = CallContext.GetData(itemKey);
return result != null ? (T)result : default(T);
}
else
{
#endif
var ctx = HttpContext.Current;
if (ctx == null)
{
if (strict) throw new InvalidOperationException("GetContextItem without a context");
return default(T);
}
else
{
var result = ctx.Items[itemKey];
return result != null ? (T)result : default(T);
}
#if DEBUG
}
#endif
}
public static void SetContextItem(string itemKey, object item)
{
#if DEBUG // HttpContext is null for unit test calls, which are only done in DEBUG
if (Context == null)
{
CallContext.SetData(itemKey, item);
}
else
{
#endif
HttpContext.Current.Items[itemKey] = item;
#if DEBUG
}
#endif
}
In our case InstantiateDB returns an L2S context, however in your case it could be an open SQLConnection or whatever.
On our application object we ensure that our connection is closed at the end of the request.
protected void Application_EndRequest(object sender, EventArgs e)
{
Current.DisposeDB(); // closes connection, clears context
}
Then anywhere in your code where you need access to the db you simple call Current.DB and stuff automatically works. This is also unit test friendly due to all the #if DEBUG stuff.
We do not start any transactions per session, if we did and had updates at the beginning of our session, we would get serious locking issues, as the locks would not be released till the end.
You'd only start a SQL Server Transaction when you need to with something like TransactionScope when you call the database with a "write" call.
See a random example in this recent question: Why is a nested transaction committed even if TransactionScope.Complete() is never called?
You would not open a connection and start a transaction per http request. Only on demand. I'm having difficulty understanding why some folk advocate opening a database transaction per session: sheer idiocy when you look at what a database transaction is
Note: I'm not against the pattern per se. I am against unnecessary, too long, client-side database transactions that invoke MSDTC

Resources