How to "submit" an ad-hoc SQL to Beam on Flink - apache-flink

I'm using Apache Beam with Flink runner with Java SDK. It seems that deploying a job to Flink means building a 80-megabyte fat jar that gets uploaded to Flink job manager.
Is there a way to easily deploy a lightweight SQL to run Beam SQL? Maybe have job deployed that can soemhow get and run ad hoc queries?

I don't think it's possible at the moment, if I understand your question. Right now Beam SDK will always build a fat jar which will implement the pipeline and include all pipeline dependencies, and it will not be able to accept lightweight ad-hoc queries.
If you're interested in more interactive experience in general, you cat look at the ongoing efforts to make Beam more interactive, for example:
SQL shell: https://s.apache.org/beam-sql-packaging . This describes a work-in-progress Beam SQL shell, which should allow you to quickly execute small SQL queries locally in a REPL environment, so that you can interactively explore your data, and design the pipeline before submitting a long-running job. This does not change the way how the job gets submitted to Flink (or any other runner) though. So after you submitted the long running job, you will likely still have to use normal job management tools you currently have to control it.
Python: https://s.apache.org/interactive-beam . Describes the approach to wrap existing runner into an interactive wrapper.

Related

What is recommended way to automate Flink Job submission on AWS EMR cluster while pipeline deployment

I am new to Flink and EMR cluster deployment. Currently we have a Flink job and we are manually deploying it on AWS EMR cluster via Flink CLI stop/start-job commands.
I wanted to automate this process (Automate updating flink job jar on every deployment happening via pipelines with savepoints) and need some recommendations on possible approaches that could be explored.
We got an option to automate this process via Flink Rest API support for all flink job operation
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/rest_api/
Sample project which used the same approach : https://github.com/ing-bank/flink-deployer

Entity Framework migration strategy for multiple instances

I have a .NET core app which I'am running on AWS Elastic Container Services (ECS).
- The app runs on two different instances.
- Database is SQL server
The app runs the database migrations on startup, which has worked really well. But then i had to migrate a lot of data which meant that the migration took longer time. This resulted in duplicates of the data being moved.
This happens because both apps first checks the database if the migration has been executed, both finds out that it hasn't, then both starts running the migration which takes time. After it is done it adds the migration to the database.
How do people solve this?
Possible solutions I and others have thought of
Start with only one instance of the app, then scale up.
this would work, but then I will have to manually scale down and up for each time there is a migration. (It is possible to do it automatically, but it would take time)
Wrap long running migrations in transactions and at the start set the migration as done in the database. Check if it is in the database before commiting the change. If the transaction fails, remove the migration from the database.
Lock the database? EF Core lock the database during migration . Seems weird.
Make the migration a part of the deployment process. This seems to be best practice, but it would mean that the Build server would need to know the Database secrets. I'am not to afraid to give it, but it would mean i would have to maintain a duplicate set.
What does people out there do? Am I missing some obvious solution?
Thanks
We also used to have our applications perform the migration, but even Microsoft recommends avoiding this in a multi-instance environment:
We recommend production apps should not call Database.Migrate at application startup. Migrate shouldn't be called from an app in server farm. For example, if the app has been cloud deployed with scale-out (multiple instances of the app are running).
Database migration should be done as part of deployment, and in a controlled way.
Like everything there are different ways to go about solving the problem. Our team is small and so we generate migration scripts through the EF CLI tooling and then run them manually as part of a deployment/maintenance routine. This could of course be automated if your process warrants it.

Automating (DevOps) the deployment of SQL Server Databases with Large Datasets

I'm in the middle of a DevOps project automating the deployment of apps of varying code stacks and DB's using a variety of DevOps Tools. I am seeking advice on automating an MS SQL DB deployment and subsequent updates.
Current approach is to build DB VM via a VM blueprint using Terraform or Cloudify. I currently have a VM with MSSQLServer configured and can script sql files against it to instantiate my DB. However once i get to scripting the raw data, size-wise,I often run out of memory. I know, manually i can increase memory etc in properties or use a sql file in CLI. I also know of things like DTS or BCP but what I think I am looking for is advice on best methods in automatically deploying an MS SQL DB via a DevOps pipeline. The intention is to use jenkins and deploy scripts via a power-shell.
Initial thoughts were to force DB owners to provide a bak/mdf file and subsequent updates are scripted. But I'd really appreciate council on best approaches in business especially if you have done this on large scale e.g. 1000's of apps.
If it helps the nature of my pipeline the approach I take needs to work for both small DB's 100MB, and larger ones - say up to 1-2TB.
Another approach (as of early 2018) is to use SQL Server database cloning, which restores a full byte copy of a set of databases into a Windows Virtual Hard Drive (VHD), which the supports delivery of clones (differencing disks) with mountable replicas. Clones can be mounted t conventional SQL Server instances or containers, and you can check out cloning support from Red Gate and Windocks.

Keep Derby Network Server running after build

We are using a dependency to start a derby network server.
How can I start the derby network server so that it keeps running.
Now the server stops after the build.
As described in my answer here, you can use Derby as your database via the derby-maven-plugin which I wrote and is available on GitHub and via Maven Central.
See here for details on how to use it. This will basically remove the need for you to start Derby through your tests and it will keep it up and running while the tests are executing. In combination with the sql-maven-plugin, you could have a reasonably decent testing environment.
To further clarify, the server does not run after the build has finished. However, under target/derby you can find your database which you can run, if you need to investigate the produced data.

Import data on HDFS to SQL Server or export data on HDFS to SQL Server

I had been trying to figure out on which is the best approach for porting data from HDFS to SQL Server.
Do I import data from Cloudera Hadoop using sqoop Hadoop Connector for SQL Server 2008 R2 or
Do I export data from Cloudera Hadoop using sqoop into SQL Server
I am sure that both are possible based on the bunch of links I read through
http://www.cloudera.com/blog/2011/10/apache-sqoop-overview/
http://www.microsoft.com/en-in/download/details.aspx?id=27584
But when I am looking for possible issues that could rise at level of configuration and maintenance I don't have proper answers.
I strongly feel that I should go for import, but I am not comfortable in troubleshooting and maintaining the issues that could come up every now and then.
Can someone share their thoughts on what could be the best?
Both of your options use the same method: Apache Sqoop's Export utility. Using the licensed Microsoft connector/driver jar should expectedly yield more performance for the task than using a generic connector offered by Apache Sqoop.
In terms of maintenance, there should be none once you have it working fine. So long as the version of SQL Server in use is supported by the driver jar, it should continue to work as normally expected of it.
In terms of configuration, you may initially have to manually tune to find the best -m value for parallelism of your Export MapReduce job launched by the export tool. Using a too high value would cause problems on the DB side, while using a too low value would not give you ideal performance. Some trial and error is required here to arrive at the right -m value, along with knowledge of the load periods of your DB, in order to set the parallelism right.
The Apache Sqoop (v1) doc page for users of the export tool also lists down a set of common reasons for the failure of the export job. You may want to view those here.
On the MapReduce side, you may also want to dedicate a defined scheduler pool or queue for such external-writing jobs as they may be business critical, and schedulers like FairScheduler and CapacityScheduler help define SLA guarantees on each pool or queue such that the jobs get adequate resources to run when they're launched.

Resources