We read a large file into a database (actually, we read Excel to csv and then dump the csv in a DB), when/after this is done i need to dump the results of several SELECT statements into files.
These file are then distributed by mail.
How can i schedule Camel to only start with the SELECTs after the entire file is dumped in the DB table?
There are probably several ways to accomplish this. Camel will do this by default, if the routes with the SELECT statements pick up where the route that loaded the data left of.
In the toy example below the first route sets a flag file in one output directory (in lieu of reading a file into a DB), then goes to two endpoints. From these endpoint two routes are started that copy a file to two different output directories.
The .multicast() line can be left out, it's not clear to me when it is neccessary.
public class DivergentRouteBuilder extends RouteBuilder {
#Override
public void configure() throws Exception {
final String FLAG_URI = "file:data/output?doneFileName=${file:name}.flag";
final String OUTPUT = "file:data/output";
final String OUTPUT2 = "file:data/output2";
final String REPORT1 = "direct:one";
final String REPORT2 = "direct:two";
final Route2IsScheduled isScheduled = new Route2IsScheduled();
from("file:data/input")
.to("log:?level=INFO&showBody=true&showHeaders=true")
.to(FLAG_URI)
.multicast()
.to(REPORT1, REPORT2);
from(REPORT1)
.to(OUTPUT);
from(REPORT2)
.choice()
.when(isScheduled)
.to(OUTPUT2)
.otherwise()
.stop()
.end();
}
}
However, when reading the file into a DB and outputting the results of the SELECT statements have different schedules, this code gets ugly. Say a Data Warehouse is updated daily but some report is only required monthly, then the corresponding route has to check if it's scheduled. This happend with isSchduled in the example.
I have similar requirements, so in the end i choose to schedule my routes via Quartz individually. The route that stores the data is scheduled 1 hour earlier than the routes that read the data. The latter have to check if the data is up to date.
Related
While integration testing, I am attempting to test the execution of a stored procedure. To do so, I need to perform the following steps:
Insert some setup data
Execute the stored procedure
Find the data in the downstream repo that is written to by the stored proc
I am able to complete all of this successfully, however, after completion of the test only the rows written by the stored procedure are rolled back. Those rows inserted via the JdbcAggregateTemplate are not rolled back. Obviously I can delete them manually at the end of the test declaration, but I feel like I must be missing something here with my configuration (perhaps in the #Transactional or #Rollback annotations.
#SpringBootTest
#Transactional
#Rollback
#TestInstance(TestInstance.Lifecycle.PER_CLASS)
class JobServiceIntegrationTest #Autowired constructor(
private val repo: JobExecutorService,
private val template: JdbcAggregateTemplate,
private val generatedDataRepo: GeneratedDataRepo,
) {
#Nested
inner class ExecuteMyStoredProc {
#Test
fun `job is executed`() {
// arrange
val supportingData = supportingData()
// act
// this data does not get rolled back but I would like it to
val expected = template.insert(supportingData)
// this data does get rolled back
repo.doExecuteMyStoredProc()
val actual = generatedDataRepo.findAll().first()
assertEquals(expected.supportingDataId, actual.supportingDataId)
}
}
fun supportingData() : SupportingData {
...
}
}
If this was all done as part of a physical database transaction, I would anticipate the inner transactions are all rolled back when the outer transaction rolls back. Obviously this is not that, but that's the behavior I'm hoping to emulate.
I've made plenty of integration tests and all of them roll back as I expect, but typically I'm just applying some business logic and writing to a database, nothing as involved as this. The only unique situations about this test from my other tests is that I'm executing a stored proc (and the stored proc contains transactions).
I'm writing this data to a SQL Server DB, and I'm using Spring JDBC with Kotlin.
Making my comment into an answer since it seemed to have solved the problem:
I suspect the transaction in the SP commits the earlier changes. Could you post the code for a simple SP that causes the problems you describe, so I can play around with it?
My Google App Engine application (Python3, standard environment) serves requests from users: if there is no wanted record in the database, then create it.
Here is the problem about database overwriting:
When one user (via browser) sends a request to database, the running GAE instance may temporarily fail to respond to the request and then it creates a new process to respond this request. It results that two instances respond to the same request. Both instances make a query to database almost in the same time, and each of them finds there is no wanted record and thus creates a new record. It results as two repeated records.
Another scenery is that for certain reason, the user's browser sends twice requests with time difference less than 0.01 second, which are processed by two instances at the server side and thus repeated records are created.
I am wondering how to temporarily lock the database by one instance to prevent the database overwriting from another instance.
I have considered the following schemes but have no idea whether it is efficient or not.
For python 2, Google App Engine provides "memcache", which can be used to mark the status of query for the purpose of database locking. But for python3, it seems that one has to setup a Redis server to rapidly exchange database status among different instances. So, how about the efficiency of database locking by using Redis?
The usage of session module of Flask. The session module can be used to share data (in most cases, the login status of users) among different requests and thus different instances. I am wondering the speed to exchange the data between different instances.
Appended information (1)
I followed the advice to use transaction, but it did not work.
Below is the code I used to verify the transaction.
The reason of failure may be that the transaction only works for CURRENT client. For multiple requests at the same time, the server side of GAE will create different processes or instances to respond to the requests, and each process or instance will have its own independent client.
#staticmethod
def get_test(test_key_id, unique_user_id, course_key_id, make_new=False):
client = ndb.Client()
with client.context():
from google.cloud import datastore
from datetime import datetime
client2 = datastore.Client()
print("transaction started at: ", datetime.utcnow())
with client2.transaction():
print("query started at: ", datetime.utcnow())
my_test = MyTest.query(MyTest.test_key_id==test_key_id, MyTest.unique_user_id==unique_user_id).get()
import time
time.sleep(5)
if make_new and not my_test:
print("data to create started at: ", datetime.utcnow())
my_test = MyTest(test_key_id=test_key_id, unique_user_id=unique_user_id, course_key_id=course_key_id, status="")
my_test.put()
print("data to created at: ", datetime.utcnow())
print("transaction ended at: ", datetime.utcnow())
return my_test
Appended information (2)
Here is new information about usage of memcache (Python 3)
I have tried the follow code to lock the database by using memcache, but it still failed to avoid overwriting.
#user_student.route("/run_test/<test_key_id>/<user_key_id>/")
def run_test(test_key_id, user_key_id=0):
from google.appengine.api import memcache
import time
cache_key_id = test_key_id+"_"+user_key_id
print("cache_key_id", cache_key_id)
counter = 0
client = memcache.Client()
while True: # Retry loop
result = client.gets(cache_key_id)
if result is None or result == "":
client.cas(cache_key_id, "LOCKED")
print("memcache added new value: counter = ", counter)
break
time.sleep(0.01)
counter+=1
if counter>500:
print("failed after 500 tries.")
break
my_test = MyTest.get_test(int(test_key_id), current_user.unique_user_id, current_user.course_key_id, make_new=True)
client.cas(cache_key_id, "")
memcache.delete(cache_key_id)
If the problem is duplication but not overwriting, maybe you should specify data id when creating new entries, but not let GAE generate a random one for you. Then the application will write to the same entry twice, instead of creating two entries. The data id can be anything unique, such as a session id, a timestamp, etc.
The problem of transaction is, it prevents you modifying the same entry in parallel, but it does not stop you creating two new entries in parallel.
I used memcache in the following way (using get/set ) and succeeded in locking the database writing.
It seems that gets/cas does not work well. In a test, I set the valve by cas() but then it failed to read value by gets() later.
Memcache API: https://cloud.google.com/appengine/docs/standard/python3/reference/services/bundled/google/appengine/api/memcache
#user_student.route("/run_test/<test_key_id>/<user_key_id>/")
def run_test(test_key_id, user_key_id=0):
from google.appengine.api import memcache
import time
cache_key_id = test_key_id+"_"+user_key_id
print("cache_key_id", cache_key_id)
counter = 0
client = memcache.Client()
while True: # Retry loop
result = client.get(cache_key_id)
if result is None or result == "":
client.set(cache_key_id, "LOCKED")
print("memcache added new value: counter = ", counter)
break
time.sleep(0.01)
counter+=1
if counter>500:
return "failed after 500 tries of memcache checking."
my_test = MyTest.get_test(int(test_key_id), current_user.unique_user_id, current_user.course_key_id, make_new=True)
client.delete(cache_key_id)
...
Transactions:
https://developers.google.com/appengine/docs/python/datastore/transactions
When two or more transactions simultaneously attempt to modify entities in one or more common entity groups, only the first transaction to commit its changes can succeed; all the others will fail on commit.
You should be updating your values inside a transaction. App Engine's transactions will prevent two updates from overwriting each other as long as your read and write are within a single transaction. Be sure to pay attention to the discussion about entity groups.
You have two options:
Implement your own logic for transaction failures (how many times to
retry, etc.)
Instead of writing to the datastore directly, create a task to modify
an entity. Run a transaction inside a task. If it fails, the App
Engine will retry this task until it succeeds.
I have 2 DB's, Building and Contact, and a stored proc that executes from the Building DB, building_sp.
building_sp needs to update a table, TblContact, within Contact, and they way I have been referencing it is by
[Contact].dbo.[TblContact]
Since the Contact table can be named arbitrarily, I need to remove this dependency.
The options NOT available to me are
Moving the stored proc logic to code (ie a .NET controller, where the Contact DB name could be set in a web service config file).
Storing the Contact DB name in a meta table/row in the Building DB.
Pass in the a string variable containing the Contact DB name into the building_sp.
Any suggestions or help on this will be appreciated, even just rough ideas or partial answers. Thanks.
I would choose option1: having 2 DataAccess components, one for Building, one for Contact and let a .NET component (controller) to invoke operation on DALs within a transaction (see TransactionScope class).
Why?
Later, you might decide to move those 2 databases on different
machines
Avoid low coupling
Later you might have a third DB invoked, so you have 2 pass 2 DB names or to access many DBs from your SP
You might need to call other services (e.g. sending a mail) and it is more natural to do this kind of operations in .NET
Respect open/close principle: if contact update logic changes, there will be no need to touch Building logic, so less impact, less time involved in testing, lower chances to produce regressions
I let the others to add other reasons here...
I ended up storing the Contact DB name in a web.config app setting
<configuration>
…
<appSettings>
…
<add key="ContactDBName" value="ContactDataBase"/>
</appSettings>
…
</configuration>
I then read this into a global static string from Application_Start()
public class WebApiApplication : System.Web.HttpApplication
{
public static string ContactDBName;
protected void Application_Start()
{
...
ReadContactDBNameFromConfig();
}
protected void ReadContactDBNameFromConfig()
{
System.Configuration.Configuration rootWebConfig = System.Web.Configuration.WebConfigurationManager.OpenWebConfiguration("/WebServiceName");
System.Configuration.KeyValueConfigurationElement contactDBNameElement = rootWebConfig.AppSettings.Settings["ContactDBName"];
ContactDBName = contactDBNameElement.Value;
}
}
I also stored the stored procedure into a string, using the "place holder" literal "{0}" for the DB name.
public static string BUILDING_SP =
#"BEGIN TRANSACTION
...
UPDATE [{0}].[dbo].[TblContact]
SET param1 = #Param1,
param2 = #Param2,
...
WHERE record_id = #RecordID
...
COMMIT";
I then use String.Format to set the place holder with the DB name
string sql = String.Format(BUILDING_SP, WebApiApplication.ContactDBName);
This string is then executed via a SqlCommand object.
pretty new to Camel, but good fun so far.
I am using it to process csv files that are FTP:d to a particular directory, parse them and then place the data in relevant MySql db tables (all on the same host).
I have a simple CamelContext containing 4 routes, each of which process a different file type and place the parsed data in its associated table. A number of these files can appear in the FTP directory every 15 minutes (on the clock 15, 30 45, 00 minutes).
The camel context registry has only one entry, that of the dataSource (a simple Mysql datasource provided by org.mariadb.jdbc.MySQLDataSource...we are migrating over to MaraiDb)
Each of the 4 routes follow the pattern below,differing only in the "from file" name pattern, and the choice of Processor:
from("file:" + FilePath + "?include=(.*)dmm_all(.*)csv.gz&" + fileAction + "&moveFailed=" + FaultyFilePath)
.routeId("myRoute")
.unmarshal().gzip()
.unmarshal().csv()
.split().body()
.process(myProcessor)
.choice()
.when(simple("${header.myResult} != 'header removed'"))
.to("sql:" + insertQuery.trim() + "?dataSource=sqlds")
.end();
Each Processor just removes the file header (column names) and sets the body for insert into the associated MySql table with the given insert query.
For incoming files it works like a charm.
But when the application/Camel process is not up, the FTP directory can fill up with a lot of files.
Starting the application in this situation is when the performance problems arise.
I get the following exception(s) with a certain file ammount threshold:
Cause: org.springframework.jdbc.CannotGetJdbcConnectionException:
Could not get JDBC Connection;
nested exception is java.sql.SQLNonTransientConnectionException:
Could not connect to localhost:3306 : Cannot assign requested address
I have tried adding a delay on each route, and it goes smoothly, but painfully slow.
I am guessing (wildly..I might add) that there is contention between the route threads (if that is how camel works with a thread for each route?) and the single db connection for the dataSource.
Is this the case, if whats the best strategy to solve this (connection pool? or using a single shared thread for all the routes or...)?
What other strategies are there to improve performance in the Camel scenario I use above?
Very grateful for any input and feedback.
cheers.
I need to aggregate a number of csv inbound files in-memory, if necessary resequencing them, on Mule ESB CE 3.2.1.
How could I implement this kind of logics?
I tried with message-chunking-aggregator-router, but it fails on startup because xsd schema does not admit such a configuration:
<message-chunking-aggregator-router timeout="20000" failOnTimeout="false" >
<expression-message-info-mapping correlationIdExpression="#[header:correlation]"/>
</message-chunking-aggregator-router>
I've also tried to attach mine correlation ids to inbound messages, then process them by a custom-aggregator, but I've found that Mule internally uses a key made up of:
Serializable key=event.getId()+event.getMessage().getCorrelationSequence();//EventGroup:264
The internal id is everytime different (also if correlation sequence is correct): this way, Mule does not use only correlation sequence as I expected and same message is processed many times.
Finally, I can re-write a custom aggregator, but I would like to use a more consolidated technique.
Thanks in advance,
Gabriele
UPDATE
I've tried with message-chunk-aggregator but it doesn't fit my requisite, as it admits duplicates.
I try to detail the scenario I need to cover:
Mule polls (on a SFTP location)
file 1 "FIXEDPREFIX_1_of_2.zip" is detected and kept in memory somewhere (as an open SFTPStream, it's ok).
Some correlation info are mantained for grouping: group, sequence, group size.
file 1 "FIXEDPREFIX_1_of_2.zip" is detected again, but cannot be inserted because would be duplicated
file 2 "FIXEDPREFIX_2_of_2.zip" is detected, and correctly added
stated that group size has been reached, Mule routes MessageCollection with the correct set of messages
About point 2., I'm lucky enough to get info from filename and put them into MuleMessage::correlation* properties, so that subsequent components could use them.
I did, but duplicates are processed the same.
Thanks again
Gabriele
Here is the right router to use with Mule 3: http://www.mulesoft.org/documentation/display/MULE3USER/Routing+Message+Processors#RoutingMessageProcessors-MessageChunkAggregator