Neo4j store is not cleanly shut down; Recovering from inconsistent db state from interrupted batch insertion - database

I was importing ttl ontologies to dbpedia following the blog post http://michaelbloggs.blogspot.de/2013/05/importing-ttl-turtle-ontologies-in-neo4j.html. The post uses BatchInserters to speed up the task. It mentions
Batch insertion is not transactional. If something goes wrong and you don't shutDown() your database properly, the database becomes inconsistent.
I had to interrupt one of the batch insertion tasks as it was taking time much longer than expected which left my database in an inconsistence state. I get the following message:
db_name store is not cleanly shut down
How can I recover my database from this state? Also, for future purposes is there a way for committing after importing every file so that reverting back to the last state would be trivial. I thought of git, but I am not sure if it would help for a binary file like index.db.

There are some cases where you cannot recover from unclean shutdowns when using the batch inserter api, please note that its package name org.neo4j.unsafe.batchinsert contains the word unsafe for a reason. The intention for batch inserter is to operate as fast as possible.
If you want to guarantee a clean shutdown you should use a try finally:
BatchInserter batch = BatchInserters.inserter(<dir>);
try {
} finally {
batch.shutdown();
}
Another alternative for special cases is registering a JVM shutdown hook. See the following snippet as an example:
BatchInserter batch = BatchInserters.inserter(<dir>);
// do some operations potentially throwing exceptions
Runtime.getRuntime().addShutdownHook(new Thread() {
public void run() {
batch.shutdown();
}
});

Related

Flink Running out of Memory

I have some fairly simple stream code that aggregating data via time windows. The windows are on the large side (1 hour, with a 2 hour bound), and the values in the streams are metrics coming from hundreds of servers. I keep running out of memory, and so I added the RocksDBStateBackend. This caused the JVM to segfault. Next I tried the FsStateBackend. Both of these backends never wrote any data to disk, but simply created a directory with the JobID. I'm running this code in standalone mode, not deployed. Any thoughts as to why the State Backends aren't writing data, and why it runs out of memory even when provided with 8GB of heap?
final SingleOutputStreamOperator<Metric> metricStream =
objectStream.map(node -> new Metric(node.get("_ts").asLong(), node.get("_value").asDouble(), node.get("tags"))).name("metric stream");
final WindowedStream<Metric, String, TimeWindow> hourlyMetricStream = metricStream
.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<Metric>(Time.hours(2)) { // set how long metrics can come late
#Override
public long extractTimestamp(final Metric metric) {
return metric.get_ts() * 1000; // needs to be in ms since Java epoch
}
})
.keyBy(metric -> metric.getMetricName()) // key the stream so we can run the windowing in parallel
.timeWindow(Time.hours(1)); // setup the time window for the bucket
// create a stream for each type of aggregation
hourlyMetricStream.sum("_value") // we want to sum by the _value
.addSink(new MetricStoreSinkFunction(parameters, "sum"))
.name("hourly sum stream")
.setParallelism(6);
hourlyMetricStream.aggregate(new MeanAggregator())
.addSink(new MetricStoreSinkFunction(parameters, "mean"))
.name("hourly mean stream")
.setParallelism(6);
hourlyMetricStream.aggregate(new ReMedianAggregator())
.addSink(new MetricStoreSinkFunction(parameters, "remedian"))
.name("hourly remedian stream")
.setParallelism(6);
env.execute("flink test");
It is tough to say why you would run out of memory unless you have a very large number of metric names (that is the only explanation I can come up with based on the code you posted).
With respect to the disk writing, RocksDB will actually use a temporary directory by default for its actual database files. You can also pass an explicit directory during configuration. You would do this by calling state.setDbStoragePath(someDirectory)
Somewhat confusingly, the FSStateBackend in fact only writes to disk during checkpointing, it otherwise is entirely heap based. So you likely did not see anything in the directory if you did not have checkpointing enabled. So that would explain why you might still run out of memory when the FSStateBackend is used.
Assuming you do have the RocksDB (or any) state backend working, you can enable checkpointing by doing:
env.enableCheckpointing(5000); // value is in MS, so however frequently you want to checkpoint
env.getCheckpointConfig.setMinPauseBetweenCheckpoints(5000); // this is to help prevent your job from making progress if checkpointing takes a bit. For large state checkpointing can take multiple seconds

Spring Batch FlatFileItemWriter does not write data to a file

I am new to Spring Batch application. I am trying to use FlatFileItemWriter to write the data into a file. Challenge is application is creating the file on a given path, but, now writing the actual content into it.
Following are details related to code:
List<String> dataFileList : This list contains the data that I want to write to a file
FlatFileItemWriter<String> writer = new FlatFileItemWriter<>();
writer.setResource(new FileSystemResource("C:\\Desktop\\test"));
writer.open(new ExecutionContext());
writer.setLineAggregator(new PassThroughLineAggregator<>());
writer.setAppendAllowed(true);
writer.write(dataFileList);
writer.close();
This is just generating the file at proper place but contents are not getting written into the file.
Am I missing something? Help is highly appreciated.
Thanks!
This is not a proper way to use Spring Batch Writer and writer data. You need to declare bean of Writer first.
Define Job Bean
Define Step Bean
Use your Writer bean in Step
Have a look at following examples:
https://github.com/pkainulainen/spring-batch-examples/blob/master/spring-boot/src/main/java/net/petrikainulainen/springbatch/csv/in/CsvFileToDatabaseJobConfig.java
https://spring.io/guides/gs/batch-processing/
You probably need to force a sync to disk. From the docs at https://docs.spring.io/spring-batch/trunk/apidocs/org/springframework/batch/item/file/FlatFileItemWriter.html,
setForceSync
public void setForceSync(boolean forceSync)
Flag to indicate that changes should be force-synced to disk on flush. Defaults to false, which means that even with a local disk changes could be lost if the OS crashes in between a write and a cache flush. Setting to true may result in slower performance for usage patterns involving many frequent writes.
Parameters:
forceSync - the flag value to set

Bro: Disable ALL log generation

I created a bro script, with the objective of extract all files for all posible protocols from a pcap file. But I dont want to write all logs. Bro create a log file for each protocol. Example: 'http.log', 'smtp.log', etc. Even a 'weird.log' is generated. My pcap files are large (20gb), so, each log file contains over 30mb of information. This log generation reduce the performance of the file extraction.
I can disable the 'conn.log' with the line Log::disable_stream(Conn::LOG) but, what about all protocol logging??
This is my script
#load base/files/extract
event bro_init()
{
Log::disable_stream(Conn::LOG);
}
event file_sniff(f: fa_file, meta: fa_metadata)
{
local ext = "";
if ( meta?$mime_type )
ext = split_string(meta$mime_type, /\//)[1];
local fname = fmt("%s-%s.%s", f$source, f$id, ext);
Files::add_analyzer(f, Files::ANALYZER_EXTRACT, [$extract_filename=fname]);
}
You can use the none writer like this:
bro -r packets.pcap Log::default_writer=Log::WRITER_NONE
I'm not totally convinced that writing these logs is harming your performance in any real way though. Typically, writing the files to disk is what causes the biggest overhead.
Here's a way to turn off whatever logging's been turned on (prior to bro_init), without having to know which stream IDs are relevant:
event bro_init()
{
# We don't want any output other than from this script.
for (id in Log::active_streams)
Log::disable_stream(id);
}
This construct makes me twitch a little about modifying a table while iterating over it, but it seems to work and I can't actually find any way to peek at one key from a table without doing an iteration. I suppose one could write
event bro_init()
{
while (|Log::active_streams|) {
for (id in Log::active_streams) {
Log::disable_stream(id);
break;
}
}
}
but that's hideous and I'm not going to use it unless I discover that I have to.
I achieved this with this line of code in main.bro:
Log::remove_filter(Conn::LOG, "default");

Connections with Entity Framework and Transient Fault Handling Block?

We're migrating SQL to Azure. Our DAL is Entity Framework 4.x based. We're wanting to use the Transient Fault Handling Block to add retry logic for SQL Azure.
Overall, we're looking for the best 80/20 rule (or maybe more of a 95/5 but you get the point) - we're not looking to spend weeks refactoring/rewriting code (there's a LOT of it). I'm fine re-implementing our DAL's framework but not all of the code written and generated against it anymore than we have to since this is already here only to address a minority case. Mitigation >>> elimination of this edge case for us.
Looking at the possible options explained here at MSDN, it seems Case #3 there is the "quickest" to implement, but only at first glance. Upon pondering this solution a bit, it struck me that we might have problems with connection management since this circumvent's Entity Framework's built-in processes for managing connections (i.e. always closing them). It seems to me that the "solution" is to make sure 100% of our Contexts that we instantiate use Using blocks, but with our architecture, this would be difficult.
So my question: Going with Case #3 from that link, are hanging connections a problem or is there some magic somewhere that's going on that I don't know about?
I've done some experimenting and it turns out that this brings us back to the old "managing connections" situation we're used to from the past, only this time the connections are abstracted away from us a bit and we must now "manage Contexts" similarly.
Let's say we have the following OnContextCreated implementation:
private void OnContextCreated()
{
const int maxRetries = 4;
const int initialDelayInMilliseconds = 100;
const int maxDelayInMilliseconds = 5000;
const int deltaBackoffInMilliseconds = initialDelayInMilliseconds;
var policy = new RetryPolicy<SqlAzureTransientErrorDetectionStrategy>(maxRetries,
TimeSpan.FromMilliseconds(initialDelayInMilliseconds),
TimeSpan.FromMilliseconds(maxDelayInMilliseconds),
TimeSpan.FromMilliseconds(deltaBackoffInMilliseconds));
policy.ExecuteAction(() =>
{
try
{
Connection.Open();
var storeConnection = (SqlConnection) ((EntityConnection) Connection).StoreConnection;
new SqlCommand("declare #i int", storeConnection).ExecuteNonQuery();
//Connection.Close();
// throw new ApplicationException("Test only");
}
catch (Exception e)
{
Connection.Close();
Trace.TraceWarning("Attempted to open connection but failed: " + e.Message);
throw;
}
}
);
}
In this scenario, we forcibly open the Connection (which was the goal here). Because of this, the Context keeps it open across many calls. Because of that, we must tell the Context when to close the connection. Our primary mechanism for doing that is calling the Dispose method on the Context. So if we just allow garbage collection to clean up our contexts, then we allow connections to remain hanging open.
I tested this by toggling the comments on the Connection.Close() in the try block and running a bunch of unit tests against our database. Without calling Close, we jumped up to ~275-300 active connections (from SQL Server's perspective). By calling Close, that number hovered at ~12. I then reproduced with a small number of unit tests both with and without a using block for the Context and reproduced the same result (different numbers - I forget what they were).
I was using the following query to count my connections:
SELECT s.session_id, s.login_name, e.connection_id,
s.last_request_end_time, s.cpu_time,
e.connect_time
FROM sys.dm_exec_sessions AS s
INNER JOIN sys.dm_exec_connections AS e
ON s.session_id = e.session_id
WHERE login_name='myuser'
ORDER BY s.login_name
Conclusion: If you call Connection.Open() with this work-around to enable the Transient Fault Handling Block, then you MUST use using blocks for all contexts you work with, otherwise you will have problems (that with SQL Azure, will cause your database to be "throttled" and ultimately taken offline for hours!).
The problem with this approach is it only takes care of connection retries and not command retries.
If you use Entity Framework 6 (currently in alpha) then there is some new in-built support for transient retries with Azure SQL Database (with a little bit of configuration): http://entityframework.codeplex.com/wikipage?title=Connection%20Resiliency%20Spec
I've created a library which allows you to configure Entity Framework to retry using the Fault Handling block without needing to change every database call - generally you will only need to change your config file and possibly one or two lines of code.
This allows you to use it for Entity Framework or Linq To Sql.
https://github.com/robdmoore/ReliableDbProvider

Transactions and Locking with Appengine

I have a similar code below that I'm trying to figure out transaction locking:
DAOT.repeatInTransaction(new Transactable() {
#Override
public void run(DAOT daot)
{
Points points = daot.ofy().find(Points.class, POINTS_ID);
// do something with points
takes_a_very_long_time_delay(); // perhaps 10 secs
daot.ofy().put(points);
}
});
The code above is executed from within a Java servlet. The operation is expected to work for 10 seconds for example. In between that time, I have a test that will invoke another servlet that will delete a Points entity, I was expecting that the delete operation would fail or at least delete the entity after the transaction above has finished.
However the entity was deleted during the period that the above code is executing. In my real application, I added exception handling to throw exception when trying to access or edit a entity that does not exist.
From there, the application is throwing "Entity not found" exception just after I executed the servlet that will delete the Entity in the code above.
Although I am using GAE Transactions already, however I think I am still missing something that's why my test fails.
Code for the delete Transaction from withing the Delete servlet:
DAOT.repeatInTransaction(new Transactable() {
#Override
public void run(DAOT daot)
{
Points points = daot.ofy().find(Points.class, POINTS_ID);
daot.ofy().delete(points);
}
});
How can I ensure that a new operation like a delete for a entity will wait until the current operation is happening on a entity during a transaction?
App Engine uses optimistic concurrency, not locking. That is, a transaction on a group of entities will not prevent other processes from modifying those entities while the transaction runs. Instead, when the transaction attempts to commit, it will check if any modifications were made while the transaction was executing, and if it has, discard any changes and run your function again from the beginning.
I assume you use objectify to work with datastore.
First you need to make sure daot.ofy() returns objectify instance with explicit transaction set (ObjectifyFactory.beginTransaction()) instead of ObjectifyFactory.begin(). Then make sure you use the same objectify instance for both find() and delete() calls (as well as for find()/put pairs).

Resources