I am using a rest-api which is supposed to import data out of csv files. The uploading and mapping to object part is working, but not the saveAll(), it just takes years to save 130000~ rows to the database (which is running on a mssqlserver) and it should be working with much bigger files in less time.
This is what my data class looks:
#Entity
data class Street(
#Id
#GeneratedValue(strategy = GenerationType.SEQUENCE, generator = "streetSeq")
#SequenceGenerator(name = "streetSeq", sequenceName = "streetSeq", allocationSize = 1)
val id: Int,
val name: String?,
val municipalityId: Int?
)
And im just using the saveAll methode. Everything in the import methode works relativ fast (like 10 seconds) untlis the saveAll()
override fun import(file: MultipartFile) {
val inputStream = file.inputStream
var import: List<Street> = listOf()
tsvReader.open(inputStream) {
val csvContents = readAllWithHeaderAsSequence()
val dataClasses = grass<ImportStreet>().harvest(csvContents)
dataClasses.forEach { row ->
import = import + toStreet(row)
}
println("Data Converted")
}
streetRepository.saveAll(import)
inputStream.close()
}
I already tried to adjust the application.yml but its not making a big diffrence.
jpa:
properties:
hibernate:
ddl-auto: update
dialect: org.hibernate.dialect.SQLServer2012Dialect
generate_statistics: true
order_inserts: true
order_updates: true
jdbc:
batch_size: 1000
As you can see in the implementaiton fo SimpleJpaRepository it's not doing any kind of batching save, it just saves each entity one by one, that's why it is slow
https://github.com/spring-projects/spring-data-jpa/blob/d35ee1a82bf0fdf2de2724a02619eea1cf3c98bd/src/main/java/org/springframework/data/jpa/repository/support/SimpleJpaRepository.java#L584
Assert.notNull(entities, "Entities must not be null!");
List<S> result = new ArrayList<S>();
for (S entity : entities) {
result.add(save(entity));
}
return result;
So try to implement batch save without usage of spring-data-jpa, for example using spring-batch
As the other comments / awnsers suggested i tried to use spring-batch, but it didnt work out for me (mostly because i didnt know how to pull it off with it)
After trying out more stuff i found jdbcTemplate, which perfectly worked out for me, the inserts are now much faster then with the saveAll()
override fun import(file: MultipartFile) {
val inputStream = file.inputStream
var import: List<Street> = listOf()
tsvReader.open(inputStream) {
val csvContents = readAllWithHeaderAsSequence()
val dataClasses = grass<ImportStreet>().harvest(csvContents)
dataClasses.forEach { row ->
import = import + toStreet(row)
}
println("Data Converted")
}
batchInsert(import)
inputStream.close()
}
The batchInstert methode is using the jdbcTemplate.batchUpdate() fun
fun batchInsert(streets: List<Street>): IntArray? {
return jdbcTemplate.batchUpdate(
"INSERT INTO street (name, municipalityId) VALUES (?, ?)",
object: BatchPreparedStatementSetter {
#Throws(SQLException::class)
override fun setValues(ps: PreparedStatement, i: Int) {
ps.setString(1, streets[i].name)
streets[i].municipalityId?.let { ps.setInt(2, it) }
}
override fun getBatchSize(): Int {
return streets.size
}
})
}
Related
A Flink Streaming was developed with a filter that does the deduplication based on the id of the event using a key-value state based on RocksDB state backend.
Application Code
env.setStateBackend(new RocksDBStateBackend(checkpoint, true).asInstanceOf[StateBackend])
val stream = env
.addSource(kafkaConsumer)
.keyBy(_.id)
.filter(new Deduplication[Stream]("stream-dedup", Time.days(30))).uid("stream-filter")
Deduplication Code
class Deduplication[T](stateDescriptor: String, time: Time) extends RichFilterFunction[T] {
val ttlConfig: StateTtlConfig = StateTtlConfig
.newBuilder(time)
.setUpdateType(StateTtlConfig.UpdateType.OnReadAndWrite)
.setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)
.cleanupFullSnapshot
.build
val deduplicationStateDescriptor = new ValueStateDescriptor[Boolean](stateDescriptor, classOf[Boolean])
deduplicationStateDescriptor.enableTimeToLive(ttlConfig)
lazy val deduplicationState: ValueState[Boolean] = getRuntimeContext.getState(deduplicationStateDescriptor)
override def filter(value: T): Boolean = {
if (deduplicationState.value) {
false
} else {
deduplicationState.update(true)
true
}
}
}
All of this works just fine. My goal with this question is to understand how I can read all the state using state processor api. So I started to write some code based on the documentation available.
Savepoint Reading Code
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
val savepoint = Savepoint
.load(env, savepointPath,new RocksDBStateBackend("file:/tmp/rocksdb", true))
savepoint
.readKeyedState("stream-filter", new DeduplicationStateReader("stream-dedup")).print()
Reader Function Code
class DeduplicationStateReader(stateDescriptor: String) extends KeyedStateReaderFunction[String, String] {
var state: ValueState[Boolean] = _
override def open(parameters: Configuration): Unit = {
val deduplicationStateDescriptor = new ValueStateDescriptor[Boolean](stateDescriptor, classOf[Boolean])
state = getRuntimeContext.getState(deduplicationStateDescriptor)
}
override def readKey(key: String, ctx: KeyedStateReaderFunction.Context, out: Collector[String]): Unit = {
out.collect("IT IS WORKING")
}
}
Whenever I try to read the state, a serialization error appears to me.
Is there anything wrong? Did I misunderstand all of this?
I am trying to send transactions to the Postgre database (the table exists) using the Exposed framework for Kotlin, but an error occurs that does not allow me to do this. The error appears on the line SchemaUtils.create(tableTest)
Source code:
import org.jetbrains.exposed.dao.id.IntIdTable
import org.jetbrains.exposed.sql.*
import org.jetbrains.exposed.sql.transactions.transaction
fun main(args: Array<String>) {
val db = Database.connect("jdbc:postgresql://localhost:5432/testBase", driver = "org.postgresql.Driver", user = "user", password = "123")
println("Database name: ${db.name}")
transaction {
addLogger(StdOutSqlLogger)
SchemaUtils.create(tableTest)
println("People: ${tableTest.selectAll()}")
}
}
object tableTest: Table() {
val id = integer("id")
val name = text("name")
val surname = text("surname")
val height = integer("height")
val phone = text("phone")
override val primaryKey = PrimaryKey(id)
}
The error:
Exception in thread "main" java.lang.ExceptionInInitializerError
at MainKt$main$1.invoke(main.kt:12)
at MainKt$main$1.invoke(main.kt)
at org.jetbrains.exposed.sql.transactions.ThreadLocalTransactionManagerKt$inTopLevelTransaction$1.invoke(ThreadLocalTransactionManager.kt:170)
at org.jetbrains.exposed.sql.transactions.ThreadLocalTransactionManagerKt$inTopLevelTransaction$2.invoke(ThreadLocalTransactionManager.kt:211)
at org.jetbrains.exposed.sql.transactions.ThreadLocalTransactionManagerKt.keepAndRestoreTransactionRefAfterRun(ThreadLocalTransactionManager.kt:219)
at org.jetbrains.exposed.sql.transactions.ThreadLocalTransactionManagerKt.inTopLevelTransaction(ThreadLocalTransactionManager.kt:210)
at org.jetbrains.exposed.sql.transactions.ThreadLocalTransactionManagerKt$transaction$1.invoke(ThreadLocalTransactionManager.kt:148)
at org.jetbrains.exposed.sql.transactions.ThreadLocalTransactionManagerKt.keepAndRestoreTransactionRefAfterRun(ThreadLocalTransactionManager.kt:219)
at org.jetbrains.exposed.sql.transactions.ThreadLocalTransactionManagerKt.transaction(ThreadLocalTransactionManager.kt:120)
at org.jetbrains.exposed.sql.transactions.ThreadLocalTransactionManagerKt.transaction(ThreadLocalTransactionManager.kt:118)
at org.jetbrains.exposed.sql.transactions.ThreadLocalTransactionManagerKt.transaction$default(ThreadLocalTransactionManager.kt:117)
at MainKt.main(main.kt:10)
Caused by: java.lang.IllegalStateException: javaClass.`package` must not be null
at org.jetbrains.exposed.sql.Table.<init>(Table.kt:306)
at org.jetbrains.exposed.sql.Table.<init>(Table.kt:303)
at tableTest.<init>(main.kt:30)
at tableTest.<clinit>(main.kt:30)
... 12 more
build.gradle.kts:
import org.jetbrains.kotlin.gradle.tasks.KotlinCompile
plugins {
kotlin("jvm") version "1.4.0"
application
}
group = "me.amd"
version = "1.0-SNAPSHOT"
repositories {
mavenCentral()
jcenter()
}
dependencies {
testImplementation(kotlin("test-junit"))
implementation("org.jetbrains.exposed", "exposed-core", "0.26.2")
implementation("org.jetbrains.exposed", "exposed-dao", "0.26.2")
implementation("org.jetbrains.exposed", "exposed-jdbc", "0.26.2")
implementation("org.postgresql:postgresql:42.2.16")
implementation("org.slf4j", "slf4j-api", "1.7.25")
implementation("org.slf4j", "slf4j-simple", "1.7.25")
implementation("org.xerial:sqlite-jdbc:3.30.1")
}
tasks.withType<KotlinCompile>() {
kotlinOptions.jvmTarget = "1.8"
}
application {
mainClassName = "MainKt"
}
Tried doing like this:
transaction {
addLogger(StdOutSqlLogger)
val schema = Schema("tableTest", authorization = "postgres", password = "123456")
SchemaUtils.setSchema(schema)
println("People: ${tableTest.selectAll()}")
}
but the error has moved to the line println("People: ${tableTest.selectAll()}")
I tried to send queries to SQLite — everything is the same
How to fix this error and still send a request to the database? I hope for your help!
Add a package statement above your import statements. Furthermore, add your main method in a class.
While using floor database can we set an action which will turn as a boolean like below example?
Future<bool> isItAdded(in user_id) async{
var = dbClient = await db;
List<Map> list = await dbClient.rawQuery{SELECT * FROM Users WHERE user_id}, [user_id]}
return list.lenght > 0 ? true : false
}
You may write DAO object:
#dao
abstract class UsersDao {
#Query('SELECT * FROM users WHERE user_id = :userId')
Future<List<User>> findUsers(Int userId);
}
Before that you neeed to create entity:
#Entity(tableName: 'users')
class User{
#PrimaryKey(autoGenerate: true)
final int id;
#ColumnInfo(name: 'user_id')
final int userId;
}
Also you need to create database access object:
part 'database.g.dart'; // the generated code will be there
#Database(version: 1, entities: [User])
abstract class AppDatabase extends FloorDatabase {
UsersDao get usersDao;
}
Then generate additional code by command:
flutter packages pub run build_runner build
And then write check function inside database access object:
Future<bool> isItAdded(in user_id) async {
List<User> list = await usersDao.findUsers(user_id);
return list.lenght > 0;
}
The best solution is not to add user_id column and use only unique id column.
I am experimenting with how to propagate back-pressure correctly when I have ConnectedStreams as part of my computation graph. The problem is: I have two sources and one ingests data faster than the other, think we want to replay some data and one source has rare events that we use to enrich the other source. These two sources are then connected in a stream that expects them to be at least somewhat synchronized, merges them together somehow (making tuple, enriching, ...) and returns a result.
With single input streams its fairly easy to implement backpressure, you simply have to spend long time in the processElement function. With connectedstreams my initial idea was to have some logic in each of the processFunctions that waits for the other stream to catch up. For example I could have a buffer thats time-span limited (large enough span to fit a watermark) and the function would not accept events that would make this span pass a threshold. For example:
leftLock.aquire { nonEmptySignal =>
while (queueSpan() > capacity.toMillis && lastTs() < ctx.timestamp()) {
println("WAITING")
nonEmptySignal.await()
}
queueOp { queue =>
println(s"Left Event $value recieved ${Thread.currentThread()}")
queue.add(Left(value))
}
ctx.timerService().registerEventTimeTimer(value.ts)
}
Full code of my example is below (its written with two locks assuming access from two different threads, which is not the case - i think):
import java.util.concurrent.atomic.{AtomicBoolean, AtomicLong}
import java.util.concurrent.locks.{Condition, ReentrantLock}
import scala.collection.JavaConverters._
import com.google.common.collect.MinMaxPriorityQueue
import org.apache.flink.api.common.state.{ValueState, ValueStateDescriptor}
import org.apache.flink.api.common.typeinfo.{TypeHint, TypeInformation}
import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.api.scala._
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.environment.LocalStreamEnvironment
import org.apache.flink.streaming.api.functions.co.CoProcessFunction
import org.apache.flink.streaming.api.functions.source.{RichSourceFunction, SourceFunction}
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.watermark.Watermark
import org.apache.flink.util.Collector
import scala.collection.mutable
import scala.concurrent.duration._
trait Timestamped {
val ts: Long
}
case class StateObject(ts: Long, state: String) extends Timestamped
case class DataObject(ts: Long, data: String) extends Timestamped
case class StatefulDataObject(ts: Long, state: Option[String], data: String) extends Timestamped
class DataSource[A](factory: Long => A, rate: Int, speedUpFactor: Long = 0) extends RichSourceFunction[A] {
private val max = new AtomicLong()
private val isRunning = new AtomicBoolean(false)
private val speedUp = new AtomicLong(0)
private val WatermarkDelay = 5 seconds
override def cancel(): Unit = {
isRunning.set(false)
}
override def run(ctx: SourceFunction.SourceContext[A]): Unit = {
isRunning.set(true)
while (isRunning.get()) {
val time = System.currentTimeMillis() + speedUp.addAndGet(speedUpFactor)
val event = factory(time)
ctx.collectWithTimestamp(event, time)
println(s"Event $event sourced $speedUpFactor")
val watermark = time - WatermarkDelay.toMillis
if (max.get() < watermark) {
ctx.emitWatermark(new Watermark(time - WatermarkDelay.toMillis))
max.set(watermark)
}
Thread.sleep(rate)
}
}
}
class ConditionalOperator {
private val lock = new ReentrantLock()
private val signal: Condition = lock.newCondition()
def aquire[B](func: Condition => B): B = {
lock.lock()
try {
func(signal)
} finally {
lock.unlock()
}
}
}
class BlockingCoProcessFunction(capacity: FiniteDuration = 20 seconds)
extends CoProcessFunction[StateObject, DataObject, StatefulDataObject] {
private type MergedType = Either[StateObject, DataObject]
private lazy val leftLock = new ConditionalOperator()
private lazy val rightLock = new ConditionalOperator()
private var queueState: ValueState[MinMaxPriorityQueue[MergedType]] = _
private var dataState: ValueState[StateObject] = _
override def open(parameters: Configuration): Unit = {
super.open(parameters)
queueState = getRuntimeContext.getState(new ValueStateDescriptor[MinMaxPriorityQueue[MergedType]](
"event-queue",
TypeInformation.of(new TypeHint[MinMaxPriorityQueue[MergedType]]() {})
))
dataState = getRuntimeContext.getState(new ValueStateDescriptor[StateObject](
"event-state",
TypeInformation.of(new TypeHint[StateObject]() {})
))
}
override def processElement1(value: StateObject,
ctx: CoProcessFunction[StateObject, DataObject, StatefulDataObject]#Context,
out: Collector[StatefulDataObject]): Unit = {
leftLock.aquire { nonEmptySignal =>
while (queueSpan() > capacity.toMillis && lastTs() < ctx.timestamp()) {
println("WAITING")
nonEmptySignal.await()
}
queueOp { queue =>
println(s"Left Event $value recieved ${Thread.currentThread()}")
queue.add(Left(value))
}
ctx.timerService().registerEventTimeTimer(value.ts)
}
}
override def processElement2(value: DataObject,
ctx: CoProcessFunction[StateObject, DataObject, StatefulDataObject]#Context,
out: Collector[StatefulDataObject]): Unit = {
rightLock.aquire { nonEmptySignal =>
while (queueSpan() > capacity.toMillis && lastTs() < ctx.timestamp()) {
println("WAITING")
nonEmptySignal.await()
}
queueOp { queue =>
println(s"Right Event $value recieved ${Thread.currentThread()}")
queue.add(Right(value))
}
ctx.timerService().registerEventTimeTimer(value.ts)
}
}
override def onTimer(timestamp: Long,
ctx: CoProcessFunction[StateObject, DataObject, StatefulDataObject]#OnTimerContext,
out: Collector[StatefulDataObject]): Unit = {
println(s"Watermarked $timestamp")
leftLock.aquire { leftSignal =>
rightLock.aquire { rightSignal =>
queueOp { queue =>
while (Option(queue.peekFirst()).exists(x => timestampOf(x) <= timestamp)) {
queue.poll() match {
case Left(state) =>
dataState.update(state)
leftSignal.signal()
case Right(event) =>
println(s"Event $event emitted ${Thread.currentThread()}")
out.collect(
StatefulDataObject(
event.ts,
Option(dataState.value()).map(_.state),
event.data
)
)
rightSignal.signal()
}
}
}
}
}
}
private def queueOp[B](func: MinMaxPriorityQueue[MergedType] => B): B = queueState.synchronized {
val queue = Option(queueState.value()).
getOrElse(
MinMaxPriorityQueue.
orderedBy(Ordering.by((x: MergedType) => timestampOf(x))).create[MergedType]()
)
val result = func(queue)
queueState.update(queue)
result
}
private def timestampOf(data: MergedType): Long = data match {
case Left(y) =>
y.ts
case Right(y) =>
y.ts
}
private def queueSpan(): Long = {
queueOp { queue =>
val firstTs = Option(queue.peekFirst()).map(timestampOf).getOrElse(Long.MaxValue)
val lastTs = Option(queue.peekLast()).map(timestampOf).getOrElse(Long.MinValue)
println(s"Span: $firstTs - $lastTs = ${lastTs - firstTs}")
lastTs - firstTs
}
}
private def lastTs(): Long = {
queueOp { queue =>
Option(queue.peekLast()).map(timestampOf).getOrElse(Long.MinValue)
}
}
}
object BackpressureTest {
var data = new mutable.ArrayBuffer[DataObject]()
def main(args: Array[String]): Unit = {
val streamConfig = new Configuration()
val env = new StreamExecutionEnvironment(new LocalStreamEnvironment(streamConfig))
env.getConfig.disableSysoutLogging()
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
env.setParallelism(1)
val stateSource = env.addSource(new DataSource(ts => StateObject(ts, ts.toString), 1000))
val dataSource = env.addSource(new DataSource(ts => DataObject(ts, ts.toString), 100, 100))
stateSource.
connect(dataSource).
keyBy(_ => "", _ => "").
process(new BlockingCoProcessFunction()).
print()
env.execute()
}
}
The problem with connected streams is it seems you cant simply block in one of the processFunctions when its stream is too far ahead, since that blocks the other processFunction aswell. On the other hand if i simply accepted all events in this job eventually the process function would run out of memory. Since it would buffer the whole stream that is ahead.
So my question is: Is it possible to propagate backpressure into each of the streams in ConnectedStreams separately and if so, how? Or alternatively, is there any other nice way to deal with this issue? Possibly all the sources communicating somehow to keep them mostly at the same event-time?
From my reading of the code in StreamTwoInputProcessor, it looks to me like the processInput() method is responsible for implementing the policy in question. Perhaps one could implement a variant that reads from whichever stream has the lower watermark, so long as it has unread input. Not sure what impact that would have overall, however.
I have a table:
trait Schema {
val db: Database
class CoffeeTable(tag: Tag) extends Table[Coffee](tag, "coffees") {
def from = column[String]("from")
def kind = column[String]("kind")
def sold = column[Boolean]("sold")
}
protected val coffees = TableQuery[Coffee]
}
I want to update entries which are sold. Here is a method I end up with:
def markAsSold(soldCoffees: Seq[Coffee]): Future[Int] = {
val cmd = coffees
.filter { coffee =>
soldCoffees
.map(sc => coffee.from === sc.from && coffee.kind === sc.kind)
.reduceLeftOption(_ || _)
.getOrElse(LiteralColumn(false))
}
.map(coffee => coffee.sold)
.update(true)
db.db.stream(cmd)
}
While it works for a small soldCoffee collection, it badly fails with an input of hundreds of elements:
java.lang.StackOverflowError
at slick.ast.TypeUtil$$colon$at$.unapply(Type.scala:325)
at slick.jdbc.JdbcStatementBuilderComponent$QueryBuilder.expr(JdbcStatementBuilderComponent.scala:311)
at slick.jdbc.H2Profile$QueryBuilder.expr(H2Profile.scala:99)
at slick.jdbc.JdbcStatementBuilderComponent$QueryBuilder.$anonfun$expr$8(JdbcStatementBuilderComponent.scala:381)
at slick.jdbc.JdbcStatementBuilderComponent$QueryBuilder.$anonfun$expr$8$adapted(JdbcStatementBuilderComponent.scala:381)
at slick.util.SQLBuilder.sep(SQLBuilder.scala:31)
at slick.jdbc.JdbcStatementBuilderComponent$QueryBuilder.expr(JdbcStatementBuilderComponent.scala:381)
at slick.jdbc.H2Profile$QueryBuilder.expr(H2Profile.scala:99)
at slick.jdbc.JdbcStatementBuilderComponent$QueryBuilder.$anonfun$expr$8(JdbcStatementBuilderComponent.scala:381)
at slick.jdbc.JdbcStatementBuilderComponent$QueryBuilder.$anonfun$expr$8$adapted(JdbcStatementBuilderComponent.scala:381)
at slick.util.SQLBuilder.sep(SQLBuilder.scala:31)
So the question is - is there another way to do such update?
What comes to my mind is using a plain SQL query or introduction of some artificial column holding concatenated values of from and type columns and filter against it.