I am trying to calculate hex md5 checksum at in scala js incrementally. The checksum will be verified at server side once file is transferred.
I tried using spark-md5 scala js web jar dependency:
libraryDependencies ++= Seq("org.webjars.npm" % "spark-md5" % "2.0.2")
jsDependencies += "org.webjars.npm" % "spark-md5" % "2.0.2" / "spark-md5.js"
scala js Code:-
val reader = new FileReader
reader.readAsArrayBuffer(data) // data is javascript blob object
val spark = scala.scalajs.js.Dynamic.global.SparkMD5.ArrayBuffer
reader.onload = (e: Event) => {
spark.prototype.append(e.target)
print("Checksum - > " + spark.end)
}
Error:-
Uncaught TypeError: Cannot read property 'buffer' of undefined
at Object.SparkMD5.ArrayBuffer.append (sampleapp-jsdeps.js:596)
at FileReader. (SampleApp.scala:458)
I tried google but most of the help is available are for javascript, couldn't find anything on how to use this library in scala js.
Sorry If I missed something very obvious, I am new to both javascript & scala js.
From spark-md5 readme, I read:
var spark = new SparkMD5.ArrayBuffer();
spark.append(e.target.result);
var hexHash = spark.end();
The way you translate that in Scala.js is as follows (assuming you want to do it the dynamically typed way):
import scala.scalajs.js
import scala.scalajs.js.typedarray._
import org.scalajs.dom.{FileReader, Event}
val SparkMD5 = js.Dynamic.global.SparkMD5
val spark = js.Dynamic.newInstance(SparkMD5.ArrayBuffer)()
val fileContent = e.target.asInstanceOf[FileReader].result.asInstanceOf[ArrayBuffer]
spark.append(fileContent)
val hexHashDyn = spark.end()
val hexHash = hexHashDyn.asInstanceOf[String]
Integrating that with your code snippet yields:
val reader = new FileReader
reader.readAsArrayBuffer(data) // data is javascript blob object
val SparkMD5 = js.Dynamic.global.SparkMD5
val spark = js.Dynamic.newInstance(SparkMD5)()
reader.onload = (e: Event) => {
val fileContent = e.target.asInstanceOf[FileReader].result.asInstanceOf[ArrayBuffer]
spark.append(fileContent)
print("Checksum - > " + spark.end().asInstanceOf[String])
}
If that's the only use of SparkMD5 in your codebase, you can stop there. If you plan to use it several times, you should probably define a facade type for the APIs you want to use:
import scala.scalajs.js.annotation._
#js.native
object SparkMD5 extends js.Object {
#js.native
class ArrayBuffer() extends js.Object {
def append(chunk: js.typedarray.ArrayBuffer): Unit = js.native
def end(raw: Boolean = false): String = js.native
}
}
which you can then use much more naturally as:
val reader = new FileReader
reader.readAsArrayBuffer(data) // data is javascript blob object
val spark = new SparkMD5.ArrayBuffer()
reader.onload = (e: Event) => {
val fileContent = e.target.asInstanceOf[FileReader].result.asInstanceOf[ArrayBuffer]
spark.append(fileContent)
print("Checksum - > " + spark.end())
}
Disclaimer: not tested. It might need small adaptations here and there.
Related
I did this in Glance to get a bitmap of the online URI for display, but it failed.
Does Glance still support weight ratios?
val context = LocalContext.current val imageLoader = ImageLoader(context)
var bitmap = currentState<Bitmap>()
imageLoader.enqueue(
ImageRequest.Builder(context)
.data(parcelItem.artUri)
.error(R.drawable.music)
.target(
onSuccess = {
bitmap = it.toBitmap()
},
onError = {
it?.let {
bitmap = it.toBitmap()
}
}
)
.build()
)
Image(
provider = ImageProvider(bitmap),
modifier = GlanceModifier.size(50.dp).cornerRadius(10.dp).padding(8.dp),
contentDescription = ""
)
It is not possible to use the Coil directly, because the Glance component is required, and the ImageProvider(URI) does not support online resolution.
I just need to get the Bitmap before the update and pass it to the Widget
A Flink Streaming was developed with a filter that does the deduplication based on the id of the event using a key-value state based on RocksDB state backend.
Application Code
env.setStateBackend(new RocksDBStateBackend(checkpoint, true).asInstanceOf[StateBackend])
val stream = env
.addSource(kafkaConsumer)
.keyBy(_.id)
.filter(new Deduplication[Stream]("stream-dedup", Time.days(30))).uid("stream-filter")
Deduplication Code
class Deduplication[T](stateDescriptor: String, time: Time) extends RichFilterFunction[T] {
val ttlConfig: StateTtlConfig = StateTtlConfig
.newBuilder(time)
.setUpdateType(StateTtlConfig.UpdateType.OnReadAndWrite)
.setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)
.cleanupFullSnapshot
.build
val deduplicationStateDescriptor = new ValueStateDescriptor[Boolean](stateDescriptor, classOf[Boolean])
deduplicationStateDescriptor.enableTimeToLive(ttlConfig)
lazy val deduplicationState: ValueState[Boolean] = getRuntimeContext.getState(deduplicationStateDescriptor)
override def filter(value: T): Boolean = {
if (deduplicationState.value) {
false
} else {
deduplicationState.update(true)
true
}
}
}
All of this works just fine. My goal with this question is to understand how I can read all the state using state processor api. So I started to write some code based on the documentation available.
Savepoint Reading Code
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
val savepoint = Savepoint
.load(env, savepointPath,new RocksDBStateBackend("file:/tmp/rocksdb", true))
savepoint
.readKeyedState("stream-filter", new DeduplicationStateReader("stream-dedup")).print()
Reader Function Code
class DeduplicationStateReader(stateDescriptor: String) extends KeyedStateReaderFunction[String, String] {
var state: ValueState[Boolean] = _
override def open(parameters: Configuration): Unit = {
val deduplicationStateDescriptor = new ValueStateDescriptor[Boolean](stateDescriptor, classOf[Boolean])
state = getRuntimeContext.getState(deduplicationStateDescriptor)
}
override def readKey(key: String, ctx: KeyedStateReaderFunction.Context, out: Collector[String]): Unit = {
out.collect("IT IS WORKING")
}
}
Whenever I try to read the state, a serialization error appears to me.
Is there anything wrong? Did I misunderstand all of this?
I am using a rest-api which is supposed to import data out of csv files. The uploading and mapping to object part is working, but not the saveAll(), it just takes years to save 130000~ rows to the database (which is running on a mssqlserver) and it should be working with much bigger files in less time.
This is what my data class looks:
#Entity
data class Street(
#Id
#GeneratedValue(strategy = GenerationType.SEQUENCE, generator = "streetSeq")
#SequenceGenerator(name = "streetSeq", sequenceName = "streetSeq", allocationSize = 1)
val id: Int,
val name: String?,
val municipalityId: Int?
)
And im just using the saveAll methode. Everything in the import methode works relativ fast (like 10 seconds) untlis the saveAll()
override fun import(file: MultipartFile) {
val inputStream = file.inputStream
var import: List<Street> = listOf()
tsvReader.open(inputStream) {
val csvContents = readAllWithHeaderAsSequence()
val dataClasses = grass<ImportStreet>().harvest(csvContents)
dataClasses.forEach { row ->
import = import + toStreet(row)
}
println("Data Converted")
}
streetRepository.saveAll(import)
inputStream.close()
}
I already tried to adjust the application.yml but its not making a big diffrence.
jpa:
properties:
hibernate:
ddl-auto: update
dialect: org.hibernate.dialect.SQLServer2012Dialect
generate_statistics: true
order_inserts: true
order_updates: true
jdbc:
batch_size: 1000
As you can see in the implementaiton fo SimpleJpaRepository it's not doing any kind of batching save, it just saves each entity one by one, that's why it is slow
https://github.com/spring-projects/spring-data-jpa/blob/d35ee1a82bf0fdf2de2724a02619eea1cf3c98bd/src/main/java/org/springframework/data/jpa/repository/support/SimpleJpaRepository.java#L584
Assert.notNull(entities, "Entities must not be null!");
List<S> result = new ArrayList<S>();
for (S entity : entities) {
result.add(save(entity));
}
return result;
So try to implement batch save without usage of spring-data-jpa, for example using spring-batch
As the other comments / awnsers suggested i tried to use spring-batch, but it didnt work out for me (mostly because i didnt know how to pull it off with it)
After trying out more stuff i found jdbcTemplate, which perfectly worked out for me, the inserts are now much faster then with the saveAll()
override fun import(file: MultipartFile) {
val inputStream = file.inputStream
var import: List<Street> = listOf()
tsvReader.open(inputStream) {
val csvContents = readAllWithHeaderAsSequence()
val dataClasses = grass<ImportStreet>().harvest(csvContents)
dataClasses.forEach { row ->
import = import + toStreet(row)
}
println("Data Converted")
}
batchInsert(import)
inputStream.close()
}
The batchInstert methode is using the jdbcTemplate.batchUpdate() fun
fun batchInsert(streets: List<Street>): IntArray? {
return jdbcTemplate.batchUpdate(
"INSERT INTO street (name, municipalityId) VALUES (?, ?)",
object: BatchPreparedStatementSetter {
#Throws(SQLException::class)
override fun setValues(ps: PreparedStatement, i: Int) {
ps.setString(1, streets[i].name)
streets[i].municipalityId?.let { ps.setInt(2, it) }
}
override fun getBatchSize(): Int {
return streets.size
}
})
}
I am reading data from Kafka and trying to write it to the HDFS file system in ORC format. I have used the below link reference from their official website. But I can see that Flink write exact same content for all data and make so many files and all files are ok 103KB
https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/connectors/streamfile_sink.html#orc-format
Please find my code below.
object BeaconBatchIngest extends StreamingBase {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
def getTopicConfig(configs: List[Config]): Map[String, String] = (for (config: Config <- configs) yield (config.getString("sourceTopic"), config.getString("destinationTopic"))).toMap
def setKafkaConfig():Unit ={
val kafkaParams = new Properties()
kafkaParams.setProperty("bootstrap.servers","")
kafkaParams.setProperty("zookeeper.connect","")
kafkaParams.setProperty("group.id", DEFAULT_KAFKA_GROUP_ID)
kafkaParams.setProperty("auto.offset.reset", "latest")
val kafka_consumer:FlinkKafkaConsumer[String] = new FlinkKafkaConsumer[String]("sourceTopics", new SimpleStringSchema(),kafkaParams)
kafka_consumer.setStartFromLatest()
val stream: DataStream[DataParse] = env.addSource(kafka_consumer).map(new temp)
val schema: String = "struct<_col0:string,_col1:bigint,_col2:string,_col3:string,_col4:string>"
val writerProperties = new Properties()
writerProperties.setProperty("orc.compress", "ZLIB")
val writerFactory = new OrcBulkWriterFactory(new PersonVectorizer(schema),writerProperties,new org.apache.hadoop.conf.Configuration);
val sink: StreamingFileSink[DataParse] = StreamingFileSink
.forBulkFormat(new Path("hdfs://warehousestore/hive/warehouse/metrics_test.db/upp_raw_prod/hour=1/"), writerFactory)
.build()
stream.addSink(sink)
}
def main(args: Array[String]): Unit = {
setKafkaConfig()
env.enableCheckpointing(5000)
env.execute("Kafka_Flink_HIVE")
}
}
class temp extends MapFunction[String,DataParse]{
override def map(record: String): DataParse = {
new DataParse(record)
}
}
class DataParse(data : String){
val parsedJason = parse(data)
val timestamp = compact(render(parsedJason \ "timestamp")).replaceAll("\"", "").toLong
val event = compact(render(parsedJason \ "event")).replaceAll("\"", "")
val source_id = compact(render(parsedJason \ "source_id")).replaceAll("\"", "")
val app = compact(render(parsedJason \ "app")).replaceAll("\"", "")
val json = data
}
class PersonVectorizer(schema: String) extends Vectorizer[DataParse](schema) {
override def vectorize(element: DataParse, batch: VectorizedRowBatch): Unit = {
val eventColVector = batch.cols(0).asInstanceOf[BytesColumnVector]
val timeColVector = batch.cols(1).asInstanceOf[LongColumnVector]
val sourceIdColVector = batch.cols(2).asInstanceOf[BytesColumnVector]
val appColVector = batch.cols(3).asInstanceOf[BytesColumnVector]
val jsonColVector = batch.cols(4).asInstanceOf[BytesColumnVector]
timeColVector.vector(batch.size + 1) = element.timestamp
eventColVector.setVal(batch.size + 1, element.event.getBytes(StandardCharsets.UTF_8))
sourceIdColVector.setVal(batch.size + 1, element.source_id.getBytes(StandardCharsets.UTF_8))
appColVector.setVal(batch.size + 1, element.app.getBytes(StandardCharsets.UTF_8))
jsonColVector.setVal(batch.size + 1, element.json.getBytes(StandardCharsets.UTF_8))
}
}
With bulk formats (such as ORC), the StreamingFileSink rolls over to new files with every checkpoint. If you reduce the checkpointing interval (currently 5 seconds), it won't write so many files.
I use Source.queue to queue up HttpRequests and throttle it on the client side to download files from a remote server. I understand that Source.queue is not threadsafe and we need to use MergeHub to make it threadsafe. Following is the piece of code that uses Source.queue and uses cachedHostConnectionPool.
import java.io.File
import akka.actor.Actor
import akka.event.Logging
import akka.http.scaladsl.Http
import akka.http.scaladsl.client.RequestBuilding
import akka.http.scaladsl.model.{HttpResponse, HttpRequest, Uri}
import akka.stream._
import akka.stream.scaladsl._
import akka.util.ByteString
import com.typesafe.config.ConfigFactory
import scala.concurrent.{Promise, Future}
import scala.concurrent.duration._
import scala.util.{Failure, Success}
class HttpClient extends Actor with RequestBuilding {
implicit val system = context.system
val logger = Logging(system, this)
implicit lazy val materializer = ActorMaterializer()
val config = ConfigFactory.load()
val remoteHost = config.getString("pool.connection.host")
val remoteHostPort = config.getInt("pool.connection.port")
val queueSize = config.getInt("pool.queueSize")
val throttleSize = config.getInt("pool.throttle.numberOfRequests")
val throttleDuration = config.getInt("pool.throttle.duration")
import scala.concurrent.ExecutionContext.Implicits.global
val connectionPool = Http().cachedHostConnectionPool[Promise[HttpResponse]](host = remoteHost, port = remoteHostPort)
// Construct a Queue
val requestQueue =
Source.queue[(HttpRequest, Promise[HttpResponse])](queueSize, OverflowStrategy.backpressure)
.throttle(throttleSize, throttleDuration.seconds, 1, ThrottleMode.shaping)
.via(connectionPool)
.toMat(Sink.foreach({
case ((Success(resp), p)) => p.success(resp)
case ((Failure(error), p)) => p.failure(error)
}))(Keep.left)
.run()
// Convert Promise[HttpResponse] to Future[HttpResponse]
def queueRequest(request: HttpRequest): Future[HttpResponse] = {
val responsePromise = Promise[HttpResponse]()
requestQueue.offer(request -> responsePromise).flatMap {
case QueueOfferResult.Enqueued => responsePromise.future
case QueueOfferResult.Dropped => Future.failed(new RuntimeException("Queue overflowed. Try again later."))
case QueueOfferResult.Failure(ex) => Future.failed(ex)
case QueueOfferResult.QueueClosed => Future.failed(new RuntimeException("Queue was closed (pool shut down) while running the request. Try again later."))
}
}
def receive = {
case "download" =>
val uri = Uri(s"http://localhost:8080/file_csv.csv")
downloadFile(uri, new File("/tmp/compass_audience.csv"))
}
def downloadFile(uri: Uri, destinationFilePath: File) = {
def fileSink: Sink[ByteString, Future[IOResult]] =
Flow[ByteString].buffer(512, OverflowStrategy.backpressure)
.toMat(FileIO.toPath(destinationFilePath.toPath)) (Keep.right)
// Submit to queue and execute HttpRequest and write HttpResponse to file
Source.fromFuture(queueRequest(Get(uri)))
.flatMapConcat(_.entity.dataBytes)
.via(Framing.delimiter(ByteString("\n"), maximumFrameLength = 10000, allowTruncation = true))
.map(_.utf8String)
.map(d => s"$d\n")
.map(ByteString(_))
.runWith(fileSink)
}
}
However, when I use MergeHub, it returns Sink[(HttpRequest, Promise[HttpResponse]), NotUsed]. I need to extract the response.entity.dataBytes and write the response to a file using a filesink. I am not able figure out how to use MergeHub to achieve this. Any help will be appreciated.
val hub: Sink[(HttpRequest, Promise[HttpResponse]), NotUsed] =
MergeHub.source[(HttpRequest, Promise[HttpResponse])](perProducerBufferSize = queueSize)
.throttle(throttleSize, throttleDuration.seconds, 1, ThrottleMode.shaping)
.via(connectionPool)
.toMat(Sink.foreach({
case ((Success(resp), p)) => p.success(resp)
case ((Failure(error), p)) => p.failure(error)
}))(Keep.left)
.run()
Source.Queue is actually thread safe now. If you want to use MergeHub:
private lazy val poolFlow: Flow[(HttpRequest, Promise[HttpResponse]), (Try[HttpResponse], Promise[HttpResponse]), Http.HostConnectionPool] =
Http().cachedHostConnectionPool[Promise[HttpResponse]](host).tail.head, port, connectionPoolSettings)
val ServerSink =
poolFlow.toMat(Sink.foreach({
case ((Success(resp), p)) => p.success(resp)
case ((Failure(e), p)) => p.failure(e)
}))(Keep.left)
// Attach a MergeHub Source to the consumer. This will materialize to a
// corresponding Sink.
val runnableGraph: RunnableGraph[Sink[(HttpRequest, Promise[HttpResponse]), NotUsed]] =
MergeHub.source[(HttpRequest, Promise[HttpResponse])](perProducerBufferSize = 16).to(ServerSink)
val toConsumer: Sink[(HttpRequest, Promise[HttpResponse]), NotUsed] = runnableGraph.run()
protected[akkahttp] def executeRequest[T](httpRequest: HttpRequest, unmarshal: HttpResponse => Future[T]): Future[T] = {
val responsePromise = Promise[HttpResponse]()
Source.single((httpRequest -> responsePromise)).runWith(toConsumer)
responsePromise.future.flatMap(handleHttpResponse(_, unmarshal))
)
}
}