I am trying to upsert large amount of data into Salesforce. I found there are two methods available for this.
1. Use UpsertBulk. This will upsert data in a single shot
2. Use Batching
Create upsert job. create batches for upsert operation.
What is the difference between these two methods?
What is the best way to do bulk upsert?
The best way to insert/update/upsert data in bulk is via the recently released and renewed BulkAPIv2. Please see this link: https://resources.docs.salesforce.com/210/latest/en-us/sfdc/pdf/api_bulk_v2.pdf
Bulk API 2.0 simplifies uploading large amounts of data by breaking the data into batches automatically. All you have to do is upload a CSV file with your record data and check back when the results are ready
Both methods are using Bulk API
Differences:
Upsert bulk - single operation offered by Salesforce connector which creates a new Job in Salesforce and creates a Batch within the Job. After batch is processed you need to make sure Job will be closed.
Create Job -> Create Batch - two separate operations of the Salesforce connector to create a new Job and add a new Batch within the Job. After batch(es) is processed you need to make sure Job will be closed.
Both options are pretty similar.
Which one is better for you? I would check the Bulk API limits first and then decide -> Salesforce Bulk API Limits
Depending on the volume of data you are going to process you may want to create one or multiple batches (or maybe use multiple jobs).
Related
We have a Kafka consumer service to ingest data into our DB. Whenever we receive the message from the topic we will compose an insert statement to insert this message into DB. We use DB connection pool to handle the insertion and so far so good.
Currently, we need to add a filter to select only the related message from Kafka and do the insert. There are two options in my mind to do this.
Option 1: Create a config table in the DB to define our filtering condition.
Pros
No need to make code changes or redeploy services
Just insert new filters to config table, service will pick them the next run
Cons
eed to query the DB every time we receive new messages.
Say we received 100k new messages daily and need to filter out 50k. So totally we only need to run 50k INSERT commands, while need to run 100K SELECT queries to check the filter condition for every single Kafka message.
Option 2: Use a hardcoded config file to define those filters.
Pros
Only need to read the filters once when the consumer start running
Has no burden on the DB layer
Cons
This is not a scalable way since we are planning to add a lot of filters, everytime we need to make code changes on the config file and redeploy the consumer.
My question is, is there a better option to achieve the goal? Find the filters without using hardcoded config file or without increasing the concurrency of DB queries.
Your filters could be in another Kafka topic.
Start your app and read the topic until the end, and only then start doing database inserts. Store each consumed record in some local structure such as ConcurrentHashmap, SQLite, RocksDB (provided by Kafka Streams), or DuckDB is popular recently...
When you add a new filter, your consumer would need to temporarily pause your database operations
If you use Kafka Streams, then you could lookup data from the incoming topic against your filters "table" statestore using Processor API and drop the records from the stream
This way, you separate your database reads and writes once you start inserting 50k+ records, and your app wouldn't be blocked trying to read any "external config"
You could also use Zookeeper, as that's one of its use cases
I would like to create a per-user view of data tables stored in Flink, which is constantly updated as changes happen to the source data, so that I can have a constantly updating UI based on a toChangelogStream() of the user's view of the data. To do that, I was thinking that I could create an ad-hoc SQL query like SELECT * FROM foo WHERE userid=X and convert it to a changelog stream, which would have a bunch of inserts at the beginning of the stream to give me the initial state, followed by live updates after that point. I would leave that query running as long as the user is using the UI, and then delete the table when the user's session ends. I think this is effectively how the Flink SQL client must work, so it seem like this is possible.
However, I anticipate that there may be some large overheads associated with each ad hoc query if I do it this way. When I write a SQL query, based on the answer in Apache Flink Table 1.4: External SQL execution on Table possible?, it sounds like internally this is going to compile a new JAR file and create new pipeline stages, I assume using more JVM metaspace for each user. I can have tens of thousands of users using the UI at once, so I'm not sure that's really feasible.
What's the idiomatic way to do this? The other ways I'm looking at are:
I could maybe use queryable state since I could group the current rows behind the userid as the key, but as far as I can tell it does not provide a way to get a changelog stream, so I would have to constantly re-query the state on a periodic basis, which is not ideal for my use case (the per-user state can be large sometimes but doesn't change quickly).
Another alternative is to output the table to both a changelog stream sink and an external RDBMS sink, but if I do that, what's the best pattern for how to join those together in the client?
I've got a periodically triggered batch job which writes data into a MongoDB. The job needs about 10 minutes and after that I would like to receive this data and do some transformations with Apache Flink (Mapping, Filtering, Cleaning...). There are some dependencies between the records which means I have to process them together. For example I like to transform all records from the latest batch job where the customer id is 45666. The result would be one aggregated record.
Are there any best practices or ways to do that without implementing everything by myself (get distict customer ids from latest job, for each customer select records and transform, flag the transformed customers, etc....)?
I'm not able to stream it because I have to transform multiple records together and not one by one.
Currently I'm using Spring Batch, MongoDB, Kafka and thinking about Apache Flink.
Conceivably you could connect the MongoDB change stream to Flink and use that as the basis for the task you describe. The fact that 10-35 GB of data is involved doesn't rule out using Flink streaming, as you can configure Flink to spill to disk if its state can't fit on the heap.
I would want to understand the situation better before concluding that this is a sensible approach, however.
I am consuming a kafka topic as a datastream and using a FlatMapFunction to process the data. The processing consist of enriching the instances that comes from the stream with more data that a get from database executing a query in other to collect the output but, it feels it is not the best approach.
Reading the docs i know that i can create a DataSet from a database query but i only saw examples for Batch Processing.
Can i perform merge/reduce (or other operation) with a DataStream and a DataSet to accomplish that ?
Can i get any performance improvement using a DataSet instead accessing directly the database?
There are various approaches one can take for accomplishing this kind of enrichment with Flink's DataStream API.
(1) If you just want to fetch all the data on a one-time basis, you can use a stateful RichFlatmapFunction that does the query in its open() method.
(2) If you want to do a query for every stream element, then you could do that synchronously in a FlatmapFunction, or look at AsyncIO for a more performant approach.
(3) For best performance while also getting up-to-date values from the external database, look at streaming in the database change stream and doing a streaming join with a CoProcessFunction. Something like http://debezium.io/ could be useful here.
I have an AppEngine application that process files from Cloud Storage and inserts them in BigQuery.
Because now and also in the future I would like to know the sanity/performance of the application... I would like to store stats data in either Cloud Datastore or in a Cloud SQL instance.
I have two questions I would like to ask:
Cloud Datastore vs Cloud SQL - what would you use and why? What downsides have you experienced so far?
Would you use a task or direct call to insert data and, also, why? - Would you add a task and then have some consumers insert to data or would you do a direct insert [ regardless of the solution choosen above ]. What downsides have you experienced so far?
Thank you.
Cloud SQL is better if you want to perform JOINs or SUMs later, Cloud Datastore will scale more if you have a lot of data to store. Also, in the Datastore, if you want to update a stats entity transactionally, you will need to shard or you will be limited to 5 updates per second.
If the data to insert is small (one row to insert in BQ or one entity in the datastore) then you can do it by a direct call, but you must accept that the call may fail. If you want to retry in case of failure, or if the data to insert is big and it will take time, it is better to run it asynchronously in a task. Note that with tasks,y you must be cautious because they can be run more than once.