Spring Batch for DB2 to SQL migration in batches of keys - sql-server

I have a DB2 table with about 150 K records. I have another SQL Server table with the same columns. One of the table columns - let's call it code - is a unique value and is indexed. I am using Spring Batch. Periodically, I get a file with a list of codes. For example, a file with 5 K codes. For each code in the file, I need to read records from the DB2 table, whose code column matches the code in the file and insert a few columns from those records to the SQL Server table. I want to use SQL and not JPA and believe there is a limit (let's say 1000) on how many values can be in the SQL IN clause. Should this be my chunk size?
How should the Spring Batch application to do this be designed? I have considered below strategies but need help deciding which one (or any other) is better.
1) Single step job with reader reading codes from file, processor using a JdbcTemplate to get rows for a chunk of codes and writer writing the rows using JdbcBatchItemWriter - seems like the JdbcTemplate would have an open DB connection through out job execution.
2) JdbcPagingItemReader - Spring Batch documentation cautions that databases like DB2 have pessimistic locking strategies and suggests using driving query instead
3) Driving Query - Is there an example? - How does the processor convert the key to a full object here? How long does the connection stay open?
4) Chaining readers - is this possible? - first reader will read from file, second from DB2 and then processor and writer.

I would go with your option #1. Your file containing unique codes effectively becomes your driving query.
Your ItemReader will read the file and emit each code to your ItemProcessor.
The ItemProcessor can either directly use a JdbcTemplate, or you can delegate to a separate data access object (DAO) in your project, but either way, with each invocation of the process method a new record will be pulled in from your DB2 database table. You can do whatever other processing is necessary here prior to emitting the appropriate object for your ItemWriter which then inserts or updates the necessary record(s) in your SQL Server database table.
Here's an example from a project where I used an ItemReader<Integer> as my driving query to collect the IDs of devices on which I needed to process configuration data. I then passed those ID data on to my ItemProcessor which dealt with one configuration file at a time:
public class DeviceConfigDataProcessor implements ItemProcessor<Integer,DeviceConfig> {
#Autowired
MyJdbcDao myJdbcDao;
#Override
public DeviceConfig process(Integer deviceId) throws Exception {
DeviceConfig deviceConfig = myJdbcDao.getDeviceConfig( deviceId );
// process deviceConfig as needed
return deviceConfig;
}
}
You would swap out deviceId for code, and DeviceConfig for whatever domain object is appropriate to your project.
If you're using Spring Boot, you should have a ConnectionPool automatically, and your DAO will pull a single record at a time for processing, so you don't need to worry about persistent connections to the database, pessimistic locks, etc.

Related

How to read from one DB but write to another using Snowflake's Snowpark?

I'm SUPER new to Snowflake and Snowpark, but I do have respectable SQL and Python experience. I'm trying to use Snowpark to do my data prep and eventually use it in a data science model. However, I cannot write to the database from which I'm pulling from -- I need to create all tables in a second DB.
I've created code blocks to represent both input and output DBs in their own sessions, but I'm not sure that's helpful, since I have to be in the first session in order to even get the data.
I use code similar to the following to create a new table while in the session for the "input" DB:
my_table= session.table("<SCHEMA>.<TABLE_NAME>")
my_table.toPandas()
table_info = my_table.select(col("<col_name1>"),
col("<col_name2>"),
col("<col_name3>").alias("<new_name>"),
col("<col_name4"),
col("<col_name5")
)
table_info.write.mode('overwrite').saveAsTable('MAINTABLE')
I need to save the table MAINTABLE to a secondary database that is different from the one where the data was pulled from. How do I do this?
It is possible to provide fully qualified name:
table_info.write.mode('overwrite').saveAsTable('DATABASE_NAME.SCHEMA_NAME.MAINTABLE')
DataFrameWriter.save_as_table
Parameters:
table_name – A string or list of strings that specify the table name or fully-qualified object identifier (database name, schema name, and table name).

Best way to handle large amount of inserts to Azure SQL database using TypeORM

I have an API created with Azure Functions (TypeScript). These functions receive arrays of JSON data, converts them to TypeORM entities and inserts them into Azure SQL database. I recently ran into an issue where the array had hundreads of items, and I got an error:
The incoming request has too many parameters. The server supports a maximum of 2100 parameters. Reduce the number of parameters and resend the request
I figured that saving all of the data at once using the entity manager causes the issue:
const connection = await createConnection();
connection.manager.save(ARRAY OF ENTITIES);
What would be the best scalable solution to handle this? I've got a couple ideas but I've no idea if they're any good, especially performance wise.
Begin transaction -> Start saving the entities individually inside forEach loop -> Commit
Split the array into smaller arrays -> Begin transaction -> Save the smaller arrays individually -> Commit
Something else?
Right now the array sizes are in tens or hundreads, but occasional arrays with 10k+ items are also a possibility.
One way you can massively scale is let DB deal with that problem. E.g. use External Tables. DB does the parsing. Let your code only orchestrates.
E.g.
Make data to be inserted available in ADLS (Datalake):
instead of calling your REST API with all data (in body or query params as array), caller would write the data to ADLS-location as csv/json/parquet/... file. OR
Caller remains unchanged. Your Azure Function writes data to some csv/json/parquet/... file in ADLS-location (instead of writing to DB).
Make DB read and load the data from ADLS.
First `CREATE EXTERNAL TABLE tmpExtTable LOCATION = ADLS-location
Then INSERT INTO actualTable (SELECT * from tmpExtTable)
See formats supported by EXTERNAL FILE FORMAT.
You need not delete and re-create external table each time. Whenever you run SELECT on it, DB will go parse the data in ADLS. But it's a choice.
I ended up doing this the easy way, as TypeORM already provided the ability to save in chunks. It might not be the most optimal way but at least I got away from the "too many parameters" error.
// Save all data in chunks of 100 entities
connection.manager.save(ARRAY OF ENTITIES, { chunk: 100 });

How can I minimize validation intervals when changing the SQL in ADO NET Source Tasks

Part of an SSIS package is the data import from an external database via a SQL command embedded into an ADO.NET Source Data Flow Source. Whenever I make even the slightest adjustment to the query (such as changing a column name) it takes ages (in that case 1-2 hours) until the program has finished validation. The query itself returns around 30,000 rows with 20 columns each.
Is there any way to cut these long intervals or is this something I have to live with?
I usually store the source queries in a table and the first part of my package would execute a select and store the query returned from the table in a package variable, which would then be used by the ADO.NET Source Data Flow. So In my package for the default value of the variable I usually have the query that is stored in the database along with a "where 1=2" at the end. Hence during design time it does execute the query but just returns the column metadata. Let me know if you have any questions.

Combining Hadoop MapReduce and database queries

A certain job I'm running needs to collect some metadata from a DB (MySQL, although that's not as relevant) before processing some large HDFS files. This metadata will be added to data in the files and passed on to the later map/combine/reduce stages.
I was wondering where the "correct" place to put this query might be. I need the metadata to be available when the mapper begins, but placing it there seems redundant, as every Mapper will execute the same query. How can I (if at all) perform this query once and share its results across all the mappers? Is there a common way to share data between all the nodes performing a task (other than writing it to HDFS)? thanks.
You can have your MYSql query in your main function and the result of the query can be stored in a string. Then you can set the variable to the Hadoop Job Configuration object. The variables set in Configuration object can be accessed by all mappers.
Your main class looks like this....
JobConf conf = new JobConf(Driver.class);
String metainfo = <You metadata Info goes here>;
conf.set("metadata",metainfo);
So in you Map Class you can access the metadata value as follows
publi class Map(...){
String sMetaInfo="";
public void configure(JobConf job) {
sMetaInfo= job.get("metadata"); // Getting the metadata value from Job Configureation Object
}
public void map(....){
// Map Function
}
}
I would use swoop if you have the cloudera distribution for ease. I usually program with cascading in java and for db sources use dbmigrate as a source "tap" making dbs a first class citizen. When using pks with dbmigrate, the performance has been adequate.

Populate SQL database from textfile on a background thread constantly

Currently, I would like provide this as an option to the user when storing data to the database.
Save the data to a file and use a background thread to read data from the textfile to SQL server.
Flow of my program:
- A stream of data coming from a server constantly (100 per second).
- want to store the data in a textfile and use background thread to copy data from the textfile back to the SQL database constantly as another user option.
Has this been done before?
Cheers.
Your question is indeed a bit confusing.
I'm guessing you mean that:
100 rows per second come from a certain source or server (eg. log entries)
One option for the user is textfile caching: the rows are stored in a textfile and periodically an incremental copy of the contents of the textfile into (an) SQL Server table(s) is performed.
Another option for the user is direct insert: the data is stored directly in the database as it comes in, with no textfile in between.
Am I right?
If yes, then you should do something in the lines of:
Create a trigger on an INSERT action to the table
In that trigger, check which user is inserting. If the user has textfile caching disabled, then the insert can go on. Otherwise, the data is redirected to a textfile (or a caching table)
Create a stored procedure that checks the caching table or text file for new data, copies the new data into the real table, and deletes the cached data.
Create an SQL Server Agent job that runs above stored procedure every minute, hour, day...
Since the interface from T-SQL to textfiles is not very flexible, I would recommend using a caching table instead. Why a textfile?
And for that matter, why cache the data before inserting it into the table? Perhaps we can suggest a better solution, if you explain the context of your question.

Resources