Combining Hadoop MapReduce and database queries

Combining Hadoop MapReduce and database queries - database

A certain job I'm running needs to collect some metadata from a DB (MySQL, although that's not as relevant) before processing some large HDFS files. This metadata will be added to data in the files and passed on to the later map/combine/reduce stages.
I was wondering where the "correct" place to put this query might be. I need the metadata to be available when the mapper begins, but placing it there seems redundant, as every Mapper will execute the same query. How can I (if at all) perform this query once and share its results across all the mappers? Is there a common way to share data between all the nodes performing a task (other than writing it to HDFS)? thanks.

You can have your MYSql query in your main function and the result of the query can be stored in a string. Then you can set the variable to the Hadoop Job Configuration object. The variables set in Configuration object can be accessed by all mappers.
Your main class looks like this....
JobConf conf = new JobConf(Driver.class);
String metainfo = <You metadata Info goes here>;
conf.set("metadata",metainfo);
So in you Map Class you can access the metadata value as follows
publi class Map(...){
String sMetaInfo="";
public void configure(JobConf job) {
sMetaInfo= job.get("metadata"); // Getting the metadata value from Job Configureation Object
}
public void map(....){
// Map Function
}
}

I would use swoop if you have the cloudera distribution for ease. I usually program with cascading in java and for db sources use dbmigrate as a source "tap" making dbs a first class citizen. When using pks with dbmigrate, the performance has been adequate.

Related

Best way to handle large amount of inserts to Azure SQL database using TypeORM

I have an API created with Azure Functions (TypeScript). These functions receive arrays of JSON data, converts them to TypeORM entities and inserts them into Azure SQL database. I recently ran into an issue where the array had hundreads of items, and I got an error:
The incoming request has too many parameters. The server supports a maximum of 2100 parameters. Reduce the number of parameters and resend the request
I figured that saving all of the data at once using the entity manager causes the issue:
const connection = await createConnection();
connection.manager.save(ARRAY OF ENTITIES);
What would be the best scalable solution to handle this? I've got a couple ideas but I've no idea if they're any good, especially performance wise.
Begin transaction -> Start saving the entities individually inside forEach loop -> Commit
Split the array into smaller arrays -> Begin transaction -> Save the smaller arrays individually -> Commit
Something else?
Right now the array sizes are in tens or hundreads, but occasional arrays with 10k+ items are also a possibility.

One way you can massively scale is let DB deal with that problem. E.g. use External Tables. DB does the parsing. Let your code only orchestrates.
E.g.
Make data to be inserted available in ADLS (Datalake):
instead of calling your REST API with all data (in body or query params as array), caller would write the data to ADLS-location as csv/json/parquet/... file. OR
Caller remains unchanged. Your Azure Function writes data to some csv/json/parquet/... file in ADLS-location (instead of writing to DB).
Make DB read and load the data from ADLS.
First `CREATE EXTERNAL TABLE tmpExtTable LOCATION = ADLS-location
Then INSERT INTO actualTable (SELECT * from tmpExtTable)
See formats supported by EXTERNAL FILE FORMAT.
You need not delete and re-create external table each time. Whenever you run SELECT on it, DB will go parse the data in ADLS. But it's a choice.

I ended up doing this the easy way, as TypeORM already provided the ability to save in chunks. It might not be the most optimal way but at least I got away from the "too many parameters" error.
// Save all data in chunks of 100 entities
connection.manager.save(ARRAY OF ENTITIES, { chunk: 100 });

Spring Batch for DB2 to SQL migration in batches of keys

I have a DB2 table with about 150 K records. I have another SQL Server table with the same columns. One of the table columns - let's call it code - is a unique value and is indexed. I am using Spring Batch. Periodically, I get a file with a list of codes. For example, a file with 5 K codes. For each code in the file, I need to read records from the DB2 table, whose code column matches the code in the file and insert a few columns from those records to the SQL Server table. I want to use SQL and not JPA and believe there is a limit (let's say 1000) on how many values can be in the SQL IN clause. Should this be my chunk size?
How should the Spring Batch application to do this be designed? I have considered below strategies but need help deciding which one (or any other) is better.
1) Single step job with reader reading codes from file, processor using a JdbcTemplate to get rows for a chunk of codes and writer writing the rows using JdbcBatchItemWriter - seems like the JdbcTemplate would have an open DB connection through out job execution.
2) JdbcPagingItemReader - Spring Batch documentation cautions that databases like DB2 have pessimistic locking strategies and suggests using driving query instead
3) Driving Query - Is there an example? - How does the processor convert the key to a full object here? How long does the connection stay open?
4) Chaining readers - is this possible? - first reader will read from file, second from DB2 and then processor and writer.

I would go with your option #1. Your file containing unique codes effectively becomes your driving query.
Your ItemReader will read the file and emit each code to your ItemProcessor.
The ItemProcessor can either directly use a JdbcTemplate, or you can delegate to a separate data access object (DAO) in your project, but either way, with each invocation of the process method a new record will be pulled in from your DB2 database table. You can do whatever other processing is necessary here prior to emitting the appropriate object for your ItemWriter which then inserts or updates the necessary record(s) in your SQL Server database table.
Here's an example from a project where I used an ItemReader<Integer> as my driving query to collect the IDs of devices on which I needed to process configuration data. I then passed those ID data on to my ItemProcessor which dealt with one configuration file at a time:
public class DeviceConfigDataProcessor implements ItemProcessor<Integer,DeviceConfig> {
#Autowired
MyJdbcDao myJdbcDao;
#Override
public DeviceConfig process(Integer deviceId) throws Exception {
DeviceConfig deviceConfig = myJdbcDao.getDeviceConfig( deviceId );
// process deviceConfig as needed
return deviceConfig;
}
}
You would swap out deviceId for code, and DeviceConfig for whatever domain object is appropriate to your project.
If you're using Spring Boot, you should have a ConnectionPool automatically, and your DAO will pull a single record at a time for processing, so you don't need to worry about persistent connections to the database, pessimistic locks, etc.

Multimapping in Dapper Without Custom SQL

Is there a way to use multimapping in Dapper in a generic way, without using custom SQL embedded in C# code?
See for example
Correct use of Multimapping in Dapper
Is there a generic way to query the data from 2 related entities, where common fields are determined automatically for join?

Don't do this. Don't even think this way! Databases are long lasting and normalized. Objects are perishable and frequently denormalized, and transitioning between the two is something to do thoughtfully, when you're writing your SQL. This is really not a step to automate. Long, painful experience has convinced many of us that database abstractions (tables and joins) should not just be sucked into (or generated out of) code. If you're not yet convinced, then use an established ORM.
If, on the other hand, you absolutely want to be in control of your SQL, but its the "embedding" in string literals in C# that bugs you, then I couldn't agree more. Can I suggest QueryFirst, a visual studio extension that generates the C# wrapper for your queries. Your SQL stays in a real SQL file, syntax validated, DB references checked, and at each save, QueryFirst generates a wrapper class with Execute() methods, and a POCO for the results.
By multi-mapping, I presume you want to fill a graph of nested objects. A nice way to do this is to use one QueryFirst .sql per class in your graph, then in the partial class of the parent, add a List of children. (QueryFirst generated POCOs are split across 2 partial classes, you control one of them, the tool generates the other.)
So, for a graph of Customers and their orders...
In the parent sql
select * from customers where name like #custName
The child sql
select * from orders where customerId = #customerId
In the parent partial class, for eager loading...
public List<Orders> orders;
public void OnLoad()
{
orders = new getOrders().Execute(customerId); // property of the parent POCO
}
or for lazy loading...
private List<Orders> _orders;
public List<Orders> orders
{
get
{
return _orders ?? _orders = new GetOrders().Execute(customerId);
}
}
5 lines of code, not counting brackets, and you have a nested graph, lazy loaded or eager loaded as you prefer, the interface discoverable in code (intellisense for the input parameter and result). Their might be hundreds of columns in those tables, whose names you will never need to re-type, and whose datatypes are going to flow transparently into your C#.
Clean separation of responsibilities. Total control. Disclaimer : I wrote QueryFirst :-)

Multimapping with Dapper is a method of running multiple SQL queries at once and then return each result mapped to a specific object.
In the context of this question, Multimapping is not even relevant, re: you're asking for a way to automatically generate a SQL query from the given objects and creating the correct joins which would result in a single SQL query which is not related to Multimapping.
I suspect what you're looking for is something along the lines of the Entity Framework. There are a couple of Dapper extension projects you may want to look into which will generate some of your SQL. See: Dapper.Rainbow VS Dapper.Contrib

Concept of a database driven program in VB.NET

I have been developing programs in VB.NET for a few years and am familiar with it. The area where I do not have a lot of exposure is databases.
I am writing a program (for my personal use) called movie manager. It will store information on movies I have. I have selected Sql Server Compact Edition database. Assume I have a database with two tables namely Info and Cast. Info table has a few columns such as movie_name, release_date and so on. Cast table has few cols such as first_name,last_name etc.
Right now I have created a DataSet which reads all the info of tables from database (opens connection, fills tables info, closes connection). This way in a global variable I have a snapshot of database.
My queries :
Once I have data with me, every time I need to add, edit or delete a record I have to open a connection, fire an sql and close the connection. Right ? Is there a way to do this without using Sql ? Plus is this concept okay.
Since I am not using structures so I need to create empty datasets to store temp information. Is this convenient ?
If I have to search for a specific thing in dataset table, then do I have to loop thru all items or can I use sql on dataset or is there an alternate ?

1)Once I have data with me, every time I need to add, edit or delete a record I have to open a connection, fire an sql and close the connection. Right ? Is there a way to do this without using Sql ? Plus is this concept okay.
No. To update a database, you have to use the database. Create a stored procedure in the database to handle your functionality and then call it from the code and pass in whatever data needs saved. DO NOT USE INLINE SQL. Paramterized stored procedures are the way to go.
2) Since I am not using structures so I need to create empty datasets to store temp information. Is this convenient ?
It depends on what you're doing. I would create an object model to retain my updated data and then I'd pass the properties into the stored procedure when it was time to save my changes.
3) If I have to search for a specific thing in dataset table, then do I have to loop thru all items or can I use sql on dataset or is there an alternate ?
You can loop the rows or you can use linq to pull what you need out. Linq is really nice as it's basically .NET coded queries against a collection.
There are plenty of tutorials/guides out there to show you how to update via stored proc call form your code. There are a ton of linq tutorials as well. Basically, a linq query against your table will look something like:
dim result as Generic.List(of String) =
(from r in table.AsEnumerable()
select r
where r["columnName"] = "the value").ToList()
This syntax is probably a bit off, but it looks something like that.
Edit
Your Model:
Public Class EmployeeModel
Public Property Id
Public Property FirstName
Public Property Last Name
Public Property JobCode
Public Sub EmployeeModel(your params)
//set properties
End Sub
End Class
Your DAL:
Public Shared Class EmployeeDAL
Public Shared Sub SaveEmployee(ByRef model as EmployeeModel)
// call your sp_SaveEmployee stored procedure and set the parameters to the model properties
// #id = EmpoyeeModel.Id
// #JobCode = Employee.JobCode
// ...
End Sub
End Class
I use VB every few months, so there are probably some small syntax errors in there. But that's basically all you need to do. The function to save your data is in the DAL, not in the class itself. If you don't want to use a DAL, you can put the save functionality in your class, though. It'll work the same way, it's just not as clearly separated.

On your Questions.
number 1: You have to connect to database in order to store and retrieve data. There are lots of ways on how to deal with it and one way of it is to use app.config or you may simply create a function that calls the connection every time you need it.
number 2: Since you are dealing with dataset here are some tips you might want to look at DataSet
number 3: You can also try using Data Adapter and Data Table. I am not sure what you meant by your question number 3.
Cheers

I have problem with the way you are using your database and the ram of your computer.
Problem1: since you already have a database for holding the movies information why are you again holding the same information in memory?, creating an extra overhead. if your answer is for performance or i have cheap memory then why don't you use xml or flatfile instead? Database is not needed with this senario.
Problem2: You are like a soldier who dosent know about the weapon he use? right? because you are asking silly questions your first question about opening connection.. the answer is yes you have to open the connection every time save/read the data and close it as soon as possible.
your second question about convinent the answer is no. instead create class with all field as property and some method for initialization,saving,deleting. this way you have to write less code. nad suppose you have a movie names xyz there can be another movie xyz how will you distinguish it? if you have whole information b4 you you can do it via release date ,casts etc, but still it will be hard, so create a primary key for both your table
and finally your 3rd question , it will be easier to use use sql queries than looping thru the dataset(get rid of the dataset as soon as possible)
wish yu good luck on the road to rdbms

Hibernate Entity Manager: Result Set Mapping

I'm new to Hibernate and I am a fan of iBATIS, but my new work environments forces me to use Hibernate. In the current scenario at my workplace, there are many complex select queries that I feel hibernate doesn't handles very well.
So I thought of writing my own native SQL select queries, with lots and lots of joins and conditions. So far so good.
I'm using the EntityManager to fire the queries and return the result set. The sample code will look like this:
String sqlQuery="select COL_1, COL_2 from MY_TABLE_NAME";
Query q=em.createNativeQuery(sqlQuery,);
List resultList=q.getResultList();
Please keep in mind this is just a code sample. I would like to have a way so that the result set as in this example (COL_1 and COL_2) can be mapped to my JAVA Class that has two (or more) fields.
Also note that this JAVA Class can not be an Entity that is linked to a table, So I need a way to define any of my JAVA class as a resultClass, the way we do in iBATIS.
Is there any way to do it? SO that the I just define a mapping somewhere for any JAVA class and the result set from query gets mapped to my JAVA Class automatically

String sqlQuery="select COL_1, COL_2 from MY_TABLE_NAME";
Query q=em.createNativeQuery(sqlQuery,MY_TABLE_NAME.class);
List<MY_TABLE_NAME> resultList=q.getResultList();
look variable in IDE debug mode, you will get what you want.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight