How many times is the DBRecordReader getting created? - database

Below is what I think of the hadoop framework processing text files. Please correct me if I am going wrong somewhere.
Each mapper acts on an input split which contains some records.
For each input split a record reader is getting created which starts reading records from the input split.
If there are n records in an input split the map method in the mapper is called n times which in turn reads a key-value pair using the record reader.
Now coming to the databases perspective
I have a database on a single remote node. I want to fetch some data from a table in this database. I would configure the parameters using DBConfigure and mention the input table using DBInputFormat. Now say if my table has 100 records in all, and I execute an SQL query which generates 70 records in the output.
I would like to know :
How are the InputSplits getting created in the above case (database) ?
What does the input split creation depend on, the number of records which my sql query generates or the total number of records in the table (database) ?
How many DBRecordReaders are getting created in the above case (database) ?

How are the InputSplits getting created in the above case (database)?
// Split the rows into n-number of chunks and adjust the last chunk
// accordingly
for (int i = 0; i < chunks; i++) {
DBInputSplit split;
if ((i + 1) == chunks)
split = new DBInputSplit(i * chunkSize, count);
else
split = new DBInputSplit(i * chunkSize, (i * chunkSize)
+ chunkSize);
splits.add(split);
}
There is the how, but to understand what it depends on let's take a look at chunkSize:
statement = connection.createStatement();
results = statement.executeQuery(getCountQuery());
results.next();
long count = results.getLong(1);
int chunks = job.getConfiguration().getInt("mapred.map.tasks", 1);
long chunkSize = (count / chunks);
So chunkSize takes the count = SELECT COUNT(*) FROM tableName and divides this by chunks = mapred.map.tasks or 1 if it is not defined in the configuration.
Then finally, each input split will have a RecordReader created to handle the type of database you are reading from for instance: MySQLDBRecordReader for MySQL database.
For more info check out the source

It appears #Engineiro explained it well by taking the actual hadoop source. Just to answer, number of DBRecordReader is equal to number of map tasks.
To explain further, the Hadoop Map side framework creates an instance of DBRecordReader for each Map task, in case where the child JVM is not reused for further Map tasks. In other words, the number of input splits is equals to the value of map.reduce.tasks in case of DBInputFormat. So, each map Task's record Reader has the meta information to construct the query to get subset of data from the table. Each Record Reader executes a pagination type of SQL which is similar to the below.
SELECT * FROM (SELECT a.*,ROWNUM dbif_rno FROM ( select * from emp ) a WHERE rownum <= 6 + 7 ) WHERE dbif_rno >= 6
The above SQL is for the second Map tasks to return the rows between 6 and 13
To generalize for any type of Input formats, the number of Record Readers is equals to the number of Map Tasks.

This post talks about all that you want : http://blog.cloudera.com/blog/2009/03/database-access-with-hadoop/

Related

In Apache Flink, how to join certain columns from 2 dynamic tables (Kafka topic-based) using TableAPI

Here are the specs of my system:
2 Kafka topics:
Kafka Topic A: Contains {"key": string, "some_data1": [...], "timestamp: int}
Kafka Topic B: Contains {"key": string, "some_data2": [...], "timestamp: int}
Let's say both topics receive 5 messages per second. (Let's ignore delays for this example)
I want to add the some_data2 values from B into A for a certain duration using Hop windowing (let's say 2 seconds hop and length for the sake of this example).
I tried the following:
1 - Create a SQL VIEW
First I created a view that join both topics like this:
CREATE TEMPORARY VIEW IF NOT EXISTS my_joined_view (
key,
some_data1,
some_data2,
timestamp
) AS SELECT
A.key,
A.timestamp,
A.some_data1,
B.some_data2
FROM
A
LEFT JOIN B ON A.key = B.key
2 - Continous Query
Doing my window on the joined view like this:
SELECT
key
my_udf(key, some_data1, some_data2, timestamp),
timestamp,
HOP_START(timestamp, INTERVAL '1' SECOND, INTERVAL '5' SECOND)
FROM
my_joined_view,
GROUP BY
HOP(timestamp, INTERVAL '1' SECOND, INTERVAL '5' SECOND), key
3 - My Expectations
My expectations were that my_udf function accumulator would receive 10 entries for some_data1 and some_data2 like:
class MyUDF(AggregateFunction):
def accumulate(self, accumulator, *args):
key = args[0]
some_data1 = args[1]
some_data2 = args[2]
timestamp = args[3]
assert len(some_data1) == 10
assert len(some_data2) == 10
But, in most cases I receive duplicates of each entries. It looks like the join mechanism is creating one row for each combination of values in the columns.
I am using a vectorized UDF function, therefore I'm dealing with pandas.Series in my UDF as my arguments types in the accumulator.
There's is clearly something wrong with the way I am joining my tables. I don't understand why I am receive duplicate entries in my UDF.
Thanks,

TDengine insertion use taos_stmt apis

After creating super table and tables, call taos_load_table_info to load the table information. Then initialize stmt by calling taos_stmt_init and taos_stmt_set_tbname to set up table name.
Create the TAOS_BIND object with the following attributes:
buffer_type = TSDB_DATA_TYPE_NCHAR
buffer_length = sizeof(str)
buffer = &str
length = sizeof(str)
Then call taos_stmt_bind_param and taos_stmt_add_batch, and finally execute with taos_stmt_execute.
The problem is that the insertion failed because I check the shell and use select * to look for the data but it only shows an empty column.
I strongly recommend you first try to insert a simple nchar type data to check whether it is the taos_stmt API's problem. If that insertion success, then you can also check if the insert nchar string has the same length as str variable. Sometimes, buffer_length is greater than or equal to length. If the actual size of your nchar data is less than the length value in TAOS_BIND, then tdengine will still analyze the binding value with other extra empty values and will fail to insert.

Split a query with too many arguments in SQLAlchemy

In my program when I did not run a database update for a long time and then try to update my data,
my sqlalchemy script generates a postgresql upsert query with >4000 params each with >8 items.
When the query is executed with databases.Database.execute(query) I end up with this error:
asyncpg.exceptions._base.InterfaceError: the number of query arguments cannot exceed 32767
My idea is to automatically split the query based on the number of arguments as threshold and execute it in two parts and merge the results.
Do you have an idea how to resolve that problem?
I ended up writing a check for the number of query arguments by getting the length of the first argument's dictionary keys in my list of argument dictionaries as they all have the same number of keys=arguments per list item:
args_per_row = len(args_dict_list[0])
PSQL_QUERY_ALLOWED_MAX_ARGS = 32767
allowed_args_per_query = int(math.floor(PSQL_QUERY_ALLOWED_MAX_ARGS/args_per_row))
Then I divided the args_dict_list into parts that have the size of allowed args per query:
query_args_sets = [
args_dict_list[x:x + allowed_args_per_query] for x in range(
0,
len(args_dict_list),
allowed_args_per_query
)
]
and finally looped over the query_args_sets and generated and executed a separate query for each set:
for arg_set in query_args_sets:
query = query_builder.build_upsert(values=arg_set)
await database.execute(query)

MATLAB Extract all rows between two variables with a threshold

I have a cell array called BodyData in MATLAB that has around 139 columns and 3500 odd rows of skeletal tracking data.
I need to extract all rows between two string values (these are timestamps when an event happened) that I have
e.g.
BodyData{}=
Column 1 2 3
'10:15:15.332' 'BASE05' ...
...
'10:17:33:230' 'BASE05' ...
The two timestamps should match a value in the array but might also be within a few ms of those in the array e.g.
TimeStamp1 = '10:15:15.560'
TimeStamp2 = '10:17:33.233'
I have several questions!
How can I return an array for all the data between the two string values plus or minus a small threshold of say .100ms?
Also can I also add another condition to say that all str values in column2 must also be the same, otherwise ignore? For example, only return the timestamps between A and B only if 'BASE02'
Many thanks,
The best approach to the first part of your problem is probably to change from strings to numeric date values. In Matlab this can be done quite painlessly with datenum.
For the second part you can just use logical indexing... this is were you put a condition (i.e. that second columns is BASE02) within the indexing expression.
A self-contained example:
% some example data:
BodyData = {'10:15:15.332', 'BASE05', 'foo';...
'10:15:16.332', 'BASE02', 'bar';...
'10:15:17.332', 'BASE05', 'foo';...
'10:15:18.332', 'BASE02', 'foo';...
'10:15:19.332', 'BASE05', 'bar'};
% create column vector of numeric times, and define start/end times
dateValues = datenum(BodyData(:, 1), 'HH:MM:SS.FFF');
startTime = datenum('10:15:16.100', 'HH:MM:SS.FFF');
endTime = datenum('10:15:18.500', 'HH:MM:SS.FFF');
% select data in range, and where second column is 'BASE02'
BodyData(dateValues > startTime & dateValues < endTime & strcmp(BodyData(:, 2), 'BASE02'), :)
Returns:
ans =
'10:15:16.332' 'BASE02' 'bar'
'10:15:18.332' 'BASE02' 'foo'
References: datenum manual page, matlab help page on logical indexing.

matlab database toolbox fetch max row limit

I am executing the database toolbox fetch function as simply
curs = exec(conn,sqlQuery);
fetched = fetch(curs);
and therefore should be getting the default max row limit of infinity, however the function is only returning 90,000 rows (there should be more than 180k).
Does anyone know why the fetch function would be truncating my results?
Thanks.
You should try to limit the batch size with:
bsize = 80000; %where bsize < NumberOfRow
setdbprefs('FetchInBatches','yes') %you divide your result into smaller part.
setdbprefs('FetchBatchSize','h') %you fix the size of each part
curs = exec(conn,sqlQuery);
fetched = fetch(curs);
if it's still don't work, it could be possible that the matlab's variable cannot support this amount of data so you can replace the last line with:
vsize = 85000; %variable size
fetched = fetch(curs,vsize);

Resources