Split a query with too many arguments in SQLAlchemy - database

In my program when I did not run a database update for a long time and then try to update my data,
my sqlalchemy script generates a postgresql upsert query with >4000 params each with >8 items.
When the query is executed with databases.Database.execute(query) I end up with this error:
asyncpg.exceptions._base.InterfaceError: the number of query arguments cannot exceed 32767
My idea is to automatically split the query based on the number of arguments as threshold and execute it in two parts and merge the results.
Do you have an idea how to resolve that problem?

I ended up writing a check for the number of query arguments by getting the length of the first argument's dictionary keys in my list of argument dictionaries as they all have the same number of keys=arguments per list item:
args_per_row = len(args_dict_list[0])
PSQL_QUERY_ALLOWED_MAX_ARGS = 32767
allowed_args_per_query = int(math.floor(PSQL_QUERY_ALLOWED_MAX_ARGS/args_per_row))
Then I divided the args_dict_list into parts that have the size of allowed args per query:
query_args_sets = [
args_dict_list[x:x + allowed_args_per_query] for x in range(
0,
len(args_dict_list),
allowed_args_per_query
)
]
and finally looped over the query_args_sets and generated and executed a separate query for each set:
for arg_set in query_args_sets:
query = query_builder.build_upsert(values=arg_set)
await database.execute(query)

Related

In Apache Flink, how to join certain columns from 2 dynamic tables (Kafka topic-based) using TableAPI

Here are the specs of my system:
2 Kafka topics:
Kafka Topic A: Contains {"key": string, "some_data1": [...], "timestamp: int}
Kafka Topic B: Contains {"key": string, "some_data2": [...], "timestamp: int}
Let's say both topics receive 5 messages per second. (Let's ignore delays for this example)
I want to add the some_data2 values from B into A for a certain duration using Hop windowing (let's say 2 seconds hop and length for the sake of this example).
I tried the following:
1 - Create a SQL VIEW
First I created a view that join both topics like this:
CREATE TEMPORARY VIEW IF NOT EXISTS my_joined_view (
key,
some_data1,
some_data2,
timestamp
) AS SELECT
A.key,
A.timestamp,
A.some_data1,
B.some_data2
FROM
A
LEFT JOIN B ON A.key = B.key
2 - Continous Query
Doing my window on the joined view like this:
SELECT
key
my_udf(key, some_data1, some_data2, timestamp),
timestamp,
HOP_START(timestamp, INTERVAL '1' SECOND, INTERVAL '5' SECOND)
FROM
my_joined_view,
GROUP BY
HOP(timestamp, INTERVAL '1' SECOND, INTERVAL '5' SECOND), key
3 - My Expectations
My expectations were that my_udf function accumulator would receive 10 entries for some_data1 and some_data2 like:
class MyUDF(AggregateFunction):
def accumulate(self, accumulator, *args):
key = args[0]
some_data1 = args[1]
some_data2 = args[2]
timestamp = args[3]
assert len(some_data1) == 10
assert len(some_data2) == 10
But, in most cases I receive duplicates of each entries. It looks like the join mechanism is creating one row for each combination of values in the columns.
I am using a vectorized UDF function, therefore I'm dealing with pandas.Series in my UDF as my arguments types in the accumulator.
There's is clearly something wrong with the way I am joining my tables. I don't understand why I am receive duplicate entries in my UDF.
Thanks,

order results by score of two combined sets of data (Solr)

I have docs with following structure:
{
id: 1
type: 1
prop1: "any value"
prop2: "any value"
...
...
}
type can be 1 or 2
Now I would like to create a query which returns all of type 1 and limited (LIMIT = 100) results of type 2 with filtering props and ordering by score.
My try so far is as follow, which isn't correct, resp. sorting by score isn't correct:
I combine two queries:
prepare a first query for using in the mainquery : type:2 AND commonfilters, size=LIMIT, sort by score, ID -> returns a list of id's
main query : (type:1 AND commonfilters) OR (id:[ids from first query]), sort by score, ID
The order isn't correct (sort by score), because it was sorted for two different independent sets of data and not sorted over all id's in combination.
What I need is something like the following SQL Query:
select * from data where commonfilters order by score, id MINUS (select * from data where rowcount > LIMIT)
Does anyone know how to achieve correct ordering for this case?

MATLAB Extract all rows between two variables with a threshold

I have a cell array called BodyData in MATLAB that has around 139 columns and 3500 odd rows of skeletal tracking data.
I need to extract all rows between two string values (these are timestamps when an event happened) that I have
e.g.
BodyData{}=
Column 1 2 3
'10:15:15.332' 'BASE05' ...
...
'10:17:33:230' 'BASE05' ...
The two timestamps should match a value in the array but might also be within a few ms of those in the array e.g.
TimeStamp1 = '10:15:15.560'
TimeStamp2 = '10:17:33.233'
I have several questions!
How can I return an array for all the data between the two string values plus or minus a small threshold of say .100ms?
Also can I also add another condition to say that all str values in column2 must also be the same, otherwise ignore? For example, only return the timestamps between A and B only if 'BASE02'
Many thanks,
The best approach to the first part of your problem is probably to change from strings to numeric date values. In Matlab this can be done quite painlessly with datenum.
For the second part you can just use logical indexing... this is were you put a condition (i.e. that second columns is BASE02) within the indexing expression.
A self-contained example:
% some example data:
BodyData = {'10:15:15.332', 'BASE05', 'foo';...
'10:15:16.332', 'BASE02', 'bar';...
'10:15:17.332', 'BASE05', 'foo';...
'10:15:18.332', 'BASE02', 'foo';...
'10:15:19.332', 'BASE05', 'bar'};
% create column vector of numeric times, and define start/end times
dateValues = datenum(BodyData(:, 1), 'HH:MM:SS.FFF');
startTime = datenum('10:15:16.100', 'HH:MM:SS.FFF');
endTime = datenum('10:15:18.500', 'HH:MM:SS.FFF');
% select data in range, and where second column is 'BASE02'
BodyData(dateValues > startTime & dateValues < endTime & strcmp(BodyData(:, 2), 'BASE02'), :)
Returns:
ans =
'10:15:16.332' 'BASE02' 'bar'
'10:15:18.332' 'BASE02' 'foo'
References: datenum manual page, matlab help page on logical indexing.

Searching chars in database records?

I have variable
Dim s As String = "1,5,20,22,28,31,40"
And I have records in (fieldnum) database (mssql) :
3,5,19,20,26,28,33,40
1,2,5,18,20,22,31,38
1,7,12,22,23,24,31
Now, how to find (SELECT ...???) records where at least 4 numbers in fieldnum are equal with numbers s variable?
In this example must be selected only those two rows.
3,5,19,20,26,28,33,40 (equal numbers: 5,20,28,40 = 4 nums)
1,2,5,18,20,22,31,38 (euqal numbers: 1,5,20,22,31 = 5 nums)
It is not going to recognize them as seperate numbers. You could trying using a LIKE search in your sql statement though.

How many times is the DBRecordReader getting created?

Below is what I think of the hadoop framework processing text files. Please correct me if I am going wrong somewhere.
Each mapper acts on an input split which contains some records.
For each input split a record reader is getting created which starts reading records from the input split.
If there are n records in an input split the map method in the mapper is called n times which in turn reads a key-value pair using the record reader.
Now coming to the databases perspective
I have a database on a single remote node. I want to fetch some data from a table in this database. I would configure the parameters using DBConfigure and mention the input table using DBInputFormat. Now say if my table has 100 records in all, and I execute an SQL query which generates 70 records in the output.
I would like to know :
How are the InputSplits getting created in the above case (database) ?
What does the input split creation depend on, the number of records which my sql query generates or the total number of records in the table (database) ?
How many DBRecordReaders are getting created in the above case (database) ?
How are the InputSplits getting created in the above case (database)?
// Split the rows into n-number of chunks and adjust the last chunk
// accordingly
for (int i = 0; i < chunks; i++) {
DBInputSplit split;
if ((i + 1) == chunks)
split = new DBInputSplit(i * chunkSize, count);
else
split = new DBInputSplit(i * chunkSize, (i * chunkSize)
+ chunkSize);
splits.add(split);
}
There is the how, but to understand what it depends on let's take a look at chunkSize:
statement = connection.createStatement();
results = statement.executeQuery(getCountQuery());
results.next();
long count = results.getLong(1);
int chunks = job.getConfiguration().getInt("mapred.map.tasks", 1);
long chunkSize = (count / chunks);
So chunkSize takes the count = SELECT COUNT(*) FROM tableName and divides this by chunks = mapred.map.tasks or 1 if it is not defined in the configuration.
Then finally, each input split will have a RecordReader created to handle the type of database you are reading from for instance: MySQLDBRecordReader for MySQL database.
For more info check out the source
It appears #Engineiro explained it well by taking the actual hadoop source. Just to answer, number of DBRecordReader is equal to number of map tasks.
To explain further, the Hadoop Map side framework creates an instance of DBRecordReader for each Map task, in case where the child JVM is not reused for further Map tasks. In other words, the number of input splits is equals to the value of map.reduce.tasks in case of DBInputFormat. So, each map Task's record Reader has the meta information to construct the query to get subset of data from the table. Each Record Reader executes a pagination type of SQL which is similar to the below.
SELECT * FROM (SELECT a.*,ROWNUM dbif_rno FROM ( select * from emp ) a WHERE rownum <= 6 + 7 ) WHERE dbif_rno >= 6
The above SQL is for the second Map tasks to return the rows between 6 and 13
To generalize for any type of Input formats, the number of Record Readers is equals to the number of Map Tasks.
This post talks about all that you want : http://blog.cloudera.com/blog/2009/03/database-access-with-hadoop/

Resources