Two queries to the same table - peewee

I need to get 25 rows from a table.
First I need to get all pinned threads. Then I need to get (25 - pinned) regular threads.
Is it possible at all to get these rows using one query?
Now I use two separate queries like this:
p = Thread.select().where(Thread.pinned).limit(25)
t = Thread.select().where(Thread.pinned >> None).limit(25-len(p))

Untested, but you could do:
case_stmt = case(None, (
(Thread.pinned >> None, 0),
), 1))
(Thread
.select(Thread, case_stmt.alias('pinned_first'))
.order_by(SQL('pinned_first').asc())
.limit(25))

Related

BigQuery - Random numbers repeating when generated inside arrays

I made a BigQuery query which involves generating an array of random numbers for each row. I use the random numbers to decide which elements to include from an array that exists in my source table.
I had a lot of trouble getting the arrays of random numbers to not repeat themselves over every single row. I found a workaround, but is this expected behavior? I'll post two "methods" (one with desired results, one with bad results) below. Note that both methods work fine if you don't use an array, but just generate a single random number.
Method 1 (BAD Results):
SELECT
(
SELECT
ARRAY(
SELECT AS STRUCT
RAND() AS random
FROM UNNEST(GENERATE_ARRAY(0, 10, 1)) AS _time
) AS random_for_times
)
FROM UNNEST(GENERATE_ARRAY(0, 10, 1))
Method 2 (GOOD results):
SELECT
(
SELECT
ARRAY(
SELECT AS STRUCT
RAND() AS random
FROM UNNEST(GENERATE_ARRAY(0, 10, 1)) AS _time
) AS random_for_times
FROM (SELECT NULL FROM UNNEST([0]))
)
FROM UNNEST(GENERATE_ARRAY(0, 10, 1))
Example Results - Method 1 (BAD):
Row 1
0.5431173080158003
0.5585452983410205
...
Row 2
0.5431173080158003
0.5585452983410205
...
Example Results - Method 2 (GOOD):
Row 1
0.49639706531271377
0.1604380522058521
...
Row 2
0.7971869432989377
0.9815667330115473
...
EDIT: See below for some alternative examples that are similar, after Yun Zhang's theory about subqueries. Your solution was useful for the problem I posted, but note that there are still some cases I am finding baffling. Also, although I agree that you are probably correct about the subqueries being tied to the problem: shouldn't a subquery (especially one without a FROM clause) be less likely to have its results re-used than selecting a "normal" value? People talk about performance issues with subqueries sometimes, because they are supposedly calculated one time for each row, even if the results may be the same.
Do you agree that this seems like it may be a bug?
The below examples show that it is not necessarily even creating an array of randoms that is the problem -- even performing a sub-select that just happens to have an unrelated array in it can cause problems with RAND(). The problem goes away by eliminating the sub-select, by choosing just the random value from the sub-select, or by including a value inside the array which varies by row. Weird !!!
BAD
SELECT
(SELECT AS STRUCT RAND() AS r, ARRAY(SELECT 1) AS a)
FROM UNNEST(GENERATE_ARRAY(0, 5, 1)) AS u
FIX #1 - No subselect
SELECT
STRUCT(RAND() AS r, ARRAY(SELECT 1) AS a)
FROM UNNEST(GENERATE_ARRAY(0, 5, 1)) AS u
FIX #2 - Select only r
SELECT
(SELECT AS STRUCT RAND() AS r, ARRAY(SELECT 1) AS a).r
FROM UNNEST(GENERATE_ARRAY(0, 5, 1)) AS u
Fix #3 - Array contains "u"
SELECT
(SELECT AS STRUCT RAND() AS r, ARRAY(SELECT u) AS a).r
FROM UNNEST(GENERATE_ARRAY(0, 5, 1)) AS u
Haven't understood why first query didn't work but I have a simpler version that works for you:
SELECT (
SELECT array_agg(RAND()) AS random
FROM UNNEST(GENERATE_ARRAY(0, 10, 1)) AS _time
) AS random_for_times
FROM UNNEST(GENERATE_ARRAY(0, 10, 1))
Update: I later realized that the problem is with ARRAY(subquery), as long as you can avoid using it for your case (like in my query above), you should be fine.

How to drop a RANGE partition?

Known that the function dropPartition requires parameters like this:
dropPartition(dbHandle, partitionPaths, [tableName])
How to specify the parameter partitionPaths if I want to drop several range partitions with one call of the function? I can only drop one partition once in this way:
db=database(dbPath, RANGE, 0 5 10 15 20)
t=table(rand(20, 100) as ID,rand(1.0, 100) as x)
pt = db.createPartitionedTable(t, `tb1, `ID)
pt.append!(t)
dropPartition(db,"0_5")
To drop multiple partitions in one call, please put partition paths in array.
dropPartition(db,["/0_5", "/10_15"])

SQL Server aggregating data that may contain multiple copies

I am working on some software where I need to do large aggregation of data using SQL Server. The software is helping people play poker better. The query I am using at the moment looks like this:
Select H, sum(WinHighCard + ChopHighCard + WinPair + ChopPair + Win2Pair + Chop2Pair + Win3OfAKind + Chop3OfAKind + WinStraight + ChopStraight + WinFlush + ChopFlush + WinFullHouse + ChopFullHouse + WinQuads + ChopQuads + WinStraightFlush + ChopStraightFlush) / Count(WinStraightFlush) as ResultTotal
from [FlopLookup].[dbo].[_5c3c3d]
inner join[FlopLookup].[dbo].[Lookup5c3c3d] on [FlopLookup].[dbo].[Lookup5c3c3d].[id] = [FlopLookup].[dbo].[_5c3c3d].[id]
where (H IN (1164, 1165, 1166) ) AND
(V IN (1260, 1311))
Group by H;
This works fine and is the fastest way I have found to do what I am trying to achieve. The problem is I need to enhance the functionality in a way that allows the aggregation to include multiple instances of V. So for example in the above query instead of it just including the data from 1260 and 1311 once it may need to include 1260 twice and 1311 three times. But obviously just saying
V IN (1260, 1260, 1311, 1311, 1311)
won't work because each unique value is only counted once in an IN clause.
I have come up with a solution to this problem which works but seems rather clunky. I have created another lookup table which just takes the values between 0 and 1325 and assigns them to a field called V1 and for each V1 there are 100 V2 values that for e.g. for V1 = 1260 there is a range from 126000 through to 126099 for the V2 values. Then in the main query I join to this table and do the lookup like this:
Select H, sum(WinHighCard + ChopHighCard + WinPair + ChopPair + Win2Pair + Chop2Pair + Win3OfAKind + Chop3OfAKind + WinStraight + ChopStraight + WinFlush + ChopFlush + WinFullHouse + ChopFullHouse + WinQuads + ChopQuads + WinStraightFlush + ChopStraightFlush) / Count(WinStraightFlush) as ResultTotal
from [FlopLookup].[dbo].[_5c3c3d]
inner join[FlopLookup].[dbo].[Lookup5c3c3d] on [FlopLookup].[dbo].[Lookup5c3c3d].[id] = [FlopLookup].[dbo].[_5c3c3d].[id]
inner join[FlopLookup].[dbo].[VillainJoinTable] on [FlopLookup].[dbo].[VillainJoinTable].[V1] = [FlopLookup].[dbo].[_5c3c3d].[V]
where (H IN (1164, 1165, 1166) ) AND
(V2 IN (126000, 126001, 131100, 131101, 131102) )
Group by H;
So although it works it is quite slow. It feels inefficient because it is adding data multiple times when what would probably be more appropriate is a way of doing this using multiplication, i.e. instead of passing in 126000, 126001, 126002, 126003, 126004, 126005, 126006, 126007 I instead pass in 1260 in the original query and then multiply it by 8. But I have not been able to work out a way to do this.
Any help would be appreciated. Thanks.
EDIT - Added more information at the request of Livius in the comments
H stands for "Hero" and is in the table _5c3c3d as a smallint representing the two cards the player is holding (e.g. AcKd, Js4h etc.). V stands for "Villain" and is similar to Hero but represents the cards the opponent is holding similarly encoded. The encoding and decoding takes place in the code. These two fields form the clustered index for the _5c3c3d table. The remaining field in this table is Id which is another smallint which is used to join with the table Lookup5c3c3d which contains all the equity information for the hero's hand against the villain's hand for the flop 5c3c3d.
V2 is just a field in a table I have created to try and resolve the problem described by having a table called VillainJoinTable which has V1 (which maps directly to V in _5c3c3d via a join) and V2 which can potentially contain 100 numbers per V1 (e.g. when V1 is 1260 it could contain 126000, 126001 ... 126099). This is in order to allow me to create an "IN" clause that can effectively have multiple lookups to equity information for the same V multiple times.
Here are some screenshots:
Structure of the three tables
Some data from _5c3c3d
Some data from Lookup5c3c3d
Some data from VillainJoinTable

MATLAB Extract all rows between two variables with a threshold

I have a cell array called BodyData in MATLAB that has around 139 columns and 3500 odd rows of skeletal tracking data.
I need to extract all rows between two string values (these are timestamps when an event happened) that I have
e.g.
BodyData{}=
Column 1 2 3
'10:15:15.332' 'BASE05' ...
...
'10:17:33:230' 'BASE05' ...
The two timestamps should match a value in the array but might also be within a few ms of those in the array e.g.
TimeStamp1 = '10:15:15.560'
TimeStamp2 = '10:17:33.233'
I have several questions!
How can I return an array for all the data between the two string values plus or minus a small threshold of say .100ms?
Also can I also add another condition to say that all str values in column2 must also be the same, otherwise ignore? For example, only return the timestamps between A and B only if 'BASE02'
Many thanks,
The best approach to the first part of your problem is probably to change from strings to numeric date values. In Matlab this can be done quite painlessly with datenum.
For the second part you can just use logical indexing... this is were you put a condition (i.e. that second columns is BASE02) within the indexing expression.
A self-contained example:
% some example data:
BodyData = {'10:15:15.332', 'BASE05', 'foo';...
'10:15:16.332', 'BASE02', 'bar';...
'10:15:17.332', 'BASE05', 'foo';...
'10:15:18.332', 'BASE02', 'foo';...
'10:15:19.332', 'BASE05', 'bar'};
% create column vector of numeric times, and define start/end times
dateValues = datenum(BodyData(:, 1), 'HH:MM:SS.FFF');
startTime = datenum('10:15:16.100', 'HH:MM:SS.FFF');
endTime = datenum('10:15:18.500', 'HH:MM:SS.FFF');
% select data in range, and where second column is 'BASE02'
BodyData(dateValues > startTime & dateValues < endTime & strcmp(BodyData(:, 2), 'BASE02'), :)
Returns:
ans =
'10:15:16.332' 'BASE02' 'bar'
'10:15:18.332' 'BASE02' 'foo'
References: datenum manual page, matlab help page on logical indexing.

How many times is the DBRecordReader getting created?

Below is what I think of the hadoop framework processing text files. Please correct me if I am going wrong somewhere.
Each mapper acts on an input split which contains some records.
For each input split a record reader is getting created which starts reading records from the input split.
If there are n records in an input split the map method in the mapper is called n times which in turn reads a key-value pair using the record reader.
Now coming to the databases perspective
I have a database on a single remote node. I want to fetch some data from a table in this database. I would configure the parameters using DBConfigure and mention the input table using DBInputFormat. Now say if my table has 100 records in all, and I execute an SQL query which generates 70 records in the output.
I would like to know :
How are the InputSplits getting created in the above case (database) ?
What does the input split creation depend on, the number of records which my sql query generates or the total number of records in the table (database) ?
How many DBRecordReaders are getting created in the above case (database) ?
How are the InputSplits getting created in the above case (database)?
// Split the rows into n-number of chunks and adjust the last chunk
// accordingly
for (int i = 0; i < chunks; i++) {
DBInputSplit split;
if ((i + 1) == chunks)
split = new DBInputSplit(i * chunkSize, count);
else
split = new DBInputSplit(i * chunkSize, (i * chunkSize)
+ chunkSize);
splits.add(split);
}
There is the how, but to understand what it depends on let's take a look at chunkSize:
statement = connection.createStatement();
results = statement.executeQuery(getCountQuery());
results.next();
long count = results.getLong(1);
int chunks = job.getConfiguration().getInt("mapred.map.tasks", 1);
long chunkSize = (count / chunks);
So chunkSize takes the count = SELECT COUNT(*) FROM tableName and divides this by chunks = mapred.map.tasks or 1 if it is not defined in the configuration.
Then finally, each input split will have a RecordReader created to handle the type of database you are reading from for instance: MySQLDBRecordReader for MySQL database.
For more info check out the source
It appears #Engineiro explained it well by taking the actual hadoop source. Just to answer, number of DBRecordReader is equal to number of map tasks.
To explain further, the Hadoop Map side framework creates an instance of DBRecordReader for each Map task, in case where the child JVM is not reused for further Map tasks. In other words, the number of input splits is equals to the value of map.reduce.tasks in case of DBInputFormat. So, each map Task's record Reader has the meta information to construct the query to get subset of data from the table. Each Record Reader executes a pagination type of SQL which is similar to the below.
SELECT * FROM (SELECT a.*,ROWNUM dbif_rno FROM ( select * from emp ) a WHERE rownum <= 6 + 7 ) WHERE dbif_rno >= 6
The above SQL is for the second Map tasks to return the rows between 6 and 13
To generalize for any type of Input formats, the number of Record Readers is equals to the number of Map Tasks.
This post talks about all that you want : http://blog.cloudera.com/blog/2009/03/database-access-with-hadoop/

Resources