SQLITE: horizontally combining large sql tables based on common entry - database

I have multiple tables I need to join horizontally based on a common entry. For the first sets of tables, I need to join tables that look like:
Table 1 dimension = 1093367x18 and It looks like
ROW #
ID
TEMPERATURE
DESCR
...
NUMB
1
32
23
Y
...
23
2
47
54
N
...
24
...
...
...
...
...
...
1,093367
78
12
Y
...
45
Table 2 dimension = 1093367x648
ROW #
ID
COLOR 1
COLOR 2
...
COLOR 648
1
32
RED
BLUE
...
GREEN
2
47
BLUE
PURPLE
...
RED
...
...
...
...
...
...
1,093367
78
YELLOW
RED
...
BLUE
And I need [Table 1 |Table 2]:
ROW #
ID
TEMPERATURE
DESCR
...
NUMB
COLOR 1
COLOR 2
...
COLOR 648
1
32
23
Y
...
23
RED
BLUE
...
GREEN
2
47
54
N
...
24
BLUE
PURPLE
...
RED
...
...
...
...
...
...
...
...
...
...
1,093367
78
12
Y
...
45
YELLOW
RED
...
BLUE
Is this possible to do in SQLITE? I have only found solutions in which I would have to type out all 648 columns for table 2. Is this the only way to do this in SQLITE?

You don't have to write any column name in the SELECT statement if you do a join with the USING clause instead of the ON clause:
SELECT *
FROM Table1 INNER JOIN Table2
USING(id);
This will return all the columns of the 2 tables (first Table1's rows and then Table2's rows) but the columns used in the USING clause (in this case the column id on which the join is based) will be returned only once.
You can find more about the USING clause in Determination of input data (FROM clause processing).
See a simplified demo.

Related

Reshape wide datafame into a single column

I have a very long data frame like:
1 2 3 ... 109
1 2 3 ... 109
1 2 3 ... 109
. . . ... ...
109 109 109... 109
and I would like to have everything in one column as described below:
1
1
1
2
2
2
3
3
3
...
109
I try converting the data frame into an numpy array to use:
arr_y = np.arange(11881).reshape(11881,1)
but it didn't work. It gave me a single vertical column (which is ok) but from 0,1,2,3,4 .. up to 11881 (=109*109).
Any idea how to do that?
In pandas DataFrame do the following:
import pandas as pd
df = pd.DataFrame({'Col1': [1,1,1], 'Col2': [2,2,2],'Col3': [3,3,3],'Col109': [109,109,109]})
df = pd.DataFrame(df.values.ravel('F'))

Sqlite optimization for MAX query on non-leftmost column on index

I noticed the online page about SQLite's query optimizer guarantees that queries of the form SELECT MAX(colA) FROM TABLE can be optimized if there is an index whose leftmost column is colA.
However, I'm less clear about what happens when an index is used to narrow the table based on an equality in WHERE clause, such that the next column in the index is the one that I'm taking a MAX on. Based on the structure of the index, the maximum value should be quickly accessible as the last row in the subset of the index satisfying the WHERE clause. For example, given an index on colA and colB, it should be possible to find SELECT MAX(colB) FROM SillyTable WHERE colA = 1 without scanning all 6 rows associated with colA = 1:
Index of SillyTable on colA, colB:
colA colB rowid
1 1 4
1 2 5
1 4 2
1 5 8
1 6 3 # This is the one
2 1 1
2 5 6
2 8 7
Does SQLite actually optimize a query like this, or will it scan all the rows that satisfy the WHERE clause? If it does a scan, how can I change the query to make it run faster?
My specific use case is similar to the SillyTable example. I created the following table:
CREATE TABLE Product(
ProductTypeID INTEGER NOT NULL,
ProductID INTEGER NOT NULL,
PRIMARY KEY(ProductTypeID, ProductID),
FOREIGN KEY(ProductTypeID)
REFERENCES ProductType(ProductTypeID)
);
ProductTypeID is not particularly selective for the table; I might have many rows with the same ProductTypeID but different ProductID. EXPLAIN QUERY PLAN tells me that my query uses an index automatically built for the composite primary key, but that is true whether it scans or binary-searches the subset of rows found with the index:
EXPLAIN QUERY PLAN SELECT MAX(ProductID) FROM Product
WHERE ProductTypeID = ?;
=>
SEARCH TABLE Product USING COVERING INDEX sqlite_autoindex_Product_1(ProductTypeID=?)
This is shown in the EXPLAIN output:
sqlite> EXPLAIN SELECT MAX(ProductID) FROM Product WHERE ProductTypeID = ?;
addr opcode p1 p2 p3 p4 p5 comment
---- ------------- ---- ---- ---- ------------- -- -------------
0 Init 0 17 0 00 Start at 17
1 Null 0 1 2 00 r[1..2]=NULL
2 OpenRead 1 3 0 k(2,,) 02 root=3 iDb=0; sqlite_autoindex_Product_1
3 Variable 1 3 0 00 r[3]=parameter(1,)
4 IsNull 3 13 0 00 if r[3]==NULL goto 13
5 Affinity 3 1 0 D 00 affinity(r[3])
6 SeekLE 1 13 3 1 00 key=r[3]
7 IdxLT 1 13 3 1 00 key=r[3]
8 Column 1 1 4 00 r[4]=Product.ProductID
9 CollSeq 0 0 0 (BINARY) 00
10 AggStep0 0 4 1 max(1) 01 accum=r[1] step(r[4])
11 Goto 0 13 0 00 max() by index
12 Prev 1 7 0 00
13 AggFinal 1 1 0 max(1) 00 accum=r[1] N=1
14 Copy 1 5 0 00 r[5]=r[1]
15 ResultRow 5 1 0 00 output=r[5]
16 Halt 0 0 0 00
17 Transaction 0 0 1 0 01 usesStmtJournal=0
18 Goto 0 1 0 00
To make the code generator simpler, SQLite always creates a loop for the aggregation (lines 6 to 12). However, for max(), this loop aborts after the first successful step (line 11).

Split Pandas Dataframe into separate pieces based on column values

I am looking to perform some Inner Joins in Pandas, using Python 2.7. Here is the dataset that I am working with:
import pandas as pd
import numpy as np
columns = ['s_id', 'c_id', 'c_col1']
index = np.arange(46) # array of numbers for the number of samples
df = pd.DataFrame(columns=columns, index = index)
df.s_id[:15] = 144
df.s_id[15:27] = 105
df.s_id[27:46] = 52
df.c_id[:5] = 1
df.c_id[5:10] = 2
df.c_id[10:15] = 3
df.c_id[15:19] = 1
df.c_id[19:27] = 2
df.c_id[27:34] = 1
df.c_id[34:39] = 2
df.c_id[39:46] = 3
df.c_col1[:5] = ['H', 'C', 'N', 'O', 'S']
df.c_col1[5:10] = ['C', 'O','S','K','Ca']
df.c_col1[10:15] = ['H', 'O','F','Ne','Si']
df.c_col1[15:19] = ['C', 'O', 'F', 'Zn']
df.c_col1[19:27] = ['N', 'O','F','Fe','Zn','Gd','Hg','Pb']
df.c_col1[27:34] = ['H', 'He', 'Li', 'B', 'N','Al','Si']
df.c_col1[34:39] = ['N', 'F','Ne','Na','P']
df.c_col1[39:46] = ['C', 'N','O','F','K','Ca', 'Fe']
Here is the dataframe:
s_id c_id c_col1
0 144 1 H
1 144 1 C
2 144 1 N
3 144 1 O <--
4 144 1 S
5 144 2 C
6 144 2 O <--
7 144 2 S
8 144 2 K
9 144 2 Ca
10 144 3 H
11 144 3 O <--
12 144 3 F
13 144 3 Ne
14 144 3 Si
15 105 1 C
16 105 1 O
17 105 1 F
18 105 1 Zn
19 105 2 N
20 105 2 O
21 105 2 F
22 105 2 Fe
23 105 2 Zn
24 105 2 Gd
25 105 2 Hg
26 105 2 Pb
27 52 1 H
28 52 1 He
29 52 1 Li
30 52 1 B
31 52 1 N
32 52 1 Al
33 52 1 Si
34 52 2 N
35 52 2 F
36 52 2 Ne
37 52 2 Na
38 52 2 P
39 52 3 C
40 52 3 N
41 52 3 O
42 52 3 F
43 52 3 K
44 52 3 Ca
45 52 3 Fe
I need to do the following in Pandas:
In a given s_id, produce separate dataframes for each c_id value. ex. for s_id = 144, there will be 3 dataframes, while for s_id = 105 there will be 2 dataframes
Inner Join the separate dataframes produced in a.), on the elements column (c_col1) in Pandas. This is a little difficult to understand so here is the dataframe what I would like to get from this step:
index s_id c_id c_col1
0 144 1 O
1 144 2 O
2 144 3 O
3 105 1 O
4 105 2 F
5 52 1 N
6 52 2 N
7 52 3 N
As you can see, what I am looking for in part 2.) is the following: Within each s_id, I am looking for those c_col1 values that occur for all the c_id values. ex. in the case of s_id = 144, only O (oxygen) occurs for c_id = 1, 2, 3. I have pointed to these entries, with "<--", in the raw data. So, I would like to have the dataframe show O 3 times in the c_col1 column and the corresponding c_id entries would be 1, 2, 3.
Conditions:
the number of unique c_ids are not known ahead of time.i.e. for one
particular s_id, I do not know if there will be 1, 2 and 3 or just 1
and 2. This means that if 1, 2 and 3 occur, there will be one Inner
Join; if only 1 and 2 occur, then there will be only one Inner Join.
How can this be done with Pandas?
Producing the separate dataframes is easy enough. How would you want to store them? One way would be in a nested dict where the outer keys are the s_id and the inner keys are the c_id and the inner values are the data. That you can do with a fairly long but straightforward dict comprehension:
DF_dict = {s_id :
{c_id : df[(df.s_id == s_id) & (df.c_id == c_id)] for c_id in df[df.s_id == s_id]['c_id'].unique()}
for s_id in df.s_id.unique()}
Then for example:
In [12]: DF_dict[52][2]
Out[12]:
s_id c_id c_col1
34 52 2 N
35 52 2 F
36 52 2 Ne
37 52 2 Na
38 52 2 P
I do not understand part two of your question. You want then to join the data within in s_id? Could you show what the expected output would be? If you want to do something within each s_id you might be better off exploring groupby options. Perhaps someone understands what you want, but if you can clarify I might be able to show a better option that skips the first part of the question...
##################EDIT
It seems to me that you should just go straight to problem 2, if problem 1 is simply a step you believe to be necessary to get to a problem 2 solution. In fact it is entirely unnecessary. To solve your second problem you need to group the data by s_id and transform the data according to your requirements. To sum up your requirements as I see them the rule is as follows: For each data group grouped by s_id, return only those ccol_1 data for which there are equal values for each value of c_id.
You might write a function like this:
def c_id_overlap(df):
common_vals = [] #container for values of c_col1 that are in ever c_id subgroup
c_ids = df.c_id.unique() #get unique values of c_id
c_col1_values = set(df.c_col1) # get a set of c_col1 values
#create nested list of values. Each inner list contains the c_col1 values for each c_id
nested_c_col_vals = [list(df[df.c_id == ID]['c_col1'].unique()) for ID in c_ids]
#Iterate through the c_col1_values and see if they are in every nested list
for val in c_col1_values:
if all([True if val in elem else False for elem in nested_c_col_vals]):
common_vals.append(val)
#return a slice of the dataframe that only contains values of c_col1 that are in every
#c_id
return df[df.c_col1.isin(common_vals)]
and then pass it to apply on data grouped by s_id:
df.groupby('s_id', as_index = False).apply(c_id_overlap)
which gives me the following output:
s_id c_id c_col1
0 31 52 1 N
34 52 2 N
40 52 3 N
1 16 105 1 O
17 105 1 F
18 105 1 Zn
20 105 2 O
21 105 2 F
23 105 2 Zn
2 3 144 1 O
6 144 2 O
11 144 3 O
Which seems to be what you are looking for.
###########EDIT: Additional Explanation:
So apply passes each chunk of grouped data to the function and the the pieces are glues back together once this has been done for each group of data.
So think about the first group passed where s_id == 105. The first line of the function creates an empty list common_vals which will contain those periodic elements that appear in every subgroup of the data (i.e. relative to each of the values of c_id).
The second line gets the unique values of 'c_id', in this case [1, 2] and stores them in an array called c_ids
The third line creates a set of the values of c_col1 which in this case produces:
{'C', 'F', 'Fe', 'Gd', 'Hg', 'N', 'O', 'Pb', 'Zn'}
The fourth line creates a nested list structure nested_c_col_vals where every inner list is a list of the unique values associated with each of the elements in the c_ids array. In this case this looks like this:
[['C', 'O', 'F', 'Zn'], ['N', 'O', 'F', 'Fe', 'Zn', 'Gd', 'Hg', 'Pb']]
Now each of the elements in the c_col1_values list is iterated over and for each of those elements the program determines whether that element appears in every inner list of the nested_c_col_vals object. The bulit in all function, determines whether every item in the sequence between the backets is True or rather whether it is non-zero (you will need to check this). So:
In [10]: all([True, True, True])
Out[10]: True
In [11]: all([True, True, True, False])
Out[11]: False
In [12]: all([True, True, True, 1])
Out[12]: True
In [13]: all([True, True, True, 0])
Out[13]: False
In [14]: all([True, 1, True, 0])
Out[14]: False
So in this case, let's say 'C' is the first element iterated over. The list comprehension inside the all() backets says, look inside each inner list and see if the element is there. If it is then True if it is not then False. So in this case this resolves to:
all([True, False])
which is of course False. No when the element is 'Zn' the result of this operation is
all([True, True])
which resolves to True. Therefore 'Zn' is appended to the common_vals list.
Once the process is complete the values inside common_vals are:
['O', 'F', 'Zn']
The return statement simply slices the data chunk according to whether the vaues os c_col1 are in the list common_vals as per above.
This is then repeated for each of the remaining groups and the data are glued back together.
Hope this helps

Display result of query without primary key

I want to join two tables (each table has 200 columns) so the purpose of this is to have a table with 400 columns, but how do I get the result without the primary key?
id a1 a2 a3 ... a200
-----------------------
1 23 4 5 7
2 24 6 8 17
3 13 14 52 73
...
id b1 b2 b3 ... b200
-----------------------
1 53 14 15 87
2 64 16 18 87
3 73 74 12 83
...
So the sesult I want is like
a1 a2 a3 ... a200 b1 b2 b3 .... b200
--------------------------------------
23 4 5 7 53 14 15 87
24 6 8 17 64 16 18 87
13 14 52 73 73 74 12 83
...
I have this
SELECT * a as T1 join b as T2 on T1.id=T2.id;
There is no way to say SELECT (* EXCEPT some_col), sorry. However, it is quite easy to generate the list by dragging the "Columns" node for each table from Object Explorer onto the query window, and then simply remove the PK columns from the list. Click on the Columns node for a view or table, then drag it onto the query window:
Voila!
You will have to specify each individual column in the SELECT statement:
SELECT a1, a2, a3, ..., a200, b1, ..., b200
FROM T1
join T2 on T1.id = T2.id
Clearly this is overly cumbersome.
I would take a look at why you have so many columns and whether your data is properly normalised. Alternatively, is there the potential to simply use the columns you need nearer to the UI (if there is one?)

transact SQL, sum each row and insert into another table

for a table on ms-sql2000 containing the following columns and numbers:
S_idJ_id Se_id B_id Status Count multiply
63 1000 16 12 1 10 2
64 1001 12 16 1 9 3
65 1002 17 12 1 10 2
66 1003 16 12 1 6 3
67 1004 12 16 1 10 2
I want to generate an classic asp script which will do the following for each row
where status=1 :
-multiply -> answer= multiply column 'count' with column 'multiply'
Then:
count the total answer and sum for each se_id like :
se_id total
12 47
16 38
17 20
and display on screen like
Rank se_id total
1 12 47
2 16 38
3 17 20
Condition:
if there are multiple equal total values then give the lower numbered se_id a priority for
getting a ranking and give the next higher numbered se_id the next number in rank
Any sample code in classic asp or advice is welcome on how to get this accomplished
'score' = source table.
if (EXISTS (select * from INFORMATION_SCHEMA.TABLES where TABLE_NAME = 'result_table'))
begin
drop table result_table;
end
select
rank = IDENTITY(INT,1,1),
se_id, sum(multiply * count) as total
into result_table
from score
where status = 1
group by se_id
order by total desc, se_id;
[Edit] Change query as answer on first comment

Resources