I have a data set with 6 columns and 4.5 million rows, and I want to iterate through all the data set to compare the value of the last column with the value of the 1st column for each and every row in my data set and append the rows whose last column value matches the value of first column of a row to that row. the first and last columns are indexed, but none are integers.
I asked the same question in stackoverflow and received a good answer which was based on numpy and arraying the data, but I am afraid it is too slow for a rather big dataset.
let's assume this is my data set (in the real data set, the first and last elements are not integers):
x = [['2', 'Jack', '8'],['1', 'Ali', '2'],['4' , 'sgee' , '1'],
['5' , 'gabe' , '2'],['100' , 'Jack' , '6'],
['7' , 'Ali' , '2'],['8' , 'nobody' , '20'],['9' , 'Al', '10']]
the result should look something like this:
[['2', 'Jack', '8', '1', 'Ali', '2', '5' , 'gabe' , '2','7' , 'Ali' , '2'],
['1', 'Ali', '2', '4' , 'sgee' , '1'],
['8' , 'nobody' , '20', '2', 'Jack', '8']]
I think I can use indexing to make the process faster, but my knowledge of databases is very limited. Does anybody have a solution (using indexes or any other tool)?
the numpy solution for this question is below:
How to compare two columns from the same data set?
here is the link to a sample of the real data in sqlite: https://drive.google.com/open?id=11w-o4twH-hyRaX8KKvFLL6dQtkTKCJky
A potential SQL-based solution could go as follows (I'm using your big sample DB as a reference):
To make my proposed solution efficient I would do the following:
Create an index on the last column and create a partial index to eliminate rows where the first and last columns are the same. This is optional so you may remove this from the later query if you think this causes a problem. But if you do you should create a full index on col 0. All three are included here for completeness.
CREATE INDEX [index_my_tab_A] ON [tab]([0]);
CREATE INDEX [index_my_tab_B] ON [tab]([5]);
CREATE INDEX [index_my_tab_AB] ON [tab]([0]) where [0] != [5];
ANALYZE;
Then I would take advantage of join behavior to generate the listing you need to produce the result you are after. By joining the table to itself you can get multiple return rows for each row considered.
SELECT * from tab t1
JOIN tab t2 on t2.[5] = t1.[0]
WHERE t1.[0] != t1.[5]
AND t2.[5] != 'N/A' -- Optional
ORDER by t1.[0];
Running that SQL against your big sample database (After ANALYZE step had completed) took 0.2 seconds on my machine. It produced three rows that matched which I presume to be correct.
It may not be immediately obvious what the resulting table means so here is the result the above query gives when run against the small sample you gave in your original post. (that SQL was modified ever so slightly to deal with the reduced number of columns) … when run it produced the following result which is equivalent to your original desired result:
1 Ali 2 4 sgee 1
2 Jack 8 1 Ali 2
2 Jack 8 5 gabe 2
2 Jack 8 7 Ali 2
8 Nobody 20 2 Jack 8
All you would have to do is run through this resulting list and combine the rows to produce the list you specified. The general idea here is to add the second trio of entries to the first trio of entries until the first trio of entries changes but only include the first trio of entries once.
So starting with the first line you would combine the Ali trio with the sgee trio giving you ['1', 'Ali', '2', '4' , 'sgee' , '1']
You would then then combine the three Jack rows giving ['2', 'Jack', '8', '1', 'Ali', '2', '5' , 'gabe' , '2','7' , 'Ali' , '2']
then the final row combines to form ['8' , 'nobody' , '20', '2', 'Jack', '8']
This matches the three arrays you specified (although not in the same order)
Note: Your original question did not indicate what result you expected for the case where the first and last column match in the same row ... [3, George, 3] so ... The where clause eliminates two kinds of entries. I noticed in your big sample data that there were many rows when col 0 and col 5 were the same. So the where clause eliminates these rows from consideration. The second thing I noticed was that many rows have 'N/A' in col 5 so I removed those from consideration too.
Related
I am moving a query from SQL Server to Snowflake. Part of the query creates a pivot table. The pivot table part works fine (I have run it in isolation, and it pulls numbers I expect).
However, the following parts of the query rely on the pivot table- and those parts fail. Some of the fields return as a string-type. I believe that the problem is Snowflake is having issues converting string data to numeric data. I have tried CAST, TRY_TO_DOUBLE/NUMBER, but these just pull up 0.
I will put the code down below, and I appreciate any insight as to what I can do!
CREATE OR REPLACE TEMP TABLE ATTR_PIVOT_MONTHLY_RATES AS (
SELECT
Market,
Coverage_Mo,
ZEROIFNULL(TRY_TO_DOUBLE('Starting Membership')) AS Starting_Membership,
ZEROIFNULL(TRY_TO_DOUBLE('Member Adds')) AS Member_Adds,
ZEROIFNULL(TRY_TO_DOUBLE('Member Attrition')) AS Member_Attrition,
((ZEROIFNULL(CAST('Starting Membership' AS FLOAT))
+ ZEROIFNULL(CAST('Member Adds' AS FLOAT))
+ ZEROIFNULL(CAST('Member Attrition' AS FLOAT)))-ZEROIFNULL(CAST('Starting Membership' AS FLOAT)))
/ZEROIFNULL(CAST('Starting Membership' AS FLOAT)) AS "% Change"
FROM
(SELECT * FROM ATTR_PIVOT
WHERE 'Starting Membership' IS NOT NULL) PT)
I realize this is a VERY big question with a lot of moving parts... So my main question is: How can I successfully change the data type to numeric value, so that hopefully the formulas work in the second half of the query?
Thank you so much for reading through it all!
EDITED FOR SHORTENING THE QUERY WITH UNNEEDED SYNTAX
CAST(), TRY_TO_DOUBLE(), TRY_TO_NUMBER(). I have also put the fields (Starting Membership, Member Adds) in single and double quotation marks.
Unless you are quoting your field names in this post just to highlight them for some reason, the way you've written this query would indicate that you are trying to cast a string value to a number.
For example:
ZEROIFNULL(TRY_TO_DOUBLE('Starting Membership'))
This is simply trying to cast a string literal value of Starting Membership to a double. This will always be NULL. And then your ZEROIFNULL() function is turning your NULL into a 0 (zero).
Without seeing the rest of your query that defines the column names, I can't provide you with a correction, but try using field names, not quoted string values, in your query and see if that gives you what you need.
You first mistake is all your single quoted columns names are being treated as strings/text/char
example your inner select:
with ATTR_PIVOT(id, studentname) as (
select * from values
(1, 'student_a'),
(1, 'student_b'),
(1, 'student_c'),
(2, 'student_z'),
(2, 'student_a')
)
SELECT *
FROM ATTR_PIVOT
WHERE 'Starting Membership' IS NOT NULL
there is no "starting membership" column and we get all the rows..
ID
STUDENTNAME
1
student_a
1
student_b
1
student_c
2
student_z
2
student_a
So you need to change 'Starting Membership' -> "Starting Membership" etc,etc,etc
As Mike mentioned, the 0 results is because the TRY_TO_DOUBLE always fails, and thus the null is always turned to zero.
now, with real "string" values, in real named columns:
with ATTR_PIVOT(Market, Coverage_Mo, "Starting Membership", "Member Adds", "Member Attrition") as (
select * from values
(1, 10 ,'student_a', '23', '150' )
)
SELECT
Market,
Coverage_Mo,
ZEROIFNULL(TRY_TO_DOUBLE("Starting Membership")) AS Starting_Membership,
ZEROIFNULL(TRY_TO_DOUBLE("Member Adds")) AS Member_Adds,
ZEROIFNULL(TRY_TO_DOUBLE("Member Attrition")) AS Member_Attrition
FROM ATTR_PIVOT
WHERE "Starting Membership" IS NOT NULL
we get what we would expect:
MARKET
COVERAGE_MO
STARTING_MEMBERSHIP
MEMBER_ADDS
MEMBER_ATTRITION
1
10
0
23
150
Okay, So here is the first question on the assignment. I just don't know where to start with the problem. If anyone could just help me get started I'd be able to figure it out probably. Thanks
Set two variable values as follows:
#minEnrollment = 10
#maxEnrollment = 20
Determine the number of courses with enrollments between the values assigned to #minEnrollment and #maxEnrollment. If there are courses with enrollments between these two values, display a message in the form
There is/are __class(es) with enrollments between __ and __..
If there are no classes within the defined range, display a message in the form
“
There are no classes with an enrollment between __ and __ students.”
.....
And here is the database to use:
CREATE TABLE Faculty
(Faculty_ID INT PRIMARY KEY IDENTITY,
LastName VARCHAR (20) NOT NULL,
FirstName VARCHAR (20) NOT NULL,
Department VARCHAR (10) SPARSE NULL,
Campus VARCHAR (10) SPARSE NULL);
INSERT INTO Faculty VALUES ('Brown', 'Joe', 'Business', 'Kent');
INSERT INTO Faculty VALUES ('Smith', 'John', 'Economics', 'Kent');
INSERT INTO Faculty VALUES ('Jones', 'Sally', 'English', 'South');
INSERT INTO Faculty VALUES ('Black', 'Bill', 'Economics', 'Kent');
INSERT INTO Faculty VALUES ('Green', 'Gene', 'Business', 'South');
CREATE TABLE Course
(Course_ID INT PRIMARY KEY IDENTITY,
Ref_Number CHAR (5) CHECK (Ref_Number LIKE '[0-9][0-9][0-9][0-9][0-9]'),
Faculty_ID INT NOT NULL REFERENCES Faculty (Faculty_ID),
Term CHAR (1) CHECK (Term LIKE '[A-C]'),
Enrollment INT NULL DEFAULT 0 CHECK (Enrollment < 40))
INSERT INTO Course VALUES ('12345', 3, 'A', 24);
INSERT INTO Course VALUES ('54321', 3, 'B', 18);
INSERT INTO Course VALUES ('13524', 1, 'B', 7);
INSERT INTO Course VALUES ('24653', 1, 'C', 29);
INSERT INTO Course VALUES ('98765', 5, 'A', 35);
INSERT INTO Course VALUES ('14862', 2, 'B', 14);
INSERT INTO Course VALUES ('96032', 1, 'C', 8);
INSERT INTO Course VALUES ('81256', 5, 'A', 5);
INSERT INTO Course VALUES ('64321', 2, 'C', 23);
INSERT INTO Course VALUES ('90908', 3, 'A', 38);
Your request is how to get started, so I'm going to focus on that instead of any specific code.
Start by getting the results that are being asked for, then move on to formatting them as requested.
First, work with the Course table and your existing variables, #minEnrollment = 10 and #maxEnrollment = 20, to get the list that meets the enrollment requirements. Hint: WHERE and BETWEEN. (The Faculty table you have listed doesn't factor into this at all.) After you're sure you have the right results in that list, use the COUNT function to get the number you need for your answer, and assign that value to a new variable.
Now, to the output. IF your COUNT variable is >0, CONCATenate a string together using your variables to fill in the values in the sentence you're supposed to write. ELSE, use the variables to fill in the other sentence.
Part of the problem is, you've actually got 3 or so questions in your post. So instead of trying to post a full answer, I'm instead going to try to get you started with each of the subquestions.
Subquestion #1 - How to assign variables.
You'll need to do some googling on 'how to declare a variable in SQL' and 'how to set a variable in SQL'. This one won't be too hard.
Subquestion #2 - How to use variables in a query
Again, you'll need to google how to do this - something like 'How to use a variable in a SQL query'. You'll find this one is pretty simple as well.
Subquestion #3 - How to use IF in SQL Server.
Not to beat a dead horse, but you'll need to google this. However, one thing I would like to note: I'd test this one first. Ultimately, you're going to want something that looks like this:
IF 1 = 1 -- note, this is NOT the correct syntax (on purpose.)
STUFF
ELSE
OTHERSTUFF
And then switch it to:
IF 1 = 2 -- note, this is NOT the correct syntax (on purpose.)
STUFF
ELSE
OTHERSTUFF
... to verify the 'STUFF' happens when the case is true, and that it otherwise does the 'OTHERSTUFF'. Only after you've gotten it down, should you try to integrate it in with your query (otherwise, you'll get frustrated not knowing what's going on, and it'll be tougher to test.)
One step at a time. Let me give you some help:
Set two variable values as follows: #minEnrollment = 10 #maxEnrollment
= 20
Translated to SQL, this would look like:
Declare #minEnrollment integer = 10
Declare #maxEnrollment integer =15
Declare #CourseCount integer = 0
Determine the number of courses with enrollments between the values
assigned to #minEnrollment and #maxEnrollment.
Now you have to query your tables to determine the count:
SET #CourseCount = (SELECT Count(Course_ID) from Courses where Enrollment > #minEnrollment
This doesn't answer your questions exactly (ON PURPOSE). Hopefully you can spot the mistakes and fix them yourself. The other answers gave you some helpful hints as well.
In my table I've got column facebook where I store facebook data ( comment count, share count etc.) and It's an array. For example:
{{total_count,14},{comment_count,0},{comment_plugin_count,0},{share_count,12},{reaction_count,2}}
Now I'm trying to SELECT rows that facebook total_count is between 5 and 10. I've tried this:
SELECT * FROM pl where regexp_matches(array_to_string(facebook, ' '), '(\d+).*')::numeric[] BETWEEN 5 and 10;
But I'm getting an error:
ERROR: operator does not exist: numeric[] >= integer
Any ideas?
There is no need to convert the array to text and use regexp. You can access a particular element of the array, e.g.:
with pl(facebook) as (
values ('{{total_count,14},{comment_count,0},{comment_plugin_count,0},{share_count,12},{reaction_count,2}}'::text[])
)
select facebook[1][2] as total_count
from pl;
total_count
-------------
14
(1 row)
Your query may look like this:
select *
from pl
where facebook[1][2]::numeric between 5 and 10
Update. You could avoid the troubles described in the comments if you would use the word null instead of empty strings ''''.
with pl(id, facebook) as (
values
(1, '{{total_count,14},{comment_count,0}}'::text[]),
(2, '{{total_count,null},{comment_count,null}}'::text[]),
(3, '{{total_count,7},{comment_count,10}}'::text[])
)
select *
from pl
where facebook[1][2]::numeric between 5 and 10
id | facebook
----+--------------------------------------
3 | {{total_count,7},{comment_count,10}}
(1 row)
However, it would be unfair to leave your problems without an additional comment. The case is suitable as an example for the lecture How not to use arrays in Postgres. You have at least a few better options. The most performant and natural is to simply use regular integer columns:
create table pl (
...
facebook_total_count integer,
facebook_comment_count integer,
...
);
If for some reason you need to separate this data from others in the table, create a new secondary table with a foreign key to the main table.
If for some mysterious reason you have to store the data in a single column, use the jsonb type, example:
with pl(id, facebook) as (
values
(1, '{"total_count": 14, "comment_count": 0}'::jsonb),
(2, '{"total_count": null, "comment_count": null}'::jsonb),
(3, '{"total_count": 7, "comment_count": 10}'::jsonb)
)
select *
from pl
where (facebook->>'total_count')::integer between 5 and 10
hstore can be an alternative to jsonb.
All these ways are much easier to maintain and much more efficient than your current model. Time to move to the bright side of power.
I have a large dataset, x, consisting of 16201 x 49 cells, the first row contains labels e.g.:
'Entry1label' 'Entry2label', 'Entry3label', 'Entry4label'
'stimuli 1' 'stimuli 2' 0.1 10
'stimuli 1' 'stimuli 3' 0.1 10
'stimuli 2' 'stimuli 1' 0.1 40
Column 4 consist of cells with values of either 10, 20, 40 or 60. All of the columns have repeated entries (but different combinations across the columns). I want to filter the cell array for all entries in, e.g. 'Entry4label', that equal e.g. 10.
I've tried:
x([x{2:end, 4}] == 10, :)
This almost works, however, about every twenty cells there's a cell with value 40 left in! Similarly, if I try with 20, I get spurious occurrences of 10. If I use 40, I get spurious occurrences of 20, and finally for 60 I get some (but very few), 40s.
Any idea as to what is going on?
Code
out = x(find(cell2mat(x(2:end,4))==10)+1,:)
Output
out =
'stimuli 1' 'stimuli 2' [0.1] [10]
'stimuli 1' 'stimuli 3' [0.1] [10]
The problem was that the first element is a string for the fourth column.
Here: x([x{2:end, 4}] == 10, :)
Because you're finding the locations within a subset of the column, it's actually taking the row offset by one. I guess that your values in that column are mostly in blocks with an occasional change, so it makes it look as if it's matching most of them.
You can put the offset back:
x(find([x{2:end, 4}]==40)+1,:)
For example, if I have these 2 Documents:
id: 1
multifield: 2, 5
id: 2
multifield: 2, 5, 9
Then say I have a set that I'm querying with, which is {2, 5, 7}. What I would want is document 1 returned because 2 and 5 are both contained in the set. But document 2 should not be returned because 9 is not in the set.
Both the multivalued field and my set are of arbitrary length. Hopefully that makes sense.
Figured this out. This was the inspiration, specifically the answer suggesting to use Function Queries.
Using the same data in the question, I will add a calculated field to my documents which contains the number of values in my multivalued field.
id: 1
multifield: 2, 5
nummultifield: 2
id: 2
multifield: 2, 5, 9
nummultifield: 3
Then I'll use an frange with some function queries. For each item in my set, I'll use the termfreq function which will return 1 or 0. I will then sum up all of these values. Finally, if that sum equals the calculated field nummultifield, then I know that for that document, every value in the document is present in the set. Remember my set is 2,5,7 so my function query will look something like this:
fq={!frange l=0 u=0}sub( nummultifield, sum( termfreq(multifield,2), termfreq(multifield,5), termfreq(multifield,7)))
If we fill in the values for Document 1 and 2, it will look like this:
Document 1: sub( 2, sum( 1,1,0 ) ) = 0 ' in my range of {0,0} so Doc 1 is returned
Document 2: sub( 3, sum( 1,1,0 ) ) = 1 ' not in the range of {0,0} so not returned
I've tested it out and it works great. You need to make sure you don't duplicate any values in multifield or you'll get weird results. Incidentally, this trick of using frange could be used whenever you want to fake a boolean result from one or more function queries.
Faceting may be the what you are looking for.
http://wiki.apache.org/solr/SolrFacetingOverview
http://www.lucidimagination.com/devzone/technical-articles/faceted-search-solr
how to search for more than one facet in solr?
I adapted this from the Lucid Imagination link.
Choose all documents that have values 2 or 5 or 7:
http://localhost:8983/solr/select?q=*
&facet=on
&facet.field=multifield
&fq=multifield:2
&fq=multifield:5
&fq=multifield:7
Incomplete: I dont know any options to exclude all other values.