The purpose of the query is to give an X tag the probability that a Y tag will also appear.
For example:
Table ListTagsFromPost:
id| Tags |
_________________
1 | <A><B><C><D>|
2 | <A><B> |
3 | <A><C> |
4 | <B><D> |
5 | <A><D> |
6 | <A><B><C> |
7 | <A><C> |
8 | <A><D> |
9 | <B><D> |
The query return :
|A | |B| |C| |D|
_____________________________________________________________
|A| |100| |0.4285714286| |0.5714285714| |0.4285714286|
|B| |0.6| |100| |0.4| |0.6|
|C| |100| |0.5| |100| |0.25|
|D| |0.6| |0.6| |0.2| |100|
The problem is that I have 50,000 records per table ListTagsFromPost.
That means there can be lots of tags (as in example A, B, C, D)
Need a simple query in order to get the result
Related
I am using Scala and Spark.
I have two data frames.
The first one is like following:
+------+------+-----------+
| num1 | num2 | arr |
+------+------+-----------+
| 25 | 10 | [a,c] |
| 35 | 15 | [a,b,d] |
+------+------+-----------+
In the second one the data frame headers are
num1, num2, a, b, c, d
I have created a case class by adding all the possible header columns.
Now what I want is, by matching the columns num1 and num2, I have to check whether
the array in arr column contains the headers of the second data frame.
If it so the value should be 1, else 0.
So the required output is:
+------+------+---+---+---+---+
| num1 | num2 | a | b | c | d |
+------+------+---+---+---+---+
| 25 | 10 | 1 | 0 | 1 | 0 |
| 35 | 15 | 1 | 1 | 0 | 1 |
+------+------+---+---+---+---+
If I understand correctly, you want to transform the array column arr into one column per possible value, that would contain whether or not the array contains that value.
If so, you can use the array_contains function like this:
val df = Seq((25, 10, Seq("a","c")), (35, 15, Seq("a","b","d")))
.toDF("num1", "num2", "arr")
val values = Seq("a", "b", "c", "d")
df
.select(Seq("num1", "num2").map(col) ++
values.map(x => array_contains('arr, x) as x) : _*)
.show
+----+----+---+---+---+---+
|num1|num2| a| b| c| d|
+----+----+---+---+---+---+
| 25| 10| 1| 0| 1| 0|
| 35| 15| 1| 1| 0| 1|
+----+----+---+---+---+---+
my google sheet excel document contain data like this
+---+---+---+---+---+---+
| | A | B | C | D | E |
+---+---+---+---+---+---+
| 1 | | c | | x | |
+---+---+---+---+---+---+
| 2 | | r | | 4 | |
+---+---+---+---+---+---+
| 3 | | | | m | |
+---+---+---+---+---+---+
| 4 | | | | | |
+---+---+---+---+---+---+
Column B and D contain data provided by IMPORTRANGE function, which are store in different files.
And i would like to fill column A with first not empty value in row, in other words: desired result must look like this:
+---+---+---+---+---+---+
| | A | B | C | D | E |
+---+---+---+---+---+---+
| 1 | c | c | | x | |
+---+---+---+---+---+---+
| 2 | r | r | | 4 | |
+---+---+---+---+---+---+
| 3 | m | | | m | |
+---+---+---+---+---+---+
| 4 | | | | | |
+---+---+---+---+---+---+
I tried ISBLANK function, but apperantly if column is imported then, even if the value is empty, is not blank, so this function dosn't work for my case. Then i tried QUERY function in 2 different variant:
1) =QUERY({B1;D1}; "select Col1 where Col1 is not null limit 1"; 0) but result in this case is wrong when row contain cells with numbers. Result with this query is following:
+---+---+---+---+---+---+
| | A | B | C | D | E |
+---+---+---+---+---+---+
| 1 | c | c | | x | |
+---+---+---+---+---+---+
| 2 | 4 | r | | 4 | |
+---+---+---+---+---+---+
| 3 | m | | | m | |
+---+---+---+---+---+---+
| 4 | | | | | |
+---+---+---+---+---+---+
2) =QUERY({B1;D1};"select Col1 where Col1 <> '' limit 1"; 0) / =QUERY({B1;D1};"select Col1 where Col1 != '' limit 1"; 0) and this dosn't work at all, result is always #N/A
Also i would like to avoid using nested IFs and javascript scripts, if possible, as solution with QUERY function suits for my case best due to easy expansion to another columns without any deeper knowladge about programming. Is there any way how to make it simply, just with QUERY, and i am just missing something, or i have to use IFs/javascript?
try:
=ARRAYFORMULA(SUBSTITUTE(INDEX(IFERROR(SPLIT(TRIM(TRANSPOSE(QUERY(
TRANSPOSE(SUBSTITUTE(B:G, " ", "♦")),,99^99))), " ")),,1), "♦", " "))
selective columns:
I am having a problem in a Neo4j query. Suppose I have a Node type called App. The App nodes have the fields "m_id" and "info". I want to build a query to create a relationship between the nodes where the field "info" is equal.
This is the query:
MATCH (a:App {m_id:'SOME_VALUE' }),(b:App {info: a.info}) WHERE ID(a)<>ID(b) AND NOT (b)-[:INFO]->(a) MERGE (a)-[r:INFO]->(b) RETURN b.m_id;
I also have indexes for both fields:
CREATE CONSTRAINT ON (a:App) ASSERT a.m_id IS UNIQUE;
CREATE INDEX ON :App(info);
But the thing is I get very slow queries, with access in all the records of the App nodes.
This is the profile of the query:
+---------------+--------+---------+-----------------+--------------------------------------------------------------------------------------------------------------------------------+
| Operator | Rows | DB Hits | Identifiers | Other |
+---------------+--------+---------+-----------------+--------------------------------------------------------------------------------------------------------------------------------+
| +ColumnFilter | 0 | 0 | b.m_id | keep columns b.m_id |
| | +--------+---------+-----------------+--------------------------------------------------------------------------------------------------------------------------------+
| +Extract | 0 | 0 | a, b, b.m_id, r | b.m_id |
| | +--------+---------+-----------------+--------------------------------------------------------------------------------------------------------------------------------+
| +Merge(Into) | 0 | 1 | a, b, r | (a)-[r:INFO]->(b) |
| | +--------+---------+-----------------+--------------------------------------------------------------------------------------------------------------------------------+
| +Eager | 0 | 0 | a, b | |
| | +--------+---------+-----------------+--------------------------------------------------------------------------------------------------------------------------------+
| +Filter | 0 | 2000000 | a, b | Ands(b.info == a.info, NOT(IdFunction(a) == IdFunction(b)), NOT(nonEmpty(PathExpression((b)-[anon[104]:INFO]->(a), true)))) |
| | +--------+---------+-----------------+--------------------------------------------------------------------------------------------------------------------------------+
| +SchemaIndex | 184492 | 1000000 | a, b | { AUTOSTRING0}; :App(m_id) |
| | +--------+---------+-----------------+--------------------------------------------------------------------------------------------------------------------------------+
| +NodeByLabel | 184492 | 1000001 | b | :App |
+---------------+--------+---------+-----------------+--------------------------------------------------------------------------------------------------------------------------------+
Try finding a by itself, using a WITH clause to put a.info into a temporary variable that is used by a separate MATCH clause for b, as in:
MATCH (a:App { m_id:'SOME_VALUE' })
WITH a, a.info AS a_info
MATCH (b:App { info: a_info })
WHERE a <> b AND NOT (b)-[:INFO]->(a)
MERGE (a)-[r:INFO]->(b)
RETURN b.m_id;
It seems that indices tend not to be used when comparing the properties of 2 nodes. The use of a_info removes that impediment.
If the profile of the above shows that one or both indices are not being used, you can try adding index hints:
MATCH (a:App { m_id:'SOME_VALUE' })
USING INDEX a:App(m_id)
WITH a, a.info AS a_info
MATCH (b:App { info: a_info })
USING INDEX b:App(info)
WHERE a <> b AND NOT (b)-[:INFO]->(a)
MERGE (a)-[r:INFO]->(b)
RETURN b.m_id;
I figure out a solution using OPTIONAL MATCH:
MATCH (a:App {m_id:'SOME_VALUE' }) OPTIONAL MATCH (a),(b:App {info: a.info}) WHERE ID(a)<>ID(b) AND NOT (b)-[:INFO]->(a) MERGE (a)-[r:INFO]->(b) RETURN b.m_id;
This is the profile of the query:
+----------------+------+---------+-----------------+------------------------------------------------------------------------------------------------------------+
| Operator | Rows | DB Hits | Identifiers | Other |
+----------------+------+---------+-----------------+------------------------------------------------------------------------------------------------------------+
| +ColumnFilter | 0 | 0 | b.m_id | keep columns b.m_id |
| | +------+---------+-----------------+------------------------------------------------------------------------------------------------------------+
| +Extract | 0 | 0 | a, b, b.m_id, r | b.m_id |
| | +------+---------+-----------------+------------------------------------------------------------------------------------------------------------+
| +Merge(Into) | 0 | 1 | a, b, r | (a)-[r:INFO]->(b) |
| | +------+---------+-----------------+------------------------------------------------------------------------------------------------------------+
| +Eager | 0 | 0 | a, b | |
| | +------+---------+-----------------+------------------------------------------------------------------------------------------------------------+
| +OptionalMatch | 0 | 0 | a, b | |
| |\ +------+---------+-----------------+------------------------------------------------------------------------------------------------------------+
| | +Filter | 0 | 0 | a, b | Ands(NOT(IdFunction(a) == IdFunction(b)), NOT(nonEmpty(PathExpression((b)-[anon[109]:INFO]->(a), true)))) |
| | | +------+---------+-----------------+------------------------------------------------------------------------------------------------------------+
| | +SchemaIndex | 0 | 0 | a, b | a.info; :App(info) |
| | | +------+---------+-----------------+------------------------------------------------------------------------------------------------------------+
| | +Argument | 0 | 0 | a | |
| | +------+---------+-----------------+------------------------------------------------------------------------------------------------------------+
| +SchemaIndex | 0 | 1 | a | { AUTOSTRING0}; :App(m_id) |
+----------------+------+---------+-----------------+------------------------------------------------------------------------------------------------------------+
I have a HIVE Table with following schema like this:
hive>desc books;
gen_id int
author array<string>
rating double
genres array<string>
hive>select * from books;
| gen_id | rating | author |genres
+----------------+-------------+---------------+----------
| 1 | 10 | ["A","B"] | ["X","Y"]
| 2 | 20 | ["C","A"] | ["Z","X"]
| 3 | 30 | ["D"] | ["X"]
Is there a query where I can perform some SELECT operation and that returns individual rows, like this:
| gen_id | rating | SplitData
+-------------+---------------+-------------
| 1 | 10 | "A"
| 1 | 10 | "B"
| 1 | 10 | "X"
| 1 | 10 | "Y"
| 2 | 20 | "C"
| 2 | 20 | "A"
| 2 | 20 | "Z"
| 2 | 20 | "X"
| 3 | 30 | "D"
| 3 | 30 | "X"
Can someone guide me how can get to this result. Thanks in advance for any kind of help.
You need to do Lateral view and explode,i.e.
SELECT
gen_id,
rating,
SplitData
FROM (
SELECT
gen_id,
rating,
array (ex_author,ed_genres) AS ar_SplitData
FROM
books
LATERAL VIEW explode(books.author) exploded_authors AS ex_author
LATERAL VIEW explode(books.genres) exploded_genres AS ed_genres
) tab
LATERAL VIEW explode(tab.ar_SplitData) exploded_SplitData AS SplitData;
I had no chance to test it but it should show you general path. GL!
I found a macro (courtesy of Jerry Beaucaire) that splits up one worksheet into many based on unique values in a given column. This works great. However...
The client has supplied a differently formatted worksheet which needs some gentle massaging to get into the format we need.
First, let me show you a snippet of JB's code:
MyArr = Application.WorksheetFunction.Transpose _
(ws.Columns(iCol).SpecialCells(xlCellTypeConstants))
From what I can tell (and I'm a total VB newbie, so what do I know..??), this populates an array with the selected row values
And this:
For Itm = 2 To UBound(MyArr)
...(code removed)
ws.Range("A" & TitleRow & ":A" & LR).EntireRow.Copy _
Sheets(MyArr(Itm) & "").Range("A1")
...(code removed)
Next Itm
...seems to do the copying.
Alright. ...fine so far.
The problem is that I need to add a step to the process. This will be tricky to explain. Please bear with me...
Title row is row 1
Data starts in row 2
Each row has 9 columns:
colA: identifier
colB-colD: x,y,z values (for top of item)
colE-colG: x,y,z values (for bottom of item)
colH and colI: can be ignored
These x,y and z values are used to define points that are used to plot lines in a 3D modelling program. Each row in the worksheet actually defines a line (well... a start point and an end point - "top" and "bottom") Unfortunately, the data(worksheet) we have received defines two sets of data for each line - both having the same start point, but with different end points. Put another way, starting with rows 3 and 4, the data in columns B-D is the same for both rows. This applies to rows 5 & 6, 7 & 8, etc.
Since all we need are a set of data POINTS, we can safely use the values from cols E-G.
HOWEVER... and this is where I need help... We need the first row of the newly created worksheet to start with the values from row 2, cols B-D. (ie. we can use the end points as our coordinates, but we still need the first start point) All the rest is fine the way it is.
For example:
Source Data:
| A | B | C | D | E | F | G |
1 | id | x-top | y-top | z-top | x-bottom | y-bottom | z-bottom |
2 | H1 | 101.2 | 0.525 | 54.25 | 110.25 | 0.625 | 56.75 |
3 | H1 | 110.25| 0.625 | 56.75 | 121.35 | 2.125 | 62.65 |
4 | H1 | 110.25| 0.625 | 56.75 | 134.85 | 3.725 | 64.125 | B,C,D same as row 3
5 | H1 | 134.85| 3.725 | 64.125| 141.25 | 4.225 | 66.75 |
6 | H1 | 134.85| 3.725 | 64.125| 148.85 | 5.355 | 69.85 | B,C,D same as row 5
What I need:
| A | B | C | D | E | F | G |
1 | id | x-top | y-top | z-top | x-bottom | y-bottom | z-bottom |
2 | H1 | | | | 101.2 | 0.525 | 54.25 |
3 | H1 | 101.2 | 0.525 | 54.25 | 110.25 | 0.625 | 56.75 |
4 | H1 | 110.25| 0.625 | 56.75 | 121.35 | 2.125 | 62.65 |
5 | H1 | 110.25| 0.625 | 56.75 | 134.85 | 3.725 | 64.125 |
6 | H1 | 134.85| 3.725 | 64.125| 141.25 | 4.225 | 66.75 |
7 | H1 | 134.85| 3.725 | 64.125| 148.85 | 5.355 | 69.85 |
So... What's the best way to do this? Can I add to the existing macro to perform this operation? If so, better to modify the array? ...better to modify the Copy routine? ...and how??
Thanks in advance for your help and please don't suggest doing it manually. There are 70,000+ rows to parse!
If you need more info, let me know!
The full macro is available for free to all at this location
To achieve your connecting points, these additions should do it:
For Itm = 2 To UBound(MyArr)
...(code removed)
ws.Range("A" & TitleRow & ":A" & LR).EntireRow.Copy _
Sheets(MyArr(Itm) & "").Range("A1")
Sheets(MyArr(Itm) & "").Rows(2).Insert xlShiftDown
Sheets(MyArr(Itm) & "").Range("E2").Resize(, 3).Value = Sheets(MyArr(Itm) & "").Range("B3").Resize(, 3).Value
...(code removed)
Next Itm