Finding bigram in a location index - sql-server

I have a table which indexes the locations of words in a bunch of documents.
I want to identify the most common bigrams in the set.
How would you do this in MSSQL 2008?
the table has the following structure:
LocationID -> DocID -> WordID -> Location
I have thought about trying to do some kind of complicated join... and i'm just doing my head in.
Is there a simple way of doing this?
I think I better edit this on monday inorder to bump it up in the questions
Sample Data
LocationID DocID WordID Location
21952 534 27 155
21953 534 109 156
21954 534 4 157
21955 534 45 158
21956 534 37 159
21957 534 110 160
21958 534 70 161

It's been years since I've written SQL, so my syntax may be a bit off; however, I believe the logic is correct.
SELECT CONCAT(i.WordID, "|", j.WordID) as bigram, count(*) as freq
FROM index as i, index as j
WHERE j.Location = i.Location+1 AND
j.DocID = i.DocID
GROUP BY bigram
ORDER BY freq DESC
You can also add the actual word IDs to the select list if that's useful, and add a join to whatever table you've got that dereferences WordID to actual words.

Related

when duplicate values found then

I want to have a query that selects all duplicate values in a column. If those value meet the conditions then I'd like the query to return only those values.
Class Student_ID Location
Biology 511 4A
Biology 512 15B
Biology 513 15B
English 514 6A
Biology 521 6A
Spanish 522 6A
Spanish 523 15B
Chemistry 524 4A
English 531 15B
Biology 532 4A
Chemistry 534 4A
Select all duplicate values in the class column and if among those values there is location in both 4A and 15B then assign 1.
CASE WHEN count(class) > 1 AND (Location = '4A' AND Location = '15B') THEN 1
ELSE 0 END
what is most important is how to select duplicate values as a group and then look at the condition (location must be 4A and 15B). So the query must first group the duplicated values from the class column and then see if within the group the values meet the condition of location. So for example we first group the class column we get 5x biology this is then seen as a group and then within this group if there exist one row with location 4A AND one row with location 15B then and only then assign value 1 for biology. Almost all the values in class column have duplicates.
Desired Output
Class Location
Biology 1
Chemistry 0
English 0
Spanish 0
As an alternative to Tim Schmelter's answer, you can also do this with a LEFT JOIN.
SELECT yt1.Class, IIF(COUNT(yt2.Class) > 0, 1, 0) AS IsMatch
FROM YourTable yt1
LEFT JOIN YourTable yt2 ON yt1.Location = '4A' AND yt2.Location = '15B' AND
yt2.Class = yt1.Class
GROUP BY yt1.Class

NUMERIC and VARCHAR

I am using SQL Server 2008 R2 to run queries and I have come across a database where it stores numeric values as varchar(4). For example:
SELECT [num]
FROM [TABLE1]
WHERE num > '95'
I get the below results
96
97
98
99
999
However when I run the same query without the '' i.e.
SELECT [num]
FROM [TABLE1]
WHERE num > 95
then I get
100
101
102
103
104
105
106
107
108
109
110
111
112
113
116
117
120
7001
7002
7003
7004
7005
7006
7007
96
97
98
99
999
In any case, I am not getting numbers in order i.e. 95, 96, 97, 98, 99. I understand this is because they are stored as varchar(4) i.e. of a string format. Please can someone explain what happens in both situations and how does a string compare in both the above cases?
Also if someone can help me write the code to change these varchar(4) into numeric on the fly so I can arrange them properly?
Much appreciated.
When you use > '95' it compares the "numbers" in alphabetical order, that's why the result is like that. When you use > 95 it type casts the column into a number and that's why the different result.
To be sure what actually happens, you should do the casting yourself. And of course you should not store numbers as varchars.
The correct ordering would be with
order by convert(int, num)
but it will fail if there's non-numeric fields in the table.
The > does a lexicographical comparison on strings, not numbers. So the output is in order of a string (order by ASC).

Optimize MDX query

I have two needs in my query
First : to have a sorted product list base on my measure.product with higher sales should appears first.
ProductCode Sales
----------- ------------
123 18
332 17
245 16
656 15
Second : to have cumulative sum on my presorted product list.
ProductCode Sales ACC
----------- ------------ ----
123 18 18
332 17 35
245 16 51
656 15 66
I wrote below MDX in order to achieve above goal:
WITH
SET SortedProducts AS
Order([DIMProduct].[ProductCode].[ProductCode].AllMEMBERS,[Measures]. [Sales],BDESC)
MEMBER [Measures].[ACC] AS
Sum
(
Head
(
[SortedProducts],Rank([DIMProduct].[ProductCode].CurrentMember,[SortedProducts])
)
,[Measures].[Sales]
)
SELECT
{[Measures].[Sales] ,[Measures].[ACC]}
ON COLUMNS,
SortedProducts
ON ROWS
FROM [Model]
But it takes about 3 minutes to run,any suggestion on how to optimize my code or is it normal?
I have 9635 products in total
if you do a quick research on google, there are different ways to achieve it (many answers here as well).
That said, I will give a try to this different way to calculate your running total
MEMBER [Measures].[SortedRank] AS Rank([Product].[Product].CurrentMember, [SortedProducts])
MEMBER [Measures].[ACC2] AS SUM(TopCount([SortedProducts], [Measures].[SortedRank]) ,[Measures].[Internet Sales Amount])
I don't know if TopCount will perform faster than Head for your case, but for example your query on my test machine on AdventureWorks cube takes the same time using Head or TopCount function.
Hope this helps

Aligning Data in SQL

I am using Sybase SQL.
I have two tables.
Table A:
Column1_A:
100
501
504
810
810
950
955
955
Table B:
Column1_B:
100
250
503
810
807
949
950
955
955
I want to achieve the following:
Column1_A Column1_B
100 NULL
501 250
504 503
810 503
810 503
950 949
955 950
955 950
So, basically I want to align the Column1_B from Table B to Column1_A from Table A so that maximum of Column1_B is less than Column1_A for each row. It should give NULL if there is no such element in Table B
The values in the Column1_A or Column1_B are for illustration only. The real values are like 1000, 1500, 2504, and they not necessarily the values in Column1_B are Column1_A - 1.
Edit:
I modified the data so that logic can be generalized. I am using Sybase SQL.
Sorry but it's not clear for me what you want to obtain. But final result that you presented could be obtained by:
SELECT Column1_A, Column1_B FROM A
LEFT JOIN B ON Column1_A = Column1_B -1
Edit.
You might try a correlated subquery then:
SELECT Column1_A a, (SELECT MAX(Column1_B) FROM B where Column1_B < a) FROM A

Formatting link lists using TSQL

Shog9 keeps on making my link lists look awesome.
Essentially, I write a bunch of queries that pull out results from the Stackoverflow data dump. However, my link lists look very ugly and are hard to understand.
Using some formatting magic Shog9 manages to make the link lists look a lot nicer.
So, for example, I will write a query that returns the following:
question id,title,user id, other info
4,When setting a form’s opacity should I use a decimal or double?,8,Eggs McLaren, some other stuff lots of text
And I want it to paste it into an answer on meta and make it look like this:
Question Id User Name Other Info
When setting a form’s opacity... Eggs Mclaren Some other stuff...
So assuming my starting point is the query that returns the start info.
What are the least amount of steps I can run in query analyser to turn the results into:
<h3> Question Id User Name Other Info </h3>
<pre>
When setting a form’s opacity... Eggs Mclaren Some other stuff...
</pre>
My initial thoughts are to insert the results into a temp table and then run a stored proc that will iron the data into my desired structure. Run the proc, cut and paste and be done.
Any candidate TSQL based solutions to this problem?
EDIT: Accepting my answer, its the only solution with an implementation.
Not sure of your exact requirements, but have you considered selecting the data as XML and then applying an XSLT transform to the results?
I'll update this post with my progress as I refine my proc:
Example:
select top 20
UserId = u.Id,
UserName = u.DisplayName,
u.Reputation,
sum(case when p.ParentId is null then 1 else 0 end) as Questions,
sum(case when p.ParentId is not null then 1 else 0 end) as Answers
into #t
from Users u
join Posts p on p.OwnerUserId = u.Id
where p.CommunityOwnedDate is null and p.ClosedDate is null
group by u.Id, u.DisplayName, u.Reputation
having sum(case when p.ParentId is not null then 1 else 0 end) < sum(case when p.ParentId is null then 1 else 0 end) / 6
order by Reputation desc
exec spShog9
Results:
User Reputation
Questions Answers
Edward Tanguay 8317 465 24
me 5767 311 29
Joan Venge 4844 226 14
Blankman 4546 310 1
acidzombie24 4359 371 32
Thanks 4350 416 21
Masi 4193 555 74
LazyBoy 3230 94 12
KingNestor 3187 92 11
Nick 2084 79 6
George2 1973 263 1
Xaisoft 1944 174 12
John 1929 160 24
danmine 1901 53 3
zsharp 1771 145 16
carrier 1742 56 8
JC Grubbs 1550 50 5
vg1890 1534 56 2
Coocoo4Cocoa 1514 143 0
Keand64 1513 83 5
Masi 4193 555 74
LazyBoy 3230 94 12
KingNestor 3187 92 11
Nick 2084 79 6
George2 1973 263 1
Xaisoft 1944 174 12
John 1929 160 24
danmine 1901 53 3
zsharp 1771 145 16
carrier 1742 56 8
JC Grubbs 1550 50 5
vg1890 1534 56 2
Coocoo4Cocoa 1514 143 0
Keand64 1513 83 5
The proc is on gist: http://gist.github.com/165544
You could do something like:
with
data (question_id,title,user_id, username ,other_info) as
(
select 4,'When setting a form''s opacity should I use a decimal or double?',8,'Eggs McLaren', 'some other stuff lots of text'
union all
select 5,'Another q title',9,'OtherUsername', 'some other stuff lots of text')
select
(select 'http://stackoverflow.com/questions/' + cast(question_id as varchar(10)) as [#href], title as [*] for xml path('a')) as questioninfo
,(select 'http://stackoverflow.com/users/' + cast(user_id as varchar(10)) + '/' + replace(username, ' ', '-') as [#href], username as [*] for xml path('a')) as userinfo
, other_info
from data
...but see how you go. I personally find that FOR XML PATH is very powerful for getting marked-up results in a way that suits me.
Rob

Resources