How can I use TF-IDF Weights to Rank Relevancy? - tf-idf

I have a set of key terms and have calculated TF-IDF weights along with tag frequencies and term counts for each term, persisted in a database.
How can I use these DB values to produce a set of related terms, given a singular term?
I have read the Wikipedia page on TF-IDF and have consumed many Google search results having to do with cosine similarities, n-gram algorithms, and the like. My strengths are not really in linear algebra, IR, or calculus, so I'm struggling to make sense of those documents.
I'd like to know about the relationship of TF-IDF weights to relevancy. Is there a method to rank these values? Do I need to rank them in relation to the weight of a predefined term?
How can I use these numbers now that I have them?

Related

Is it better to cache a value in a column or query another table [duplicate]

I am trying to figure out the fastest way to access data stored in a junction object. The example below is analagous to my problem, but with a different context, because the actual dataset I am dealing with is somewhat unintuitive in its relationships.
We have 3 classes: User, Product, and Rating. User has a many-to-many relationship to Product with Rating as the junction/'through' class.
The Rating object stores the answers to several questions which are integer ratings on a scale of 1-5 (Example questions: How is the quality of the Product, how is the value of the Product, how user-friendly is the Product). For simplification assume every User rates every Product they buy.
Now here is the calculation I want to perform: For a User, calculate the average rating of all the Products they have bought (that is, the average rating from all other Users, one of which will be from this User themself). Then we can tell the user "On average, you buy products rated 3/5 for value by all customers who bought that product".
The simple and slow way is just to iterate over all of a user's review objects. If we assume that each user has bought a small (<100) number of products, and each product has n ratings, this is O(100n) = O(n).
However, I could also do the following: On the Product class, keep a counter of the number of Rating s that selected each number (e.g. how many User s rated this product 3/5 for value). If you increment that counter every time a Product is rated, then computing the average for a given Product just requires checking the 5 counters for each Rating criteria.
Is this a valid technique? Is it commonly employed/is there a name for it? It seems intuitive to me, but I don't know enough about databases to tell whether there's some fundamental flaw or not.
This is normal. It is ultimately caching: encoding of state redundantly to benefit some patterns of usage at the expense of others. Of course it's also a complexification.
Just because the RDBMS data structure is relations doesn't mean you can't rearrange how you are encoding state from some straightforward form. Eg denormalization.
(Sometimes redundant designs (including ones like yours) are called "denormalized" when they are not actually the result of denormalization and the redundancy is not the kind that denormalization causes or normalization removes. Cross Table Dependency/Constraint in SQL Database Indeed one could reasonably describe your case as involving normalization without preserving FDs (functional dependencies). Start with a table with a user's id & other columns, their ratings (a relation) & its counter. Then ratings functionally determines counter since counter = select count(*) from ratings. Decompose to user etc + counter, ie table User, and user + ratings, which ungroups to table Rating. )
Do you have a suggestion as to the best term to use when googling this
A frequent comment by me: Google many clear, concise & specific phrasings of your question/problem/goal/desiderata with various subsets of terms & tags as you may discover them with & without your specific names (of variables/databases/tables/columns/constraints/etc). Eg 'when can i store a (sum OR total) redundantly in a database'. Human phrasing, not just keywords, seems to help. Your best bet may be along the lines of optimizing SQL database designs for performance. There are entire books ('amazon isbn'), some online ('pdf'). (But maybe mostly re queries). Investigate techniques relevant to warehousing, since an OLTP database acts as an input buffer to an OLAP database, and using SQL with big data. (Eg snapshot scheduling.)
PS My calling this "caching" (so does tag caching) is (typical of me) rather abstract, to the point where there are serious-jokes that everything in CS is caching. (Googling... "There are only two hard problems in Computer Science: cache invalidation and naming things."--Phil Karlton.) (Welcome to both.)

Can I use a counter in a database Many-to-Many field to reduce lookups?

I am trying to figure out the fastest way to access data stored in a junction object. The example below is analagous to my problem, but with a different context, because the actual dataset I am dealing with is somewhat unintuitive in its relationships.
We have 3 classes: User, Product, and Rating. User has a many-to-many relationship to Product with Rating as the junction/'through' class.
The Rating object stores the answers to several questions which are integer ratings on a scale of 1-5 (Example questions: How is the quality of the Product, how is the value of the Product, how user-friendly is the Product). For simplification assume every User rates every Product they buy.
Now here is the calculation I want to perform: For a User, calculate the average rating of all the Products they have bought (that is, the average rating from all other Users, one of which will be from this User themself). Then we can tell the user "On average, you buy products rated 3/5 for value by all customers who bought that product".
The simple and slow way is just to iterate over all of a user's review objects. If we assume that each user has bought a small (<100) number of products, and each product has n ratings, this is O(100n) = O(n).
However, I could also do the following: On the Product class, keep a counter of the number of Rating s that selected each number (e.g. how many User s rated this product 3/5 for value). If you increment that counter every time a Product is rated, then computing the average for a given Product just requires checking the 5 counters for each Rating criteria.
Is this a valid technique? Is it commonly employed/is there a name for it? It seems intuitive to me, but I don't know enough about databases to tell whether there's some fundamental flaw or not.
This is normal. It is ultimately caching: encoding of state redundantly to benefit some patterns of usage at the expense of others. Of course it's also a complexification.
Just because the RDBMS data structure is relations doesn't mean you can't rearrange how you are encoding state from some straightforward form. Eg denormalization.
(Sometimes redundant designs (including ones like yours) are called "denormalized" when they are not actually the result of denormalization and the redundancy is not the kind that denormalization causes or normalization removes. Cross Table Dependency/Constraint in SQL Database Indeed one could reasonably describe your case as involving normalization without preserving FDs (functional dependencies). Start with a table with a user's id & other columns, their ratings (a relation) & its counter. Then ratings functionally determines counter since counter = select count(*) from ratings. Decompose to user etc + counter, ie table User, and user + ratings, which ungroups to table Rating. )
Do you have a suggestion as to the best term to use when googling this
A frequent comment by me: Google many clear, concise & specific phrasings of your question/problem/goal/desiderata with various subsets of terms & tags as you may discover them with & without your specific names (of variables/databases/tables/columns/constraints/etc). Eg 'when can i store a (sum OR total) redundantly in a database'. Human phrasing, not just keywords, seems to help. Your best bet may be along the lines of optimizing SQL database designs for performance. There are entire books ('amazon isbn'), some online ('pdf'). (But maybe mostly re queries). Investigate techniques relevant to warehousing, since an OLTP database acts as an input buffer to an OLAP database, and using SQL with big data. (Eg snapshot scheduling.)
PS My calling this "caching" (so does tag caching) is (typical of me) rather abstract, to the point where there are serious-jokes that everything in CS is caching. (Googling... "There are only two hard problems in Computer Science: cache invalidation and naming things."--Phil Karlton.) (Welcome to both.)

Partition Key for CosmosDB Graph vertices and edges

I am creating a graph and initially used a partition key that seems like the only logical one given the set of data. However, the number of vertices and edges ends up being too large for a single partition. I did not create a partitioned collection yet but only created a single 10GB collection. I ran this out of space and filled it up as I wasn't sure how many vertices and edges I would have. The data is a set of categories with varying number of subcategories(and subcategories of those subcats down to an arbitrary depth). The data is a category id and name and a market for which the category applies. The partition key is currently the market. Within a given market there are a bunch of category/subcat/subcat/... that exhausted the 10GB partition for that given market.
If all I have is a category id which is unique, a category name, and a market (as a vertex), and then a parentOf edge the connects a parent category to its children, then what else would make sense as a partition key? If I have a parent category (vertex) with id of 1, a market of 'US', and it has 100 subcategories each with their own id and the corresponding 100 edges for the parentOf connections all with same market of 'US', then the only other option I have for a partition key other than the market is the category id. The issue is, how efficient would lookups and traversals be if the children and children of those children(and edges) are in other partitions?
How do you build a very large graph with a scenario like this?
Given an arbitrary category id, what would the performance be like to find all the children and walk the edges down to find all the children in the hierarchy of those edges?
What would the partition key attribute for the edges need to be? The same partition key as the parent vertex or the same partition key as the child vertex?
Am I thinking about this wrong?
My recommendation for any non-trivial graph implementation is to make a super generic property that all your docs must include such as (quite literally) partitionKey. Then you're free to use the value for market in that field where it makes sense and something else to support a different query pattern.
The important thing to understand is that queries across multiple partitions are going to be slow. So as much as possible you should tailor your partition key to support the best balance between reads and writes.
Ask yourself "What queries will I need to perform against this data most often?" and then adjust the partitionKey for the various documents accordingly.
As for edges, when you add an edge between two vertices using Gremlin, Cosmos automatically places the edge document in the same partition as the out vertex.

relationship type,degree, cardinality, optionality terms confusion

I currently studying database i've seen degree and cardinality uses as same term, or in some other degree is defined as no. of entities involved in a relationship and further catogories as unary, binary and trenary.
Some placed degree is defined as The degree of a relationship type concerns the number of entities within each entity type that can be linked by a given relationship type.
Cardinality is minimum and maximun number of entity occurrence associated with one occurrence of
the related entity
cardinality types as 1 to 1 , 1 to many , many to many. or min and max cardinality.
Min degree is optionality and maximum degree is cardinalty.
what is the difference between degree and cardinaltiy ?
In another context cardinality is a number of rows in table and degree is a number of columns.
So what i'm i suppose to write if question is asked "Define cardinality ?".
Can somebody explain ?
Ok here is the explanation
1.Degree. This is the number of entities involved in the relationship and it is usually 2 (binary relationship) however Unary and higher degree relationships can be exists.
2.Cardinality. This specifies the number of each entity that is involved in the relationship
there are 3 types of cardinality for binary relationships
one to one (1:1)
one to many (1:n)
many to many (n:m)
hope this will clear your mind. Please communicate for more information
Degree - number of attributes (columns) in a relation (table)
Cardinality - number of tuples (rows) present in a table
See this for more details.
To add to the first answer:
Simply
Degree of a Relation - Number of attributes in a relation
Cardinality of a Relation - Number of tuples in a relation.
Can't post the image to show you but you can check out this book to read up more and get a better picture. Also there is Connolly and Begg - Database Systems, 4th Edition
Reference: Elmasri, R., Navathe, S.B., 2011. Fundamentals of Database Systems. 6th ed. United States of America: Pearson.
Degree of a Relationship : The number of participating entities in a relationship. This can be unary, binary, ternary, quaternary, etc
Cardinality : The number of relationship instances an entity can participate in.
Ex: 1:1, 1:Many, Many:N
(Min,Max) notation : Minimum represents the participation constraints while Maximum stands for the cardinality ratio.
Degree of a relation : Number of columns(attributes) in a relation(table).
Degree of a relationship is different from degree of a relation (table). Both definitions are likely to get mixed up and cause confusion.
Relation in this context (in relational databases) is synonym to a "table"
Whereas,
Relationship is synonym to "a connection between tables (relations)".
We have to consider both of these characteristics known as "degree" and "cardinality" of
relations (tables) and
relationships
Separately.
1. in a relation (table)
i) Degree - Number of fields (columns) in relation.
ii) Cardinality - number of records (rows) in relation.
2. in a relationship
i) Degree - Number of entities (tables) involved in a relationship (Unary, Binary, Ternary, N-array)
ii) Cardinality - Number of connections that each record (row/data) of an entity might establish with the records of other entity. (One to one, One to many, Many to many)
It would be good to take note of a distinction when referring to this definition:
Degree of a Relationship differs from,
Degree of a Relation
Good definitions for both have been given above, just take note of this so that the different definitions don't end up confusing you.

Are attributes of a dimension in hierarchical order?

Do the different 'attributes' of a dimension of an OLAP cube have to have a hierarchical order? If not, would the corresponding cube store the results for each possible combination of the dimension attributes?
Let us assume a cube with only two dimensions: time and product.
Time (year, quarter, month, day)
Product (product channel [direct vs. indirect], product group)
While the attributes (how are these called technically?) of the dimension time are clearly strictly hierarchical, the two attributes of the product dimensions are not. We may combine either Channel-Product group or Product group-channel (depending on which one's first).
Is such dimension even possible (non-hierarchical)? If so, which aggregations would the cube store? Each combination (aggregation where first grouped according to channel, then according to product group and the other way around)?
I think Attributes is a perfectly fine name for them - I knew exactly what you meant.
Dimensions don't have to be hierarchical, and very often aren't.
As to which aggregations it will store, there is no simple answer. It will depend on what DBMS you are using, and what you tell it to do. For example with SQL Server (SSAS) you can tell it to precalculate a given percentage of results, from 0 to 100. However within that you can't tell it which ones: it'll do that itself; you can only tell it e.g. 50%. I usually specify 100%.
Other DBMS's will have different facilities.

Resources