google refine: use facet tools to infer map between two columns

google refine: use facet tools to infer map between two columns - google-refine

I've been searching but haven't found how to do this in refine.
I've got two columns of unique IDS. For each a in A, I want to find the top 10 closest matches in B.
My backup plan is to just use Levenshtein to iterate ... but Refine has such a nice iterface and many more algorithms implemented that I was hoping to be able to do some of the work using it.
Or is there another tool for doing this?

Did you know you can use clustering algorithm like fingerprint or ngramFingerprint (source) out of the clustering interface in Refine?
Using you IDS field, create a new column based on this column with the following expression: ngramFingerprint(value)
You can now cross with your other data set on this new column. This might help to get more matches.

Related

How do we give an identification to a relationship in OntoRefine RDF mapping?

I'm working on a transformation work in which I need to transform a property graph dataset into a RDF dataset. There are so many n-ary relationships that need to be traited as a class, but I do not know how to affect an unique identification on these relations. I tried to use the row index but I've got more than one file on this work so this can't work. So I would like to know how do you affect an unique identification to relationships, if the URI is the solution, how do we do this in OntoRefine mapping? Thank you for your answers.
Lee

There are several ways to address this:
Ideally, use some characteristics of the related entities to make a deterministic URL. Eg if you're making a position (membership) node between a person and an org that involves a mandatory role and start date, you could use a URL like org/<org_id>/person/<person_id>/role/<role_id>/date/<date>
Use a blank node. In that case you don't need to worry about a URN
Use the row index if you prepend it with the table/file name (as a constant)
Use the GREL function random(). It doesn't produce a globally unique identifier, but if you ask for a large enough range, it'll be unique with a very high probability
Use a Jython function, as shown at How to create UUID in Openrefine based on the MD5 hash of the values
If you do your mapping using SPARQL, then use the builtin uuid() function

Is primary, secondary, tertiary etc. sorting possible on qooxdoo tables?

I can't find any info on this in the qooxdoo documentation (currently on version 4.01 for various reasons), so I'm assuming it is not supported out of the box. Has anyone been able to implement this using qooxdoo, or should I use some other third party table control, or perhaps implement one myself (which I'd rather not)?

I've not seen any implementation of multiple column sorting in qooxdoo yet. What I've done is an implicit multi column sorting with the remote table model, where, if a user selected to sort on column A also sorted on column B, in sql terms "order by A, B"
An implementation for multi column sorting would be very welcome.
Please join us on https://github.com/qooxdoo/qooxdoo/ and https://gitter.im/qooxdoo/qooxdoo

(sorry my bad english) Hi.
This is a very old humble example (adapted) of 'my own' multiple column sorting implementation.
Use the table and lists context menus, generate a few sort formulaes, then return to table context menu to select the sort order. (Change the data table for convenience)
Maybe you can get some new idea or some perspective.
Playground example

Solr generate key

I'm working with solr and indexing data from DB.
When I import the data using SQL query, I got some rows with the same key.
I need a way that solr will generate a new field with unique key.
How can I do that?
Thanks

I am not sure if this is possible or not, but maybe you need to re-consider your logic here...
Indexing operation into Solr should be Re-Runable. So, imagine that you come one day and decide to change the schema of your core.
If you generate a new key everytime you import a document, you will end up creating duplicate items when you re-run your data import.
Maybe you need to revisit your DB design to have a unique key, or maybe in the select query, you can create a derived or calculated column value that is calculated based on multiple columns. But I am sure that pushing this problem to solr is not the solution.

ideally the unique key should come from the db (are you sure you cannot get one, by composing some columns etc?).
But, if you cannot, Solr supports UUID generation for this, look here to see how it works depending on your solr version

How do you implement a fulltext search over multiple columns in sql server?

I am trying to implement a fulltext search on two columns which I created a view for: VendorName, ProductName. I have the full text index etc working but the actual query is what is causing some issues for me.
I want users to be able to use some standard search conventions, AND OR NOT and grouping of terms by () which is fine but I want to apply the search over both the columns so for example if I were to run a query such as:
SELECT * FROM vw_Search
WHERE CONTAINS((VendorName, ProductName), "Apple AND iTunes")
It seems to be applying the query to each column individually i.e. checking vendor name for both terms and then checking product name for both terms which wont match unless the vendor was called "Apple iTunes".
If I change the query to :
SELECT * FROM vw_Search
WHERE CONTAINS(VendorName, "Apple OR iTunes")
AND CONTAINS(ProductName, "Apple OR iTunes")
then it works but breaks other search queries (such as searching for just apple) and from user writing the query it doesn't make much sense as what they are likely to write is AND but it requires an OR to work.
What I want is it to return if between the two the search term was valid so it would match all vendors named apple with a product name itunes for example.
Should I create a separate field in the view that concatenates the Vendor and Product fields and performs the first query on that new field or is there something I am missing out?
Aside from that would anyone know of an existing method of validating the queries?

In earlier versions of SQL Server, the queries matched across multiple columns.
However, this was considered a bug.
To match across multiple columns, you should concatenate them in a computed column and create an index over that column.

Full-text Search on Joined, Hierarchical Records in SQL Server 2008

Probably a noob question, but I'll go for it nevertheless.
For sake of example, I have a Person table, a Tag table and a ContactMethod table. A Person will have multiple Tag records and multiple ContactMethod records associated with them.
I'd like to have a forgiving search which will search among several fields from each table. So I can find a person by their email (via ContactMethod), their name (via Person) or a tag assigned to them.
As a complete noob to FTS, two approaches come to mind:
Build some complex query which addresses each field individually
Build some sort of lookup table which concatenates the fields I want to index and just do a full-text query on that derived table.
(Feel free to edit for clarity; I'm not in it for the rep points.)

If your sql server supports it you can create an indexed view and full text search that; you can use containstable(*,'"chris"') to read all the columns.
If it doesn't support it as the fields are all coming from different tables I think for scalability; if you can easily populate the fields into a single row per record in a separate table I would full text search that rather than the individual records. You will end up with a less complex FTS catalog and your queries will not need to do 4 full text searches at a time. Running lots of separate FTS queries over different tables at the same time is a ticket to query performance issues in my experience. The downside with doing this is you lose the ability to search for Surname on its own; if that is something you need you might need to look at an alternative.
In our app we found that the single table was quicker (we can't rely on customers having enterprise sql at hand); so we populate the data with spaces into an FTS table through an update sp then our main contact lookup runs a search over the list. We have two separate searches to handle finding things with precision (i.e. names or phone numbers) or just for free text. The other nice thing about the table is it is relatively easy and low cost to add further columns to the lookup (we have been asked for social security number for example; to do it we just added the column to the update SP and we were away with little or no impact.

One possibility is to make a view which has these columns: PersonID, ContentType, Content. ContentType would be something like "Email", "PhoneNumber", etc... and Content would hold that. You'd be searching on the Content column, and you'd be able to see what the person's ID is. I'm not 100% sure how full text search works though, so I'm not sure if you could use that on a view.

The FTS can search multiple fields out-of-the-box. The CONTAINS predicate accepts a list of columns to search. Also CONTAINSTABLE.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

google refine: use facet tools to infer map between two columns - google-refine

Related

How do we give an identification to a relationship in OntoRefine RDF mapping?

Is primary, secondary, tertiary etc. sorting possible on qooxdoo tables?

Solr generate key

How do you implement a fulltext search over multiple columns in sql server?

Full-text Search on Joined, Hierarchical Records in SQL Server 2008

Categories

Resources