Spark-graphx- Strongly connected components - spark-graphx

I'm new to spark and spark-graphx. I tried to run strongly connected components, but I'm getting only the triplets which are connected.
What I'm looking for here is getting all the vertices (group of vertices), which are only strongly connected (including single nodes).
Example:
Vertex Edge description
1 2 rule1
1 3 rule1
2 3 rule1
3 4 rule1
4 5 rule1
5 6 rule1
5 7 rule1
9 10 rule2
10 11 rule2
10 12 rule2
Output of strongly connected components:
(1,2,3) - Rule1
(4) - Rule1
(5,6,7)- Rule1
(9,10,11,12)- Rule2
I believe I explained the use case correctly, please let me know if you need further details.
The final goal is to assign one user defined Id to each strongly connected group.

I use PySpark and if I try your example probably I will get the output in DataFrame format as following:
+---------------+
|id | component |
+---+-----------+
|1 |rule1 |
|2 |rule1 |
|3 |rule1 |
|4 |rule1 |
|5 |rule1 |
|6 |rule1 |
|7 |rule1 |
|9 |rule2 |
|10 |rule2 |
|11 |rule2 |
|12 |rule2 |
+---+-----------+
As you may know, PySpark is Python API which still interprets Scala scripts. I have no idea why Scala prints the result in separate lines (maybe because mapReduce applies the algorithm in parallel and once convergence is reached it prints out the output). However I think you can aggregate the result and say 1,2,3,4,5,6,7 are strongly connected and belong to rule1 group. And 10,11,12 belong to rule2.

Related

Matching and labeling

Good Morning all. I'm trying to write some of my first scripts and I'm having a difficult time doing so. I'm trying to match data from one file to another, and add a label to that row in the original.
I'm using two different data sources, to accomplish this and there are tens of thousands different rows to match. I'm trying to take one column of zip codes in data source one, match it to the same zip codes in a data source two, and add a new column labeling the location in data source one. see example below.
Data Source One:
|A | | B |
|13329 | X |
|22193 | X |
|13211 | X |
Data source two:
|A | | B |
|13211 | Syracuse |
|22193 | D.C. Metro |
|13329 | Utica Rome |
New Data Source one:
| A | B | C |
|13329 | X | Utica-Rome |
|22193 | X | D.C. Metro |
|13211 | X | Syracuse |
New Data Source One is the desired end state. I am dealing with rows that will have no new labels and can be labeled as N/A or NA (whichever way is fine). I hope I have explained the problem and the desired result well enough. Please help.
More commonly than Matching and labeling this is called joining.
join <(sort DS1) <(sort DS2)

How can I traverse a hierarchy to find circular references?

I have a table that maps employees for profit sharing. It allows a given employee to assign a percentage of their take to another employee. This means there isn't necessarily a single entry for a given employee, there could be 0 or more. Alice could give 5% to Bob and 3% to Charlie, so Alice would have two records in the table.
The table has three fields, FromEmployeeId of type int, ToEmployeeId of type int, PctShare of type float.
When a user attempts to add a new mapping record I need to make sure that it won't result in a circular reference.
The problem is I have both multiple hierarchy levels and each employee can have multiple mappings. Initially I thought a recursive CTE may work but I don't think it will work with the multiple mapping issue.
Here's an example:
Bob maps to Charlie and Yusuf
Charlie maps to David and William
Yusuf has no mappings
David maps to Alice and Vince
Vince has no mappings
Alice maps to Tucker
Tucker has no mappings
or in tabular form:
------------------------
| From | To |
------------------------
| Bob | Charlie|
| Bob | Yusuf |
| Charlie | David |
| Charlie | William|
| David | Alice |
| David | Vince |
| Alice | Tucker |
------------------------
If a user attempted to add a mapping record from Alice -> Bob, I want to recognize that that results in a circular reference because Bob -> Charlie -> David -> Alice.
Is my only option something like a cursor and looping through each of the result sets?

Implementing a Model in a Relational Database

I have a super-class/subclass hierarchical relationship as follows:
Super-class: IT Specialist
Sub-classes: Databases, Java, UNIX, PHP
Given that each instance of a super-class may not be a member of a subclass and a super-class instance may be a member of two or more sub-classes, how would I go about implementing this system?
I haven't been given any attributes to assign to the entities so I find this very vague and I'm at a loss where to start.
To get started, you would have one table that contains all of your super-classes (in your example case, there would only be IT Specialist, but it could also contain things like Networking Specialist, or Digital Specialist). I've included these to give a bit more flavour:
ID | Name |
-----------------------------
1 | IT Specialist |
2 | Networking Specialist |
3 | Digital Specialist |
You also would have another table that contains all of your sub-classes:
ID | Name |
--------------------
1 | Databases |
2 | Java |
3 | UNIX |
4 | PHP |
For example, let's say that a Networking Specialist needs to know about Databases, and a Digital Specialist needs to know about both Java and PHP. An IT Specialist would need to know all four fields listed above.
There are two possible ways to go about this. One such way would be to set 'flags' in the sub-class table:
ID | Name | Is_IT | Is_Networking | Is_Digital
----------------------------------------------------
1 | Databases | 1 | 1 | 0
2 | Java | 1 | 0 | 1
3 | UNIX | 1 | 0 | 0
4 | PHP | 1 | 0 | 1
Keep in mind, this is only using a small number of skills. If you started to have a lot of super-classes, the columns in the sub-class table could get out of hand pretty quickly.
Fortunately, you can also use something known as a bridging table (also known as an associative entity). Essentially, a bridging table allows you to have two foreign keys that are primary keys in another table, solving the problem of a many-to-many relationship.
You would set this up by having a new table that associates which sub-classes belong with which super-classes:
ID | Sub-class ID | Super-class ID |
-------------------------------------
1 | 1 | 1 |
2 | 1 | 2 |
3 | 2 | 1 |
4 | 2 | 3 |
5 | 3 | 1 |
6 | 4 | 1 |
7 | 4 | 3 |
Note that there are 'duplicates' in both the sub-class ID and super-class ID fields, yet no duplicates in the ID field. This is because the bridging table has unique IDs, which it uses to make independent associations. Sub-class 1 (Databases) needs to be associated to two different groups (IT Specialist and Networking Specialist). Thus, two different associations need to be formed.
Both approaches above give the same 'result'. The only real difference here is that a bridging table will give you more rows, while setting multiple flags will give you more columns. Obviously, the way in which you craft your query will be different as well.
Which of the two approaches you choose to go with really depends on how much data you're dealing with, and how much scope the database is going to have for expansion in the future :)
Hope this helps! :)

PHP deleting records on page without deleting in Database

I would like to check whether
if there any ways to delete the record in the page, without deleting records in database?
I'm doing a shopping cart and would like to have the user to be able to view back their past transaction.
Is there any ways that I could delete the records in the cart, after I check out? And also able to view the past transaction.
I've not done any code yet,
just seeking for advices
Yes. You can have a database with the products, a second with your users and a third one to associate the users id with the products id they purchased.
You can never delete a item from the database. You can include a information of status like 'active/inactive' instead of actually deleting it.
Edit
When you have products in cart, usually the list is saved in cookies, temporarily by browser or javascript or another way (if your country forbid cookie, you have to search for alternatives). Once the purchase is finished, the cart list is saved in database and the cookie or whatever is cleared. You do not delete anything from database.
Example
Users/ buyers database
+--------+----------+
| idUser | nameUser |
+--------+----------+
| 1 | Antony |
| 2 | Betty |
| 3 | Carl |
+--------+----------+
Products database
+-----------+--------------+-------+-----------+
| idProduct | nameProduct | price | available |
+-----------+--------------+-------+-----------+
| 1 | Apple dozen | 10.00 | yes |
| 2 | Banana unity | 20.00 | yes |
| 3 | Cherry kg | 30.00 | yes |
+-----------+--------------+-------+-----------+
Active/ Inactive Example
Notice here how you don't need to delete any record ever. You can create a option such like available to products like example above. So you can just set it to yes or no. So you just code your front-end to show products which available is yes while your back-end can see all of them. It is very useful. In a case when your product has tens of information and you delete the product, but it will only be unavailable for 3 weeks, you would lose a great time deleting and typing and saving everything again. It is much better just set it to show or not to show. (Some stores just opt to show everything, but alert when it is unavailable).
Have in mind that it isn't a rule for everything. Always look for the best perfomance. If you have a database with 10k unavaiable products that will never be available again and you just have 2 available itens, it would be a little nosense keep all this records alive.
About prices historic
Prices Historic Database
+--------+--------+-----------+-----+----------+----------+
| idItem | idUser | idProduct | qty | subtotal | date |
+--------+--------+-----------+-----+----------+----------+
| 1 | 1 | 1 | 2 | 20 | 10/10/13 |
| 2 | 1 | 2 | 1 | 20 | 10/10/13 |
| 3 | 2 | 1 | 1 | 10 | 11/12/13 |
| 4 | 2 | 2 | 1 | 20 | 11/12/13 |
| 5 | 2 | 3 | 1 | 30 | 11/12/13 |
| 6 | 1 | 1 | 1 | 10 | 01/06/14 |
+--------+--------+-----------+-----+----------+----------+
It is always good to have a unique identification for each row on database. Well, in most cases. Its is not very readable for humans, but it fits well for machines. Check this database. The line 1 and 2 say us that user 1 bought on 10/10/13 the item with id 1 x 2 and item with id 2 x 1. Translating: Antony bought 2 Apple dozen and 1 Banana unity.
In this database you could filter rows by user and them group by Date. You would see that Antony bought more Apples in 01/06/2014.
I personally like to save the subtotal on database. If in 2015 the apple price raise to 15.00, you and the user can see he paid 10.00 in 2013 and you can do reverse calc to get the individual price.
If you could read this thid database well, you see Betty bought one of each item in 11/12/13 and Carl has never bought anything.
I believe and hope it will help you.
It is just a simplified example on how the logic works. I matched the product directly with buyer, but most stores usually add a 4th database just to list Orders. So you relate Users with Orders and Orders with Products. Everything has its advantages and disadvantages

Extjs custom grid row grouping

I met some problems with creating table using Extjs. My table has difficult structure
-------------------------------------------|
| | | 4 |
| | 2 ---------|
| | | 5 |
| 1 |---------------------------|
| | | 6 |
| | 3 ---------|
| | | 7 |
-------------------------------------------|
The data from the server are as following:
1 2 4
1 2 5
1 3 6
1 3 7
Every sequence is an array
I need them to be grouped as at the picture above.
Any ideas?
You can use PivotGrid. Example: http://dev.sencha.com/deploy/ext-3.4.0/examples/pivotgrid/simple.html
Unfortunately it is only available in Ext JS 3. It should be available in Ext JS 4.1 though.
It looks like you need to group on the first two columns but unfortunately Ext.data.Store only supports one level of grouping. You'll have to extend Store to support more and Ext.grid.feature.Grouping to take advantage of it.

Resources