add unique id for JSON data with R - arrays

I am a beginner with R. My situation is I have a JSON dataset with a nested array. In the JSON file, one institution looks like this:
{
"website": "www.123.org",
"programs": [
{
"website": "www.111.com",
"contact": "Jim"
},
{
"website": "www.222.com",
"contact": "Han"
}
]
}
To each institution, there may be one program or may be more. I have more than 100 hundreds institution and nearly two hundreds programs in the JSON. I want to ad id for each institution and idpr for each program. Finally, i hope i can get a data.frame that looks like:
id idpr website websitepr contactpr
1 1 www.123.org www.111.com Jim
1 2 www.123.org www.222.com Han
2 1 www.345.org www.aaa.com Lily
3 1 www.567.org www.bbb.com Jack
3 2 www.567.org www.ccc.com Mike
3 3 www.567.org www.ddd.com Minnie
.........
I tried to write a nested loop like this:
count<-0
for (n in json_data){
count<-count+1
id<-c(id,count)
website<-c(website,n$website)
countpr<-1
for (i in n$programs){
id<-c(id,count)
website<-c(website,n$website)
idpr<-c(idpr,countpr)
websitepr<-c(websitepr,i$website)
contactpr<-c(contactpr,i$contact)
countpr<-countpr+1
}
}
but this nested loop can not give me the result i want. Thanks for helping me!

Try this:
# sample data
json.file <- textConnection('[{"website":"www.123.org","programs":[{"website":"www.111.com","contact":"Jim"},{"website":"www.222.com","contact":"Han"}]},{"website":"www.345.org","programs":[{"website":"www.aaa.com","contact":"Lily"}]},{"website":"www.567.org","programs":[{"website":"www.bbb.com","contact":"Jack"},{"website":"www.ccc.com","contact":"Mike"},{"website":"www.ddd.com","contact":"Minnie"}]}]')
# read the data into an R nested list
library(rjson)
raw.data <- fromJSON(file = json.file)
# a function that will process one institution
process.one <- function(id, institution) {
website <- institution$website
websitepr <- sapply(institution$programs, `[[`, "website")
contactpr <- sapply(institution$programs, `[[`, "contact")
data.frame(id, idpr = seq_along(websitepr),
website, websitepr, contactpr)
}
# run the function on all institutions and put the pieces together
do.call(rbind, Map(process.one, seq_along(raw.data), raw.data))
# id idpr website websitepr contactpr
# 1 1 1 www.123.org www.111.com Jim
# 2 1 2 www.123.org www.222.com Han
# 3 2 1 www.345.org www.aaa.com Lily
# 4 3 1 www.567.org www.bbb.com Jack
# 5 3 2 www.567.org www.ccc.com Mike
# 6 3 3 www.567.org www.ddd.com Minnie

you can write a
class website {
//write all as data member
//program as object of class program
}
and use jackson api to convert it into string.
using Mapper.writeValueAsString(object of website)
jar needed are
1.jackson-core-2.0.2.jar
jackson-databind-2.0.2.jar.

Related

Webscraping to a DataFrame

I am trying to get information from a website, and into a Dataframe, but I'm having some trouble.
I have extracted the data, but I'm trying to merge two dataframes, and reshape them into one. Here is what I have:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://www.civilaviation.gov.in/'
resp = requests.get(url)
soup = BeautifulSoup(resp.content.decode(), 'html.parser')
div = soup.find('div', {'class':'airport-col vande-bharat-col'})
div2 = soup.find('div', {'class':'airport-col airport-widget'})
div['class'] = 'Domestic traffic'
div2['class'] = 'International traffic'
dom = div.get_text()
intl = div2.get_text()
def str2frame(estr, sep = '\n', lineterm = '\n\n\n\n\n', set_header = True):
dat = [x.split(sep) for x in estr.split(lineterm)][0:-1]
df = pd.DataFrame(dat)
if set_header:
df = df.T.set_index(0, drop = True).T # flip, set ix, flip back
return df
df1 = str2frame(dom)
df2 = str2frame(intl)
df1.rename(columns={"अन्तर्देशीय यातायात Domestic traffic On 29 Jan 2023":"Domestic Traffic"}, inplace=True)
df2.rename(columns={"अंतर्राष्ट्रीय यातायात International traffic On 29 Jan 2023":"International Traffic"}, inplace=True)
So now I get two separate DataFrames with all the information I want, but not in the format I want. The shape of my dataframes are 6,2(one of the columns is blank)... I need them merged into one dataframe that is 2,6. So basically I show
Domestic Traffic
1 Departing flights 2,967
2 Departing Pax 4,24,224
3 Arriving flights 2,960
4 Arriving Pax 4,18,697
5 Aircraft movements 5,927
6 Airport footfalls 8,42,921
I would like to see two rows, one for domestic and one for international traffic, and each column based on the given values. I apologize if my question or my coding is unclear. This is my first time asking a question on this forum. Thank you for your help.
Not sure if this is the expected result but you could transform and concat your data:
pd.concat([
df1.set_index(df1.columns[0]).T,
df2.set_index(df2.columns[0]).T
]).reset_index()
Output
0
Departing flights
Departing Pax
Arriving flights
Arriving Pax
Aircraft movements
Airport footfalls
0
अन्तर्देशीय यातायात Domestic traffic On 30 Jan 2023
2,862
4,07,957
2,864
4,04,799
5,726
8,12,756
1
अंतर्राष्ट्रीय यातायात International traffic On 30 Jan 2023
433
90,957
516
82,036
949
1,72,993

Extract from Dictionary having multiple lists

I have a dataframe having dictionary with multiple lists and I would like to create a dataframe by extracting on a certain element 'Student'.
1
2
3
4
{"Student":["Grad","School"], "Comments": "Finished Education"}
{"Student":["New"], "Comments": "Started Education", Location : ["USA", "China", "Australia"]}
{"Student": ["Middle", "School"], "ID" : ["1000", "2000"]}}
{"Student": ["Med","School"]}
Expected output:
Student
Grad, School
New
Middle, School
Med, School
I have tried to read the dataframe into a dictionary, but was unable to retrieve only the 'Student' element from the dictionary.
data_dict = {}
df = df.toPandas()
for column in df.columns:
data_dict[column] = df[column].values.tolist()
Student = [data for data in data_dict.values()]
First of all, what you have is not dictionaries and not lists. When you have a Spark dataframe, what you call a dictionary is a struct. And what you call a list is an array. Data types can be inspected with df.printSchema().
You can extract the "Student" fields from the structs, then add everything to an array in order to finally explode.
arr = F.array([F.col(f'{c}.Student') for c in df.columns])
df = df.select(F.explode(arr).alias('Student'))
Full example:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[((["Grad","School"], "Finished Education"),(['New'], "Started Education", ["USA", "China", "Australia"]),(["Middle", "School"], ["1000", "2000"]),(["Med","School"],))],
'`1` struct<Student:array<string>,Comments:string>, `2` struct<Student:array<string>,Comments:string,Location:array<string>>, `3` struct<Student:array<string>,ID:array<string>>, `4` struct<Student:array<string>>')
arr = F.array([F.col(f'{c}.Student') for c in df.columns])
df = df.select(F.explode(arr).alias('Student'))
df.show()
# +----------------+
# | Student|
# +----------------+
# | [Grad, School]|
# | [New]|
# |[Middle, School]|
# | [Med, School]|
# +----------------+
Another working option:
to_melt = [f"`{c}`.Student" for c in df.columns]
df = df.selectExpr(f"stack({len(to_melt)}, {','.join(to_melt)}) Student")

Print results of a loop in a dataframe

I wrote a code that splits a dataframe data according to a factor a and for each level of the factor returns an anova table for the factor b.
for (i in 1:length(levels(data$a))){
levels<-levels(data$a)
assign(paste("data_", levels[i], sep = ""), subset(data, a==levels[i]))
print (levels[i])
print(anova(lm(var~b, subset(data, a==levels[i]))))
}
The result is exactly what I want, but I would like to have all the anova tables pooled and returned as a unique list or data frame.
Anyone can help?
Apparently this code does the trick:
result_anova<-data.frame()
for (i in 1:length(levels(data$a))){
levels<-levels(data$a)
assign(paste("data_", levels[i], sep = ""), subset(data, a==levels[i]))
result<-as.data.frame(anova(lm(var~b, subset(data, a==levels[i]))))
result_anova[i, 1]<-levels[i]
result_anova[i, 2]<-result[1, 1 ]
result_anova[i, 3]<-result[1, 2 ]
result_anova[i, 4]<-result[1, 3 ]
result_anova[i, 5]<-result[1, 4 ]
result_anova[i, 6]<-result[1, 5 ]
result_anova[i, 7]<-result[2, 1 ]
result_anova[i, 8]<-result[2, 2 ]
result_anova[i, 9]<-result[2, 3 ]
result_anova[i, 10]<-result[2, 4 ]
result_anova[i, 11]<-result[2, 5 ]
colnames(result_anova_genos)<-c ( "genotype", "Df_fac", "Sum_Sq_fac", "Mean_Sq_fac", "F_value_fac", "Pr(>F)_fac", "Df_res", "Sum_Sq_res", "Mean_Sq_res", "F_value_res", "Pr(>F)_res")
}
Please vote this answer or let me know if this code can be improved.

Sort by a key, but value has more than one element using Scala

I'm very new to Scala on Spark and wondering how you might create key value pairs, with the key having more than one element. For example, I have this dataset for baby names:
Year, Name, County, Number
2000, JOHN, KINGS, 50
2000, BOB, KINGS, 40
2000, MARY, NASSAU, 60
2001, JOHN, KINGS, 14
2001, JANE, KINGS, 30
2001, BOB, NASSAU, 45
And I want to find the most frequently occurring for each county, regardless of the year. How might I go about doing that?
I did accomplish this using a loop. Refer to below. But I'm wondering if there is shorter way to do this that utilizes Spark and Scala duality. (i.e. can I decrease computation time?)
val names = sc.textFile("names.csv").map(l => l.split(","))
val uniqueCounty = names.map(x => x(2)).distinct.collect
for (i <- 0 to uniqueCounty.length-1) {
val county = uniqueCounty(i).toString;
val eachCounty = names.filter(x => x(2) == county).map(l => (l(1),l(4))).reduceByKey((a,b) => a + b).sortBy(-_._2);
println("County:" + county + eachCounty.first)
}
Here is the solution using RDD. I am assuming you need top occurring name per county.
val data = Array((2000, "JOHN", "KINGS", 50),(2000, "BOB", "KINGS", 40),(2000, "MARY", "NASSAU", 60),(2001, "JOHN", "KINGS", 14),(2001, "JANE", "KINGS", 30),(2001, "BOB", "NASSAU", 45))
val rdd = sc.parallelize(data)
//Reduce the uniq values for county/name as combo key
val uniqNamePerCountyRdd = rdd.map(x => ((x._3,x._2),x._4)).reduceByKey(_+_)
// Group names per county.
val countyNameRdd = uniqNamePerCountyRdd.map(x=>(x._1._1,(x._1._2,x._2))).groupByKey()
// Sort and take the top name alone per county
countyNameRdd.mapValues(x => x.toList.sortBy(_._2).take(1)).collect
Output:
res8: Array[(String, List[(String, Int)])] = Array((KINGS,List((JANE,30))), (NASSAU,List((BOB,45))))
You could use the spark-csv and the Dataframe API. If you are using the new version of Spark (2.0) it is slightly different. Spark 2.0 has a native csv data source based on spark-csv.
Use spark-csv to load your csv file into a Dataframe.
val df = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load(new File(getClass.getResource("/names.csv").getFile).getAbsolutePath)
df.show
Gives output:
+----+----+------+------+
|Year|Name|County|Number|
+----+----+------+------+
|2000|JOHN| KINGS| 50|
|2000| BOB| KINGS| 40|
|2000|MARY|NASSAU| 60|
|2001|JOHN| KINGS| 14|
|2001|JANE| KINGS| 30|
|2001| BOB|NASSAU| 45|
+----+----+------+------+
DataFrames uses a set of operations for structured data manipulation. You could use some basic operations to become your result.
import org.apache.spark.sql.functions._
df.select("County","Number").groupBy("County").agg(max("Number")).show
Gives output:
+------+-----------+
|County|max(Number)|
+------+-----------+
|NASSAU| 60|
| KINGS| 50|
+------+-----------+
Is this what you are trying to achieve?
Notice the import org.apache.spark.sql.functions._ which is needed for the agg() function.
More information about Dataframes API
EDIT
For correct output:
df.registerTempTable("names")
//there is probably a better query for this
sqlContext.sql("SELECT * FROM (SELECT Name, County,count(1) as Occurrence FROM names GROUP BY Name, County ORDER BY " +
"count(1) DESC) n").groupBy("County", "Name").max("Occurrence").limit(2).show
Gives output:
+------+----+---------------+
|County|Name|max(Occurrence)|
+------+----+---------------+
| KINGS|JOHN| 2|
|NASSAU|MARY| 1|
+------+----+---------------+

Padding printed output of tabular data

I know this is probably dead simple, but I've got some data such as this in one file:
Artichoke
Green Globe, Imperial Star, Violetto
24" deep
Beans, Lima
Bush Baby, Bush Lima, Fordhook, Fordhook 242
12" wide x 8-10" deep
that I'd like to be able to format into a nice TSV type of table, to look something like this:
Name | Varieties | Container Data
----------|------------- |-------
some data here nicely padded with even spacing and right aligned text
Try String#rjust(width):
"hello".rjust(20) #=> " hello"
I wrote a gem to do exactly this: http://tableprintgem.com
No one has mentioned the "coolest" / most compact way -- using the % operator -- for example: "%10s %10s" % [1, 2]. Here is some code:
xs = [
["This code", "is", "indeed"],
["very", "compact", "and"],
["I hope you will", "find", "it helpful!"],
]
m = xs.map { |_| _.length }
xs.each { |_| _.each_with_index { |e, i| s = e.size; m[i] = s if s > m[i] } }
xs.each { |x| puts m.map { |_| "%#{_}s" }.join(" " * 5) % x }
Gives:
This code is indeed
very compact and
I hope you will find it helpful!
Here is the code made more readable:
max_lengths = xs.map { |_| _.length }
xs.each do |x|
x.each_with_index do |e, i|
s = e.size
max_lengths[i] = s if s > max_lengths[i]
end
end
xs.each do |x|
format = max_lengths.map { |_| "%#{_}s" }.join(" " * 5)
puts format % x
end
This is a reasonably full example that assumes the following
Your list of products is contained in a file called veg.txt
Your data is arranged across three lines per record with the fields on consecutive lines
I am a bit of a noob to rails so there are undoubtedly better and more elegant ways to do this
#!/usr/bin/ruby
class Vegetable
##max_name ||= 0
##max_variety ||= 0
##max_container ||= 0
attr_reader :name, :variety, :container
def initialize(name, variety, container)
#name = name
#variety = variety
#container = container
##max_name = set_max(#name.length, ##max_name)
##max_variety = set_max(#variety.length, ##max_variety)
##max_container = set_max(#container.length, ##max_container)
end
def set_max(current, max)
current > max ? current : max
end
def self.max_name
##max_name
end
def self.max_variety
##max_variety
end
def self.max_container()
##max_container
end
end
products = []
File.open("veg.txt") do | file|
while name = file.gets
name = name.strip
variety = file.gets.to_s.strip
container = file.gets.to_s.strip
veg = Vegetable.new(name, variety, container)
products << veg
end
end
format="%#{Vegetable.max_name}s\t%#{Vegetable.max_variety}s\t%#{Vegetable.max_container}s\n"
printf(format, "Name", "Variety", "Container")
printf(format, "----", "-------", "---------")
products.each do |p|
printf(format, p.name, p.variety, p.container)
end
The following sample file
Artichoke
Green Globe, Imperial Star, Violetto
24" deep
Beans, Lima
Bush Baby, Bush Lima, Fordhook, Fordhook 242
12" wide x 8-10" deep
Potatoes
King Edward, Desiree, Jersey Royal
36" wide x 8-10" deep
Produced the following output
Name Variety Container
---- ------- ---------
Artichoke Green Globe, Imperial Star, Violetto 24" deep
Beans, Lima Bush Baby, Bush Lima, Fordhook, Fordhook 242 12" wide x 8-10" deep
Potatoes King Edward, Desiree, Jersey Royal 36" wide x 8-10" deep
another gem: https://github.com/visionmedia/terminal-table
Terminal Table is a fast and simple, yet feature rich ASCII table generator written in Ruby.
I have a little function to print a 2D array as a table. Each row must have the same number of columns for this to work. It's also easy to tweak to your needs.
def print_table(table)
# Calculate widths
widths = []
table.each{|line|
c = 0
line.each{|col|
widths[c] = (widths[c] && widths[c] > col.length) ? widths[c] : col.length
c += 1
}
}
# Indent the last column left.
last = widths.pop()
format = widths.collect{|n| "%#{n}s"}.join(" ")
format += " %-#{last}s\n"
# Print each line.
table.each{|line|
printf format, *line
}
end
Kernel.sprintf should get you started.

Resources