character(0) get the first table but canot get to second table on pae - screen-scraping

Trying to webscrape from HTML.
Using Inspector gadget.
No problems with the 1st table on page. With 2nd table, Iget character(0) or Nodeset(0)
library(rvest)
library(plyr)
date1=20161011
gdf1 <- data.frame(matrix(0, ncol = 11, nrow = 1))
newdate<-date1
# HOCKEY
year<-2015
# date1=newdate[d]
date1=2010
#for (yr in 1:10){
date1=date1+1
c<-paste("https://www.hockey-reference.com/leagues/NHL_",date1,".html",sep="")
nbc<-read_html(c)
nbc
#tables<-html_nodes(nbc,".center , #games , .right:nth-child(5), .right:nth-child(3), #games a")
tables<-html_nodes(nbc,"#stats .right , #stats a")
g<-html_text(tables)

Related

Incrementing over a URL variable

import urllib2
import pandas as pd
from bs4 import BeautifulSoup
x = 0
i = 1
data = []
while (i < 13):
soup = BeautifulSoup(urllib2.urlopen(
'http://games.espn.com/ffl/tools/projections?&slotCategoryId=4&scoringPeriodId=%d&seasonId=2018&startIndex=' % i, +str(x)).read(), 'html')
tableStats = soup.find("table", ("class", "playerTableTable tableBody"))
for row in tableStats.findAll('tr')[2:]:
col = row.findAll('td')
try:
name = col[0].a.string.strip()
opp = col[1].a.string.strip()
rec = col[10].string.strip()
yds = col[11].string.strip()
dt = col[12].string.strip()
pts = col[13].string.strip()
data.append([name, opp, rec, yds, dt, pts])
except Exception as e:
pass
df = pd.DataFrame(data=data, columns=[
'PLAYER', 'OPP', 'REC', 'YDS', 'TD', 'PTS'])
df
i += 1
I have been working with a fantasy football program and I am trying to increment data over all weeks so I can create a dataframe for the top 40 players for each week.
I have been able to get it for any week of my choice by manually entering the week number in the PeriodId part of the url, but I am trying to programmatically increment it over each week to make it easier. I have tried using PeriodId='+ I +' and PeriodId=%d but I keep getting various errors about str and int concatenate and bad operands. Any suggestions or tips?
Try removing the comma between %i and str(x) to concatenate the strings and see if that helps.
soup = BeautifulSoup(urllib2.urlopen('http://games.espn.com/ffl/tools/projections?&slotCategoryId=4&scoringPeriodId=%d&seasonId=2018&startIndex='%i, +str(x)).read(), 'html')
should be:
soup = BeautifulSoup(urllib2.urlopen('http://games.espn.com/ffl/tools/projections?&slotCategoryId=4&scoringPeriodId=%d&seasonId=2018&startIndex='%i +str(x)).read(), 'html')
if you have problem concatenating or formatting URL please create variable instead write it one line with BeautifulSoup and urllib2.urlopen.
Use parenthesis to format with multiple value like "before %s is %s" % (1, 0)
url = 'http://games.espn.com/ffl/tools/projections?&slotCategoryId=4&scoringPeriodId=%s&seasonId=2018&startIndex=%s' % (i, x)
# or
#url = 'http://games.espn.com/ffl/tools/projections?&slotCategoryId=4&scoringPeriodId=%s&seasonId=2018&startIndex=0' % i
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
Make the code sorter will not effect the performance.

Manipulating character arrays quickly in R data.table [duplicate]

This question already has answers here:
Faster way to read fixed-width files
(4 answers)
Closed 4 years ago.
I have a huge datatset (14GB, 200 Mn rows) of character vector. I've fread it (took > 30 mins on 48 core 128 GB server). The string contains concatenated information on various fields. For instance, the first row of my table looks like:
2014120900000001091500bbbbcompany_name00032401
where the first 8 characters represent date in YYYYMMDD format, next 8 characters are id, next 6 the time in HHMMSS format and then next 16 are name (prefixed with b's) and the last 8 are price (2 decimal places).
I need to transfer the above 1 column data.table into 5 columns: date, id, time, name, price.
For the above character vector that will turn out to be: date = "2014-12-09", id = 1, time = "09:15:00", name = "company_name", price = 324.01
I am looking for a (very) fast and efficient dplyr / data.table solution. Right now I am doing it with using substr:
date = as.Date(substr(d, 1, 8), "%Y%m%d");
and it's taking forever to execute!
Update: With readr::read_fwf I am able to read the file in 5-10 mins. Apparently, the reading is faster than fread. Below is the code:
f = "file_name";
num_cols = 5;
col_widths = c(8,8,6,16,8);
col_classes = "ciccn";
col_names = c("date", "id", "time", "name", "price");
# takes 5-10 mins
data = readr::read_fwf(file = f, col_positions = readr::fwf_widths(col_widths, col_names), col_types = col_classes, progress = T);
setDT(data);
# object.size(data) / 2^30; # 17.5 GB
A possible solution:
library(data.table)
library(stringi)
widths <- c(8,8,6,16,8)
sp <- c(1, cumsum(widths[-length(widths)]) + 1)
ep <- cumsum(widths)
DT[, lapply(seq_along(sp), function(i) stri_sub(V1, sp[i], ep[i]))]
which gives:
V1 V2 V3 V4 V5
1: 20141209 00000001 091500 bbbbcompany_name 00032401
Including some additional processing to get the desired result:
DT[, lapply(seq_along(sp), function(i) stri_sub(V1, sp[i], ep[i]))
][, .(date = as.Date(V1, "%Y%m%d"),
id = as.integer(V2),
time = as.ITime(V3, "%H%M%S"),
name = sub("^(bbbb)","",V4),
price = as.numeric(V5)/100)]
which gives:
date id time name price
1: 2014-12-09 1 09:15:00 company_name 324.01
But you are actually reading a fixed width file. So could also consider read.fwf from base R or read_fwffrom readr or write your own fread.fwf-function like I did a while ago:
fread.fwf <- function(file, widths, enc = "UTF-8") {
sp <- c(1, cumsum(widths[-length(widths)]) + 1)
ep <- cumsum(widths)
fread(file = file, header = FALSE, sep = "\n", encoding = enc)[, lapply(seq_along(sp), function(i) stri_sub(V1, sp[i], ep[i]))]
}
Used data:
DT <- data.table(V1 = "2014120900000001091500bbbbcompany_name00032401")
Maybe your solution is not so bad.
I am using this data:
df <- data.table(text = rep("2014120900000001091500bbbbcompany_name00032401", 100000))
Your solution:
> system.time(df[, .(date = as.Date(substr(text, 1, 8), "%Y%m%d"),
+ id = as.integer(substr(text, 9, 16)),
+ time = substr(text, 17, 22),
+ name = substr(text, 23, 38),
+ price = as.numeric(substr(text, 39, 46))/100)])
user system elapsed
0.17 0.00 0.17
#Jaap solution:
> library(data.table)
> library(stringi)
>
> widths <- c(8,8,6,16,8)
> sp <- c(1, cumsum(widths[-length(widths)]) + 1)
> ep <- cumsum(widths)
>
> system.time(df[, lapply(seq_along(sp), function(i) stri_sub(text, sp[i], ep[i]))
+ ][, .(date = as.Date(V1, "%Y%m%d"),
+ id = as.integer(V2),
+ time = V3,
+ name = sub("^(bbbb)","",V4),
+ price = as.numeric(V5)/100)])
user system elapsed
0.20 0.00 0.21
An attempt with read.fwf:
> setClass("myDate")
> setAs("character","myDate", function(from) as.Date(from, format = "%Y%m%d"))
> setClass("myNumeric")
> setAs("character","myNumeric", function(from) as.numeric(from)/100)
>
> ff <- function(x) {
+ file <- textConnection(x)
+ read.fwf(file, c(8, 8, 6, 16, 8),
+ col.names = c("date", "id", "time", "name", "price"),
+ colClasses = c("myDate", "integer", "character", "character", "myNumeric"))
+ }
>
> system.time(df[, as.list(ff(text))])
user system elapsed
2.33 6.15 8.49
All outputs are the same.
Maybe try using matrix with numeric instead of data.frame. Aggregation should take less time.

using lookup tables to plot a ggplot and table

I'm creating a shiny app and i'm letting the user choose what data that should be displayed in a plot and a table. This choice is done through 3 different input variables that contain 14, 4 and two choices respectivly.
ui <- dashboardPage(
dashboardHeader(),
dashboardSidebar(
selectInput(inputId = "DataSource", label = "Data source", choices =
c("Restoration plots", "all semi natural grasslands")),
selectInput(inputId = "Variabel", label = "Variable", choices =
choicesVariables)),
#choicesVariables definition is omitted here, because it's very long but it
#contains 14 string values
selectInput(inputId = "Factor", label = "Factor", choices = c("Company
type", "Region and type of application", "Approved or not approved
applications", "Age group" ))
),
dashboardBody(
plotOutput("thePlot"),
tableOutput("theTable")
))
This adds up to 73 choices (yes, i know the math doesn't add up there, but some choices are invalid). I would like to do this using a lookup table so a created one with every valid combination of choices like this:
rad1<-c(rep("Company type",20), rep("Region and type of application",20),
rep("Approved or not approved applications", 13), rep("Age group", 20))
rad2<-choicesVariable[c(1:14,1,4,5,9,10,11, 1:14,1,4,5,9,10,11, 1:7,9:14,
1:14,1,4,5,9,10,11)]
rad3<-c(rep("Restoration plots",14),rep("all semi natural grasslands",6),
rep("Restoration plots",14), rep("all semi natural grasslands",6),
rep("Restoration plots",27), rep("all semi natural grasslands",6))
rad4<-1:73
letaLista<-data.frame(rad1,rad2,rad3, rad4)
colnames(letaLista) <- c("Factor", "Variabel", "rest_alla", "id")
Now its easy to use subset to only get the choice that the user made. But how do i use this information to plot the plot and table without using a 73 line long ifelse statment?
I tried to create some sort of multidimensional array that could hold all the tables (and one for the plots) but i couldn't make it work. My experience with these kind of arrays is limited and this might be a simple issue, but any hints would be helpful!
My dataset that is the foundation for the plots and table consists of dataframe with 23 variables, factors and numerical. The plots and tabels are then created using the following code for all 73 combinations
s_A1 <- summarySE(Samlad_info, measurevar="Dist_brukcentrum",
groupvars="Companytype")
s_A1 <- s_A1[2:6,]
p_A1=ggplot(s_A1, aes(x=Companytype,
y=Dist_brukcentrum))+geom_bar(position=position_dodge(), stat="identity") +
geom_errorbar(aes(ymin=Dist_brukcentrum-se,
ymax=Dist_brukcentrum+se),width=.2,position=position_dodge(.9))+
scale_y_continuous(name = "") + scale_x_discrete(name = "")
where summarySE is the following function, burrowed from cookbook for R
summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=TRUE,
conf.interval=.95, .drop=TRUE) {
# New version of length which can handle NA's: if na.rm==T, don't count them
length2 <- function (x, na.rm=FALSE) {
if (na.rm) sum(!is.na(x))
else length(x)
}
# This does the summary. For each group's data frame, return a vector with
# N, mean, and sd
datac <- ddply(data, groupvars, .drop=.drop,
.fun = function(xx, col) {
c(N = length2(xx[[col]], na.rm=na.rm),
mean = mean (xx[[col]], na.rm=na.rm),
sd = sd (xx[[col]], na.rm=na.rm)
)
},
measurevar
)
# Rename the "mean" column
datac <- rename(datac, c("mean" = measurevar))
datac$se <- datac$sd / sqrt(datac$N) # Calculate standard error of the mean
# Confidence interval multiplier for standard error
# Calculate t-statistic for confidence interval:
# e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
ciMult <- qt(conf.interval/2 + .5, datac$N-1)
datac$ci <- datac$se * ciMult
return(datac)
}
The code in it's entirety is a bit to large but i hope this may clarify what i'm trying to do.
Well, thanks to florian's comment i think i might have found a solution my self. I'll present it here but leave the question open as there is probably far neater ways of doing it.
I rigged up the plots (that was created as lists by ggplot) into a list
plotList <- list(p_A1, p_A2, p_A3...)
tableList <- list(s_A1, s_A2, s_A3...)
I then used subset on my lookup table to get the matching id of the list to select the right plot and table.
output$thePlot <-renderPlot({
plotValue<-subset(letaLista, letaLista$Factor==input$Factor &
letaLista$Variabel== input$Variabel & letaLista$rest_alla==input$DataSource)
plotList[as.integer(plotValue[1,4])]
})
output$theTable <-renderTable({
plotValue<-subset(letaLista, letaLista$Factor==input$Factor &
letaLista$Variabel== input$Variabel & letaLista$rest_alla==input$DataSource)
skriva <- tableList[as.integer(plotValue[4])]
print(skriva)
})

Remove garbage(#,$) value from any string and drop records that contains only garbage(#,$) value with multiple occurances in multiple columns

I tried below code for drop records that contains garbage value with multiple occurrences and multiple columns,But I want to remove garbage value form string with multiple occurrences in multiple columns.
Sample Code :-
filter_list = ['$','#','%','#','!','^','&','*','null']
def filterfn(*x):
remove_garbage = list(chain(*[[filter not in elt for filter in
filter_list] for elt in x]))
return(reduce(lambda x,y: x and y, remove_garbage, True))
filter_udf = f.udf(filterfn, BooleanType())
original = original.filter(filter_udf(*[col for col in compulsory_fields]))
original.show()
In this example "original" is my original dataframe and "compulsory_fields" this is my array(it stores as multiple columns).
Sample Input :-
id name salary
# Yogita 1000
2 Neha ##
3 #Jay$deep## 8000
4 Priya 40$00&
5 Bhavana $$%&^
6 $% $$&&
Sample Output :-
id name salary
3 Jaydeep 8000
4 priya 4000
Your requirements are not completely clear to me, but it seems you want to output records that are valid after removing the "garbage" characters. You can achieve this by adding a clean_special_characters udf that removes the special characters before running your filter_udf:
import pyspark.sql.functions as f
from itertools import chain
from pyspark.sql.functions import regexp_replace,col
from pyspark.sql.types import BooleanType,StringType
rdd = sc.parallelize((
('#','Yogita','1000'),
('2', 'Neha', '##'),
('3', '#Jay$deep##','8000'),
('4', 'Priya', '40$00&'),
('5', 'Bhavana', '$$%&^'),
('6', '$%','$$&&'))
)
original = rdd.toDF(['id','name','salary'])
filter_list = ['$','#','%','#','!','^','&','*','null']
compulsory_fields = ['id','name','salary']
def clean_special_characters(input_string):
cleaned_input = input_string.translate({ord(c): None for c in filter_list if len(c)==1})
if cleaned_input == '':
return 'null'
return cleaned_input
clean_special_characters_udf = f.udf(clean_special_characters, StringType())
original = original.withColumn('name', clean_special_characters_udf(original.name))
original = original.withColumn('salary', clean_special_characters_udf(original.salary))
def filterfn(*x):
remove_garbage = list(chain(*[[filter not in elt for filter in
filter_list] for elt in x]))
return(reduce(lambda x,y: x and y, remove_garbage, True))
filter_udf = f.udf(filterfn, BooleanType())
original = original.filter(filter_udf(*[col for col in compulsory_fields]))
original.show()
This outputs:
+---+-------+------+
| id| name|salary|
+---+-------+------+
| 3|Jaydeep| 8000|
| 4| Priya| 4000|
+---+-------+------+

How to apply function by groups in array in R? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I have a 3d array with latitude, longitude and datetime (Year_Month_Day_Hour). Which is the best way in R to apply a function over the array by groups (in this case by years or month or days)? The result should be an array with the mean values. The 3 dimension is than the year, month or day.
str(data)
num [1:7, 1:7, 1:5] 977 994 1010 1020 1026 ...
- attr(*, "dimnames")=List of 3
..$ : chr [1:7] "60" "57.5" "55" "52.5" ...
..$ : chr [1:7] "-30" "-27.5" "-25" "-22.5" ...
..$ : chr [1:5] "2014_10_01_00" "2014_10_01_06" "2014_10_01_12" "2014_10_01_18" ...
Example (truncated):
dput(data) structure(c(977.2, 994.4, 1009.8, 1020.1, 1026.4, 1029.4, 1029.2,
978.7, 995.7, 1010.2, 1020.5, 1026.5, 1028.8, 1028.3, 982, 997.5,
1011.3, 1021.2, 1026.1, 1027.4, 1027.1, 986.2, 999.9, 1013, 1021.7,
1025.1, 1025.7, 1026, 990.6, 1002.7, 1014.5, 1021.3, 1023.9,
1024.7, 1025.6, 995.1, 1005.7, 1015.2, 1019.9, 1022.6, 1024.5,
1025.9, 999.1, 1008, 1015.1, 1018.6, 1021.8, 1024.5, 1026.6,
982.1, 998.9, 1011.8, 1020.1, 1025.5, 1028.4, 1028.8, 981.9,
999.3, 1012.7, 1021.2, 1026.4, 1028.8, 1029, 983.9, 1000.2, 1013.5,
1022.1, 1027, 1028.9, 1028.9, 987.1, 1001.8, 1014.6, 1022.7,
1027.3, 1028.6, 1028.2, 990.9, 1004.1, 1016.1, 1023.3, 1027.2,
1027.9, 1027.4, 995.1, 1006.9, 1017.8, 1023.8, 1026.8, 1027,
1026.9, 999.5, 1010.1, 1019.1, 1023.8, 1025.9, 1026.1, 1026.9,
990.3, 1002.3, 1010.9, 1018.3, 1024, 1027.6, 1028.6, 990.6, 1004.1,
1013.2, 1020.8, 1026.2, 1029.3, 1029.8, 992.1, 1005.5, 1015.2,
1023, 1028, 1030.4, 1030.5, 994.5, 1007, 1017.2, 1024.7, 1029.4,
1031, 1030.3, 997.4, 1008.8, 1019, 1025.7, 1030, 1031, 1029.8,
1000.1, 1010.9, 1020.9, 1026.5, 1030, 1030.6, 1029.5, 1002.9,
1013.3, 1022.6, 1027.2, 1029.7, 1029.7, 1029.2, 993.6, 997.5,
1001.3, 1007.4, 1015.5, 1022.7, 1026.4, 996.1, 1001.1, 1005.8,
1012.7, 1020.1, 1025.6, 1027.9, 998.4, 1004.5, 1010.4, 1017.6,
1023.8, 1027.6, 1029.1, 1000.2, 1007.3, 1014.4, 1021.5, 1026.4,
1029, 1029.7, 1002, 1010, 1017.8, 1024.3, 1028.4, 1029.9, 1029.6,
1004.3, 1012.9, 1020.7, 1026.3, 1029.7, 1030.2, 1029.3, 1006.9,
1016, 1023.2, 1027.7, 1030.3, 1029.7, 1028.6, 987.9, 989.6, 995.1,
1002.9, 1010.8, 1018.9, 1025.1, 989.8, 990, 995.1, 1004.7, 1013.9,
1021.8, 1026.8, 993.1, 992.6, 998.1, 1008.8, 1018, 1024.6, 1028.3,
996.9, 997.3, 1003.9, 1014, 1021.9, 1026.8, 1029.1, 1000.3, 1003.1,
1010.5, 1019, 1025.2, 1028.5, 1029.6, 1003.6, 1008.7, 1016.4,
1023.1, 1027.8, 1029.8, 1029.9, 1007.3, 1013.7, 1020.8, 1026.3,
1029.8, 1030.2, 1029.6), .Dim = c(7L, 7L, 5L), .Dimnames = list(
c("60", "57.5", "55", "52.5", "50", "47.5", "45"), c("-30",
"-27.5", "-25", "-22.5", "-20", "-17.5", "-15"), c("2014_10_01_00",
"2014_10_01_06", "2014_10_01_12", "2014_10_01_18", "2014_10_02_00"
)))
SOLUTION:
group <- as.factor(as.Date(dimnames(data)[[3]],format="%Y_%m_%d"))
aperm(apply(data,c(1,2), by, group, mean),c(2,3,1))
First I would recommend tidying up your data. Right now we can't really tell what it looks like.
For grouping, create columns for your dates. I'm not sure what date "2014_10_01_00" might be, but if 2014 is the year and the month is October, split these into two columns. I don't think storing longtitude and latitude as type character makes sense, perhaps numeric might be better.
Second, check out the data.table package. It makes manipulating data (esp large ones) a breeze.
To use a function over the data table by different groups, do
my_dt[ , lapply(.SD, my_func), by = c("year", "month")]
where year and month are column names in your data table.
Just specify the dimension as the second argument in the apply function. For example summing with "Date" as margin:
> apply(array, 3, sum)
# 2014_10_01_00 2014_10_01_06 2014_10_01_12 2014_10_01_18 2014_10_02_00
# 49691.3 49782.3 49919.6 49851.4 49639.0
If your dimensions have names you can also use the name as character string as the second argument.
EDIT
OP wants the results to be grouped by date. This function can maybe give guidance to the desired result:
myapply <- function(array, d, fun){
# function to apply "fun" to "array"
# which is of class array with dimension
# 3. array is grouped by d which is a number
# between 1 and 4
# 1: year
# 2: month
# 3: day
# 4: hour
d.name <- strsplit(dimnames(array)[[3]], "_")
# make groups
names <- lapply(d.name, function(x, d)
paste(x[1:d], collapse= "_"), d = d)
groups <- unique(names)
# get the indices for the groups
indices <- lapply(groups, function(x, names)
which(unlist(names) %in% x), names = names)
# compute the function on the groups
results <- lapply(indices, function(ind, arr, fun)
fun(as.vector(arr[,,ind])), arr = array, fun = fun)
names(results) <- unlist(groups)
return(results)
}
Results:
# mean grouping by day
myapply(array, 3, mean)
# $`2014_10_01`
# [1] 1016.554
#
# $`2014_10_02`
# [1] 1013.041
# mean, grouping by hour
myapply(array, 4, mean)
# $`2014_10_01_00`
# [1] 1014.108
#
# $`2014_10_01_06`
# [1] 1015.965
#
# $`2014_10_01_12`
# [1] 1018.767
#
# $`2014_10_01_18`
# [1] 1017.376
#
# $`2014_10_02_00`
# [1] 1013.041

Resources