LOAD CSV WITH HEADERS FROM
'file:///epl_mataches.csv' as row
MATCH (c1:Club {name:row.`Team1`}), (c2:Club {name:row.`Team2`})
MERGE (c1) -[f:FEATURED{
round:toInteger(row.Round),
date:row.Date,
homeTeamFTScore: toInteger(split(row.FT,"-" [0])),
awayTeamFTScore: toInteger(split(row.FT,"-" [1])),
homeTeamHTScore: toInteger(split(row.HT,"-" [0])),
awayTeamHTScore: toInteger(split(row.HT,"-" [1]))
}] -> (c2)
The error is present when I try to create the relationships and to pull through the required information from the data file.
Neo.ClientError.Statement.SyntaxError
Type mismatch: expected List<T> but was String (line 7, column 45 (offset: 248))
" homeTeamFTScore: toInteger(split(row.FT,"-" [0])),"
There is a typo on your script, so instead of
homeTeamFTScore: toInteger(split(row.FT,"-" [0])),
Use below
homeTeamFTScore: toInteger(split(row.FT,"-") [0])
Notice the parenthesis before [0] and NOT after it.
For example:
RETURN toInteger(split("2-test","-") [0]) as sample
result:
╒════════╕
│"sample"│
╞════════╡
│ 2 │
└────────┘
Related
I have two dataframes ecah has an array(string) columns.
I am trying to create a new data frame that only filters rows where one of the array element in a row matches with other.
#first dataframe
main_df = spark.createDataFrame([('1', ['YYY', 'MZA']),
('2', ['XXX','YYY']),
('3',['QQQ']),
('4', ['RRR', 'ZZZ', 'BBB1'])],
('No', 'refer_array_col'))
#second dataframe
df = spark.createDataFrame([('1A', '3412asd','value-1', ['XXX', 'YYY', 'AAA']),
('2B', '2345tyu','value-2', ['DDD', 'YFFFYY', 'GGG', '1']),
('3C', '9800bvd', 'value-3', ['AAA']),
('3C', '9800bvd', 'value-1', ['AAA', 'YYY', 'CCCC'])],
('ID', 'Company_Id', 'value' ,'array_column'))
df.show()
+---+----------+-------+--------------------+
| ID|Company_Id| value| array_column |
+---+----------+-------+--------------------+
| 1A| 3412asd|value-1| [XXX, YYY, AAA] |
| 2B| 2345tyu|value-2|[DDD, YFFFYY, GGG, 1]|
| 3C| 9800bvd|value-3| [AAA] |
| 3C| 9800bvd|value-1| [AAA, YYY, CCCC] |
+---+----------+-------+---------------------+
Code I tried:
The main idea is to use rdd.toLocalIterator() as there are some other functions inside the same for loop that are depending on this filters
for x in main_df.rdd.toLocalIterator:
a = main_df["refer_array_col"]
b = main_df["No"]
some_x_filter = F.col('array_coulmn').isin(b)
final_df = df.filter(
# filter 1
some_x_filter &
# second filter is to compare 'a' with array_column - i tried using F.array_contains
(F.array_contains(F.col('array_column'), F.lit(a)))
)
some_x_filter is also working in a similar way
some_x_filter is comparing a string value in a array of strings column.
But now a contains a list of strings and I am unable to compare it with array_column
With my code I am getting an error for array contains
Error
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.functions.lit.
: java.lang.RuntimeException: Unsupported literal type class java.util.ArrayList ['YYY', 'MZA']
Can anyone tell me what can i use at the second filter alternatively?
From what I understood based on our conversation in the comments.
Essentially your requirement is to compare an array column with a Python List.
Thus, this would do the job
df.withColumn("asArray", F.array(*[F.lit(x) for x in b]))
I have a dataframe that looks like this
Company CompanyDetails
A [{"companyId": 1482, "companyAddress": 'sampleaddress1', "numOfEmployees": 500}]
B [{"companyId": 1437, "companyAddress": 'sampleaddress2', "numOfEmployees": 50}]
C [{"companyId": 1452, "companyAddress": 'sampleaddress3', "numOfEmployees": 10000}]
When I execute df.dtypes I find that both the Company and CompanyDetails columns are objects.
df[['CompanyDetails']].iloc[0, :] would return '{["companyId": 1482, "companyAddress": 'sampleaddress1', numOfEmployees: 500]}' (there will be quotes ' ' around my array").
I am trying to extract the details within the dictionary in the CompanyDetails column so that I can add new columns to my dataframe to look like this:
Company CompanyId CompanyAddress numOfEmployees
A 1482 'sampleaddress1' 500
B 1437 'sampleaddress2' 50
C 1452 'sampleaddress3' 10000
I tried something like this as I was trying to convert the CompanyDetails column to contain arrays for all my values so I can easily extract each property in the object.
import ast
df['CompanyDetails'] = df['CompanyDetails'].apply(ast.literal_eval)
However, the above code caused this error
ValueError: malformed node or string: <ast.Name object at 0x000002D73D0C13A0>
Would appreciate any help on this, thanks!
You're getting the error because it's actually not valid JSON. numOfEmployees is not quoted, and JSON required ALL keys to be double-quoted.
The easiest, safest (in terms of likelyhood to break) way I can think of to fix this would be to repair the JSON using a regular expression replace:
df['CompanyDetails'] = df['CompanyDetails'].str.replace(r',\s*(\w+)\s*:', r', "\1":', regex=True)
Then do your other stuff:
import ast
df['CompanyDetails'] = df['CompanyDetails'].apply(ast.literal_eval)
df = pd.concat([df.drop('CompanyDetails', axis=1), pd.json_normalize(df['CompanyDetails'].explode())], axis=1)
...or whatever you have in mind.
You can use
import pandas as pd
# Test dataframe
df = pd.DataFrame({'Company':['A'], 'CompanyDetails':[[{"companyId": 1482, "companyAddress": 'sampleaddress1', "numOfEmployees": 500}]]})
df['CompanyDetails'] = df['CompanyDetails'].str[0]
df = pd.concat([df.drop(['CompanyDetails'], axis=1), df['CompanyDetails'].apply(pd.Series)], axis=1)
# => >>> df
# Company companyId companyAddress numOfEmployees
# 0 A 1482 sampleaddress1 500
Note:
df['CompanyDetails'] = df['CompanyDetails'].str[0] gets the first item from each list since each of them only contains one item
pd.concat([df.drop(['CompanyDetails'], axis=1), df['CompanyDetails'].apply(pd.Series)], axis=1) does the actual expansion and merging with the current dataframe.
I would like to put my summary statistics into a table using the kable function, but I cannot because it comes up as an array.
```{r setup options, include = FALSE}
knitr::opts_chunk$set(fig.width = 8, fig.height = 5, echo = TRUE)
library(mosaic)
library(knitr)
```
```{r}
sum = summary(SwimRecords$time) # generic data set included with mosaic package
kable(sum) # I want this to be printed into a table
```
Any suggestions?
You can do so easily with the broom package which is built to "tidy" these stats-related objects:
#install.packages(broom)
broom::tidy(sum)
I want to get the filename of the latest file in a directory. latest is based on creation time.
Currently I stuck in sorting the 2-dimensional array. I do not know how I should sort it? I get the following error
ERROR: LoadError: MethodError: no method matching isless(::Array{Any,1}, ::Array{Any,1})
The 2-dimensional array looks like this:
Any[
Any[1.47913e9,"foo.csv"],
Any[1.47913e9,"bar.csv"],
Any[1.47913e9,"foobar.csv"]
]
newestfile.jl
dfolder = "C:\\Users\\Foo\\Downloads"
cd( dfolder )
dfiles = readdir( "." )
files=[]
#println( dfiles )
for file in dfiles
created = ctime( file )
push!(files, [created, file] )
end
println( files )
# sort the timestamp
sort!( files ) # This throws an error
# grab the newst file and display the filename
How do I display the newest file in the directory?
Try:
julia> sort!( files, by = e -> e[1])
The last item is the newest one:
julia> files[end]
2-element Array{Any,1}:
1.48061e9 ".whatever"
The filename is:
julia> files[end][2]
".whatever"
I am trying to use the application oecosimu in Vegan package.
library(sna)
library(permute)
library(lattice)
library(vegan)
library(bipartite)
bio<-read.csv("/home/vicar66/Raw_data/Jan12/98percents_April_16/otu_table/Chapter1/no_bloom_bio_data_L3.txt")
rownames(bio)<-bio[,1]
bio[,1]<-NULL
bio_m<-as.matrix(bio)
a<-oecosimu(bio_m,bipartite::C.score,"swap")
but I keep having this error message:
Attaching package: 'bipartite'
The following object is masked from 'package:vegan':
nullmodel
Error in rowSums(x) : 'x' must be an array of at least two dimensions
Calls: oecosimu -> nullmodel -> rowSums
Execution halted
demo data:
Ciliophora Dinoflagellates MALVs Choanoflagellida
DNAHIN5m 0.062469804 0.826323018 0.031084701 0.000323747
DNAHIN35m 0.045216826 0.595750636 0.187010719 0.000917053
DNAHIN150m 0.018434224 0.204865296 0.531016032 0.017009618
DNAHIN240m 0.016211333 0.889640227 0.04191846 0.03087407
**first column first row is empty. First row are rownames
Anyone have encountered this problem yet?
Thanks!