NebulaGraph Database: when submitting the algorithm package directly to run louvain, it reports an error - database

Some details are as follows:
NebulaGraph version is 3.3.0
NebulaGraph Studio version is 3.5.0
Deployment way is distributed
Data volume is as following:
The nodes are Comment, and the edges are the ones in the red box.
application.conf:
{
# Spark relation config
spark: {
app: {
name: louvain
# spark.app.partitionNum
partitionNum:50
}
master:local
}
data: {
# data source. optional of nebula,csv,json
source: nebula
# data sink, means the algorithm result will be write into this sink. optional of nebula,csv,text
sink: nebula
# if your algorithm needs weight
hasWeight: false
}
# Nebula Graph relation config
nebula: {
# algo's data source from Nebula. If data.source is nebula, then this nebula.read config can be valid.
read: {
# Nebula metad server address, multiple addresses are split by English comma
metaAddress: "192.168.200.100:9559,192.168.200.101:9559,192.168.200.111:9559"
# Nebula space
space: ldbc
# Nebula edge types, multiple labels means that data from multiple edges will union together
labels: ["HAS_CREATOR","HAS_TAG","IS_LOCATED_IN","REPLY_OF"]
# Nebula edge property name for each edge type, this property will be as weight col for algorithm.
# Make sure the weightCols are corresponding to labels.
weightCols: [""]
}
# algo result sink into Nebula. If data.sink is nebula, then this nebula.write config can be valid.
write:{
# Nebula graphd server address, multiple addresses are split by English comma
graphAddress: "192.168.200.100:9669,192.168.200.101:9669,192.168.200.111:9669,192.168.200.112:9669,192.168.200.114:9669"
# Nebula metad server address, multiple addresses are split by English comma
metaAddress: "192.168.200.100:9559,192.168.200.101:9559,192.168.200.111:9559"
user:root
pswd:nebula
# Nebula space name
space:ldbc
# Nebula tag name, the algorithm result will be write into this tag
tag:Comment
# algorithm result is insert into new tag or update to original tag. type: insert/update
type:update
}
}
local: {
# algo's data source from Nebula. If data.source is csv or json, then this local.read can be valid.
read:{
filePath: "file:///tmp/edge_follow.csv"
# srcId column
srcId:"_c0"
# dstId column
dstId:"_c1"
# weight column
#weight: "col3"
# if csv file has header
header: false
# csv file's delimiter
delimiter:","
}
# algo result sink into local file. If data.sink is csv or text, then this local.write can be valid.
write:{
resultPath:/tmp/count
}
}
algorithm: {
# the algorithm that you are going to execute,pick one from [pagerank, louvain, connectedcomponent,
# labelpropagation, shortestpaths, degreestatic, kcore, stronglyconnectedcomponent, trianglecount,
# betweenness, graphtriangleCount, clusteringcoefficient, bfs, hanp, closeness, jaccard, node2vec]
executeAlgo: louvain
# PageRank parameter
pagerank: {
maxIter: 10
resetProb: 0.15 # default 0.15
}
# Louvain parameter
louvain: {
maxIter: 20
internalIter: 10
tol: 0.5
}
# connected component parameter.
connectedcomponent: {
maxIter: 20
}
# LabelPropagation parameter
labelpropagation: {
maxIter: 20
}
# ShortestPaths parameter
shortestpaths: {
# several vertices to compute the shortest path to all vertices.
landmarks: "1"
}
# Vertex degree statistics parameter
degreestatic: {}
# KCore parameter
kcore:{
maxIter:10
degree:1
}
# Trianglecount parameter
trianglecount:{}
# graphTriangleCount parameter
graphtrianglecount:{}
# Betweenness centrality parameter. maxIter parameter means the max times of iterations.
betweenness:{
maxIter:5
}
# Clustering Coefficient parameter. The type parameter has two choice, local or global
# local type will compute the clustering coefficient for each vertex, and print the average coefficient for graph.
# global type just compute the graph's clustering coefficient.
clusteringcoefficient:{
type: local
}
# ClosenessAlgo parameter
closeness:{}
# BFS parameter
bfs:{
maxIter:5
root:"10"
}
# HanpAlgo parameter
hanp:{
hopAttenuation:0.1
maxIter:10
preference:1.0
}
#Node2vecAlgo parameter
node2vec:{
maxIter: 10,
lr: 0.025,
dataNumPartition: 10,
modelNumPartition: 10,
dim: 10,
window: 3,
walkLength: 5,
numWalks: 3,
p: 1.0,
q: 1.0,
directed: false,
degree: 30,
embSeparate: ",",
modelPath: "hdfs://127.0.0.1:9000/model"
}
# JaccardAlgo parameter
jaccard:{
tol: 1.0
}
}
}
When I execute spark-submit --master "local" --class com.vesoft.nebula.algorithm.Main /opt/offline/nebula/nebula-algorithm-3.0.0.jar -p /opt/offline/nebula/application.conf, it reports the error in the screenshot.
How should I solve this error?

The value of the weightcols parameter in the application.conf file cannot be quoted if it is empty.

Check the appliation.conf. If the item weightcols is empty, use [""] without and blank.

Related

Lapply function to anova and post hoc test cld

I am new to r and I am trying to get my mind around the apply function. So far I managed to run my anovas for all the the variables on my data and I got the pairwise comparison.
varlist <- names(dt)[5:length(dt)]
# loop
models <- lapply(X = varlist,
FUN = function(t) lm(formula = paste0("`", t, "` ~ block+irrigation*genotype"), data = dt))
#Name the list of models to the column name
names(models) = varlist
## apply anova to each model stored in the list, models
lapply(models, anova)
#marginal-means-all-variable}
res.model1 <- lapply(models, function(x) pairs(emmeans(x, ~genotype:irrigation)))
res.model1
So far so good, now I want to create a compact letter list so I can use to plot it. Previously I used the following but I can't work out how to apply an lapply function to the following code
CLD = cld(res.model1,
alpha=0.05,
Letters=letters,
adjust="tukey")
I use the CLD data to create graphs
I manage to get the letters with the following code but then I am not getting the full anova table.
tx <- with(dt, interaction(irrigation, genotype)) # determining the factors
model2 <- lapply(varlist, function(x) {
lm(substitute(i~block+tx, list(i = as.name(x))), data = dt)}) # using the factors already in "tx"
lapply(model2, anova)
letters = lapply(model2, function(m) HSD.test((m), "tx", alpha = 0.05, group = TRUE, console = TRUE))
Any suggestions to achieve what I need.
Thank you

How to set max sequence length with a hugging face sagemaker estimator?

I'd like to increase the max sequence length from 128 to 512 (the maximum distilbert can handle.) I believe it's only using 128 tokens right now, because the training samples it prints out have an attention_mask with 128 values. This is my code:
import sagemaker
from sagemaker.huggingface import HuggingFace
# gets role for executing training job
role = sagemaker.get_execution_role()
hyperparameters = {
'model_name_or_path': 'distilbert-base-uncased',
'output_dir': '/opt/ml/model',
'do_predict': True,
'do_eval': True,
'do_train': True,
"train_file": "/opt/ml/input/data/train/train.csv",
"validation_file": "/opt/ml/input/data/val/val.csv",
"test_file": "/opt/ml/input/data/test/test.csv",
"num_train_epochs": 50,
"per_device_train_batch_size": 32
}
# git configuration to download our fine-tuning script
git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.6.1'}
# creates Hugging Face estimator
huggingface_estimator = HuggingFace(
entry_point='run_glue.py',
source_dir='./examples/pytorch/text-classification',
instance_type='ml.p3.8xlarge',
instance_count=1,
role=role,
git_config=git_config,
transformers_version='4.6.1',
pytorch_version='1.7.1',
py_version='py36',
#use_spot_instances = True,
#max_wait = 24*60*60+1,
hyperparameters = hyperparameters
)
# starting the train job
huggingface_estimator.fit({'train' : s3_input + "/train.csv",
'val' : s3_input + "/val.csv",
'test' : s3_input + "/test.csv"})
Inspecting run_glue.py, input arguments are taken here
model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1]))
but the hyperparameters that we can set only impact training_args. data_args gets used to set the max_seq_length later in this file. I don't see an option in the huggingface estimator to pass anything other than hyperparameters. I could fork v4.6.1 and manually set this value, but it seems overkill, is there a proper way to just pass this value?
It turns out max_seq_length : 512 can just be plugged into the hyperparams. I likely typo'd this before as I was getting messages that the param wasn't being used.

querying a CSV::Table to find item with most sales between two given dates in plain old ruby script

I am trying to find the highest sales between two given dates.
this is what my ad_report.csv file with headers:
date,impressions,clicks,sales,ad_spend,keyword_id,asin
2017-06-19,4451,1006,608,24.87,UVOLBWHILJ,63N02JK10S
2017-06-18,5283,3237,1233,85.06,UVOLBWHILJ,63N02JK10S
2017-06-17,0,0,0,21.77,UVOLBWHILJ,63N02JK10S
...
Below is all the working code I have that returns the row with the highest value, but not between the given dates.
require 'csv'
require 'date'
# get directory of the current file
LIB_DIR = File.dirname(__FILE__)
# get the absolute path of the ad_report & product_report CSV
# and set to a var
AD_CSV_PATH = File.expand_path('data/ad_report.csv', LIB_DIR)
PROD_CSV_PATH = File.expand_path('data/product_report.csv', LIB_DIR)
# create CSV::Table for ad-ad_report and product_report CSV
ad_report_table = CSV.parse(File.read(AD_CSV_PATH), headers: true)
prod_report_table = CSV.parse(File.read(PROD_CSV_PATH), headers: true)
## finds the row with the highest sales
sales_row = ad_report_table.max_by { |row| row[3].to_i }
At this point I can get the row that has the greatest sale, and all the data from that row, but it is not in the excepted range.
Below I am trying to use range with the preset dates.
## range of date for items between
first_date = Date.new(2017, 05, 02)
last_date = Date.new(2017, 05, 31)
range = (first_date...last_date)
puts sales_row
below is sudo code of what I feel that I am supposed to do, but there is probably a better method.
## check for highest sales
## return sales if between date
## else reject col if
## loop this until it returns date between
## return result
You could do this by creating a range containing two dates and then use Range#cover? to test if the date is in the range:
range = Date.new(2015-01-01)..Date.new(2020-01-01)
rows.select do |row|
range.cover?(Date.parse(row[1]))
end.max_by { |row| row[3].to_i }
Although the Tin Man is completely right in that you should use a database instead.
You could obtained the desired value as follows. I have assumed that the field of interest ('sales') represents integer values. If not, change .to_i to .to_f below.
Code
require 'csv'
def greatest(fname, max_field, date_field, date_range)
largest = nil
CSV.foreach(fname, headers:true) do |csv|
largest = { row: csv.to_a, value: csv[max_field].to_i } if
date_range.cover?(csv[date_field]) &&
(largest.nil? || csv[max_field].to_i > largest[:value])
end
largest.nil? ? nil : largest[:row].to_h
end
Examples
Let's first create a CSV file.
str =<<~END
date,impressions,clicks,sales,ad_spend,keyword_id,asin
2017-06-19,4451,1006,608,24.87,UVOLBWHILJ,63N02JK10S
2017-06-18,5283,3237,1233,85.06,UVOLBWHILJ,63N02JK10S
2017-06-17,0,0,0,21.77,UVOLBWHILJ,63N02JK10S
2017-06-20,4451,1006,200000,24.87,UVOLBWHILJ,63N02JK10S
END
fname = 't.csv'
File.write(fname, str)
#=> 263
Now find the record within a given date range for which the value of "sales" is greatest.
greatest(fname, 'sales', 'date', '2017-06-17'..'2017-06-19')
#=> {"date"=>"2017-06-18", "impressions"=>"5283", "clicks"=>"3237",
# "sales"=>"1233", "ad_spend"=>"85.06", "keyword_id"=>"UVOLBWHILJ",
# "asin"=>"63N02JK10S"}
greatest(fname, 'sales', 'date', '2017-06-17'..'2017-06-25')
#=> {"date"=>"2017-06-20", "impressions"=>"4451", "clicks"=>"1006",
# "sales"=>"200000", "ad_spend"=>"24.87", "keyword_id"=>"UVOLBWHILJ",
# "asin"=>"63N02JK10S"}
greatest(fname, 'sales', 'date', '2017-06-22'..'2017-06-25')
#=> nil
I read the file line-by-line (using CSV#foreach) to keep memory requirements to a minimum, which could be essential if the file is large.
Notice that, because the date is in "yyyy-mm-dd" format, it is not necessary to convert two dates to Date objects to compare them; that is, they can be compared as strings (e.g. '2017-06-17' <= '2017-06-18' #=> true).

Turning an Array into a Data Frame

I would like to put my summary statistics into a table using the kable function, but I cannot because it comes up as an array.
```{r setup options, include = FALSE}
knitr::opts_chunk$set(fig.width = 8, fig.height = 5, echo = TRUE)
library(mosaic)
library(knitr)
```
```{r}
sum = summary(SwimRecords$time) # generic data set included with mosaic package
kable(sum) # I want this to be printed into a table
```
Any suggestions?
You can do so easily with the broom package which is built to "tidy" these stats-related objects:
#install.packages(broom)
broom::tidy(sum)

Python 3, extract info from file problems

And again, asking for help. But, before I start, here will be a lot of text, so please sorry for that.
I have about 500~ IP addresses with devices 2x categories in .xlsx book
I want:
telnet to device. Check device (by authentication prompt) type 1 or type 2.
If device is type 1 - get it firmware version in 2x partitions
write in excel file:
column 1 - IP address
column 2 - device type
column 3 - firmware version
column 4 - firmware version in reserve partition.
If type 2 - write in excel file:
column 1 - IP address
column 2 - device type
If device is down, or device type 3(unknown) - write in excel file:
column 1 - IP address
column 2 - result (EOF, TIMEOUT)
What I have done: I'm able to telnet to device, check device type, write in excel with 2 columns (in 1 column IP addresses, in 2 column is device type, or EOF/TIMEOUT results)
And, I'm writing full logs from session to files in format IP_ADDRESS.txt to future diagnosis.
What I can't understand to do? I can't understand how to get firmware version, and put it on 3,4 columns.
I can't understand how to work with current log session in real time, so I've decided to copy logs from main file (IP_ADDRESS.txt) to temp.txt to work with it.
I can't understand how to extract information I needed.
The file output example:
Trying 10.40.81.167...
Connected to 10.40.81.167.
Escape character is '^]'.
####################################
# #
# RADIUS authorization disabled #
# Enter local login/password #
# #
####################################
bt6000 login: admin
Password:
Please, fill controller information at first time (Ctrl+C to abort):
^C
Controller information filling canceled.
^Cadmin#bt6000# firmware info
Active boot partition: 1
Partition 0 (reserved):
Firmware: Energomera-2.3.1
Version: 10117
Partition 1 (active):
Firmware: Energomera-2.3.1_01.04.15c
Version: 10404M
Kernel version: 2.6.38.8 #2 Mon Mar 2 20:41:26 MSK 2015
STM32:
Version: bt6000 10083
Part Number: BT6024
Updated: 27.04.2015 16:43:50
admin#bt6000#
I need values - after "Energomera" words, like 2.3.1 for reserved partition, and 2.3.1_01.04.15c for active partition.
I've tried to work with string numbers and excract string, but there was not any kind of good result at all.
Full code of my script below.
import pexpect
import pxssh
import sys #hz module
import re #Parser module
import os #hz module
import getopt
import glob #hz module
import xlrd #Excel read module
import xlwt #Excel write module
import telnetlib #telnet module
import shutil
#open excel book
rb = xlrd.open_workbook('/samba/allaccess/Energomera_Eltek_list.xlsx')
#select work sheet
sheet = rb.sheet_by_name('IPs')
#rows number in sheet
num_rows = sheet.nrows
#cols number in sheet
num_cols = sheet.ncols
#creating massive with IP addresses inside
ip_addr_list = [sheet.row_values(rawnum)[0] for rawnum in range(sheet.nrows)]
#create excel workbook with write permissions (xlwt module)
wb = xlwt.Workbook()
#create sheet IP LIST with cell overwrite rights
ws = wb.add_sheet('IP LIST', cell_overwrite_ok=True)
#create counter
i = 0
#authorization details
port = "23" #telnet port
user = "admin" #telnet username
password = "12345" #telnet password
#firmware ask function
def fw_info():
print('asking for firmware')
px.sendline('firmware info')
px.expect('bt6000#')
#firmware update function
def fw_send():
print('sending firmware')
px.sendline('tftp server 172.27.2.21')
px.expect('bt6000')
px.sendline('firmware download tftp firmware.ext2')
px.expect('Updating')
px.sendline('y')
px.send(chr(13))
ws.write(i, 0, host)
ws.write(i, 1, 'Energomera')
#if eltek found - skip, write result in book
def eltek_found():
print(host, "is Eltek. Skipping")
ws.write(i, 0, host)
ws.write(i, 1, 'Eltek')
#if 23 port telnet conn. refused - skip, write result in book
def conn_refuse():
print(host, "connection refused")
ws.write(i, 0, host)
ws.write(i, 1, 'Connection refused')
#auth function
def auth():
print(host, "is up! Energomera found. Starting auth process")
px.sendline(user)
px.expect('assword')
px.sendline(password)
#start working with ip addresses in ip_addr_list massive
for host in ip_addr_list:
#spawn pexpect connection
px = pexpect.spawn('telnet ' + host)
px.timeout = 35
#create log file with in IP.txt format (10.1.1.1.txt, for example)
fout = open('/samba/allaccess/Energomera_Eltek/{0}.txt'.format(host),"wb")
#push pexpect logfile_read output to log file
px.logfile_read = fout
try:
index = px.expect (['bt6000', 'sername', 'refused'])
#if device tell us bt6000 - authorize
if index == 0:
auth()
index1 = px.expect(['#', 'lease'])
#if "#" - ask fw version immediatly
if index1 == 0:
print('seems to controller ID already set')
fw_info()
#if "Please" - press 2 times Ctrl+C, then ask fw version
elif index1 == 1:
print('trying control C controller ID')
px.send(chr(3))
px.send(chr(3))
px.expect('bt6000')
fw_info()
#firmware update start (temporarily off)
# fw_send()
#Eltek found - func start
elif index == 1:
eltek_found()
#Conn refused - func start
elif index == 2:
conn_refuse()
#print output to console (test purposes)
print(px.before)
px.send(chr(13))
#Copy from current log file to temp.txt for editing
shutil.copy2('/samba/allaccess/Energomera_Eltek/{0}.txt'.format(host), '/home/bark/expect/temp.txt')
#EOF result - skip host, write result to excel
except pexpect.EOF:
print(host, "EOF")
ws.write(i, 0, host)
ws.write(i, 1, 'EOF')
#print output to console (test purposes)
print(px.before)
#Timeout result - skip host, write result to excel
except pexpect.TIMEOUT:
print(host, "TIMEOUT")
ws.write(i, 0, host)
ws.write(i, 1, 'TIMEOUT')
#print output to console (test purposes)
print(px.before)
#Copy from current log file to temp.txt for editing
shutil.copy2('/samba/allaccess/Energomera_Eltek/{0}.txt'.format(host), '/home/bark/expect/temp.txt')
#count +1 to correct output for Excel
i += 1
#workbook save
wb.save('/samba/allaccess/Energomera_Eltek_result.xls')
Have you have any suggestions or ideas, guys, how I can do this?
Any help is greatly appreciated.
You can use regular expressions
example:
>>> import re
>>>
>>> str = """
... Trying 10.40.81.167...
...
... Connected to 10.40.81.167.
...
... Escape character is '^]'.
...
...
...
... ####################################
... # #
... # RADIUS authorization disabled #
... # Enter local login/password #
... # #
... ####################################
... bt6000 login: admin
... Password:
... Please, fill controller information at first time (Ctrl+C to abort):
... ^C
... Controller information filling canceled.
... ^Cadmin#bt6000# firmware info
... Active boot partition: 1
... Partition 0 (reserved):
... Firmware: Energomera-2.3.1
... Version: 10117
... Partition 1 (active):
... Firmware: Energomera-2.3.1_01.04.15c
... Version: 10404M
... Kernel version: 2.6.38.8 #2 Mon Mar 2 20:41:26 MSK 2015
... STM32:
... Version: bt6000 10083
... Part Number: BT6024
... Updated: 27.04.2015 16:43:50
... admin#bt6000#
... """
>>> re.findall(r"Firmware:.*?([0-9].*)\s", str)
['2.3.1', '2.3.1_01.04.15c']
>>> reserved_firmware = re.search(r"reserved.*\s*Firmware:.*?([0-9].*)\s", str).group(1)
>>> reserved_firmware
'2.3.1'
>>> active_firmware = re.search(r"active.*\s*Firmware:.*?([0-9].*)\s", str).group(1)
>>> active_firmware
'2.3.1_01.04.15c'
>>>

Resources