Prepare data for scikit-learn

Prepare data for scikit-learn - dataset

I am working on a small NLP project of authorship attribution: I have some texts from two authors and I want to say who wrote them.
I have some pre-processed text (tokenized, pos-tagged, ect.) and I want to load it into sciki-learn.
The documents have this shape:
Testo - SPN Testo testare+v+indic+pres+nil+1+sing testo+n+m+sing O
: - XPS colon colon+punc O
" - XPO " quotation_mark+punc O
Buongiorno - I buongiorno buongiorno+inter buongiorno+n+m+_ O
a - E a a+prep O
tutti - PP tutto tutto+adj+m+plur+pst+ind tutto+pron+_+m+_+plur+ind O
. <eos> XPS full_stop full_stop+punc O
Ci - PP pro loc+pron+loc+_+3+_+clit pro+pron+accdat+_+1+plur+clit O
sarebbe - VI essere essere+v+cond+pres+nil+2+sing O
molto - B molto molto+adj+m+sing+pst+ind
So it's a tab separeted text file of 6 columns (word, end of sentence marker, part of speech, lemma, morphological information and named entity recognition marker).
Every file represents a document to classify.
What would be the best way to shape them for scikit learn?

The structure they use in scikit-learn example https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html# is described here
http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_files.html
Replace this
# Load some categories from the training set
if opts.all_categories:
categories = None
else:
categories = [
'alt.atheism',
'talk.religion.misc',
'comp.graphics',
'sci.space',
]
if opts.filtered:
remove = ('headers', 'footers', 'quotes')
else:
remove = ()
print("Loading 20 newsgroups dataset for categories:")
print(categories if categories else "all")
data_train = fetch_20newsgroups(subset='train', categories=categories,
shuffle=True, random_state=42,
remove=remove)
data_test = fetch_20newsgroups(subset='test', categories=categories,
shuffle=True, random_state=42,
remove=remove)
with your data load statements, for example:
# Load some categories from the training set
categories = [
'high',
'low',
]
print("loading dataset for categories:")
print(categories if categories else "all")
train_path='c:/Users/username/Documents/SciKit/train'
data_train = load_files(train_path, encoding='latin1')
test_path='c:/Users/username/Documents/SciKit/test'
data_test = load_files(test_path, encoding='latin1')
and in each of train and test directories create "high" and "low" subdirectories for your category files.

Related

Lapply function to anova and post hoc test cld

I am new to r and I am trying to get my mind around the apply function. So far I managed to run my anovas for all the the variables on my data and I got the pairwise comparison.
varlist <- names(dt)[5:length(dt)]
# loop
models <- lapply(X = varlist,
FUN = function(t) lm(formula = paste0("`", t, "` ~ block+irrigation*genotype"), data = dt))
#Name the list of models to the column name
names(models) = varlist
## apply anova to each model stored in the list, models
lapply(models, anova)
#marginal-means-all-variable}
res.model1 <- lapply(models, function(x) pairs(emmeans(x, ~genotype:irrigation)))
res.model1
So far so good, now I want to create a compact letter list so I can use to plot it. Previously I used the following but I can't work out how to apply an lapply function to the following code
CLD = cld(res.model1,
alpha=0.05,
Letters=letters,
adjust="tukey")
I use the CLD data to create graphs
I manage to get the letters with the following code but then I am not getting the full anova table.
tx <- with(dt, interaction(irrigation, genotype)) # determining the factors
model2 <- lapply(varlist, function(x) {
lm(substitute(i~block+tx, list(i = as.name(x))), data = dt)}) # using the factors already in "tx"
lapply(model2, anova)
letters = lapply(model2, function(m) HSD.test((m), "tx", alpha = 0.05, group = TRUE, console = TRUE))
Any suggestions to achieve what I need.
Thank you

How to speed up iteration through array in ruby

I have multiple csv files that have the name and the price of products. There may be or may not be products that are in both files. I have to find the highest and the lowest price across these files for each product.
I joined products from both files into one array:
Dir["./*.csv"].each do |file|
CSV.foreach(file, headers:true) do |row|
tmpRow = row.to_s.chomp + "," + file #saving name of the input file
list.push(tmpRow.chomp.split(","))
end
end
The array list looks like this:
[["5893105","2.38", "weightOrSomethingIrrelevant", "./FIAT_2.csv"]]
This is the main algorithm:
while list[0] do
if list[0] != nil
tmpPart = list[0][0]
tmpParts = list.select{ |part, price| part == tmpPart}
tmpParts.each do |tp|
tmpPrices.push(tp[1])
end
list[0][2].to_f != 0.0 ? tmpWeight = list[0][2].to_s : tmpWeight = "Undefined"
tmpMaxPrice = tmpParts.select{|part, price| part == tmpPart && price == tmpPrices.max}
tmpMinPrice = tmpParts.select{|part, price| part == tmpPart && price == tmpPrices.min}
result.push([tmpPart, tmpWeight, tmpPrices.max, tmpMaxPrice[0].last, tmpPrices.min, tmpMinPrice[0].last)
tmpPart = ""
list = list - tmpParts
tmpParts = []
tmpPrices = []
tmpMaxPrice = []
tmpMinPrice = []
tmpWeight = ""
end
end
The input files are huge (over 200 000 rows), so I am having problems with efficiency of my algorithm (as it processes one row in half a second).
I am wondering if there is any better way to write this app.

I would split this into several parts:
1) I suggest you have a table which represents files (the file name, location, line number etc) and connected to that a product table (the row data from that file)
2) script / function to ingest files and store rows as DB records
3) script / function to analyse rows and find products by name, using the DB and pulling price info out using Min/max.
This could later be improved to deal with naming inconsistencies products vs product occurrences etc.

using lookup tables to plot a ggplot and table

I'm creating a shiny app and i'm letting the user choose what data that should be displayed in a plot and a table. This choice is done through 3 different input variables that contain 14, 4 and two choices respectivly.
ui <- dashboardPage(
dashboardHeader(),
dashboardSidebar(
selectInput(inputId = "DataSource", label = "Data source", choices =
c("Restoration plots", "all semi natural grasslands")),
selectInput(inputId = "Variabel", label = "Variable", choices =
choicesVariables)),
#choicesVariables definition is omitted here, because it's very long but it
#contains 14 string values
selectInput(inputId = "Factor", label = "Factor", choices = c("Company
type", "Region and type of application", "Approved or not approved
applications", "Age group" ))
),
dashboardBody(
plotOutput("thePlot"),
tableOutput("theTable")
))
This adds up to 73 choices (yes, i know the math doesn't add up there, but some choices are invalid). I would like to do this using a lookup table so a created one with every valid combination of choices like this:
rad1<-c(rep("Company type",20), rep("Region and type of application",20),
rep("Approved or not approved applications", 13), rep("Age group", 20))
rad2<-choicesVariable[c(1:14,1,4,5,9,10,11, 1:14,1,4,5,9,10,11, 1:7,9:14,
1:14,1,4,5,9,10,11)]
rad3<-c(rep("Restoration plots",14),rep("all semi natural grasslands",6),
rep("Restoration plots",14), rep("all semi natural grasslands",6),
rep("Restoration plots",27), rep("all semi natural grasslands",6))
rad4<-1:73
letaLista<-data.frame(rad1,rad2,rad3, rad4)
colnames(letaLista) <- c("Factor", "Variabel", "rest_alla", "id")
Now its easy to use subset to only get the choice that the user made. But how do i use this information to plot the plot and table without using a 73 line long ifelse statment?
I tried to create some sort of multidimensional array that could hold all the tables (and one for the plots) but i couldn't make it work. My experience with these kind of arrays is limited and this might be a simple issue, but any hints would be helpful!
My dataset that is the foundation for the plots and table consists of dataframe with 23 variables, factors and numerical. The plots and tabels are then created using the following code for all 73 combinations
s_A1 <- summarySE(Samlad_info, measurevar="Dist_brukcentrum",
groupvars="Companytype")
s_A1 <- s_A1[2:6,]
p_A1=ggplot(s_A1, aes(x=Companytype,
y=Dist_brukcentrum))+geom_bar(position=position_dodge(), stat="identity") +
geom_errorbar(aes(ymin=Dist_brukcentrum-se,
ymax=Dist_brukcentrum+se),width=.2,position=position_dodge(.9))+
scale_y_continuous(name = "") + scale_x_discrete(name = "")
where summarySE is the following function, burrowed from cookbook for R
summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=TRUE,
conf.interval=.95, .drop=TRUE) {
# New version of length which can handle NA's: if na.rm==T, don't count them
length2 <- function (x, na.rm=FALSE) {
if (na.rm) sum(!is.na(x))
else length(x)
}
# This does the summary. For each group's data frame, return a vector with
# N, mean, and sd
datac <- ddply(data, groupvars, .drop=.drop,
.fun = function(xx, col) {
c(N = length2(xx[[col]], na.rm=na.rm),
mean = mean (xx[[col]], na.rm=na.rm),
sd = sd (xx[[col]], na.rm=na.rm)
)
},
measurevar
)
# Rename the "mean" column
datac <- rename(datac, c("mean" = measurevar))
datac$se <- datac$sd / sqrt(datac$N) # Calculate standard error of the mean
# Confidence interval multiplier for standard error
# Calculate t-statistic for confidence interval:
# e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
ciMult <- qt(conf.interval/2 + .5, datac$N-1)
datac$ci <- datac$se * ciMult
return(datac)
}
The code in it's entirety is a bit to large but i hope this may clarify what i'm trying to do.

Well, thanks to florian's comment i think i might have found a solution my self. I'll present it here but leave the question open as there is probably far neater ways of doing it.
I rigged up the plots (that was created as lists by ggplot) into a list
plotList <- list(p_A1, p_A2, p_A3...)
tableList <- list(s_A1, s_A2, s_A3...)
I then used subset on my lookup table to get the matching id of the list to select the right plot and table.
output$thePlot <-renderPlot({
plotValue<-subset(letaLista, letaLista$Factor==input$Factor &
letaLista$Variabel== input$Variabel & letaLista$rest_alla==input$DataSource)
plotList[as.integer(plotValue[1,4])]
})
output$theTable <-renderTable({
plotValue<-subset(letaLista, letaLista$Factor==input$Factor &
letaLista$Variabel== input$Variabel & letaLista$rest_alla==input$DataSource)
skriva <- tableList[as.integer(plotValue[4])]
print(skriva)
})

How to return a dictionary from a database?

I have written functions to access a database.
The table called Books has:
- a `book_id` TEXT column,
- a title TEXT column, and
- an author TEXT column.
For the first one, run_query is a function that connects the database.
get_book_cnt_per_author is a function that returns a list of tuples in this form:
'author', number of books
I don't know how to use the run_query function through the loop. I always got None for what I wrote.
I don't know where my problem is. I only get one book for each author.
Please tell me what is the problem.
def get_books(db, book_cnt_list, book_cnt):
""" (str, list of tuple, int) -> list of str
Precondition: the elements in book_cnt_list are sorted
in ascending order by author name.
Return a list of all the book titles whose authors
each have book_cnt books in the database with name db
according to the book_cnt_list. The book titles should be in
ascending order for each author, but not for the entire list.
Follow ascending order across authors, that is, the order
authors appear in book_cnt_list that is already sorted by
author name.
>>> author_cnt_list = get_book_cnt_per_author("e7_database.db")
>>> books_list = get_books("e7_database.db", author_cnt_list, 10)
>>> books_list[0]
'A Christmas Carol'
>>> books_list[9]
'The Life and Adventures of Nicholas Nickleby'
>>> books_list[10]
'Disgrace'
>>> books_list[-1]
'Youth'
"""
# HINT: First figure out which authors have book_cnt books
# using the book_cnt_list. Then, access the database db
# to retrieve the required information for those authors.
# Do not call any other of your E7 functions other than run_query.
list1 = []
for i in book_cnt_list:
if i[1] == "book_cnt":
list1.append(i[0])
for j in list1:
return run_query(my_db, '''SELECT title FROM Books OREDER BY Books.title ASC WHERE Books.author = ? ''', (j))
def create_author_dict(db):
""" (str) -> dict of {str: list of str}
Return a dictionary that maps each author to the books they have written
according to the information in the Books table of the database
with name db.
>>> author_dict = create_author_dict('e7_database.db')
>>> author_dict['Isaac Asimov'].sort()
>>> author_dict['Isaac Asimov']
['Foundation', 'I Robot']
>>> author_dict['Maya Angelou']
['I Know Why the Caged Bird Sings']
"""
con = sqlite3.connect(db)
cur = con.cursor()
cur.execute('''SELECT author, title FROM Books WHERE Books.author = ?''')
new_list = cur.fetchall()
new_dict = {}
for i in new_list:
key = i[0]
value = i[1:]
new_dict.update({key: list(value)})
con.commit()
cur.close()
con.close()
return new_dict

try changing the:
new_dict.update({key: list(value)})
for a statement like:
new_dict[key] = new_dict.get(key, list()) + [value]

MatLab - Creating an array of images depending on their correlation

I've created a program for a project that tests images against one another to see whether or not it's the same image or not. I've decided to use correlation since the images I am using are styled in the same way and with this, I've been able to get everything working up to this point.
I now wish to create an array of images again, but this time, in order of their correlation. So for example, if I'm testing a 50 pence coin and I test 50 images against the 50 pence coin, I want the highest 5 correlations to be stored into an array, which can then be used for later use. But I'm unsure how to do this as each item in the array will need to have more than one variable, which will be the image location/name of the image and it's correlation percentage.
%Program Created By Ben Parry, 2016.
clc(); %Simply clears the console window
%Targets the image the user picks
inputImage = imgetfile();
%Targets all the images inside this directory
referenceFolder = 'M:\Project\MatLab\Coin Image Processing\Saved_Images';
if ~isdir(referenceFolder)
errorMessage = print('Error: Folder does not exist!');
uiwait(warndlg(errorMessage)); %Displays an error if the folder doesn't exist
return;
end
filePattern = fullfile(referenceFolder, '*.jpg');
jpgFiles = dir(filePattern);
for i = 1:length(jpgFiles)
baseFileName = jpgFiles(i).name;
fullFileName = fullfile(referenceFolder, baseFileName);
fprintf(1, 'Reading %s\n', fullFileName);
imageArray = imread(fullFileName);
imshow(imageArray);
firstImage = imread(inputImage); %Reading the image
%Converting the images to Black & White
firstImageBW = im2bw(firstImage);
secondImageBW = im2bw(imageArray);
%Finding the correlation, then coverting it into a percentage
c = corr2(firstImageBW, secondImageBW);
corrValue = sprintf('%.0f%%',100*c);
%Custom messaging for the possible outcomes
corrMatch = sprintf('The images are the same (%s)',corrValue);
corrUnMatch = sprintf('The images are not the same (%s)',corrValue);
%Looping for the possible two outcomes
if c >=0.99 %Define a percentage for the correlation to reach
disp(' ');
disp('Images Tested:');
disp(inputImage);
disp(fullFileName);
disp (corrMatch);
disp(' ');
else
disp(' ');
disp('Images Tested:');
disp(inputImage);
disp(fullFileName);
disp(corrUnMatch);
disp(' ' );
end;
imageArray = imread(fullFileName);
imshow(imageArray);
end

You can use struct() function to create structures.
Initializing an array of struct:
imStruct = struct('fileName', '', 'image', [], 'correlation', 0);
imData = repmat(imStruct, length(jpgFiles), 1);
Setting field values:
for i = 1:length(jpgFiles)
% ...
imData(i).fileName = fullFileName;
imData(i).image = imageArray;
imData(i).correlation = corrValue;
end
Extract values of correlation field and select 5 highest correlations:
corrList = [imData.correlation];
[~, sortedInd] = sort(corrList, 'descend');
selectedData = imData(sortedInd(1:5));

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Prepare data for scikit-learn - dataset

Related

Lapply function to anova and post hoc test cld

How to speed up iteration through array in ruby

using lookup tables to plot a ggplot and table

How to return a dictionary from a database?

MatLab - Creating an array of images depending on their correlation

Categories

Resources