Using Pandas to store spotipy output in csv (python) - spotipy

I wrote a code to store the output of spotipy into a pd.Dataframe:
import pandas as pd
import spotipy
sp = spotipy.Spotify()
from spotipy.oauth2 import SpotifyClientCredentials
cid ='XXXCIDXXX'
secret = 'XXXSECRETXXX'
client_credentials_manager = SpotifyClientCredentials(client_id=cid,
client_secret=secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)
sp.trace=False
playlist = sp.user_playlist_tracks('spotify', '37i9dQZF1DX5nwnRMcdReF')
songs = playlist['items']
df = pd.DataFrame(songs)
df.to_csv('Songs.csv', sep=';', encoding='utf-8', index=True)
But theres a lot of output there that I dont need, so I found a code to only output the data that I need which is:
for i, item in enumerate(playlist['items']):
track = item['track']
need = (i, track['artists'][0]['name'], track['name'], track['id'])
now i can use print(need) to output exactly what I want, but I dont know how to store the data into the DataFrame.
If someone could help me that would be great.
Thank you.

Append the output you need to separate lists and then load those lists into a dataframe:
# appending needed data to separate lists
for i, item in enumerate(playlist['items']):
track = item['track']
artist_name.append(track['artists'][0]['name'])
track_name.append(track['name'])
track_id.append(track['id']
# loading lists into the dataframe
df = pd.DataFrame({'artist_name':artist_name,'track_name':track_name,'track_id':track_id})

Related

Pytorch Dataloader for Image GT dataset

I am new to pytorch. I am trying to create a DataLoader for a dataset of images where each image got a corresponding ground truth (same name):
root:
--->RGB:
------>img1.png
------>img2.png
------>...
------>imgN.png
--->GT:
------>img1.png
------>img2.png
------>...
------>imgN.png
When I use the path for root folder (that contains RGB and GT folders) as input for the torchvision.datasets.ImageFolder it reads all of the images as if they were all intended for input (classified as RGB and GT), and it seems like there is no way to pair the RGB-GT images. I would like to pair the RGB-GT images, shuffle, and divide it to batches of defined size. How can it be done? Any advice will be appreciated.
Thanks.
I think, the good starting point is to use VisionDataset class as a base. What we are going to use here is: DatasetFolder source code. So, we going to create smth similar. You can notice this class depends on two other functions from datasets.folder module: default_loader and make_dataset.
We are not going to modify default_loader, because it's already fine, it just helps us to load images, so we will import it.
But we need a new make_dataset function, that prepared the right pairs of images from root folder. Since original make_dataset pairs images (image paths if to be more precisely) and their root folder as target class (class index) and we have a list of (path, class_to_idx[target]) pairs, but we need (rgb_path, gt_path). Here is the code for new make_dataset:
def make_dataset(root: str) -> list:
"""Reads a directory with data.
Returns a dataset as a list of tuples of paired image paths: (rgb_path, gt_path)
"""
dataset = []
# Our dir names
rgb_dir = 'RGB'
gt_dir = 'GT'
# Get all the filenames from RGB folder
rgb_fnames = sorted(os.listdir(os.path.join(root, rgb_dir)))
# Compare file names from GT folder to file names from RGB:
for gt_fname in sorted(os.listdir(os.path.join(root, gt_dir))):
if gt_fname in rgb_fnames:
# if we have a match - create pair of full path to the corresponding images
rgb_path = os.path.join(root, rgb_dir, gt_fname)
gt_path = os.path.join(root, gt_dir, gt_fname)
item = (rgb_path, gt_path)
# append to the list dataset
dataset.append(item)
else:
continue
return dataset
What do we have now? Let's compare our function with original one:
from torchvision.datasets.folder import make_dataset as make_dataset_original
dataset_original = make_dataset_original(root, {'RGB': 0, 'GT': 1}, extensions='png')
dataset = make_dataset(root)
print('Original make_dataset:')
print(*dataset_original, sep='\n')
print('Our make_dataset:')
print(*dataset, sep='\n')
Original make_dataset:
('./data/GT/img1.png', 1)
('./data/GT/img2.png', 1)
...
('./data/RGB/img1.png', 0)
('./data/RGB/img2.png', 0)
...
Our make_dataset:
('./data/RGB/img1.png', './data/GT/img1.png')
('./data/RGB/img2.png', './data/GT/img2.png')
...
I think it works great) It's time to create our class Dataset. The most important part here is __getitem__ methods, because it imports images, applies transformation and returns a tensors, that can be used by dataloaders. We need to read a pair of images (rgb and gt) and return a tuple of 2 tensor images:
from torchvision.datasets.folder import default_loader
from torchvision.datasets.vision import VisionDataset
class CustomVisionDataset(VisionDataset):
def __init__(self,
root,
loader=default_loader,
rgb_transform=None,
gt_transform=None):
super().__init__(root,
transform=rgb_transform,
target_transform=gt_transform)
# Prepare dataset
samples = make_dataset(self.root)
self.loader = loader
self.samples = samples
# list of RGB images
self.rgb_samples = [s[1] for s in samples]
# list of GT images
self.gt_samples = [s[1] for s in samples]
def __getitem__(self, index):
"""Returns a data sample from our dataset.
"""
# getting our paths to images
rgb_path, gt_path = self.samples[index]
# import each image using loader (by default it's PIL)
rgb_sample = self.loader(rgb_path)
gt_sample = self.loader(gt_path)
# here goes tranforms if needed
# maybe we need different tranforms for each type of image
if self.transform is not None:
rgb_sample = self.transform(rgb_sample)
if self.target_transform is not None:
gt_sample = self.target_transform(gt_sample)
# now we return the right imported pair of images (tensors)
return rgb_sample, gt_sample
def __len__(self):
return len(self.samples)
Let's test it:
from torch.utils.data import DataLoader
from torchvision.transforms import ToTensor
import matplotlib.pyplot as plt
bs=4 # batch size
transforms = ToTensor() # we need this to convert PIL images to Tensor
shuffle = True
dataset = CustomVisionDataset('./data', rgb_transform=transforms, gt_transform=transforms)
dataloader = DataLoader(dataset, batch_size=bs, shuffle=shuffle)
for i, (rgb, gt) in enumerate(dataloader):
print(f'batch {i+1}:')
# some plots
for i in range(bs):
plt.figure(figsize=(10, 5))
plt.subplot(221)
plt.imshow(rgb[i].squeeze().permute(1, 2, 0))
plt.title(f'RGB img{i+1}')
plt.subplot(222)
plt.imshow(gt[i].squeeze().permute(1, 2, 0))
plt.title(f'GT img{i+1}')
plt.show()
Out:
batch 1:
...
Here you can find a notebook with code and simple dummy dataset.

Filter json array data in spark dataframe

I have spark dataframe that I am converting into JSON format:
json = df.toJSON().collect()
print(json)
['{"lot_number":"4f19-9deb-0ef861c1a6a1","recipients":[{"account":"45678765457876545678","code":"user1","status":"pending"},{"account":"12354567897545678","code":"error2","status":"pending"}]}',
'{"lot_number":"09ad-451e-8fb1-50bc185ef02f","recipients":[{"account":"4567654567876545678","code":"user3","status":"pending"},{"account":"12354567876545678","code":"user2","status":"pending"}]}']
I need to filter the data from array, that is all recipients whose code is "user1".
I'm expecting this result:
['{"lot_number":"4f19-9deb-0ef861c1a6a1","recipients":[{"account":"45678765457876545678","code":"user1","status":"pending"}'
]
Can anyone help to filter the data as shown above?
Firstly you will need to convert the string in the list to dict objects.
import json
json_rdd = df.toJSON().collect()
json_ls = [json.loads(x) for x in json_rdd]
# Now you can filter using "user1"
final_json_ls = [x for x in json_ls if x.get("recipients")[0].get("code") == "user1"]
If you have multiple recipients:
new_list = list()
for lot in json_ls:
recs = lot.get('recipients')
lot_recipients = [rec for rec in recs if rec.get("code") == "user1"]
if lot_recipients:
new_list.append({"lot_number": lot.get('lot_number'),
"recipients": lot_recipients})
# OUTPUT
# [{'lot_number': u'4f19-9deb-0ef861c1a6a1', 'recipients': [{u'status': u'pending', u'account': u'45678765457876545678', u'code': u'user1'}]}]
And since you want to convert it back to json to send POST requests:
for ls in new_list:
lot = ls.get("lot_number")
url = "test.com/api/v1/notify/request/"+ batch
response = requests.put(url, data=item, headers=headers)
print(response.text)

array - list format input

I have the following question: how can I change the format of curve2 (list). I want something similar to curve
curve = [0.0556, 0.0563]
curve2 = [[0.0159, 0.0178]]
Context: I´d like to apply a certain code, but I don´t get the result I expect since the input has different format
My code is something like:
import pandas as pd
import numpy as np
curve = [0.0556, 0.0563]
curve2 = [[0.0159, 0.0178]]
df= pd.DataFrame()
def SUM (curve):
df['COl1'] = curve
return df
print(SUM(curve))
PD: curve2 is a row extracted from an array (as a list):
[[ 0.01593353 0.01783041]
[ 0.00917833 0.00593893]
[ 0.00829569 0.02123637]
[-0.03057529 -0.04138836]
[ 0.05212978 0.03239212]]

get data behind " , " to take phrase by python

I want to get data in Column D behind " , " in the end of the sentence from left to right to get phrase in link bio:
[1]:( http://prntscr.com/fye9hi) "here"
Someone cant help me please ....
This is my code but it cant go like i want.
import xlrd
file_location = "C:/Users/admin/DataKH.xlsx"
wb = xlrd.open_workbook(file_location)
sheet = wb.sheet_by_index(0)
print(sheet.nrows)
print(sheet.ncols)
for rows in range(sheet.nrows):
row_0 = sheet.cell_value(rows,0)
from xlwt import Workbook
import xlwt
from xlwt import Formula
workbook = xlrd.open_workbook(file_location)
sheet = workbook.sheet_by_index(0)
data = [sheet.cell_value(row,3) for row in range(sheet.nrows)]
data1 = [sheet.cell_value(row, 4) for row in range(sheet.nrows)]
workbook = xlwt.Workbook()
sheet = workbook.add_sheet('test')
for index, value in enumerate(data):
sheet.write(index, 0, value)
for index, value in enumerate(data1):
sheet.write(index, 1 , value)
workbook.save('output.xls')
How about using split(",") method? It returns a list of phrases so you can easily iterate through it though.
#MinhTuấnNgô: I'm confused with xlrd's syntax so I switched to pandas instead.
import pandas as pd
df = pd.read_excel('SampleData.xlsx')
df['Extracted Address'] = pd.Series((cell.split(',')[-1] for cell in df['Address']), index = df.index)
Not sure what you mean by 'getting the data after the comma' but this shows a way to manipulate the cell data.
After you've finished formatting the data, you can export it back to excel using df.to_excel(<filepath>)
For xlrd, you can iterate through a specific column using this syntax:
for row in ws.col(2)[1:]:
This should skip the first row (as taken care of in the case of pandas anyway) and iterate all remaining rows.

Saving users and items features to HDFS in Spark Collaborative filtering RDD

I want to extract users and items features (latent factors) from the result of collaborative filtering using ALS in Spark. The code I have so far:
import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.MatrixFactorizationModel
import org.apache.spark.mllib.recommendation.Rating
// Load and parse the data
val data = sc.textFile("myhdfs/inputdirectory/als.data")
val ratings = data.map(_.split(',') match { case Array(user, item, rate) =>
Rating(user.toInt, item.toInt, rate.toDouble)
})
// Build the recommendation model using ALS
val rank = 10
val numIterations = 10
val model = ALS.train(ratings, rank, numIterations, 0.01)
// extract users latent factors
val users = model.userFeatures
// extract items latent factors
val items = model.productFeatures
// save to HDFS
users.saveAsTextFile("myhdfs/outputdirectory/users") // does not work as expected
items.saveAsTextFile("myhdfs/outputdirectory/items") // does not work as expected
However, what gets written to HDFS is not what I expect. I expected each line to have a tuple (userId, Array_of_doubles). Instead I see the following:
[myname#host dir]$ hadoop fs -cat myhdfs/outputdirectory/users/*
(1,[D#3c3137b5)
(3,[D#505d9755)
(4,[D#241a409a)
(2,[D#c8c56dd)
.
.
It is dumping the hash value of the array instead of the entire array. I did the following to print the desired values:
for (user <- users) {
val (userId, lf) = user
val str = "user:" + userId + "\t" + lf.mkString(" ")
println(str)
}
This does print what I want but I can't then write to HDFS (this prints on the console).
What should I do to get the complete array written to HDFS properly?
Spark version is 1.2.1.
#JohnTitusJungao is right and also the following lines works as expected :
users.saveAsTextFile("myhdfs/outputdirectory/users")
items.saveAsTextFile("myhdfs/outputdirectory/items")
And this is the reason, userFeatures returns an RDD[(Int,Array[Double])]. The array values are denoted by the symbols you see in the output e.g. [D#3c3137b5 , D for double, followed by # and hex code which is created using the Java toString method for this type of objects. More on that here.
val users: RDD[(Int, Array[Double])] = model.userFeatures
To solve that you'll need to make the array as a string :
val users: RDD[(Int, String)] = model.userFeatures.mapValues(_.mkString(","))
The same goes for items.

Resources