Dataset for Apriori algorithm - database

I am going to develop an app for Market Basket Analysis (using apriori algorithm) and I found a dataset which has more than 90,000 Transaction records .
the problem is this dataset doesn't have the name of items in it and only contains the barcode of the items .
I just start the project and doing research on apriori algorithm , can anyone help me about this case , how is the best way to implement this algorithm using the following dataset ?

these kind of datasets are consider critical information and chain stores will not give you these information but you can generate some sample dataset yourself using SQL Server .

The algorithm is defined independent of the identifiers used for the object. Also, you didn't post the 'following data set' :P If your problem is that the algorithm expects your items to be numbered 0,1,2,... then just scan your data set and map each individual barcode to a number.
If you're interested, there's been some papers on how to represent frequent item sets very efficiently: http://www.google.de/url?sa=t&source=web&cd=1&ved=0CB8QFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.163.4827%26rep%3Drep1%26type%3Dpdf&ei=QdVuTsn7Cc6WmQWD7sWVCg&usg=AFQjCNGDG8etNN2B4GQ52pSNIfQaTH7ajQ&sig2=7r3buh8AcfJmn2CwjjobAg

The algorithm does not need the name of the items.

Related

How to store FaceNet data efficiently?

I am using the Facenet algorithm for face recognition. I want to create application based on this, but the problem is the Facenet algorithm returns an array of length 128, which is the face embedding per person.
For person identification, I have to find the Euclidian difference between two persons face embedding, then check that if it is greater than a threshold or not. If it is then the persons are same; if it is less then persons are different.
Let's say If I have to find person x in the database of 10k persons. I have to calculate the difference with each and every person's embeddings, which is not efficient.
Is there any way to store this face embedding efficiently and search for the person with better efficiency?
I guess reading this blog will help the others.
It's in detail and also covers most aspects of implementation.
Face recognition on 330 million faces at 400 images per second
Recommend you to store them in redis or cassandra. They will overperform than relational databases.
Those key-value stores can store multidimensional vector as a value.
You can find embedding vectors with deepface. I shared a sample code snippet below.
#!pip install deepface
from deepface import DeepFace
img_list = ["img1.jpg", "img2.jpg", ...]
model = DeepFace.build_model("Facenet")
for img_path in img_list:
img_embedding = DeepFace.represent(img_path, model = model)
#store img_embedding into the redis here
Sounds like you want a nearest neighbour search. You could have a look at the various space partitioning data structures like kd-trees
First make a dictionary with 10000 face encodings as it is shown at Face_recognition sample, then store it as pickle-file. While loaded in memory it will take a sacond to find distance between X face encoding and that 10000 pre-encoded ones. take a look how it works I'm operating with millions of faces in such way.

How to repeat a command on different values of the same variable using SPSS LOOP?

Probably an easy question:
I want to run this piece of syntax:
SUMMARIZE
/TABLES=AGENCY
PIN
AGE
GENDER
DISABILITY
MAINSERVICE
MRESAGENCY
MRESSUPPORT
/FORMAT=LIST NOCASENUM TOTAL
/TITLE='Case Summaries'
/MISSING=VARIABLE
/CELLS=COUNT.
for 264 different agencies which are all values contained in the variable 'AGENCY'.
I want to create a different table for each agency outlining the above information for them.
I think I can do this using a DO REPEAT or LOOP on SPSS.
Any advice would be much appreciated.
Thank you :)
note: I have Google'd and read endless amounts on looping I am just a little unsure as to which method is what I am looking for
Take a look at SPLIT FILE, which meets your needs

Return only single max value row in Power BI Desktop

I have the following table of Parts which are sold for a particular job, which is the Order Number.
I am trying to extract just the Description of the most expensive part so that I can put it onto a single value card.
I have tried for a day mucking around with CALCULATE, MAX, TOP, SELECTEDVALUE, I cant seem to figure it out. I'm sure it is something simple too...
Would appreciate it if somebody can help me retrieve it in a way that I can see what I missed and learn for future.
My page is filtered by DrillThrough on the Order Number which filters the parts list for me.
Essentially, I just want the card to show 'PUMP,DTH,ELE'. My approach was to just select the top 1 row when the parts list is sorted descending by Amount in LC but it so far has not been as simple as that :(
Should it be a calculated column or measure on my Order table which has that string?
You should be able to create a measure that does this and then place that measure on a card.
Most Expensive Part = LOOKUPVALUE(Parts[Description],Parts[Amount],MAX(Parts[Amount]))
The MAX(Parts[Amount] piece gives you the maximal amount. Then you look up the description corresponding to that amount.

Dataset help for TF-IDF and Vector Model

I want to compare TF-IDF, Vector model and some optimization of TF-IDF algorithm.
For that I need a dataset (at least 100 documents of English text). I am not able to find one. any suggestions ?
It depends the application that you use TF-IDF. for example if you want to find keywords you could use "Mendely" dataset or for tagging using "Delicious" data.

Finding a handwritten dataset with an already extracted features

I want to test my clustering algorithms on data of handwritten text, so I'm searching for a dataset of handwritten text (e.g. words) with already extracted features (the goal is to test my clustering algorithms on, not to extract features). Does anyone have any information on that ?
Thanks.
There is a dataset of images of handwritten digits : http://yann.lecun.com/exdb/mnist/ .
Texmex has 128d SIFT vectors
"to evaluate the quality of approximate
nearest neighbors search algorithm on different kinds of data and varying database sizes",
but I don't know what their images are of; you could try asking the authors.

Resources