I want to convert my pandas dataframe into a markov chain transaction matrix
import pandas as pd
dict1={'state_num_x': {0: 0, 1: 1, 2: 1,3: 1,4: 2,5: 2,6: 2,7: 3,8: 3,9: 4,10: 5,11: 5,
12: 5,13: 5,14: 5,15: 5,16: 6,17: 6,18: 6,19: 7,20: 7,21: 7},
'state_num_y': {0: 1,1: 1,2: 2,3: 5,4: 1,5: 4,6: 6,7: 1,8: 6,9: 1,10: 1,11: 2,
12: 3,13: 5,14: 6,15: 7,16: 1,17: 2,18: 5,19: 1,20: 4,21: 6},
'Sum_Prob': {0: 0.9999999999999999,1: 0.0369363131137667,2: 0.7408182206817178,
3: 0.22224546620451535,4: 0.0369363131137667,5: 0.7408182206817178,
6: 0.22224546620451535,7: 0.17028359283647593,8: 0.8297164071635239,
9: 0.9999999999999999,10: 0.003599493183089517,11: 0.08889818648180613,
12: 0.13334727972270924,13: 0.021335564755633474,14: 0.012001255175043838,
15: 0.7408182206817178,16: 0.015600748358133354,17: 0.8297164071635239,
18: 0.1546828444783427,19: 0.015600748358133354,20: 0.8297164071635239,21: 0.1546828444783427}}
df=pd.DataFrame.from_dict(dict1)
It looks like
state_num_x state_num_y Sum_Prob
0 1 1.000000
1 1 0.036936
1 2 0.740818
. . .
. . .
7 1 0.015601
7 4 0.829716
7 6 0.154683
let's called the result array arr_tx
arr_tx[0][1] should be equal to 1
arr_tx[1][1] should be equal to 0.036936
arr_tx[1][2] should be equal to 0.740818
it should be an 8x8 matrix and missing values should equal to zero.
So final result should look like
0,1,0,0,0,0,0,0,
0,0.036936,0.740818,0,0,0.222245,0,0
.,.,.,.,.,.,.,.
It looks like you want a pivot_table:
df.pivot_table(index='state_num_x', columns='state_num_y',
values='Sum_Prob', fill_value=0)
output:
state_num_y 1 2 3 4 5 6 7
state_num_x
0 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
1 0.036936 0.740818 0.000000 0.000000 0.222245 0.000000 0.000000
2 0.036936 0.000000 0.000000 0.740818 0.000000 0.222245 0.000000
3 0.170284 0.000000 0.000000 0.000000 0.000000 0.829716 0.000000
4 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
5 0.003599 0.088898 0.133347 0.000000 0.021336 0.012001 0.740818
6 0.015601 0.829716 0.000000 0.000000 0.154683 0.000000 0.000000
7 0.015601 0.000000 0.000000 0.829716 0.000000 0.154683 0.000000
I am currently working on a C program in Debian. This program at first allocates several gigabytes of memory. the problem is that after the startup of the program, still it is allocating memory. I checked and there is no malloc or calloc or etc. in the main loop of the program. I have checked the memory with the RES column in the htop command.
then I decided to check the memory syscalls of the program with strace. I attached strace after program startup using this command:
strace -c -f -e trace=memory -p $(pidof myprogram)
Here is the result:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00 0.000311 0 10392 mprotect
------ ----------- ----------- --------- --------- ----------------
100.00 0.000311 10392 total
So it is clear that there is no brk or mmap syscalls that can allocate memory.
Here is the list of all syscals:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
33.00 1.446748 6156 235 67 futex
32.41 1.420658 8456 168 poll
17.35 0.760549 31 24459 nanosleep
16.24 0.712000 44500 16 select
1.00 0.044000 7333 6 2 restart_syscall
0.00 0.000000 0 80 40 read
0.00 0.000000 0 40 write
0.00 0.000000 0 184 mprotect
0.00 0.000000 0 33 rt_sigprocmask
0.00 0.000000 0 21 sendto
0.00 0.000000 0 47 sendmsg
0.00 0.000000 0 138 44 recvmsg
0.00 0.000000 0 7 gettid
------ ----------- ----------- --------- --------- ----------------
100.00 4.383955 25434 153 total
Do you have any idea why is memory allocated?
I have written a small C program. It does read some gzipped files, does some filtering and then again outputs to gzipped files.
I run gcc with -O3 -Ofast. Otherwise pretty standard.
If I do strace -c on my executable I get:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
46.01 0.077081 0 400582 read
42.73 0.071579 4771 15 munmap
9.34 0.015647 0 110415 brk
1.01 0.001688 32 52 openat
0.45 0.000746 3 228 mmap
0.20 0.000327 4 70 mprotect
0.15 0.000254 0 1128 write
0.06 0.000100 2 50 fstat
0.05 0.000087 1 52 close
0.00 0.000006 6 1 getrandom
0.00 0.000005 2 2 rt_sigaction
0.00 0.000004 2 2 1 arch_prctl
0.00 0.000003 3 1 1 stat
0.00 0.000003 1 2 lseek
0.00 0.000002 2 1 rt_sigprocmask
0.00 0.000002 2 1 prlimit64
0.00 0.000000 0 8 pread64
0.00 0.000000 0 1 1 access
0.00 0.000000 0 1 execve
0.00 0.000000 0 2 fdatasync
0.00 0.000000 0 1 set_tid_address
0.00 0.000000 0 1 set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00 0.167534 512616 3 total
So my program is quite busy with reading the file. Now, I am not sure if I can get it faster. The relevant code is the following:
while (gzgets(file_pointer, line, LL) != Z_NULL) {
linkage = strtok(line,"\t");
linkage = strtok(NULL,"\t");
linkage[strcspn(linkage, "\n")] = 0;
add_linkage_entry(id_cnt, linkage);
id_cnt++;
}
Do you see see room for improvement here? Is it possible to intervene manually with gzread or is gzgets doint a good job here to not read char by char?
Any other advice? (Are the errors in the strace worrisome?)
EDIT:
add_linkage_entry does add an entry to a uthash hash table (https://troydhanson.github.io/uthash/)
I don't think that gzgets (and the related read system calls) are the bottleneck here.
The number of read calls is small for data that compresses well, and it will increase for data that has more entropy (zlib has to request uncompressed data from disk more frequently then). E.g., for text data generated from urandom (via
base64 /dev/urandom | tr -- '+HXA' '\t' | head -n 10000000 | gzip
) I get about 70000 read calls for 10M lines, equalling about 140 lines/call. This nicely matches your experience of 100..1000 lines per call.
What is more, the CPU time for reading those lines is still negligible (about 2.5M lines/s, including the strtok calls). Highly compressed data requires about 40 times fewer read calls and can be read about 4 times as fast -- but this factor of 4 can also be seen with raw decompression via gzip -d on the command lines.
It thus appears that your function add_linkage_entry is the bottleneck here. In particular the large number of brk calls looks unusal.
The errors in strace output look harmless.
I am trying to store a CSV file that has 430469 rows (excluding the header) and 41 columns(excluding 2 columns of at the beginning) in a matrix in C. For that, I have declared a matrix dynamically.
The data is like this,
Where the first row contains the header. The first 2 columns contains the latitude and longitude respectively and the later rows contains the temperature recorded at those latitude and longitude in that year.
For reading the data and storing them in the matrix I have used this code snippet
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
int main()
{
//counting the number of rows and columns
FILE *mf=fopen("tdata.csv", "r");
if(mf==NULL)
{
perror("Unable to open the file");
exit(1);
}
char line[2000000];
int r=0, c;
while(fgets(line,sizeof(line), mf))
{
char *token;
c=0;
token=strtok(line, ",");
while(token!=NULL)
{
c++;
token= strtok(NULL, ",");
}
r++;
}
r=r-1; //since first row is header
c=c-2; //since first 2 columns are latitude and longitude that we dont need
printf("The number of row and column is is %d %d \n", r, c);
//declaring the matrix of size m[r][c]
double **mat = (double **)malloc(r * sizeof(double *));
for (int i=0; i<r; i++)
mat[i] = (double *)malloc(c * sizeof(double));
// Storing the values in the matrix
FILE *mf1=fopen("test.csv", "r");
if(mf1==NULL)
{
perror("Unable to open the file");
exit(1);
}
char line1[2000];
int i=0,j=0;
while(fgets(line1,sizeof(line1), mf1))
{
char *token1;
char *ptr;
double ret;
j=0;
token1=strtok(line1, ",");
while(token1!=NULL)
{
if(i>0 && j>1)
{
double d;
sscanf(token1, "%lf", &d);
mat[i-1][j-2] = d;
}
token1= strtok(NULL, ",");
j++;
}
i++;
}
//Printing the matrix
for (int i = 0; i < r; i++)
{
for (int j = 0; j < c; j++)
{
printf("%lf ", mat[i][j]);
}
printf("\n");
}
//printf("%lf ", mat[23][0]);
//finding the minimum in every year.
//To find the minimum of every year, we must at first declare an array having the size as total number of columns
double* arr;
arr = (double*)malloc(c * sizeof(double));
double min;
for (int j = 0; j < c; j++)
{
min=1000;
for (int i = 0; i < r; i++)
{
if(mat[i][j]<min)
min=mat[i][j];
}
arr[j]=min;
}
printf("%lf \n", arr[0]);
}
But when I print the matrix to my surprise, I see that the first few rows are partially filled and after 10th row all values are 0. like this
-5.670000 -8.260000 -10.700000 -14.130000 -5.150000 -9.850000 -13.160000 -6.830000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
-14.230000 -9.640000 -12.350000 -16.820000 -10.760000 -17.540000 -24.110000 -16.380000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
-0.870000 -0.770000 -4.720000 -8.000000 -1.220000 -2.900000 -5.060000 -0.110000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
-25.190000 -21.960000 -20.880000 -19.400000 -34.940000 -28.290000 -22.710000 -26.630000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
-0.940000 0.320000 1.570000 -3.560000 0.210000 0.340000 -3.770000 -2.640000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
-19.100000 -19.000000 -14.250000 -21.250000 -18.250000 -17.890000 -19.900000 -16.640000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
-2.380000 -1.880000 -2.930000 -6.430000 -5.340000 -5.200000 -10.960000 -7.860000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
4.120000 6.810000 4.890000 3.110000 5.630000 4.180000 5.350000 5.750000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
-21.450000 -22.410000 -23.990000 -14.130000 -24.540000 -22.970000 -24.410000 -25.730000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
-7.940000 -5.170000 -9.780000 -7.790000 -6.750000 -3.810000 -5.640000 -3.090000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
Can someone help me and tell me what is wrong with my code that I am getting this behavior? Thank you.
I am implementing DCT transform using this formula:
But the results are incorrect. For some 8 by 8 matrix,
0 0 0 0 0 0 0 0
210 210 210 210 210 210 210 210
255 255 255 255 255 255 255 255
210 210 210 210 210 210 210 210
0 0 0 0 0 0 0 0
210 210 210 210 210 210 210 210
255 255 255 255 255 255 255 255
210 210 210 210 210 210 210 210
The results I got after passing the data to dct transform function are:
1350.000000 0.000000 -0.000000 0.000000 0.000000 0.000000 -0.000000 -0.000000
-250.897627 -0.000000 0.000000 -0.000000 -0.000000 -0.000000 0.000000 0.000000
-0.000000 0.000000 0.000000 -0.000000 0.000000 -0.000000 0.000000 -0.000000
-461.931139 -0.000000 0.000000 -0.000000 -0.000000 -0.000000 0.000000 0.000000
-510.000000 0.000000 0.000000 -0.000000 -0.000000 -0.000000 0.000000 0.000000
156.770200 0.000000 -0.000000 0.000000 0.000000 0.000000 -0.000000 -0.000000
-0.000000 -0.000000 -0.000000 -0.000000 -0.000000 0.000000 0.000000 -0.000000
-260.946562 -0.000000 0.000000 -0.000000 -0.000000 -0.000000 0.000000 0.000000
(Only 1st column has non-zero values)
The problem is that I was told that the correct results should be only non-zero values at upper left corner of the matrix. And I am not sure where might be wrong in my code. Can anyone help me? Thanks.
Here is my DCT code:
static double C(int val){
if(val == 0)
return 1.0 / sqrt(2.0);
else
return 1.0;
}
void dctTransform(int matrix[8][8], double dctMatrix[8][8]){
int u, v, x, y;
double temp;
for(u=0; u<8; u++)
for(v=0; v<8; v++){
temp = 0;
dctMatrix[u][v] = 0;
for(x=0;x<8;x++){
for(y=0;y<8;y++){
temp += matrix[y][x]*cos(((2*x+1)*u*M_PI) / 16)*cos(((2*y+1)*v*M_PI) / 16);
}
}
dctMatrix[u][v] = C(u) * C(v) * 0.25 * temp;
}
}
Aren't you supposed to first change the input values from 0..255 to -128..127? Haven't you also switched x and y in matrix[y][x]?
If I subtract 128 from the input values to center them around zero, and then switch x and y, your code at least gives me the correct values according to the example in Wikipedia.