JDG Lab Notebook

MSE Estimation Comparison

Alternative versions: LabNotebook, nbviewer, slideviewer, pdf

Summary of Results

Overview

The following notes address a practical issue that arose in my research, namely that I had variously been using two different in-house tools for estimating MSE, inherited from two different individuals (i.e. I did not code them), and they were giving divergent results. One of these functions was written in matlab and one in python, and I needed a python option, as the MSE computation formed part of a suite of jobs intended for use with a large compute cluster, and these jobs were already using python to run tvb simulations. However, I also needed the results to be directly comparable to previous results with empirical data that had used the matlab code. So, I have ported the matlab code to python. The following sections present the new implementation, and some tests and comparisons.

Test Data

We'll look at three test datasets:

  1. Random noise
  2. EEG data
  3. Synthetic neural data

Here's what the time series look like:

(0_0)
broken link Test Dataset 1 - time series.
(0_0)
broken link Test Dataset 2 - gran mean time series.
(0_0)
broken link Test Dataset 3 - grand mean time series.

MSE Curves

Random noise:

(0_0)
broken link Test Dataset 1 - mses.

EEG data:

(0_0)
broken link Test Dataset 2 avtr - mses.

Synthetic data:

(0_0)
broken link Test Dataset 3 avr - mses.

Performance

Matlab and the TVB sample entropy function seem to be comparable, but at the moment my implementation of natasa's code is really slow.

all_durs = pd.concat({'td1': df_td1_1k_durs,
                      'td2 avtr': df_td2_avtr_durs, 
                      'td3 avt': df_td3_avr_durs})

d(all_durs.unstack())
duration
nkml nkpy tvb
td1 00:1.32 02:27 00:02
td2 avtr 00:0.31 00:21 00:00
td3 avt 00:0.13 00:05 00:00

Conclusions

(to add...)

Analysis Documentation + Code

Notebook Setup

Enable matlab in the notebook

%load_ext pymatbridge
Starting MATLAB on ZMQ socket ipc:///tmp/pymatbridge-f50fe9f8-965e-461c-a580-9bf758ce4e4b
Send 'exit' command to kill the server
...................MATLAB started and connected!

Define some variables

# put some system-specific variables in the namespace ('le' dict)
%run ~/set_localenv_vars.py

nb_name = 'mse_estimation_comparison'

outdir = '%s/notebooks/%s' %(le['data_dir'], nb_name)
!mkdir -p $outdir


# Analysis stuff

tvb_dir = le['code_dir'] + '/libraries_of_others/github/tvb-pack'

wfdb_dir = le['code_dir'] + '/libraries_of_others/misc'\
                             '/WFDB/wfdb-app-toolbox-0-9-9/mcode'

eeglab_dir = '/usr/share/matlab/site/m/eeglab11'

empdat_dir =  le['data_dir'] + '/copied_from_pales_jheisz_eeg_jen_Sternberg'\
                              '/Artifact_removal_results'

empdat_file_orig = empdat_dir  + '/st_01_exp10ym/del1.set'
empdat_file = outdir +  '/st_01_exp10ym_del1.mat'

simdat_dir =  le['data_dir'] + '/notebooks/mse_brute_force_exploration/g2do_sims'
simdat_file = simdat_dir + '/mbfe_#182_ad6d923b-13c8-4aac-abe3-59b91645bb7d_res.h5'

m2u = 2
r2u = 0.5
s2u = range(1,51)

mse_tools_file  = outdir + '/mse_tools.py'

# stuff for workdocs-cloudfiles

aws_key = 'drowssaperucesyreva'
aws_secret = '?teytidettopsuoyevah'  

Importage

# generic imports

import os,sys,glob,h5py,inspect,numpy as np,pandas as pd
from copy import copy,deepcopy
from datetime import datetime
from scipy.io import savemat,loadmat
from scipy.stats import zscore


# visualization stuff

%matplotlib inline
from matplotlib import pyplot as plt
from IPython.display import Image,HTML,IFrame,clear_output,display as d
import seaborn as sns

# tvb stuff

sys.path += [tvb_dir + '/' + t for t in ['tvb-library', 'tvb-data', 'tvb-framework',
                                         'matlab', 'external-geodesic']]
from tvb.analyzers.info import sampen



# aws api and workdocs-cloudfiles stuff

sys.path.append(le['ipynb_workdocs_dir'])
from new_utils import nb_fig,cloudfiles_nb

Grab Natasa's matlab function and put a copy in this notebook's output folder

nkmatfuncdir = le['data_dir'] + '/notebooks/reproducing_Heisz2015'
!cp $nkmatfuncdir/get_multiple_mse_curves_matlab.m $outdir/

Add eeglab and wfdb toolboxes, plus outdir with some custom .m files, to matlab path

%%matlab -i wfdb_dir,eeglab_dir,outdir
addpath(wfdb_dir)
addpath(genpath(eeglab_dir))
addpath(outdir)

Initialize aws api and workdocs-cloudfiles folder

cnb = cloudfiles_nb('aws', [aws_key,aws_secret])
cnb.initialize_folder(nb_name)

Calico document tools

%%javascript
IPython.load_extensions('calico-spell-check', 'calico-document-tools',
                        'calico-cell-tools');

Go to output folder

os.chdir(outdir)
%%matlab
cd(outdir)

Ok, let's get cracking...

Define some functions

Here is my python port of Natasa's get_multiple_mse_curves_matlab.m function. For reference, the original matlab code is also given below.

def get_multiple_mse_curves_py(data,m=2,r=0.5,scales=None,override=True):
    
  """
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  Port of get_multiple_mse_curves_matlab.m by Natasa Kovacevic. 
  
  With additional annotations from JG explaining what's going on. 
  
  Usage:   mse, scales = get_multiple_mse_curves_py(data,m,r,scales) 
  
  Inputs:
  
     data     = time series data in (time variable) format, 
                e.g (time trial) or (time voxel)
     m        = pattern length (default 2)
     r        = similarity criterion (defualt 0.5)
     scales   = vector of scales (default [1:(floor(size(data,1)/50))])
     override = 0 or 1 (default is 0) - in some rare cases we want
                to override this min50-rule

  Outputs:
  
     mse      = entropy in (scale variable) format
     scales   = return vector of scales actually used by the program
     
     
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     
     
  JG Notes: 
  ---------
  
    ***Unfinished!***: not yet giving *exactly* the same outputs as the matlab code. 
                       The error in my port is in the construction of the 'cont' array. 
                       
    Speed: currently the python function runs incredibly slow compared to the
           matlab function. I don't see any reason why this should be the case; 
           so need to identify where the slow parts are and improve. 
           
    Redesign: once this function is working satisfactorily, I will probably move to a 
              modified version that separate the estimation of sample entropy from 
              the loops over scales and input variables. 
    
  """
    
    
  # Import numpy functions all individually so that the function calls
  # are as similar to the original matlab code as possible
  # (...so be warned this port is going to me more 'matlabic' than 'pythonic'...)
    
  from numpy import (unique,floor,size,intersect1d,sort,
                     r_,std,mean,zeros,log,linspace,arange,round)
    
  num_tpts = size(data,0)
  num_vars = size(data,1)

  if scales is None: 
    scales = arange(1,floor(num_tpts/50)+1)  
    
  # make sure that scales are positive integers 
  scales = sort(unique(scales)); # % sort scales in increasing order
      
  if sum(scales == round(scales)) != len(scales):
    raise ValueError('error: scales vector must contain positive integers') 
    
  if override is False: 
     # make sure that scales are  < (num_tpts/50)
     scales = sort(intersect1d(scales,r_[0:floor(size(data,0)/50)]))
            
  # normalize data and r
  sd = std(data,axis=0)
  r_new = r*sd    

  mse = zeros([len(scales), num_vars])
            
  # Note: 
  # The matlab code uses the pattern length variable 'm' 
  # in two different ways: one related to array sizes, 
  # and one related to array indexes. For the former, 
  # the usage in python and matlab will be identicaly, 
  # but for the latter python will use 'm-1' (because 
  # like most proper programming langauges, python 
  # uses 0-based indexing). 
  # This looks a little unusal in the indexing of the 
  # 'cont' array constructed at each scale, which has 
  # size m+1, giving python array indices (but not 
  # sizes) 'm-1+1'. I has this rather than simplifying 
  # it to 'm' for the sake of clarity when comparing the 
  # matlab and python code versions. 
            
  for s,sc in enumerate(scales):
    
    # coarse-grain time series at this scale
    num_cg_tpts = floor(num_tpts/sc)   
    y = zeros([num_cg_tpts, num_vars])
    for t in r_[0:num_cg_tpts]:
      y[t,:] = mean(data[(t*sc + r_[0:sc]).astype(int),:],0)
        
    # calculate sample entropy of coarse grained time series y
    nlin_sc = num_cg_tpts - m       
    cont = zeros([m+1,num_vars]);  
    
    # The i and l loops define shifting pairs of points with incrementally increasing 
    # starting points and gaps between 
    for var in r_[0:num_vars]:
      for i in r_[0:nlin_sc]:
        for l in r_[(i+1):nlin_sc]: # self-matches are not counted
        
          #% calculate sample entropy of coarse grained time series y
          k = -1;
          while ((k <= m-1) & (abs(y[i+k,var] - y[l+k,var]) <= r_new[var])):
            # c1,c2,targ = 0,0,0
            #  while c1 & c2 <= targ:
            #  targ = r_new[var]
            #  c1 = (k <= m-1)
            #  c2 = abs(y[i+k,var] - y[l+k,var])
            #  #while ((k <= m-1) & (abs(y[i+k,var] - y[l+k,var]) <= r_new[var])):
            k+=1 
            cont[k,var] +=1
            
    # calculate mse at this scale
    for var in r_[0:num_vars]:
      if cont[m-1+1,var] == 0 or cont[m-1,var] == 0:
        mse[s,var] = -log(1/((nlin_sc)*(nlin_sc -1)));
      else:
        mse[s,var] = -log(cont[m-1+1,var]/cont[m-1,var])
      
  return mse,scales
    
    
    
    
def gmmcs_nkpy(data,m=2,r=0.5,scales=None,override=True):
    
  from datetime import datetime
  from numpy import r_
    
  if scales is None: scales = r_[:20]    
    
  start = datetime.now()

  mse,scales = get_multiple_mse_curves_py(data,m=m,r=r,scales=scales,override=override)

  finish = datetime.now()
  dur = str(finish-start)      
        
  return mse,scales,dur
    
    
    
    
def gmmcs_tvb(data,m=2,r=0.5,scales=None,qse=False):    
  
  from datetime import datetime
  from numpy import r_,newaxis,ndim

  if scales is None: scales = r_[:20]    
    
  start = datetime.now()

  if ndim(data) == 1: data = deepcopy(data)[:,newaxis]    
        
  mse = np.array([sampen(data[:,roi], m=m,taus=scales,r=r,qse=qse)
                  for roi in range(data.shape[1])])#,qse=False,m=2.)    
  
  finish = datetime.now()
  dur = str(finish-start)      
        
  return mse,scales,dur
function [mse, scales]= get_multiple_mse_curves_matlab(data,m,r,scales,override) 

% $$$ Written by Natasha Kovacevic - based on M. Costa's C code - simplified - skipped all sorts of command line options  - here we simultaneously get mse curves across trials or voxels
% $$$ Usage:   [mse, scales]= get_multiple_mse_curves_matlab(data,m,r,scales) 
% $$$ Inputs:
% $$$   data = time sereis data in (time variable) format, e.g (time trial) or (time voxel)
% $$$   m = pattern length (default 2)
% $$$   r = similarity criterion (defualt 0.5)
% $$$   scales = vector of scales (default [1:(floor(size(data,1)/50))])
% $$$   override = 0 or 1 (default is 0) - in some rare cases we want to oveeride this min50-rule
% $$$ Outputs:
% $$$   mse = entropy in (scale variable) format
% $$$   scales = return vector of scales actually used by the program


    num_tpts = size(data,1);
    num_vars = size(data,2);
    if ~exist('m','var'), m=2; end
    if ~exist('r','var'), r=0.5; end
    if ~exist('scales','var'), scales = [1:(floor(num_tpts/50))] ; end

    % make sure that scales are positive integers 
    scales = sort(unique(scales)); % sort scales in increasing order
    if sum(scales == round(scales)) ~= numel(scales)
      error('scales vector must contain positive integers');
    end

    if exist('override','var') 
      if ~override      
        % make sure that scales are  < (num_tpts/50)
        scales = sort(intersect(scales,[1:(floor(size(data,1)/50))]));
      end
    end


    % mormalize data and r
    sd = std(data,[],1);
    r_new = r * sd;

    mse = zeros(numel(scales),num_vars); %(scale var)
    for s=1:numel(scales)
      sc = scales(s);

      % coarse grind time series at this scale
      num_cg_tpts = floor(num_tpts/sc);
      y = zeros(num_cg_tpts, num_vars);
      for t = 1:num_cg_tpts
        y(t,:) = mean(data((t-1)*sc + [1:sc],:),1);
      end 

      % calculate sample entropy of coarse ground time series y
      nlin_sc = num_cg_tpts - m;
      cont = zeros(m+1,num_vars);
      for var = 1:num_vars
        for i = 1:nlin_sc
          for l = (i+1):nlin_sc % self-matches are not counted
            k = 0;
            while ((k < m) & (abs(y(i+k,var) - y(l+k,var)) <= r_new(var)))
              k = k + 1;
              cont(k,var) = cont(k,var) + 1; 
            end
            if ((k == m) & (abs(y(i+m,var) - y(l+m,var)) <= r_new(var)))
              cont(m+1,var) = cont(m+1,var) + 1;
            end
          end
        end          
      end

      % calculate mse at this scale
      for var = 1:num_vars
        if (cont(m+1,var) == 0 | cont(m,var) == 0)
          mse(s,var) = -log(1/((nlin_sc)*(nlin_sc -1)));
        else
          mse(s,var) = -log(cont(m+1,var)/cont(m,var));
        end
      end 

    end % for s=1:numel(scales)

Test 1: Random sample

Define the data

td1_10k_dat = np.random.randn(3*10000)[:,np.newaxis]
df_td1_10k_dat = pd.DataFrame(td1_10k_dat)
td1_1k_dat = np.random.randn(3*1000)[:,np.newaxis]
df_td1_1k_dat = pd.DataFrame(td1_1k_dat)

Plot the data

(0_0)
broken link Test Dataset 1 time series. 1000 data points from the 1`0000 random data test dataset.

Compute MSE in matlab with original NK function

td1_1k_nkml_matfile = outdir + '/td1_1k_nkml_test.mat'
%%matlab -i td1_1k_dat,td1_1k_nkml_matfile,m2u,r2u,s2u -o td1_1k_nkml_mse,td1_1k_nkml_scales,td1_1k_nkml_dur

tic
[td1_1k_nkml_mse,td1_1k_nkml_scales] = get_multiple_mse_curves_matlab(td1_1k_dat,m2u,r2u,s2u);
td1_1k_nkml_dur=toc;

save(td1_1k_nkml_matfile,'td1_1k_dat', 'td1_1k_nkml_mse', 'td1_1k_nkml_scales', 'td1_1k_nkml_dur')
df_td1_1k_nkml_mse = pd.DataFrame(td1_1k_nkml_mse.copy(),
                                  index=np.squeeze(td1_1k_nkml_scales.copy())).T
df_td1_1k_nkml_mse.columns.names = ['scale']

Compute MSE in python with ported NK function

td1_1k_nkpy_mse,td1_1k_nkpy_sc,td1_1k_nkpy_dur = gmmcs_nkpy(td1_1k_dat,m=m2u,
                                                            r=r2u,scales=s2u)


df_td1_1k_nkpy_mse = pd.DataFrame(td1_1k_nkpy_mse, index=td1_1k_nkpy_sc).T
df_td1_1k_nkpy_mse.columns.names = ['scale']

Compute MSE in python using TVB sampen function

td1_1k_tvb_mse,td1_1k_tvb_sc,td1_1k_tvb_dur = gmmcs_tvb(td1_1k_dat,m=m2u,r=r2u,scales=s2u)
df_td1_1k_tvb_mse = pd.DataFrame(td1_1k_tvb_mse,columns=td1_1k_tvb_sc)
df_td1_1k_tvb_mse.columns.names = ['scale']

Plot the three MSE curves together

(0_0)
broken link Test Dataset 1 - All MSEs.

Durations:

# Something like this
tmp1 = '00:%.2f' %td1_1k_nkml_dur
tmp2 = ':'.join(td1_1k_nkpy_dur.split(':')[1:]).split('.')[0]
tmp3 = ':'.join(td1_1k_tvb_dur.split(':')[1:]).split('.')[0]
df_td1_1k_durs = pd.DataFrame(np.array([tmp1,tmp2,tmp3]), index=['nkml', 'nkpy', 'tvb'])
df_td1_1k_durs.columns = ['duration']
d(df_td1_1k_durs)
duration
nkml 00:1.32
nkpy 02:27
tvb 00:02

Save to file

df_td1_1k_dat.to_pickle(outdir + '/df_td1_1k_dat.pkl')
df_td1_1k_nkml_mse.to_pickle(outdir + '/df_td1_1k_nkml_mse.pkl')
df_td1_1k_nkpy_mse.to_pickle(outdir + '/df_td1_1k_nkpy_mse.pkl')
df_td1_1k_tvb_mse.to_pickle(outdir + '/df_td1_1k_tvb_mse.pkl')
df_td1_1k_durs.to_pickle(outdir + '/df_td1_1k_durs.pkl')

Test 2: Empirical EEG Data

Grab the data, convert from original eeglab format, and save as a standard matlab array.

fdir,fname = os.path.split(empdat_file_orig)
%%matlab -i eeglab_dir,fdir,fname,empdat_file

addpath(genpath(eeglab_dir))

eeg = pop_loadset(fname,fdir);
eeg = eeg_checkset(eeg);
data = double(eeg.data);

save(empdat_file, 'data')
td2_dat = loadmat(empdat_file, struct_as_record=False,
                 squeeze_me=True)['data']

td2_avt_dat = np.mean(td2_dat,2).T
td2_avtr_dat = np.mean(td2_avt_dat,1)[:,np.newaxis]

df_td2_avt_dat = pd.DataFrame(td2_avt_dat)
df_td2_avtr_dat = pd.DataFrame(td2_avtr_dat)
df_td2_avtch1_dat = pd.DataFrame(td2_avt_dat[:,0])
# Z-scored ones; not sure if using...

Z_td2_dat = zscore(td2_dat,1)
Z_td2_avt_dat = np.mean(Z_td2_dat,2).T
Z_td2_avtr_dat = np.mean(Z_td2_avt_dat,1)[:,np.newaxis]

df_Z_td2_avt_dat = pd.DataFrame(Z_td2_avt_dat)
df_Z_td2_avtr_dat = pd.DataFrame(Z_td2_avtr_dat)
df_Z_td2_avtch1_dat = pd.DataFrame(Z_td2_avt_dat[:,0])
(0_0)
broken link EEG grand mean time series.

Compute MSE with matlab

td2_avtr_nkml_matfile = outdir + '/td2_avtr_nkml_test.mat'
%%matlab -i td2_avtr_dat,td2_avtr_nkml_matfile,m2u,r2u,s2u -o td2_avtr_nkml_mse,td2_avtr_nkml_scales,td2_avtr_nkml_dur

tic
[td2_avtr_nkml_mse,td2_avtr_nkml_scales] = get_multiple_mse_curves_matlab(td2_avtr_dat,m2u,r2u,s2u);
td2_avtr_nkml_dur=toc;

save(td2_avtr_nkml_matfile,'td2_avtr_dat', 'td2_avtr_nkml_mse', 'td2_avtr_nkml_scales', 'td2_avtr_nkml_dur', 'm2u', 'r2u', 's2u');
df_td2_avtr_nkml_mse = pd.DataFrame(td2_avtr_nkml_mse.copy(),index=np.squeeze(td2_avtr_nkml_scales.copy())).T
df_td2_avtr_nkml_mse.columns.names = ['scale']

Compute MSE with ported python function

td2_avtr_nkpy_mse,td2_avtr_nkpy_sc,td2_avtr_nkpy_dur = gmmcs_nkpy(td2_avtr_dat,m=m2u,
                                                            r=r2u,scales=s2u)

df_td2_avtr_nkpy_mse = pd.DataFrame(td2_avtr_nkpy_mse, index=td2_avtr_nkpy_sc).T
df_td2_avtr_nkpy_mse.columns.names = ['scale']

Compute MSE with TVB

td2_avtr_tvb_mse,td2_avtr_tvb_sc,td2_avtr_tvb_dur = gmmcs_tvb(td2_avtr_dat,m=m2u,r=r2u,scales=s2u)
df_td2_avtr_tvb_mse = pd.DataFrame(td2_avtr_tvb_mse,columns=td2_avtr_tvb_sc)
df_td2_avtr_tvb_mse.columns.names = ['scale']

Plot the three sets of MSE curves

(0_0)
broken link Test Dataset 2 grand mean MSEs.

Durations

# Something like this
tmp1 = '00:%.2f' %td2_avtr_nkml_dur
tmp2 = ':'.join(td2_avtr_nkpy_dur.split(':')[1:]).split('.')[0]
tmp3 = ':'.join(td2_avtr_tvb_dur.split(':')[1:]).split('.')[0]
df_td2_avtr_durs = pd.DataFrame(np.array([tmp1,tmp2,tmp3]), index=['nkml', 'nkpy', 'tvb'])
df_td2_avtr_durs.columns = ['duration']
d(df_td2_avtr_durs)
duration
nkml 00:0.31
nkpy 00:21
tvb 00:00

Save to file

df_td2_avtr_dat.to_pickle(outdir + '/df_td2_avtr_dat.pkl')
df_td2_avtr_nkml_mse.to_pickle(outdir + '/df_td2_avtr_nkml_mse.pkl')
df_td2_avtr_nkpy_mse.to_pickle(outdir + '/df_td2_avtr_nkpy_mse.pkl')
df_td2_avtr_tvb_mse.to_pickle(outdir + '/df_td2_avtr_tvb_mse.pkl')
df_td2_avtr_durs.to_pickle(outdir + '/df_td2_avtr_durs.pkl')

Test 3: Simulated data

Load in the data

f = h5py.File(simdat_file, 'r')

td3_dat = f['tavg_dat'][:]
td3_ts = f['tavg_ts'][:]

f.close()


td3_avr_dat = np.mean(td3_dat,1)[:,np.newaxis]

df_td3_dat = pd.DataFrame(td3_dat,index=td3_ts)
df_td3_avr_dat = pd.DataFrame(td3_avr_dat,index=td3_ts)
(0_0)
broken link Test Dataset 3 time series.

Compute MSE with matlab

td3_avr_nkml_matfile = outdir + '/td3_avr_nkml_test.mat'
%%matlab -i td3_avr_dat,td3_avr_nkml_matfile,m2u,r2u,s2u -o td3_avr_nkml_mse,td3_avr_nkml_scales,td3_avr_nkml_dur

tic
[td3_avr_nkml_mse,td3_avr_nkml_scales] = get_multiple_mse_curves_matlab(td3_avr_dat,m2u,r2u,s2u);
td3_avr_nkml_dur=toc;

save(td3_avr_nkml_matfile,'td3_avr_dat', 'td3_avr_nkml_mse', 'td3_avr_nkml_scales', 'td3_avr_nkml_dur', 'm2u', 'r2u', 's2u');
df_td3_avr_nkml_mse = pd.DataFrame(td3_avr_nkml_mse.copy(),index=np.squeeze(td3_avr_nkml_scales.copy())).T
df_td3_avr_nkml_mse.columns.names = ['scale']

Compute MSE with python-ported function

td3_avr_nkpy_mse,td3_avr_nkpy_sc,td3_avr_nkpy_dur = gmmcs_nkpy(td3_avr_dat,m=m2u,
                                                            r=r2u,scales=s2u)

df_td3_avr_nkpy_mse = pd.DataFrame(td3_avr_nkpy_mse, index=td3_avr_nkpy_sc).T
df_td3_avr_nkpy_mse.columns.names = ['scale']

Compute MSE with TVB function

td3_avr_tvb_mse,td3_avr_tvb_sc,td3_avr_tvb_dur = gmmcs_tvb(td3_avr_dat,m=m2u,r=r2u,scales=s2u)
df_td3_avr_tvb_mse = pd.DataFrame(td3_avr_tvb_mse,columns=td3_avr_tvb_sc)
df_td3_avr_tvb_mse.columns.names = ['scale']

Plot the three sets of curves together

(0_0)
broken link Test Dataset 3 - All MSEs.

Durations

# Something like this
tmp1 = '00:%.2f' %td3_avr_nkml_dur
tmp2 = ':'.join(td3_avr_nkpy_dur.split(':')[1:]).split('.')[0]
tmp3 = ':'.join(td3_avr_tvb_dur.split(':')[1:]).split('.')[0]
df_td3_avr_durs = pd.DataFrame(np.array([tmp1,tmp2,tmp3]), index=['nkml', 'nkpy', 'tvb'])
df_td3_avr_durs.columns = ['duration']
d(df_td3_avr_durs)
duration
nkml 00:0.13
nkpy 00:05
tvb 00:00

Save to file

df_td3_avr_dat.to_pickle(outdir + '/df_td3_avr_dat.pkl')
df_td3_avr_nkml_mse.to_pickle(outdir + '/df_td3_avr_nkml_mse.pkl')
df_td3_avr_nkpy_mse.to_pickle(outdir + '/df_td3_avr_nkpy_mse.pkl')
df_td3_avr_tvb_mse.to_pickle(outdir + '/df_td3_avr_tvb_mse.pkl')
df_td3_avr_durs.to_pickle(outdir + '/df_td3_avr_durs.pkl')

Benchmarking

%load_ext line_profiler

At the moment get_multiple_mse_curves_py is taking an obscenely long amount of time to run.

To figure out where the problems are we can use the line_profile module (and corresponding notebook extension), which will spit out a report of how many times each liner was visited, how long they took to execute each time, and what percentage of the overall run time was spent on each line.

This requires the source code to be defined in a text file, rather than from the command prompt as we have done above.

So first let's use the inspect modoule to collect the source from our main functions of interest above and write them to file.

_txt = """\
# mse tools
# =========
#
# (see mse_estimation_comparison notebook entry for 
#  further details on these functions )
#
#


# Imports

import numpy as np
import sys

tvb_dir = '/alexandra/mcintosh_lab/john/Code/libraries_of_mine/github/tvb-pack'
sys.path.append(tvb_dir)
from tvb.analyzers.info import sampen



"""

_txt += inspect.getsource(get_multiple_mse_curves_py) + '\n\n\n\n'
_txt += inspect.getsource(gmmcs_nkpy) + '\n\n\n\n'
_txt += inspect.getsource(gmmcs_tvb) + '\n\n\n\n'

with open (mse_tools_file, 'w') as f: f.write (_txt)    

Now import the functions from the new source files, but give them different names (_lp suffix) to the original ones.

import mse_tools
reload(mse_tools)
gmmcs_tvb_lp = mse_tools.gmmcs_tvb
gmmcs_nkpy_lp = mse_tools.gmmcs_nkpy
get_multiple_mse_curves_py_lp = mse_tools.get_multiple_mse_curves_py
gmmcs_nkpy_avtr_lpreport = outdir + '/gmmcs_nkpy_avtr_lpreport.txt'
%lprun -f get_multiple_mse_curves_py_lp -T $gmmcs_nkpy_avtr_lpreport td2_avtr_nkpy_mse,td2_avtr_nkpy_sc,td2_avtr_nkpy_dur = gmmcs_nkpy_lp(td2_avtr_dat,m=m2u,r=r2u,scales=s2u) 
*** Profile printout saved to text file u'/alexandra/mcintosh_lab/john/Data/notebooks/mse_estimation_comparison/gmmcs_nkpy_avtr_lpreport.txt'. 
cat $gmmcs_nkpy_avtr_lpreport 
Timer unit: 1e-06 s

Total time: 39.6435 s
File: mse_tools.py
Function: get_multiple_mse_curves_py at line 21

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    21                                           def get_multiple_mse_curves_py(data,m=2,r=0.5,scales=None,override=True):
    22                                               
    23                                             """
    24                                             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    25                                             Port of get_multiple_mse_curves_matlab.m by Natasa Kovacevic. 
    26                                             
    27                                             With additional annotations from JG explaining what's going on. 
    28                                             
    29                                             Usage:   mse, scales = get_multiple_mse_curves_py(data,m,r,scales) 
    30                                             
    31                                             Inputs:
    32                                             
    33                                                data     = time series data in (time variable) format, 
    34                                                           e.g (time trial) or (time voxel)
    35                                                m        = pattern length (default 2)
    36                                                r        = similarity criterion (defualt 0.5)
    37                                                scales   = vector of scales (default [1:(floor(size(data,1)/50))])
    38                                                override = 0 or 1 (default is 0) - in some rare cases we want
    39                                                           to override this min50-rule
    40                                           
    41                                             Outputs:
    42                                             
    43                                                mse      = entropy in (scale variable) format
    44                                                scales   = return vector of scales actually used by the program
    45                                                
    46                                                
    47                                             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    48                                                
    49                                                
    50                                             JG Notes: 
    51                                             ---------
    52                                             
    53                                               ***Unfinished!***: not yet giving *exactly* the same outputs as the matlab code. 
    54                                                                  The error in my port is in the construction of the 'cont' array. 
    55                                                                  
    56                                               Speed: currently the python function runs incredibly slow compared to the
    57                                                      matlab function. I don't see any reason why this should be the case; 
    58                                                      so need to identify where the slow parts are and improve. 
    59                                                      
    60                                               Redesign: once this function is working satisfactorily, I will probably move to a 
    61                                                         modified version that separate the estimation of sample entropy from 
    62                                                         the loops over scales and input variables. 
    63                                               
    64                                             """
    65                                               
    66                                               
    67                                             # Import numpy functions all individually so that the function calls
    68                                             # are as similar to the original matlab code as possible
    69                                             # (...so be warned this port is going to me more 'matlabic' than 'pythonic'...)
    70                                               
    71         1           35     35.0      0.0    from numpy import (unique,floor,size,intersect1d,sort,
    72                                                                r_,std,mean,zeros,log,linspace,arange,round)
    73                                               
    74         1           21     21.0      0.0    num_tpts = size(data,0)
    75         1            4      4.0      0.0    num_vars = size(data,1)
    76                                           
    77         1            2      2.0      0.0    if scales is None: 
    78                                               scales = arange(1,floor(num_tpts/50)+1)  
    79                                               
    80                                             # make sure that scales are positive integers 
    81         1          216    216.0      0.0    scales = sort(unique(scales)); # % sort scales in increasing order
    82                                                 
    83         1          231    231.0      0.0    if sum(scales == round(scales)) != len(scales):
    84                                               raise ValueError('error: scales vector must contain positive integers') 
    85                                               
    86         1            4      4.0      0.0    if override is False: 
    87                                                # make sure that scales are  < (num_tpts/50)
    88                                                scales = sort(intersect1d(scales,r_[0:floor(size(data,0)/50)]))
    89                                                       
    90                                             # normalize data and r
    91         1          220    220.0      0.0    sd = std(data,axis=0)
    92         1           13     13.0      0.0    r_new = r*sd    
    93                                           
    94         1            8      8.0      0.0    mse = zeros([len(scales), num_vars])
    95                                                       
    96                                             # Note: 
    97                                             # The matlab code uses the pattern length variable 'm' 
    98                                             # in two different ways: one related to array sizes, 
    99                                             # and one related to array indexes. For the former, 
   100                                             # the usage in python and matlab will be identicaly, 
   101                                             # but for the latter python will use 'm-1' (because 
   102                                             # like most proper programming langauges, python 
   103                                             # uses 0-based indexing). 
   104                                             # This looks a little unusal in the indexing of the 
   105                                             # 'cont' array constructed at each scale, which has 
   106                                             # size m+1, giving python array indices (but not 
   107                                             # sizes) 'm-1+1'. I has this rather than simplifying 
   108                                             # it to 'm' for the sake of clarity when comparing the 
   109                                             # matlab and python code versions. 
   110                                                       
   111        51          192      3.8      0.0    for s,sc in enumerate(scales):
   112                                               
   113                                               # coarse-grain time series at this scale
   114        50          629     12.6      0.0      num_cg_tpts = floor(num_tpts/sc)   
   115        50          715     14.3      0.0      y = zeros([num_cg_tpts, num_vars])
   116      5094        26937      5.3      0.1      for t in r_[0:num_cg_tpts]:
   117      5044       450524     89.3      1.1        y[t,:] = mean(data[(t*sc + r_[0:sc]).astype(int),:],0)
   118                                                   
   119                                               # calculate sample entropy of coarse grained time series y
   120        50          254      5.1      0.0      nlin_sc = num_cg_tpts - m       
   121        50          400      8.0      0.0      cont = zeros([m+1,num_vars]);  
   122                                               
   123                                               # The i and l loops define shifting pairs of points with incrementally increasing 
   124                                               # starting points and gaps between 
   125       100         1058     10.6      0.0      for var in r_[0:num_vars]:
   126      4994        20588      4.1      0.1        for i in r_[0:nlin_sc]:
   127   1021172      3781098      3.7      9.5          for l in r_[(i+1):nlin_sc]: # self-matches are not counted
   128                                                   
   129                                                     #% calculate sample entropy of coarse grained time series y
   130   1016228      2759305      2.7      7.0            k = -1;
   131   1664096     27532032     16.5     69.4            while ((k <= m-1) & (abs(y[i+k,var] - y[l+k,var]) <= r_new[var])):
   132                                                       # c1,c2,targ = 0,0,0
   133                                                       #  while c1 & c2 <= targ:
   134                                                       #  targ = r_new[var]
   135                                                       #  c1 = (k <= m-1)
   136                                                       #  c2 = abs(y[i+k,var] - y[l+k,var])
   137                                                       #  #while ((k <= m-1) & (abs(y[i+k,var] - y[l+k,var]) <= r_new[var])):
   138    647868      2012283      3.1      5.1              k+=1 
   139    647868      3054493      4.7      7.7              cont[k,var] +=1
   140                                                       
   141                                               # calculate mse at this scale
   142       100         1178     11.8      0.0      for var in r_[0:num_vars]:
   143        50          369      7.4      0.0        if cont[m-1+1,var] == 0 or cont[m-1,var] == 0:
   144                                                   mse[s,var] = -log(1/((nlin_sc)*(nlin_sc -1)));
   145                                                 else:
   146        50          702     14.0      0.0          mse[s,var] = -log(cont[m-1+1,var]/cont[m-1,var])
   147                                                 
   148         1            3      3.0      0.0    return mse,scales

So, the vast majority (69%) of time is being spent on on the line

while ((k <= m-1) & (abs(y[i+k,var] - y[l+k,var]) <= r_new[var])):

which is run 1664096 times, at 27532032 ms (?) per hit.

  • [1] Reproducing, not replicating; a replication would be if the study were repeated and the analysis was on different data. I am analyzing the same data and simply attempting to repeat some key analysis steps.
  • [2] Ghostbusters reference. Did you have to ask?
  • misc MSE