pysal / mgwr

Multiscale Geographically Weighted Regression (MGWR)

Home Page:https://mgwr.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Spatial Variation Monte Carlo

amedwick opened this issue · comments

Hi!

Is it possible to make the spatial variation Monte Carlo more efficient by using multiprocessing? I've been working on it myself, but I'm clearly doing something wrong since each set of std's on the Georgia data comes back exactly the same. Since it is important to be able to replicate runs using a seed, I tried taking the randomization of the coordinates out and creating a giant list of coordinates to feed to multiple processors and then collect the results. With the 8 cores on my PC, even 1,000 iterations should be done in a reasonable amount of time. Unfortunately, it just doesn't seem to be doing what I hoped. Any suggestions?

Thanks!
Allan

Hi @amedwick ,

GWR can be sped up by passing a multiprocessing.Pool() object to both the selector and the fitting function. This notebook has an example. I actually haven't tried to see whether the pool parameter can be automatically reused in the spatial variability function call. Could you try that, I could also try that later this week. Essentially, the Monte Carlo test is expected to run 1000 parallelized GWRs, so we don't need to parallelize the for loop in the MC test (though I think parallelizing the 1000 iterations may have a better performance).

Hope this makes sense.

Ziqi

The pool function while running an MGWR works great. I should preface this by saying I am completely new to python and I am learning as I am going here.

The first step I did was to pull out the randomization of the coordinates:

n_iters= 5
np.random.seed(5536)
A = []

for x in range(n_iters):
    temp_coords = np.random.permutation(mgwr_results.model.coords)
    A.append(temp_coords)
    
print(len(A))

Then I tried creating a new function based on the original spatial_variation function:

import copy

SDs = []

class monte:
    def __init__(self):
        self.data = [] # I needed to put something here to avoid an error message

    def my_func(self, coords, y=g_y, X=g_X, selector=mgwr_selector, sds=SDs):
        temp_sel = copy.deepcopy(selector)
        search_params = temp_sel.search_params
        temp_sel.coords = coords
        temp_sel.search(**search_params)
        temp_params = temp_sel.params
        temp_sd = np.std(temp_params, axis=0)
        sds.append(temp_sd)
        return temp_sd

monte().my_func(A[0])

which worked:
Backfitting: 4% 8/200 [00:06<02:54, 1.10it/s]
[0.06891862 0.20525906 0.1436936 0.0603458 ]

My next step was to see about sending the coordinates in A[] through the function, but to do it in parallel instead of doing it in series:

SDs = []
work = A

from multiprocessing import Pool

def pool_handler():
    p = Pool(8)
    p.map(monte().my_func, work)

if __name__ == '__main__':
    pool_handler()

Which is where I am currently stuck. I've tried adapting some of the sample code in the multiprocessing documentation, but it wasn't working. The above is just one example of the things I tried.

Thanks!
Allan

Hi @amedwick,

I think there are some restrictions on passing functions defined in a class to the pool.map(). This thread on stackoverflow may provide some help.

The StackOverflow link was very helpful and pointed me in the right direction. I needed to use the pathos.multiprocessing library. The code now works and runs in what seems to be a reasonable amount of time for the task. I ran 1,000 iterations on the Georgia data using a pool of 7 processors on an Intel i7-4770K and it took about 52 minutes. I had to leave one core out, otherwise, my computer was useless while it was running. I just got an i7-11700K processor for a new build, which has 8 physical cores, but I'm thinking it might make even more sense to figure out a way to do it using AWS.

The Georgia dataset is relatively small with only 159 observations. The dataset I want to use this on has 3,108 observations, so that is the next test.

Thanks!
Allan

Elapsed time during the whole program in seconds: 0.859375 | 3145.4924421310425
Average time per iteration: 0.000859375 | 3.1454924421310424

#!/usr/bin/env python
# coding: utf-8

# In[1]: import libraries

import numpy as np
import pandas as pd
import libpysal as ps
import geopandas as gp
import matplotlib.pyplot as plt
import matplotlib as mpl

from mgwr.gwr import GWR, MGWR
from mgwr.sel_bw import Sel_BW
from mgwr.utils import compare_surfaces, truncate_colormap

import sys
import time

import pathos.multiprocessing as pm

# In[2]: import Georgia data

georgia_data = pd.read_csv(ps.examples.get_path('GData_utm.csv'))
georgia_shp = gp.read_file(ps.examples.get_path('G_utm.shp'))
g_y = georgia_data['PctBach'].values.reshape((-1,1))
g_X = georgia_data[['PctFB', 'PctBlack', 'PctRural']].values
u = georgia_data['X']
v = georgia_data['Y']
g_coords = list(zip(u,v))
g_X = (g_X - g_X.mean(axis=0)) / g_X.std(axis=0)
g_y = g_y.reshape((-1,1))
g_y = (g_y - g_y.mean(axis=0)) / g_y.std(axis=0)

# In[3]: run MGRW

mgwr_selector = Sel_BW(g_coords, g_y, g_X, multi=True)
mgwr_bw = mgwr_selector.search(multi_bw_min=[2])
print(mgwr_bw)

# In[4]: fit MGWR

mgwr_results = MGWR(g_coords, g_y, g_X, mgwr_selector).fit()

# In[5]: store parameters

init_sd = np.std(mgwr_results.params, axis=0)
print(init_sd)

# In[6]: randomize coordinates

n_iters= 1000

np.random.seed(5536)

A = []

print("Size of empty list:", sys.getsizeof(A), "bytes")

for x in range(n_iters):
    temp_coords = np.random.permutation(mgwr_results.model.coords)
    A.append(temp_coords)
    
print("Number of items in list:", len(A))
print("Size of full list:", sys.getsizeof(A), "bytes")

# In[7]: define Monte Carlo function

SDs = []

def monte(coords, y=g_y, X=g_X, selector=mgwr_selector, sds=SDs):
    import copy
    import numpy as np
    temp_sel = copy.deepcopy(selector)
    search_params = temp_sel.search_params
    temp_sel.coords = coords
    temp_sel.search(**search_params)
    temp_params = temp_sel.params
    temp_sd = np.std(temp_params, axis=0)
    sds.append(temp_sd) # Not sure if I need this
    #print(temp_sd)
    return temp_sd

# In[8]: Check size of SDs list

SDs = []
print(len(SDs))
print(SDs)

# In[ ]: Parallel process randomized coordinates

work = A

p = pm.Pool(7)

t1_start = time.process_time() 
start = time.time()

print("Time Started:", time.strftime('%a, %d %b %Y %H:%M:%S GMT', time.gmtime()))

SDs = p.map(monte, work)

t1_stop = time.process_time()
end = time.time()

print("Time Finished:", time.strftime('%a, %d %b %Y %H:%M:%S GMT', time.gmtime()))

print("Length of Returned List:", len(SDs))

print("Elapsed time:", t1_stop, t1_start, "|", start, end) 
   
print("Elapsed time during the whole program in seconds:", t1_stop-t1_start, "|", end-start) 

print("Average time per iteration:", (t1_stop - t1_start) / n_iters, "|", (end-start) / n_iters)

# In[ ]: Calculate p-values

p_vals = (np.sum(np.array(SDs) > init_sd, axis=0) / float(n_iters))
print(p_vals)

Looks promising! Hope I could have an i7-11700K as well....