lee-group-cmu / RFCDE

Random Forests for Conditional Density Estimation

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

When x_new doesn't have the same dimensionality as the training covariates.

colorlace opened this issue · comments

First off: Great package and paper. Sometimes a point estimate just ain't gonna cut it. Anyway...

In the docstring for predict() in core.py it states...

x_new: numpy array/matrix
The covariates for the new observations. Each row/value
corresponds to an observation. Must have the same
dimensionality as the training covariates.

However, I am able to produce a density for inputs that do not match dimensionally with the training covariates. I adapted your RFCDE.ipynb code to provide an example.

import rfcde
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)


def generate_data(n):
    x_relevant = np.random.uniform(0, 1, (n, 10))
    x_irrelevant = np.random.uniform(0, 1, (n, 10))
    z = np.random.normal(np.sum(x_relevant, axis=1), 1, n)
    return np.hstack([x_relevant, x_irrelevant]), z


n_train = 100
n_test = 3


# This training set has 20 covariates. (10 relevant + 10 irrelevant)
x_train, z_train = generate_data(n_train)

n_trees = 1000
mtry = 4
node_size = 20
n_basis = 15

forest = rfcde.RFCDE(n_trees=n_trees, mtry=mtry, node_size=node_size, n_basis=n_basis)
forest.train(x_train, z_train)

# Prediction
bandwidth = 0.2
n_grid = 100
z_grid = np.linspace(0, 10, n_grid)

# !!!here's a test set of 3 observations with only 1 covariate, 
# instead of the 20 (10 relevant + 10 irrel.) from the training set.
funky_x_test = np.array([[.1], [.2], [.3]])

density = forest.predict(funky_x_test, z_grid, bandwidth)

# no problem, weights and densiries are still produced. plots may follow

I'm wondering if I am interpreting your docstring correctly. Are you aware of this? If so, how are the weights being produced when the x_new data is lacking covariates? I am looking through your cpp files, but am not very experienced with C++ so in the meantime I thought I would ask if you know the answer off the top of your head (or if you're surprised by this behavior!). Thanks.

So I'm pretty confused by why it's not segfaulting. The C++ code is just getting a pointer to each row and using the offsets to traverse the individual trees. So at some point it should have an out-of-bounds memory access because you're missing the rest of the variables in the row (well I guess the pointer would move to the next row, but even then you should get out-of-bounds for the last row). Instead it just seems to be reading the memory at those locations and splitting based on those junk values.

Obviously this isn't desirable. I think the easiest fix is just introducing an assertion that the number of dimensions has to be the same. Because otherwise predictions doesn't make sense.

Thanks for the bug report and using the package! :)

Thanks for your help. Follow up Q: Should I expect this see this change reflected in the published version on PyPI?

Should be updated now!