bambinos / formulae

I could not find any reference on the grammar that is supported. How does it compare to formulaic?

You're right there's no formal grammar reference. The closest we have is the Getting started section in our docs.

Nevertheless, formulae and formulaic should agree in the design matrices that are constructed from the same formula. The internals are different, but the results should match.

One of the most relevant differences is that formulae implements the | operator to construct design matrices for random effects, while formulae does not (yet).

Thank you.

How does it compare to R? We are in the process of migrating an R package to python, it would be useful to know them.

Could you tell me which features you're interested in? That could help to point you to relevant differences.

Formulae is very similar to model formulas in R. It still lacks the double pipe || operator, but maybe it is not relevant for you.

I was wondering about diff or lag operators for example. It is not clear to me how vector operators are handled.

I think this is going to be clearer if you share a concrete R example with both the input and the output. In particular, how do you use diff() or lag() in a R formula?

Some thing like below:

log(y) ~ lag(log(y)) + diff(log(x))

As you can see diff and lag apply to the whole 'vector', they are not scalar functions.

@teucer could you share a reproducible example? That's what I mean with both input and output. I understand the formula, but I don't know what is the expected behavior from that formula.

What I would like to have is the following

Sample input data
Sample formula
How the formula is used (e.g. is it used within lm() or within which modeling function?)
Sample output

As far as I know, the result of lag(x) is of length n, but the result of diff(x) is of length n - 1 so it is not clear to me how these are handled.

> x <- 1:5
> length(lag(x))
# [1] 5
> length(diff(x))
# [1] 4

Sorry for the lengthy discussion, I am not the author of the package that we are trying to convert. It seems that people have implemented their own lag and diff functions. The general question remains though, would something like below work?

# this is not a scalar function
import numpy as np

def lag(arr, num=1, fill_value=np.nan):
    if num >= 0:
        return np.concatenate((np.full(num, fill_value), arr[:-num]))
    else:
        return np.concatenate((arr[-num:], np.full(-num, fill_value)))
        
xs = np.arange(10)
print(lag(xs)) # > [nan  0.  1.  2.  3.  4.  5.  6.  7.  8.]

dm = design_matrices("y1 ~ lag(x)", data)

Now, it would be useful if we could pass our own additional transformations. I think you are using Environment to do that. What about passing a dictionary of transformations? It might be the case that we are defining them in another file.

# we want to avoid star import: "from .utils import *"
from .utils import lag 

def myfun(x):
    return x + 1

trans_dict = {"myfun":  myfun, "lag": lag}
 
dm = design_matrices("y1 ~ myfun(x)", data, transformations=trans_dict)

Thanks for the comments clarifying my question!

Formulae should work fine with array valued functions.
The problem with the first lag() function is that it returns np.nan. Currently, Formulae either raises an error or drops rows containing missing values. We should add a case where we just want to keep missing values. See

formulae/formulae/matrices.py

Lines 524 to 533 in 38d7f42

    
           if incomplete_rows_n > 0: 
        
               if na_action == "drop": 
        
                   _log.info( 
        
                       "Automatically removing %s/%s rows from the dataset.", 
        
                       incomplete_rows_n, 
        
                       data.shape[0], 
        
                   ) 
        
                   data = data[~incomplete_rows] 
        
               else: 
        
                   raise ValueError(f"'data' contains {incomplete_rows_n} incomplete rows.")

At the moment, it is not possible to pass a dictionary of transformations to design_matrices(). Maybe you can modify the following dictionary

formulae/formulae/transforms.py

Lines 401 to 412 in 38d7f42

    
           TRANSFORMS = { 
        
               "B": binary, 
        
               "binary": binary, 
        
               "C": C, 
        
               "I": I, 
        
               "offset": offset, 
        
               "p": proportion, 
        
               "prop": proportion, 
        
               "proportion": proportion, 
        
               "S": S, 
        
               "T": T, 
        
           }

which is what Formulae consumes when looking for internal transformations. However I don't think this is a clean solution.

How many transformations do you have? If they are few, you could just do from .utils import lag, diff, etc and that would solve the problem.

2: It would be good to keep them. We could ignore the rows with np.nan during estimation, but would need them later on.
3: The only way to do is to "monkey patch" (?). Not my preferred approach, hence the suggestion. We could take the additional transformations and concatenate with TRANSFORMS.
4: That is feasible, but I don't know how many we will end up with.

PS: could do PRs if required.

It would be good if you could try to work on a PR for 2 (allowing to keep missing values) and 3, allowing to pass a dictionary with transformations.

For 3, have a look at

formulae/formulae/matrices.py

Line 513 in 38d7f42

env = Environment.capture(env, reference=1)

and https://github.com/bambinos/formulae/blob/master/formulae/environment.py

and feel free to ask questions.

As a pointer, I would create a new Environment holding the transforms you want and./or use .with_outer_namespace().

@teucer thanks for the contributions y #69 and #70. Is the development version working for your needs now?

Yes. Can we do a release?

The development version as a lot of breaking changes. I want to double check a couple of things before doing a release.
Can you install from github in the meantime?

Ok. Is there an ETA?

PS: My main use case is behind corporate barriers, it would be difficult to install from github. We have an internal pypi proxy.

I've been thinking about the release and I think it's not going to be a problem to have a release now. We need to update the Changelog first.

I just wanted to add tests for some features we don't have covered yet and test whether this development version was OK for what I want to do in Bambi. I guess I can do another release if I need to include more changes for Bambi.

Thank you for the support. Looking forward to the new release.

PS: I would be happy to further contribute if you have further tasks. If you add tasks and tag them (e.g. need contributor), it would be easier for me to pickup and send PRs.

I'm having a problem building the docs, see https://github.com/bambinos/formulae/runs/5511845996?check_suite_focus=true

I don't know what's going on. I tried to reproduce the problem locally (same version of everything) but I couldn't.

Edit It drove me nuts.

I fixed the error by changing how formulae is installed within the environment where the action runs. See ed81e6e and https://github.com/bambinos/formulae/actions/runs/1969203706.

@teucer there's a new release now :)

	if incomplete_rows_n > 0:
	if na_action == "drop":
	_log.info(
	"Automatically removing %s/%s rows from the dataset.",
	incomplete_rows_n,
	data.shape[0],
	)
	data = data[~incomplete_rows]
	else:
	raise ValueError(f"'data' contains {incomplete_rows_n} incomplete rows.")

	TRANSFORMS = {
	"B": binary,
	"binary": binary,
	"C": C,
	"I": I,
	"offset": offset,
	"p": proportion,
	"prop": proportion,
	"proportion": proportion,
	"S": S,
	"T": T,
	}

Grammar reference