Grammar reference
teucer opened this issue · comments
I could not find any reference on the grammar that is supported. How does it compare to formulaic?
You're right there's no formal grammar reference. The closest we have is the Getting started section in our docs.
Nevertheless, formulae and formulaic should agree in the design matrices that are constructed from the same formula. The internals are different, but the results should match.
One of the most relevant differences is that formulae implements the |
operator to construct design matrices for random effects, while formulae does not (yet).
Thank you.
How does it compare to R? We are in the process of migrating an R package to python, it would be useful to know them.
Could you tell me which features you're interested in? That could help to point you to relevant differences.
Formulae is very similar to model formulas in R. It still lacks the double pipe ||
operator, but maybe it is not relevant for you.
I was wondering about diff or lag operators for example. It is not clear to me how vector operators are handled.
I think this is going to be clearer if you share a concrete R example with both the input and the output. In particular, how do you use diff()
or lag()
in a R formula?
Some thing like below:
log(y) ~ lag(log(y)) + diff(log(x))
As you can see diff and lag apply to the whole 'vector', they are not scalar functions.
@teucer could you share a reproducible example? That's what I mean with both input and output. I understand the formula, but I don't know what is the expected behavior from that formula.
What I would like to have is the following
- Sample input data
- Sample formula
- How the formula is used (e.g. is it used within
lm()
or within which modeling function?) - Sample output
As far as I know, the result of lag(x)
is of length n
, but the result of diff(x)
is of length n - 1
so it is not clear to me how these are handled.
> x <- 1:5
> length(lag(x))
# [1] 5
> length(diff(x))
# [1] 4
Sorry for the lengthy discussion, I am not the author of the package that we are trying to convert. It seems that people have implemented their own lag
and diff
functions. The general question remains though, would something like below work?
# this is not a scalar function
import numpy as np
def lag(arr, num=1, fill_value=np.nan):
if num >= 0:
return np.concatenate((np.full(num, fill_value), arr[:-num]))
else:
return np.concatenate((arr[-num:], np.full(-num, fill_value)))
xs = np.arange(10)
print(lag(xs)) # > [nan 0. 1. 2. 3. 4. 5. 6. 7. 8.]
dm = design_matrices("y1 ~ lag(x)", data)
Now, it would be useful if we could pass our own additional transformations. I think you are using Environment to do that. What about passing a dictionary of transformations? It might be the case that we are defining them in another file.
# we want to avoid star import: "from .utils import *"
from .utils import lag
def myfun(x):
return x + 1
trans_dict = {"myfun": myfun, "lag": lag}
dm = design_matrices("y1 ~ myfun(x)", data, transformations=trans_dict)
Thanks for the comments clarifying my question!
- Formulae should work fine with array valued functions.
- The problem with the first
lag()
function is that it returnsnp.nan
. Currently, Formulae either raises an error or drops rows containing missing values. We should add a case where we just want to keep missing values. See
Lines 524 to 533 in 38d7f42
- At the moment, it is not possible to pass a dictionary of transformations to
design_matrices()
. Maybe you can modify the following dictionary
formulae/formulae/transforms.py
Lines 401 to 412 in 38d7f42
which is what Formulae consumes when looking for internal transformations. However I don't think this is a clean solution.
- How many transformations do you have? If they are few, you could just do
from .utils import lag, diff, etc
and that would solve the problem.
- 2: It would be good to keep them. We could ignore the rows with
np.nan
during estimation, but would need them later on. - 3: The only way to do is to "monkey patch" (?). Not my preferred approach, hence the suggestion. We could take the additional transformations and concatenate with
TRANSFORMS
. - 4: That is feasible, but I don't know how many we will end up with.
PS: could do PRs if required.
It would be good if you could try to work on a PR for 2 (allowing to keep missing values) and 3, allowing to pass a dictionary with transformations.
For 3, have a look at
Line 513 in 38d7f42
and https://github.com/bambinos/formulae/blob/master/formulae/environment.py
and feel free to ask questions.
As a pointer, I would create a new Environment holding the transforms you want and./or use .with_outer_namespace()
.
Yes. Can we do a release?
The development version as a lot of breaking changes. I want to double check a couple of things before doing a release.
Can you install from github in the meantime?
Ok. Is there an ETA?
PS: My main use case is behind corporate barriers, it would be difficult to install from github. We have an internal pypi proxy.
I've been thinking about the release and I think it's not going to be a problem to have a release now. We need to update the Changelog first.
I just wanted to add tests for some features we don't have covered yet and test whether this development version was OK for what I want to do in Bambi. I guess I can do another release if I need to include more changes for Bambi.
Thank you for the support. Looking forward to the new release.
PS: I would be happy to further contribute if you have further tasks. If you add tasks and tag them (e.g. need contributor), it would be easier for me to pickup and send PRs.
I'm having a problem building the docs, see https://github.com/bambinos/formulae/runs/5511845996?check_suite_focus=true
I don't know what's going on. I tried to reproduce the problem locally (same version of everything) but I couldn't.
Edit It drove me nuts.
I fixed the error by changing how formulae
is installed within the environment where the action runs. See ed81e6e and https://github.com/bambinos/formulae/actions/runs/1969203706.
@teucer there's a new release now :)