Discussion on initial draft of Zipf's Law package

Question

Discussion on initial draft of Zipf's Law package

cwickham opened this issue 3 years ago · comments

Charlotte Wickham commented 3 years ago

Moving discussion over from: merely-useful/r-rse#54

At this point it's probably useful to:

review the vignette, which runs through all the functionality, and
glance over the body of the functions in R/

Questions for Discussion

I think everything is up for discussion at this point, but some particular questions might be:

Does the overall interface look OK? I.e. is this too many/too few functions? Are these the right names for the functions? Are the argument names and types OK? Are the output types OK?
How do you feel about the complexity of the body of the functions?

Madeleine Bonsma-Fisher · Answer 1 · Sat Mar 13 2021 03:49:42 GMT+0800 (China Standard Time)

A few thoughts (happy to submit as a PR if you prefer):

Looks great overall, the functions look like they're doing exactly what they should with no unnecessary padding, and I think it's a good amount of functionality.
I found the names of the functions dpower and nllpower in fit.R a bit confusing: maybe we could make them more readable, i.e. nllpower -> nlog_likelihood (same as in Py-RSE)? Looks like dpower doesn't have an exact counterpart in the Python one but maybe something like frequency or frequency_power_law instead?
How about n instead of x as the argument for dpower and nllpower?
Can we actually define alpha = 1/(beta_hat - 1) at the end of fit_zipfs for clarity?

Charlotte Wickham · Answer 2 · Sat Mar 13 2021 05:25:18 GMT+0800 (China Standard Time)

@mbonsma Thanks for taking a look!

I found the names of the functions dpower and nllpower in fit.R a bit confusing: maybe we could make them more readable, i.e. nllpower -> nlog_likelihood (same as in Py-RSE)? Looks like dpower doesn't have an exact counterpart in the Python one but maybe something like frequency or frequency_power_law instead?

How about n instead of x as the argument for dpower and nllpower?

dpower was named that way to be consistent with the other density functions in R, e.g. dnorm, dunif, dchisq etc. That's the rational behind using x (as opposed to n) as well, since these functions all have the data vector go in as x. It may also be my statistician perspective, but data is always x or y for me, with n reserved for sample size. But, then again I don't actually follow this convention with plot_zipfs() or fit_zipfs() which both use n for the word count vector. (We could look at that too, would counts be a better argument name when the function expects a vector of word counts?).

I think I'd like to keep the naming convention for dpower and the x argument, but it is definitely worth a call out in the book, i.e. let's make a point of saying the naming here was chosen to be consistent with other existing R functions. We might also make the point that since these functions are not user facing, the naming is more about making our (i.e. whoever is writing the package) easier, rather than our users (the people using the package). (Of course we still need to think about our readers).

I definitely agree that nllpower is bit cryptic, and doesn't have a natural analog in base R so the consistency argument doesn't really matter here. nlog_likelihood works although it would be nice to be explicit that this is just for power law likelihood. Is nlog_likelihood_power too long?

Can we actually define alpha = 1/(beta_hat - 1) at the end of fit_zipfs for clarity?

Yes, this is a great idea.

Madeleine Bonsma-Fisher · Answer 3 · Thu Mar 25 2021 23:45:55 GMT+0800 (China Standard Time)

dpower was named that way to be consistent with the other density functions in R, e.g. dnorm, dunif, dchisq etc. That's the rational behind using x (as opposed to n) as well, since these functions all have the data vector go in as x. It may also be my statistician perspective, but data is always x or y for me, with n reserved for sample size. But, then again I don't actually follow this convention with plot_zipfs() or fit_zipfs() which both use n for the word count vector. (We could look at that too, would counts be a better argument name when the function expects a vector of word counts?).

I agree with this reasoning for keeping dpower the way you wrote it. I think if we're dropping the convention in plot_zipfs and fit_zipfs, I support using counts instead of n for maximum clarity.

I think I'd like to keep the naming convention for dpower and the x argument, but it is definitely worth a call out in the book, i.e. let's make a point of saying the naming here was chosen to be consistent with other existing R functions. We might also make the point that since these functions are not user facing, the naming is more about making our (i.e. whoever is writing the package) easier, rather than our users (the people using the package). (Of course we still need to think about our readers).

This is a great point and definitely worth mentioning that function naming is for the creator as well as the user!

I definitely agree that nllpower is bit cryptic, and doesn't have a natural analog in base R so the consistency argument doesn't really matter here. nlog_likelihood works although it would be nice to be explicit that this is just for power law likelihood. Is nlog_likelihood_power too long?

Now I'm having second thoughts about the n part of the name too - it may be confusing especially if we keep n as the count argument elsewhere. How about neg_ll_power? It seems like abbreviating log-likelihood as ll is pretty common. But if this were my code, I would generally go for a long name when in doubt (negative_log_likelihood_power). Since it's internal, no one has to call it so the length matters less?

Luke W. Johnston · Answer 4 · Fri Mar 26 2021 00:23:18 GMT+0800 (China Standard Time)

Completely agree here with Maddie's suggestion about negative_log_likelihood_power. Verbose is fine considering we have autocompletion, and it helps others who look at the code to know what it is. The code is much more self-documenting.

Luke W. Johnston · Answer 5 · Fri Mar 26 2021 00:27:49 GMT+0800 (China Standard Time)

The same reasoning I think applies to the dpower. If its internal, I'd prefer longer, more descriptive over shorter, even if it goes against base R functions. Plus, let's be honest, a lot of base R functions should have been named better. Like I always read runif() as "run if", and always forget. If it was named "random_uniform", that is clearer! Doesn't help that ?runif isn't the most user-, non-statistician friendly documentation.

Charlotte Wickham · Answer 6 · Fri Mar 26 2021 02:43:25 GMT+0800 (China Standard Time)

Ok, I'm on board. Let's do negative_log_likelihood_power(), density_power() and use counts as the argument. I'll make the edits.