Discussion on initial draft of Zipf's Law package
cwickham opened this issue · comments
Moving discussion over from: merely-useful/r-rse#54
At this point it's probably useful to:
- review the vignette, which runs through all the functionality, and
- glance over the body of the functions in
R/
Questions for Discussion
I think everything is up for discussion at this point, but some particular questions might be:
- Does the overall interface look OK? I.e. is this too many/too few functions? Are these the right names for the functions? Are the argument names and types OK? Are the output types OK?
- How do you feel about the complexity of the body of the functions?
A few thoughts (happy to submit as a PR if you prefer):
- Looks great overall, the functions look like they're doing exactly what they should with no unnecessary padding, and I think it's a good amount of functionality.
- I found the names of the functions
dpower
andnllpower
infit.R
a bit confusing: maybe we could make them more readable, i.e.nllpower
->nlog_likelihood
(same as in Py-RSE)? Looks likedpower
doesn't have an exact counterpart in the Python one but maybe something likefrequency
orfrequency_power_law
instead? - How about
n
instead ofx
as the argument fordpower
andnllpower
? - Can we actually define
alpha = 1/(beta_hat - 1)
at the end offit_zipfs
for clarity?
@mbonsma Thanks for taking a look!
- I found the names of the functions
dpower
andnllpower
infit.R
a bit confusing: maybe we could make them more readable, i.e.nllpower
->nlog_likelihood
(same as in Py-RSE)? Looks likedpower
doesn't have an exact counterpart in the Python one but maybe something likefrequency
orfrequency_power_law
instead?- How about
n
instead ofx
as the argument fordpower
andnllpower
?
dpower
was named that way to be consistent with the other density functions in R, e.g. dnorm
, dunif
, dchisq
etc. That's the rational behind using x
(as opposed to n
) as well, since these functions all have the data vector go in as x
. It may also be my statistician perspective, but data is always x
or y
for me, with n
reserved for sample size. But, then again I don't actually follow this convention with plot_zipfs()
or fit_zipfs()
which both use n
for the word count vector. (We could look at that too, would counts
be a better argument name when the function expects a vector of word counts?).
I think I'd like to keep the naming convention for dpower
and the x
argument, but it is definitely worth a call out in the book, i.e. let's make a point of saying the naming here was chosen to be consistent with other existing R functions. We might also make the point that since these functions are not user facing, the naming is more about making our (i.e. whoever is writing the package) easier, rather than our users (the people using the package). (Of course we still need to think about our readers).
I definitely agree that nllpower
is bit cryptic, and doesn't have a natural analog in base R so the consistency argument doesn't really matter here. nlog_likelihood
works although it would be nice to be explicit that this is just for power law likelihood. Is nlog_likelihood_power
too long?
- Can we actually define
alpha = 1/(beta_hat - 1)
at the end offit_zipfs
for clarity?
Yes, this is a great idea.
dpower
was named that way to be consistent with the other density functions in R, e.g.dnorm
,dunif
,dchisq
etc. That's the rational behind usingx
(as opposed ton
) as well, since these functions all have the data vector go in asx
. It may also be my statistician perspective, but data is alwaysx
ory
for me, withn
reserved for sample size. But, then again I don't actually follow this convention withplot_zipfs()
orfit_zipfs()
which both usen
for the word count vector. (We could look at that too, wouldcounts
be a better argument name when the function expects a vector of word counts?).
I agree with this reasoning for keeping dpower
the way you wrote it. I think if we're dropping the convention in plot_zipfs
and fit_zipfs
, I support using counts
instead of n
for maximum clarity.
I think I'd like to keep the naming convention for
dpower
and thex
argument, but it is definitely worth a call out in the book, i.e. let's make a point of saying the naming here was chosen to be consistent with other existing R functions. We might also make the point that since these functions are not user facing, the naming is more about making our (i.e. whoever is writing the package) easier, rather than our users (the people using the package). (Of course we still need to think about our readers).
This is a great point and definitely worth mentioning that function naming is for the creator as well as the user!
I definitely agree that
nllpower
is bit cryptic, and doesn't have a natural analog in base R so the consistency argument doesn't really matter here.nlog_likelihood
works although it would be nice to be explicit that this is just for power law likelihood. Isnlog_likelihood_power
too long?
Now I'm having second thoughts about the n
part of the name too - it may be confusing especially if we keep n
as the count argument elsewhere. How about neg_ll_power
? It seems like abbreviating log-likelihood as ll
is pretty common. But if this were my code, I would generally go for a long name when in doubt (negative_log_likelihood_power
). Since it's internal, no one has to call it so the length matters less?
Completely agree here with Maddie's suggestion about negative_log_likelihood_power
. Verbose is fine considering we have autocompletion, and it helps others who look at the code to know what it is. The code is much more self-documenting.
The same reasoning I think applies to the dpower
. If its internal, I'd prefer longer, more descriptive over shorter, even if it goes against base R functions. Plus, let's be honest, a lot of base R functions should have been named better. Like I always read runif()
as "run if", and always forget. If it was named "random_uniform", that is clearer! Doesn't help that ?runif
isn't the most user-, non-statistician friendly documentation.
Ok, I'm on board. Let's do negative_log_likelihood_power()
, density_power()
and use counts
as the argument. I'll make the edits.