PRQL / prql

PRQL is a modern language for transforming data — a simple, powerful, pipelined SQL replacement

Home Page:https://prql-lang.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Feedback wanted: contributing to the complier

max-sixty opened this issue · comments

As discussed with @aljazerzen:

How could we better enable folks to make their first few PRs to the compiler?

PRQL has had many successes over the past year — there's much we can be happy and be grateful for. But something we've done a bad job at is enabling folks to contribute to the compiler.

We're not doing well / the trend is negative

For context, over the past six months the pattern of project's contributions have diverged:

  • Overall, code contributions to PRQL have increased a lot and the number of contributors has broadened 🙌. But the broadening is focused on the parts around the compiler — such as bindings, docs, integrations such as prql-vscode
  • The concentration of code contributions to the compiler has narrowed — I have done less, and @aljazerzen has done much much more.

My mental mode of contributions is there's a funnel — folks see a project, think it's good, maybe try using it, maybe try contributing to docs, maybe try adding some code. Then as they contribute more, they build the context to make increasingly more significant contributions. But if we can't enable them to start contributing to the compiler, they can't build the context to contribute more there. So our biggest challenge is to enable making the 1st / 2nd / 3rd compiler PR.

This matters

The current state is bad for the project for a few reasons:

  • Code progress — the project doesn't make the progress it needs to. We've got a lot to do — Our Roadmap — and even before the big feature development, we still have some bugs / bad error messages / etc.
  • Context building — beyond losing the code of those who can't contribute, we're losing their ideas. There's a dramatic difference in the speed — and to some extent the quality — of discussion between those who have the deep context of the language and those who don't. We still welcome ideas from all, and there are other ways to build that context, but having written the actual code is likely primary.
  • Bus factor — It puts stress on the existing contributor. While I hope we're each taking joy from the project, it's less joyful if lots of folks are waiting for a bug fix and you're the only person who can fix something.
  • Ratcheting — There's a self-reinforcing effect. When one person writes all the code, the code becomes fit to their style, the incentives to make the code engageable are less strong — e.g. comments explaining the code, modularity, explicit interfaces. Multiple contributors brings inherent robustness.
  • Diversity — It reduces the diversity of the team, and the project — which has externalities beyond just the project
  • Sharing — We're not sharing the joy of open-source (at least for me, if other projects were less accessible, I wouldn't have started contributing to them eight years ago, and I'd have missed out on the tremendous opportunities that this world has brought me)

Examples

I'm going to list a couple of examples of issues where folks were interested in contributing, but didn't get to submit a PR for these (though they did for other things). It's important these are perceived as examples of the core team failing to make the codebase sufficiently accessible, not as the fault of the potential contributors! Hopefully the context of this issue makes that clear.

(My guess is that there's an order of magnitude more of these where folks didn't comment on an issue they found, or didn't find an issue, despite having an interest in contributing)

And it's worth flagging some successes, so hopefully we can get feedback on what was difficult / could have been better.

Questions

Some questions we'd love feedback on

  • If you tried to make a contribution — even for ten minutes — and didn't manage. What was difficult? What could we have done to make that easier? What projects have you successfully contributed to which we could learn from? Do you recognize something they did which was effective?
  • If you did make a contribution — similar questions to how we could improve. Is there something that made it more difficult to make more contributions?
  • If you know of a project that has done this well — what's the project and what could we learn from them?

What we've done / what we could do

Some things we've done so far:

Some things we could do:

  • Offer "office-hours" where we go through an issue live with someone who's looking to start contributing?
  • Stream / record some of our own contributions?
  • Work on breaking up the compiler into more modular pieces?
    • We've had some good progress on this with PL & RQ, though still many of our tests are still end-to-end, it can be difficult to know where to start. Even an issue such as #1355 requires changes throughout the codebase
  • Deliberately leave some easier issues? (I know I often want to take a break from work for 30 minutes and get something done, often that means grabbing an easy issue. But then the ones remaining are the gnarlier ones)
  • Find areas where there are "repeatable" PRs?
    • I was chatting with @charliermarsh about how Ruff been so successful in enabling contributions. One advantage Ruff has is that add a new lint to Ruff is a similar contribution to the lints that have already been added. So folks can look at an existing PR, think about the differences, and then add their own. Maybe we could do something similar with #1420?
  • Others?

I realize this issue is now very long, and getting engagement might have been optimized by just posting "PRQL is easy to contribute to" and seeing people point out why it's wrong (Cunningham's Law). Hopefully the issue at least highlights how much we care about this dimension.

If you can build on the questions above, it's fine to have a short or curt response! Greatly appreciate your feedback, thank you.

I went from playing with the code on the binding side.

  1. Create R bindings (Of course, this is because the JS and Python bindings were so simple that I felt it would be easy to create R bindings)
  2. Investigate how to realize the functionality I want in the R binding (convert from string to enum)
  3. Think it would be better to implement it on the compiler side, so try to add the functionality on the compiler side.

My thought is that it is easier to get around to fixing the parts of each language binding and CLI that users can execute immediately.
So why not create issues for each language bindings and label it as a Good First issue?
For example, something like #1838 to add a function that was in prql-js but not in prql-python.

Find areas where there are "repeatable" PRs?

In this repository, adding functionality to the bindings for each language would be equivalent to this.
Once we become familiar with the repository, it will be easier for us to contribute to the compiler too.

@max-sixty - Is this a good agenda item for the Dev Call tomorrow? Thanks.

@max-sixty - Is this a good agenda item for the Dev Call tomorrow? Thanks.

Yes, though tbc, I'm hoping to understand the empirical experiences of those who have arrived new to the project & attempted to contribute to the compiler. So I'm keen to hear from folks who have done that — ideally before we jump to potential solutions. I'll add the label.

Sorry, it seems that I had accidentally unpinned this issue by mistake.......

Building on the broader thoughts in #1420 (comment) here:


OK cool — that said, I still think that a different unmerged branch introduces a bunch more context. Maybe I try and get that merged with an initial example after I'm done on some of the current priority issues?

I'm not trying to criticize at all — clearly this work is much much better than doing nothing — I'm more trying to work through why efforts such as this or the precursor to #2251 weren't successful — open to other ideas ofc.

And I guess I'm open to "it's not worth our time to create these issues, easier to do it ourselves". My view might be overly biased by how I got involved with Open-Source eight years ago, and seeing a few people start-but-then-pause in this repo. And wanting us to compound our time and contributions...

Closing as stale