Allow modifying remote/project urls of imported manifests
jobasto opened this issue · comments
Introduction
We use west to checkout all source files necessary for building a firmware and use the import feature as pictured below. This way we don't need to specify the exact version of the modules we want to use because that information is taken from the imported manifests.
project_repository (internal git server)
├─ west.yml
⇓ imports (filtered by name-allowlist)
company_library_repository (internal git server)
├─ west.yml
⇓ imports
zephyr_repository (https://github.com/zephyrproject-rtos/zephyr)
├─ west.yml
⇓ imports
zephyr_module_1 (https://github.com/zephyrproject-rtos/zephyr_module_1)
zephyr_module_2 (https://github.com/zephyrproject-rtos/zephyr_module_2)
zephyr_module_3 (https://github.com/zephyrproject-rtos/zephyr_module_3)
Requirement
To ensure long term reproducibility we want to mirror all externally hosted repositories on our internal git server and use this mirror server for cloning.
Possible solutions
User/system specific git config
One way to achieve this would be to use the git url.<base>.insteadOf config option (see also #28 (comment)). However, this option is user and/or system specific but we want that the actually used servers are deducible by examining the project_repository and its imports.
Specifying git config options in manifest
Another way would be to introduce a new manifest file key that allows setting arbitrary git config options. However, west probably shouldn't set user or system wide config options so it would have to set the config option in all imported repositories.
Adding a rewrite-url
key to the manifest
Probably the cleanest way would be to introduce a new key to the manifest file, e.g. rewrite-url
. An example could look like
manifest:
defaults:
remote: zephyr
rewrite-url:
- from: "default from regexp"
to: "default to regexp "
remotes:
- name: zephyr
url-base: https://github.com/zephyrproject-rtos
projects:
- name: zephyr
revision: xxxxxx
path: modules/zephyr
import:
rewrite-url:
- from: "from regexp 1"
to: "to regexp 1"
- from: "from regexp 2"
to: "to regexp 2"
- name: module
revision: yyyyyyyyyyy
path: modules/lib/module
import:
true # uses default rewrite-url
rewrite-url are a list of rewrite rules that will be applied to a url in order. If manifests are imported multiple levels deep the rewrite rules on the lowest level are applied first.
Open points:
- Is this full fledged implementation necessary or is there a more simple solution?
- Should we support full regexp or just match the start of the url as git does?
- Do we need project specific rewrite-urls? Would it make the implementation if we supported only rewrite-url in the defaults section?
Suggestions?
Did you take a look at
Thanks for pointing that out. I've overlooked this issue. But I'm not quite sure what to make of it.
The solution proposed is to use the url.<base>.insteadOf config option of git, which I had considered above. But if I see it correctly this mandates that every developer has to configure their git installation accordingly. And this changes the servers used to checkout without that fact being reflected in some west.yml
.
But I'm not quite sure what to make of it.
I believe the salient point is "anything 'dynamic' in imports instantly becomes insanely complicated". Quoting @mbolivar-nordic in #565
I thought about how to treat remotes as first class objects you can inherit from sub-manifests that you import. It gets really messy and I couldn't come up with a clean set of rules for it. So the decision I made is that when you import a project, its fetch URL is completely resolved at import time.
Quoting him again in #608
in the presence of this feature, we can't count on a complete workspace, so all commands which require a manifest to work would need to be revisited and correct semantics decided upon (e.g. if I do west init + west update --skip foo + west list, and 'foo' has some imports in it, the manifest will fail to load and west list won't have all the information it needs to do its job. Should it fail? Do we want something else? Etc.)
To be fair, mirroring does not change any git version so it's not actually "dynamic". I'm afraid mirroring is unfortunate "collateral damage" of the more general complexity of changing imports on the fly.
My 2 cents, I'm merely trying (and probably failing) to follow @mbolivar-nordic's expert thoughts on this.
EDIT: https://github.com/zephyrproject-rtos/west/issues?q=label%3A%22Partial+imports%22
Is this full fledged implementation necessary or is there a more simple solution?
I'm instinctively not a fan of regular expressions and "rewrites" for this. It seems vastly overkill to just point at a different git clone. git submodules
somewhat support mirrors and they don't need all that. A much simpler layer of remote
indirection looks like it would be enough do to the job. It would also solve #565 and #608.
It could/should be something as simple git's insteadOf
but implemented in workspace/.west/config
to make it project-specific and not pollute unrelated work.
but we want that the actually used servers are deducible by examining the project_repository and its imports.
I don't understand why. Mirroring should make absolutely no functional difference; it should only affect build performance. If mirroring makes any functional difference then it's a bug.
but we want that the actually used servers are deducible by examining the project_repository and its imports.
Wait, this requirement is actually much worse than I thought: what if I want to use my own, personal/internal mirrors for maximum performance and fault tolerance? Or because my network is isolated for security reasons? I do NOT want to touch any manifest files in that case because comparing manifests is the very first step to make sure I'm building exactly the same thing as someone else - including for security reasons but not just[*]
but we want that the actually used servers are deducible by examining the project_repository and its imports.
I don't understand why. Mirroring should make absolutely no functional difference; only performance and build reliability. If mirror makes a difference then it's a bug.
It's meant to ensure that we have everything necessary to build a project in-house so that in a couple of years from now no matter what happens we are able to reproduce the binaries. So I want to make it obvious that our internal mirror is used and I want to make it the default by putting the rewrite rule (or whatever method) on the company_library_repository level (because every application project is importing this repository).
It's meant to ensure that we have everything necessary to build a project in-house so that in a couple of years from now no matter what happens we are able to reproduce the binaries.
The biggest innovation brought by decentralized version control systems like git and mercurial is that the server does not matter. All clones are equivalent, they all have the same content (except of course for brand new work in progress not synchronized yet).
Hosting and content are now unrelated to each other.
Mirroring and backups come "for free".
In fact this "content-addressable distributed storage" idea is so good that it's being extended beyond git to pretty much everything: https://en.wikipedia.org/wiki/InterPlanetary_File_System
To "ensure you have everything necessary" specific server names do not and must not matter.
So I want to make it obvious that our internal mirror is used and...
If that mattered then there would be a flaw in your design.
It's meant to ensure that we have everything necessary to build a project in-house so that in a couple of years from now no matter what happens we are able to reproduce the binaries.
The biggest innovation brought by decentralized version control systems like git and mercurial is that the server does not matter. All clones are equivalent, they all have the same content (except of course for brand new work in progress not synchronized yet).
That's true as long as one can find the repository somewhere. One could argue that this will be always the case for any valuable piece of source code. But I don't want to rely on somebody else to do the backup for me. Maybe we have different standpoints regarding this subject which explains our discussion.
So how do we proceed? Should we wait for @mbolivar-nordic to have a look at this issue?
That's true as long as one can find the repository somewhere.
Yes, and? I don't understand the point you're trying to make here sorry.
But I don't want to rely on somebody else to do the backup for me.
To perform backups I can (and I do) clone my own repos as many times as I like, I don't have to wait for someone else to do it.
I try to summarize it. I hope this makes my thoughts clear.
- I want to make sure that the application is buildable without relying on any external servers.
- How do I do that? -> I mirror everything using our own servers where I know a backup process is in place.
- How do I make sure that every repository I (or any colleague) need is on the mirror server? -> I make using the mirror server the default so that referencing a repo that is not mirrored will fail.
- How do I make using the mirror the default (also for colleagues and new application projects)? -> I put a rewrite rule in the company_library_repository .
Thanks for making sure there is no misunderstanding.
My point is: 3. and 4. must not be in a manifest. The final mirroring decision must belong to a local configuration file. Having default values and sample configuration files in git is fine but the final mirroring decision must be in an (optional) local file that is not tracked in git.
Even when you and I build from different mirrors we must keep the ability to compare what we build and make sure it's the same thing all the way. This is becoming a legal requirement for obvious security reasons: https://www.cisa.gov/sbom
The very first step (not the only one) to compare our builds is to make sure we build from the same manifests and that our west status
and git status
are clean. Polluting manifests with mirroring details would change either git hashes or git status
and break the ability to compare builds easily.
git (and mercurial, etc.) were a bit of a revolution because they totally decoupled location and version control thanks to the magic of checksumming everything. "Download from anywhere; I don't care". Git submodules kept that critical decoupling: you can mirror submodules and download them from anywhere and they still work. Git submodules come with a default location but it can be redirected in .git/config
without polluting anything tracked in git.
West manifests should certainly not regress this wonderful design decision.
I make using the mirror server the default so that referencing a repo that is not mirrored will fail.
This does not prove anything. You could still have a build script downloading something without west and git knowing anything. For instance CMake's "ExternalProject" does that routinely. I'm not 100% sure but I think it has happened in Zephyr: https://github.com/zephyrproject-rtos/zephyr/issues?q=externalproject. There are many other ways to download from the Internet.
The only way to make really sure is to test your mirror while having no internet access.
This does not prove anything. You could still have a build script downloading something without west and git knowing anything.
Just remembered the Zephyr project has a west blob
command (I never used it)
EDIT: here's another example of code being downloaded at build time. It's wrong but not uncommon
Happy new year!
@marc-hb thank you for collecting all the backlogs!
I believe the salient point is "anything 'dynamic' in imports
instantly becomes insanely complicated".
Yep, the way import works internally is more complicated than people
usually expect.
The internal representations west uses to keep everything consistent
and well-defined when information (like project URLs) can come from
multiple places in the manifest or imported manifests (like 'remote'
or 'url' keys in an individual project definition, or the default
remote if neither is provided) can make some things harder to do than
people expect in particular.
To be fair, mirroring does not change any git version so it's not
actually "dynamic". I'm afraid mirroring is unfortunate "collateral
damage" of the more general complexity of changing imports on the
fly.
I thought about it a bit and we may be able to salvage some sort of
project URL override to satisfy this use case.
manifest:
defaults:
remote: zephyr
rewrite-url:
- from: "default from regexp"
to: "default to regexp "
This does not seem to address the core problem, which as I understand
it is how to selectively rewrite the URLs of imported projects, while
leaving everything else the same. I'm also not clear about the
intended semantics. I am inclined to say "no" to this part of a
proposal without a better description of what problem it solves and
how it solves it.
projects:
- name: zephyr
revision: xxxxxx
path: modules/zephyr
import:
rewrite-url:
- from: "from regexp 1"
to: "to regexp 1"
- from: "from regexp 2"
to: "to regexp 2"
I think that something like this is probably doable, though I share
@marc-hb 's concerns about using regular expressions.
Let me start by saying that while I don't have time to implement this
myself (my main tasks are devicetree-related at the present time), I
would be happy to provide some guidance on how I would like to see
this done.
The function that needs to be hooked is Manifest._load_projects
:
Line 2073 in fc278d4
This is what's responsible for taking the parsed YAML data and
creating a west.manifest.Project object, which has the final url
field that @jobasto wants to override.
We need to track any context-specific information that we may have
inherited from a "parent" manifest (so e.g.
project_repository/west.yml could provide a URL override, which then
needs to be passed downwards as we recursively import
company_library_repository/west.yml and finally
zephyr_repository/west.yml). At this last step, we will call
_load_projects
on the manifest data defining the project, and we
need the "parent" rewrite data in project_repository/west.yml to be
available.
The correct place to put this information is _import_ctx
:
Line 266 in fc278d4
The current import context is always available as
west.manifest.Manifest._ctx
. You should therefore:
-
add a field to
_import_ctx
with an internal representation for all
the rewrite rules you need -
merge it with any new rewrite rules as you recursively resolve
manifests (look at how_compose_imap_filters()
is defined and used
for examples of how west updates its import context as it resolves
imports) -
make sure
_load_projects
consults this context when it creates the
Project object, using the rewrite as necessary
@jobasto is this something you would be willing to take on?
Happy new year!
manifest: defaults: remote: zephyr rewrite-url: - from: "default from regexp" to: "default to regexp "
This does not seem to address the core problem, which as I understand it is how to selectively rewrite the URLs of imported projects, while leaving everything else the same. I'm also not clear about the intended semantics. I am inclined to say "no" to this part of a proposal without a better description of what problem it solves and how it solves it.
The goal is to have a "catch-all" rewrite rule that applies to all imports if no other rewrite rule is specified.
One could also imagine to put a rule in the remotes section:
manifest:
defaults:
remote: upstream
remotes:
- name: upstream
url-base: https://github.com/zephyrproject-rtos
url-rewrite: [...]
However, I don't need it for my use case as I've only one import where I want to rewrite urls. So we can keep things simple and leave that out.
But I'm not sure how far we should think ahead. For example, I only need one rewrite rule. So
rewrite-url:
from: "from regexp"
to: "to regexp"
would be enough for me. But I can imagine that maybe somebody would need multiple rules. So I would prefer to have a list of rules even if I put only one rule in that list.
I think that something like this is probably doable, though I share @marc-hb 's concerns about using regular expressions.
Regarding regular expressions the rationale was a bit like before: Keep it flexible. But I don't need it at the moment. So we could replace the beginning of the url as git does. If the need to have regular expressions should ever arise we could always extend the syntax, e.g.
rewrite-url:
from: "from regexp"
to: "to regexp"
regexp: true
@jobasto is this something you would be willing to take on?
Yes, I would. And thanks a lot for offering guidance.
@marc-hb :
My point is: 3. and 4. must not be in a manifest. The final mirroring decision must belong to a local configuration file. Having default values and sample configuration files in git is fine but the final mirroring decision must be in an (optional) local file that is not tracked in git.
I think this is a separate feature request from what @jobasto is asking for. West has no features for populating configuration variables to certain values by default. Putting this in the manifest is the only way to guarantee that other users are a west init
+ west update
away from getting the repositories from exactly the mirrors you want.
You could certainly additionally add configuration file support for overriding URLs, the same way we have configuration file support for group filters in addition to manifest file syntax for specifying group filters. However that does not seem to address the concrete use case here.
Furthermore upon reflection I am a bit leery of adding configuration file support for this. Anything that allows users to silently get something different when they run west update
is a potential footgun and I am very protective of making sure that west update
won't randomly get you different code. I see group filters as different because they are useful for getting a subset of a workspace, which is different than getting some slightly different use case workspace. Therefore I'd be leery of accepting such a feature without a compelling use case.
The goal is to have a "catch-all" rewrite rule that applies to all imports if no other rewrite rule is specified.
This feels like overkill to me unless you have some examples where it really saves a lot of typing. If the idea of restricting these rules to live inside an import: foo: <list of rules>
map will solve your use case (I think it does, right?), I'd prefer we start there, as long as the design doesn't limit us from adding catch-all rules later.
Remember west is:
Conservative about features: no features will be accepted without strong and compelling motivation.
https://docs.zephyrproject.org/latest/develop/west/why.html#west-design-constraints
So I would prefer to have a list of rules even if I put only one rule in that list.
Agreed; anywhere that one rewrite rule can appear, we should ideally allow several. Though then the semantics of what happens when multiple rules apply needs to be clearly specified. I suggest "first match wins, subsequent matches are ignored". Thoughts?
If the need to have regular expressions should ever arise we could always extend the syntax
Agreed -- please feel free to design a syntax that is future-proof enough to make regular expressions as a future extension, but let's limit this to what you need, and what git can do, at least for now / until we have a clear use case.
@jobasto is this something you would be willing to take on?
Yes, I would. And thanks a lot for offering guidance.
Great, thanks! I'm watching the repository and will review. Feel free to ask questions or chat further on the #west
channel in the zephyr discord instance (https://discord.com/invite/Ck7jw53nU2).
I suggest "first match wins, subsequent matches are ignored". Thoughts?
Note that this is a little subtle since we can inherit rules from multiple levels of parents.
West has no features for populating configuration variables to certain values by default.
I'm afraid we're talking across each other. The default value for mirroring is obviously "no mirroring", so there's no need for west to provide default mirroring configuration variables.
I think this is a separate feature request from what @jobasto is asking for.
Putting this in the manifest is the only way to guarantee that other users are a west init + west update away from getting the repositories from exactly the mirrors you want.
@jobasto and I have clearly different mirroring objectives but it's still mirroring in both cases.
@jobasto wants to make sure different people using the same top-level manifest hit "exactly the mirrors specified".
I believe there is "strong and compelling" (and very different) use case for "invisible mirroring" where mirroring is a local decision; making sure only THE CONTENT is the same no matter where it was downloaded from. In fact it's not me believing that but basically anyone working on content-addressable storage, including the people who designed git and friends. See all the references above.
I think these two different mirroring objectives are not mutually exclusive and I suspect they could share most of their "remote rewriting" implementation besides the user interface.
You could certainly additionally add configuration file support for overriding URLs, the same way we have configuration file support for group filters in addition to manifest file syntax for specifying group filters. However that does not seem to address the concrete use case here.
OK so maybe we're not talking across each other?
Furthermore upon reflection I am a bit leery of adding configuration file support for this. Anything that allows users to silently get something different when they run west update is a potential footgun and I am very protective of making sure that west update won't randomly get you different code.
I don't understand this. If the revision
is a checksum (true content-addressable storage) then rewriting remotes cannot get any different content. That's how git is designed.
If on the other hand the revision
is a moving branch then two people running west update
at a different time will get different content anyway before any kind of mirror gets involved at all.
So mirroring will never make any content difference one way or the other - as expected.
If on the other hand the
revision
is a moving branch then two people runningwest update
at a different time will get different content anyway before any kind of mirror gets involved at all.
These two cases are strikingly different when you are on a support team and you are positive what's in your company's fork of a repository, but have no idea what's in a random customer's mirror.
So there are two manifest types: the deterministic ones that provide true content-addressable storage, reproducible builds, safe supply chains etc. and the other ones with a non-zero number of "moving parts". I understand there are some situations where you want to use the latter instead of the former but these should be considered the exception and not the rule. In such cases you probably want to ask users to (re)move ALL their custom configuration anyway for good measure - including git's insteadOf
which already exists[*], has already been recommended here and that no west design decision can block.
In support situations (I've been there too) you also want to ask for some nice and short west list
anyway to be completely sure there hasn't been any "operator error", misunderstanding or unexpected bug.
[*] ... but is not workspace-specific.
including git's
insteadOf
which already exists, has already been recommended here and that no west design decision can block[*].
That's a fair counterpoint. OK, I'm fine with "someone" (again, not me) adding configuration options for this as well, but I don't think it should block an initial implementation which doesn't have them, especially if that's all the author has time to submit. We can always add more features later as concrete use cases arise.
I am at least a little sympathetic to the "I want to hack something together with my own mirrors and my local config files" -- but if that's the case, insteadOf
already works, so it's still not clear what the added value of a west config option is. We already made the decision not to implement any west-specific support for managing secrets to access private repositories since git already natively supports this. This kind of feels similar.
Thanks, glad we could find some common ground.
but I don't think it should block an initial implementation which doesn't have them, especially if that's all the author has time to submit. We can always add more features later as concrete use cases arise.
Agreed; I'm certainly not expecting anyone to implement something they don't need themselves.
On the other hand, I expect the implementation of one feature not to get in the way of an other, future and very similar feature with which it's likely to share most of its code. That's what I care about.
insteadOf already works, so it's still not clear what the added value of a west config option is.
It's simply global = in the wrong place for some use cases. It's not because you want to use a mirror for some git clones that you want to use a mirror for all git clones / projects / workspaces.
OK, yep, good point. Thanks.
I just want to let everybody know that I've written a little code and will open a pull request as soon as I can (which might still take some time).
Please share code as soon as you can but don't forget @mbolivar-nordic will request some new tests and coverage eventually :-)
A use case we are trying to figure out would be to have "remotes-override" section as part of your initial manifest repository. This would replace project URLs with a different URL.
The use case is when you want to try to retain a 1-to-1 alignment with some internal repositories with external ones that you will periodically push to. But you have a corporate security restriction to not expose internal server names which would normally be in the shared project's west.yml. So your internal repository west.yml will point to external URLs. For internal developers, you would have them start with a different internal manifest repository that would import the primary project but would replace the key URLs with internal server names.
@jobasto did you ever put together code to do this?