alan-if / alan-docs

Alan IF Documentation Project

Home Page:https://git.io/alan-docs

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Storing generated PDF creates rebase conflicts

thoni56 opened this issue · comments

I know there are some advantages of storing a generated version of the PDF:s in the repo, e.g. there is always a version that can be pointed to from the website.

But just having been through the hazzle of rebasing the dev-man branch onto a recent chunk of text changes, and been forced to handle a conflict for each and every commit (and probably also forcing a lot of that onto whoever wants to pull now), I think we need to re-think this. (I had no other problems...)

I propose that we figure out a way to upload the generated files to repo releases or somewhere else (website), so that we can get rid of the generated binary files from the repo to streamline the editing of the two branches.

I quickly looked for a git option to always ignore conflicts for some files but didn't find much. Possibly we could use a (custom?) merge-driver. Never heard about it before so that's why I added the "research" tag...

After some googling I found https://gist.github.com/tmaybe/4c9d94712711229cd506 which explains merge driver a bit, but also indicates how to use it for exactly this purpose. So I'm gonna try this out on the dev-man branch and see what I can learn from it.

I propose that we figure out a way to upload the generated files to repo releases or somewhere else (website), so that we can get rid of the generated binary files from the repo to streamline the editing of the two branches.

We could keep the generated PDFs only in master branch. I remember that the reason we were also updating them in the dev branches was to check that everything was working fine, but maybe now we no longer need that.

I quickly looked for a git option to always ignore conflicts for some files but didn't find much.

I think that git rerere (reuse recorded resolution) might do the trick:

Thanks for the ideas. Yeah, we could do away with the PDF in the dev-branch.

The "branching" model we decided upon was, as least how I see it, to maintain two version of the documentation

  • master that is aligned with the current official release but that would be updated with better and better content continously
  • dev-man that is aligned with the development snapshots, as new features are added this version of the documentation would also add it

We would continuously rebase dev-man onto master to ensure that at release a smooth merge of it onto master could be done. Thanks to asciidoc that is practically feasible. And working.

(Maybe that needs to go somewhere, for posterity, and maybe myself ,-), the wiki?)

Removing the PDF from one, or both, branches would help. But that leaves us with no "dev documents" to point to.

There is a CI-job, which currently only does asciidoc validation as far as I can see. Would it be a possible route to expand on that to build the documentation too? Maybe that was our intention all along. Setting up the toolchain in Travis would be a hazzel, esp. with asciidoctor-fopub, but it would be interesting to try.

Then we could possibly upload the result from the branches to two different "releases", like "Official" and "Snapshot". Then those would be stable places to point to. And would contain corresponding, and updated, versions of the various documents. (Ok, only the manual would actually need the "Snapshot" version...)

I think I tried something like this (uploading to github) in some other project but did not finish it, probably because of lack of time. I would be interested to dig into this.

Maybe a new issue for it?

ALAN Manual: Dev Snapshots

The "branching" model we decided upon was, as least how I see it, to maintain two version of the documentation

  • master that is aligned with the current official release but that would be updated with better and better content continously
  • dev-man that is aligned with the development snapshots, as new features are added this version of the documentation would also add it

Exactly. The point is that not every commit to dev-man would require rebuilding the docs; only significant changes to the documentation demand a rebuild, i.e. in order to grant developers access to bleeding edge features, or just benefit from updated contents. This judgement being arbitrary in nature, we didn't so far attempt to automate the process but just handled it by hand.

Removing the PDF from one, or both, branches would help. But that leaves us with no "dev documents" to point to.

The PDF is really only required in master branch, the dev docs could be just in HTML, after all they are just a pre-view courtesy, so one format should suffice (also, the HTML is fully standalone, so it can be downloaded directly and consulted locally).

ALAN Manual: Dev Strategies

We would continuously rebase dev-man onto master to ensure that at release a smooth merge of it onto master could be done. Thanks to asciidoc that is practically feasible. And working.

After the first conflicts problems, we've been careful to ensure that the dev branches are always rebaseable unto master, and we also did rebase dev-man whenever master was changed. So, it should be fine if we keep going this way.

Workflow Notes

(Maybe that needs to go somewhere, for posterity, and maybe myself ,-), the wiki?)

We did jot down some notes in CONTRIBUTING.md, but maybe the Wiki could be a better place for a full blown article on the strategies for specific documents maintenance. It seems that we often loose track of our knowledgebase contents (e.g. the Wiki pages on the problems with ALAN Italian have been there for years); as a general rule, every problem we faced and solved was documented somehow and somewhere, the problem is remembering where!

Since we've been using diligently Issue labels, milestones and Dashboard Projects, it should be fairly easy to find out any topic via filtered searches. But this only covers Issue, not in-repo documents or Wiki pages...

Indexing the Existing Knowledgebase

We should strengthen the Wiki of each repository, creating some sort of smart index that allows us to retrieve all these guidelines and notes. The Wiki of the alan-if/alan repository might deserve more attention in this respect, and could become the main "official Wiki", making it easier to use it a directory to find all knowledgebase articles (or links to Issues), otherwise this knowledge will be scattered across many repositories (and their Wikis). The only problem here is redundancy — i.e. some articles naturally belong to the Wiki of their repository, but having to keep an Index in every Wiki of the alan-if org might become cumbersome, unless we can come up with some sort of databse solution than can auto-generate and update all these Indexes.

It might seem overkill, but the problem here is that our projects are actually over documented (whereas usually the opposite is true). Writing guidelines and tutorials take time, so it's a pity if these articles are not brought to fruition due to lack of visibility.

CI Automation

There is a CI-job, which currently only does asciidoc validation as far as I can see.

It only validates code styles consistency via EditorConfig, actually.

Would it be a possible route to expand on that to build the documentation too? Maybe that was our intention all along. Setting up the toolchain in Travis would be a hazzel, esp. with asciidoctor-fopub, but it would be interesting to try.

I remember the problem being the asciidoctor-fopub part, which only works on Windows (see #66) due to Win-specific paths in our custom configuration (for fonts or styles, don't remember). The fopub configurations are not very friendly, and quite hostile toward collaborative editing and version control. Probably the best solution would be to check Asciidoctor's native PDF backend, which has been updated a lot in the meantime and probably solved all the issues that were preventing us from using it here (problems with footnotes inside tables, and a few other missing features, see: #9)

ALAN Manual Dev Snapshots: Release Strategy

Then we could possibly upload the result from the branches to two different "releases", like "Official" and "Snapshot". Then those would be stable places to point to. And would contain corresponding, and updated, versions of the various documents. (Ok, only the manual would actually need the "Snapshot" version...)

To achieve this we'd need to define a snapshot release strategy.

Sometime only one of the two format needs updating (e.g. because we improved the CSS of the HTML doc, or the template of the PDF); other times there might be just typo fixes, we don't necessarily call for an updated version.

Possible solutions could be:

  1. Regular updated every so often — e.g. every three or six months.
  2. Update the snapshot docs whenever a new Alpha version of ALAN is released.
  3. Others?

The problem with (1) is that end users might have to wait to gain access to new juicy stuff; whereas with (2) the problem is ensuring that we update the contents before releasing the new Alpha, since the build jobs would be automated via some CI cross-repo communications, base on release tags or branch merges.

I think I tried something like this (uploading to github) in some other project but did not finish it, probably because of lack of time. I would be interested to dig into this.

Personally, I'm for the manual approach, after all we are always aware when new juicy contents are added to the Manual, so each one of us is free to update the HTML snapshot (and even the PDF) based on whether he thinks it's worth sharing it with bleeding-edge authors — it just boils down to deciding whether to add the built docs to the commit stage or not, so no big deal there. Since the snapshot preview links are pointing to the dev-man branch, and docs name don't change, these links will always show the latest document that was pushed.

Maybe a new issue for it?

We've had a long discussion on this in #6 (now closed), we could reuse some of the text material from that thread if it saves us some typing.

About PDF Rebase Conflicts

Back to the original question of this Issue, which type of conflicts does rebasing on master bring about?

What I usually do when these conflicts come up (inevitably for both HTML and PDF builds, due to date changes or just Asciidoctor template changes) is to simply solve the conflict with either "our" or "theirs" and then rebuild both docs on the fly and add them to the stage before committing the rebase — bear in mind that when rebasing or merging we should always rebuilt the docs from scratch (if we include them) because of the way dates and version numbers are handled, and because Asciidoctor updates often introduce CSS changes.

So the best strategy is to always manually rebuild the docs.

This can be probably achieved by some Git hooks/filters (when carrying out specific operations on dev-man, for example).

Thanks for digging up #6, and some good thoughts on information handling, although the concrete approach for that is still up for discovery, decision and implementation ;-)

ALAN Manual: Dev Snapshots

The "branching" model we decided upon was, as least how I see it, to maintain two version of the documentation

  • master that is aligned with the current official release but that would be updated with better and better content continously
  • dev-man that is aligned with the development snapshots, as new features are added this version of the documentation would also add it

Exactly. The point is that not every commit to dev-man would require rebuilding the docs; only significant changes to the documentation demand a rebuild, i.e. in order to grant developers access to bleeding edge features, or just benefit from updated contents. This judgement being arbitrary in nature, we didn't so far attempt to automate the process but just handled it by hand.

Good. But I feel we have slightly differing ideas about how to manage the actual "results" of the branches. I read your comments as thinking around releases. I also think "releases" are important, but I also strive for "continuous deployment" to lessen the cognitive load on us to remember to create new "builds" only for "releases", and deliver the value of the change as soon as possible.

E.g. before extracting the manual to this alan-doc project, each development snapshot of Alan also contained a, mostly updated, version of the manual, consistent with the development snapshots functionality. In that model the "stable" documentation was actually that, stable. No improvements could be done in it until the next official release. We have now flipped this, and we can continuously improve the "stable" version. It would be A Good Thing™ (but not strictly required) if that change would immediately benefit readers/users.

...

ALAN Manual Dev Snapshots: Release Strategy

Then we could possibly upload the result from the branches to two different "releases", like "Official" and "Snapshot". Then those would be stable places to point to. And would contain corresponding, and updated, versions of the various documents. (Ok, only the manual would actually need the "Snapshot" version...)

To achieve this we'd need to define a snapshot release strategy.

Sometime only one of the two format needs updating (e.g. because we improved the CSS of the HTML doc, or the template of the PDF); other times there might be just typo fixes, we don't necessarily call for an updated version.

Possible solutions could be:

  1. Regular updated every so often — e.g. every three or six months.
  2. Update the snapshot docs whenever a new Alpha version of ALAN is released.
  3. Others?

As already indicated I'd suggest "generate new documentation on every commit".

To me, the release process is different from the continuous builds and deployment. The important thing is that the information in a document generated from master should always match functionality in the latest official release of Alan, and be marked with that version. Any kind of compatible improvement should go in master directly. Changes in functionality should result in a change in dev-man.

A release of Alan will thus also render some additional, manual, work when it comes to alan-doc, especially the manual. This work would then be

  • merge dev-man into master
  • on master: update release marking to the new release
  • on dev-man: update release marking to the next release, still with "development snapshot" label
  • done.

Just stating this, to ensure we are on the same page here, but I'm confident we are. (I'm ignoring the actual build here, since that is implied, being manual or not. I'm also ignoring the actual merge-branch of dev-man, as that is not important for the discussion.)

So when it comes to releases, to me, only the update of the documentation for the new functionality is important, Any other types of changes are inconsequential when it comes to releases and can be done, and published, at any time. An improvement should not need to wait for the next release.

But again, I think we think about this in slightly different ways. So let me put it this way:

What would be the worst thing that could happen if we build a new set of fully usable documentation on every commit?

It's not like there are API incompatibilities that breaks things, as for a software release. We also use SemVer-like semantics for the two branches, as does the Alan SDK with the official releases and the development snapshots, so the correlation between them are clear.

To be very clear, I'm not forcing the issue. Instead I think it is interesting that it seems that we have differing reasoning here, and interested in learning more about your viewpoint.

At the end of writing all this I realise that my concern is primarily the content, and even more so, specifically the manual. But you have worked hard, even struggling, with the toolchain, layout and such things, and also for the other documents. A random change in a tool configuration or version might actually trash things, warranting a "proof-reading" before actually releasing. Is this your concern? If so, I think I can feel where you are coming from...

Good. But I feel we have slightly differing ideas about how to manage the actual "results" of the branches. I read your comments as thinking around releases. I also think "releases" are important, but I also strive for "continuous deployment" to lessen the cognitive load on us to remember to create new "builds" only for "releases", and deliver the value of the change as soon as possible.

E.g. before extracting the manual to this alan-doc project, each development snapshot of Alan also contained a, mostly updated, version of the manual, consistent with the development snapshots functionality.

As far as I can remember, the reason we didn't manage to come up with a solution was because of a number of unsolved questions that were preventing a CI solution:

  1. asciidoctor-fopub preventing a cross platform build.
  2. Unsolved discussion regarding the Manual last changed date attribute.
  3. ALAN code examples and transcripts being dynamically generated to match the ALAN version for which the Manual is being built — i.e. ensuring that any CI builds are using the correct alan and ARun versions, according to branch.

Also, so far we had only a single Manual release on master, which happened when the preview release of the StdLib came out, and it was very close to the Christmas Holidays, so there hadn't been much follow up on that.

But let's recapitulate the problematic points of the above list...

asciidoctor-fopub Problems

Unless we can find a solution to the fopub problem, it's going to be hard to come with any CI solution. The current problem prevents using the configuration files on both Windows and Linux, and we're now using Windows for the builds.

Possible solutions:

  1. Use a script to modify these configuration files so they work on Linux and the CI virtual machine (e.g. using SED).
  2. Create a Linux version of these settings in the repository (with a different extensions), and have a script replace the Windows files by the CI initialization job.

Alternatively, switch to Asciidoctor's native PDF backend (if it now support all the needed features) — but this would require some extra work before it becomes usable in CI production:

  1. Create an ALAN syntax and theme for the Rouge (Ruby) syntax highlighter, which is needed for this backend.
  2. Find a way to use our custom syntax with Rouge, because we might need to benefit from any changes in real time (i.e. can't way for a PR to be merged into the upstream project). I have no idea if this can be done, or how.

Manual Date

I remember you proposed that the date attribute should be set by the build script, at conversion time, to avoid having to manually edit it in the source file each time. While this make sense, there are also some undesired side effects to this:

It would mean that if the docs are rebuilt at every commit, the date would also change in the docs, regardless of whether contents have changed.

Especially for the PDF edition (which is usually intended for download, whereas the HTML doc more for online consultation), because end users might end up download it again even if no real changes occurred (which translates to downloading and replacing their local copy, wherever that is stored).

IMO the last updated date should indicate when contents were last modified, and not change when the template was tweaked, or other non-meaningful cosmetic changes took place.

Dynamic Examples

Although right now the Manual doesn't use these (other docs here do!), in the nearby future we'll be adopting this approach always more, since it proved successful for the StdLib Manual (see AnssiR66/AlanStdLib#82).

The idea is that ALAN code snippets in the docs should be extracted directly from a real source file (via include::) and that their output should be extracted from a real transcript generated by compiling the sample adventures with the matching alan and arun binaries which the document is being written for.

This would ensure that:

  1. All code examples are valid and compilable.
  2. All output sample match the real output generated by the current ALAN version.

This reduces the maintenance work of the examples, by allowing us to "set them and forget them", and have the toolchain automatically produce the correct results.

But it also mean that we must ensure that we're using the correct ALAN binaries, both locally and on the CI server, which introduces the problem of having to use the latest Beta on master, and the latest Alpha on dev-man.


As already indicated I'd suggest "generate new documentation on every commit".

You mean on every commit the docs should be rebuilt and commite to the dev-man branch, even if the commit doesn't alter the Manual and its assets?
This could quickly lead to a huge bloat in the repository size, especially if the build script injects the date, since it would mean that at every commit the generated docs will differ at least in the date value.

A release of Alan will thus also render some additional, manual, work when it comes to alan-doc, especially the manual. This work would then be

  • merge dev-man into master
  • on master: update release marking to the new release
  • on dev-man: update release marking to the next release, still with "development snapshot" label
  • done.

Just stating this, to ensure we are on the same page here, but I'm confident we are.

I'm 100% with you on this, and I'm also a great fan of all things "auto-magic". It's just that I think that there are still some major problems with the Asciidoctor build that need to be addressed before we can setup a CI toolchain of this type.

Also, I believe you use Circle CI, which I don't know anything about (I use only Travis CI). In the meantime, GitHub Actions have also entered the scene, which seem an interesting way to handle CI tasks, especially with the Marketplace offering ready-made solution which are maintained by the creators of these GH Actions — especially when the actions are build by the creators of the tools which are involved.

What would be the worst thing that could happen if we build a new set of fully usable documentation on every commit?

  1. Huge size bloat of the repository, especially if you inject the date attribute via the build script, because then we could never have identical output of the HTML and PDF files, so they'll always end up in every commit. Also, it seems to me that this would pollute those commits that don't deal with document changes, and would interfere with cherry picking and other interactive Git operations — and of course, we'll never be sure of whether these documents contain real changes or are just the result of this policy.
  2. The whitespace diff bug (that has been afflicting the StdLib repo for years). We also have tests for the source adventures in this repo. Just imagine if one of these sources would stumble in whitespace creeping in at every run: a CI job might trigged and endless series of commits, rebuilding the docs every time, probably until the CI VM crashes or you hit the monthly CI free-minutes limit.
  3. I'm not sure this would be a good service for end users, who might expect new versions of these documents to contain real changes, especially if they are downloading them.

To be very clear, I'm not forcing the issue. Instead I think it is interesting that it seems that we have differing reasoning here, and interested in learning more about your viewpoint.

I don't think we have different visions on this, is just that each of us is focusing on different problems that are preventing this to happen. Whereas you're more focused on how to interconnect this to the ALAN release cycle, I'm more focused on the current unsolved problems of the Asciidoctor toolchain (which prevent any reliable CI build, right now).

Bear in mind that I've been spending quite some time trying to find solutions to a number of Asciidoctor toolchain related problems, and how to come up with a good solution that would work across different repositories that use Asciidoctor for ALAN — so I tend to be more aware of how far these solutions are.

Just to mention briefly one problem: ISO-8859-1 validation!

ECLint simply fails to validate ISO-8859-1 files, raising false positive for valid files. There doesn't seem to be a bullet-proof way to validate files for ISO-8859-1 encoding, you can mostly proof they are not UTF-8, or that they are single-char encoded. Yet we need assurances that our sources are valid ISO files, especially since modern editors tend to break ISO encoding with almost any paste operations.

The problem even gets worst when we don't have some ALAN specific file extensions for ALAN related files (e.g. transcripts and solutions), because extensions like .log are usually associated to UTF-8 in most editors — hence my proposal to official adopt .a3sol and .a3log, which I'm using in most projects anyhow. But the .i extension is also at risk of being corrupted by most editors, since it's a generic extension used by many languages for include files. Sine neither Git nor GitHub offer much support for these legacy encoding, we really need a safe and trusted way to ensure they are correctly encoded.

A random change in a tool configuration or version might actually trash things, warranting a "proof-reading" before actually releasing. Is this your concern?

Not really, I mean ... the contents are usually well polished whenever we commit them, and the master branch should only contain finished work for a specific ALAN release. I'm more worried about the fact that there are so many different tools and standards involved that we need to make sure that every piece of the puzzle is solidly constructed, before handing all these to some automatic robot muncher.

If you had been struggling with the "spurious whitespace bug" that has afflicted the StdLib and Alan Italian repos (it suddenly disappeared in the latter, for unknown reasons) you would know how frustrating it can be to work with Git and ALAN sources when things go wrong — at every run the transcripts change, even if nothing was changed, so there's at play some complex interaction between the ISO encoding and Git's lack of support for it here, possibly due to a small bug that spits out a char sequence which to Git is broken UTF-8. The problem is that these changes show up in Git's work space, and these are a nightmare on any CI job, since they prevent many Git operations.

Publishing Dev Snapshots on an Orphan Branch

I've been giving some thought to the whole problem of how and where to "publish" the dev snapshots of the ALAN Manual (both PDF and HTML).

I think that committing them to the dev-man branch is only going to give us problems, both in terms of conflicts as well as in terms of size bloat.

A possible solution would be to tweak the build scripts so that they are branch aware (via a simple git query) and when it's running on dev-man it should output the PDF/HTM with a different name (or path) which is ignored in the repository. Then the script could commit the new documents to a new and separate branch, especially created for documents dev snapshots previews.

We could create this special branch as an orphan branch, which only stores snapshot previews documents and doesn't share any history with the main repo, so no possible conflicts could come from it, but we'd still be able to offer live preview links of the latest dev docs to end users.

This approach should lend itself well to the various CI tasks, and then we'd only have to focus on auto-rebuild the Manual on master branch, whenever a new ALAN release is out — which, I believe is you main concern here, i.e. being able to synchronize automation between new ALAN releases and publishing the latest ALAN Manual on master.

Also, being an orphan branch, we could simply force commit at each documents build, effectively resetting the branch at every new snapshot, since we won't really be needing a commit history there; which means that its size would never bloat, even if we rebuild them at every single commit on dev-man — the only concern here would be that we might run out of the monthly free-minutes of the CI server (or GitHub Actions), which would mean that the CI would stop working until the next month, unless there are funds for more operations/minutes.

In any case, I think it's important that dev-snapshots of the Manual should be build with different names from those on master, to avoid all these annoying conflicts and keep them clearly separated (we could even just add some prefix or suffix to their names, e.g. dev_manual.pdf/.html)

Does it make sense to you?