rust-lang / mdBook

Create book from markdown files. Like Gitbook but implemented in Rust

Home Page:https://rust-lang.github.io/mdBook/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add multilingual support

azerupi opened this issue · comments

Add support for multiple languages.

commented

multiple languages for document?

Yes, I think Gitbook does support something like that.

Instead of having the markdown files directly in the source folder you would have some sub folders like this:

src/
├── de
├── en
└── fr

And their would be an easy way to change the language in the rendered book.

It's definitely something I would like to add, but it's not the highest priority at the moment

Multiple designs possible:

  • One SUMMARY.md to rule them all

    pros:

    • Changes in structure are reflected in all languages immediately
    • 1 to 1 mapping from pages from one language to another, would allow changing the language of the page directly from a menu button

    cons:

    • If one language is lagging behind it's going to get ugly
  • One SUMMARY.md for every language

    pros:

    • Every language can have it's own pace

    cons:

    • Does not push all languages to be up to date / coherent
    • No 1 to 1 mapping guarantee and thus not possible to toggle the language from a page without having the risk that the page does not exist in the other language

I don't think one SUMMARY.md for everything is a good idea. I consider consistency within translated version more important than consistency with original. Otherwise, we can easily start having broken links because upstream renamed some chapter and translation didn't, yet. I believe a book that has no broken links is the minimum standard.

Also, I don't support the idea of "pushing" to be up-to-date. AFAIK, translations (not only ours) are done by enthusiasts and it's not always possible to keep up at all times.

Moreover, 1 to 1 mapping of pages doesn't look straightforward to me, even in case there's single SUMMARY. Words have different length in different languages, and in Russian translation we consistently have sentences that are noticeably longer than original. But I'd love to have it so that one click can show the same point in text in original language.

I think this can be handled by tracking 1-to-1 mapping of paragraphs - sections aka markdown files are too big. Paragraphs also seem a good candidate because sentences get paraphrased and reordered sometimes, but the paragraphs stay in same order and have same gist.

Thanks for the input! I really appreciate the feedback :)

Otherwise, we can easily start having broken links because upstream renamed some chapter and translation didn't, yet. I believe a book that has no broken links is the minimum standard.

Moreover, 1 to 1 mapping of pages doesn't look straightforward to me

When I am talking about 1 to 1 mapping I am talking about page to page mapping, not sentence to sentence (that would be insane 😉).

Let's take a hypothetical situation with the Rust book. Let's say I am reading a blog post and it references some chapter in the Rust book, for example the chapter about ownership. But English is not my main language and it would be a lot easier to understand the chapter in my native language. If we have 1 to 1 mapping on page / chapter level the user could then select his language (if it is supported) from a dropdown menu and he would land on the exact same page in his chosen language.

However for this to work correctly we need a guarantee that every page in one language has an equivalent page in the other language. If you allow a different SUMMARY.md per language there is no way to know what pages are equivalent if any equivalent page even exists at all.

Also, I don't support the idea of "pushing" to be up-to-date. AFAIK, translations (not only ours) are done by enthusiasts and it's not always possible to keep up at all times.

Of course, I totally agree with you. But the SUMMARY.md is only about structure, so what order the chapters come in, not the content.

If there is one SUMMARY.md for all languages I think it will only cause trouble if:

  1. New chapters get added, as equivalent chapter in other languages will just be blank until they are translated
  2. The markdown files get renamed, this should not happen often when it does it is not difficult to rename the files accordingly for every language
  3. A reorderering of the chapters where the continuity of the content is broken. This too should not happen often, but it's more challenging to fix as it requires the translators to translate the text that changed

To be honest, once a book has it's definitive structure the SUMMARY.md is not likely to change often unless there is a major rewrite being done.

I think both designs have advantages and drawbacks, we need to figure out which one we want / need the most.


Idea for Rust book workflow when translations are in tree

When / if translations are moved into the official repository we could create a more elaborate pull request process. This is only an idea, it may be flawed 😉

When a pull request is made that contain changes that need translation (e.g. not typos) we could wait to merge the pull request until translations have been made for all officially supported languages.

The pull request could track what translations have been made using a check list like this:

  • Russian
  • French
  • German

Once all the translations are ready the pull request is merged in.
Officially supported languages could be languages with a minimum number of "official" maintainers.

This would add a little / lot of overhead for the english version but it would solve the two big issues with translations.

  1. Translations would always be up to date!
  2. This is probably the easiest way to track changes

There may be organizational problems I haven't considered though. @steveklabnik

The biggest problem with blocking English changes to non-English changes is that I am paid for my work, but others are not. This places a big burden on them; I'm gonna want to land changes ASAP, and that's not fair to people who can't do this as a day job.

That's true, didn't think of that.
It could still be applied without blocking the English changes? Just for tracking. Not sure if it's worth the overhead though.

Anyways, do you have a preference for any of the two design choices (one vs. multiple SUMMARY.md)?

I think I prefer a single for the reasons you've stated, but since I'm not doing the translations themselves, I don't think my opinions matter much :)

And yeah, tracking might be different/better than actually blocking on them landing.

When I am talking about 1 to 1 mapping I am talking about page to page mapping, not sentence to sentence (that would be insane 😉).

Ok, I think what I was trying to say but couldn't get across is this: page-to-page mapping isn't enough for printed versions, as same pages will have different content. And if by page you meant a web page, that is not enough either. Some sections (pages) are tens of screens long, and to provide smooth transition from one version to another we should track smaller units than entire files (web pages).

I originally thought you were talking about printed pages and written the following, but I'm not sure now. For printed versions, depending on length of the section and sentence-length difference with the original, this can very from "I see not the beginning of the paragraph that talks about Foo feature, but the end" to "I don't see the paragraph that talks about Foo feature on screen at all", when linked to "page 83 of PDF".

So let's clarify the terms before continuing as apparently I misunderstood something 😄

Ok yes, I will try to do my best to explain what I envision:

So in this issue I am not at all talking about tracking any changes for translations, only about how to support multiple languages in the same folder / book.

Before I continue, let's explain what the SUMMARy.md does exactly.

When you render the book (mdbook build) it is going to search for the SUMMARY.md and parse it. The SUMMARY dictates

  • The Order of the chapters.
  • The names of the chapters.
  • The markdown file corresponding to each chapter.

That is the "only" information we get from the SUMMARY.md

If we want to support multiple languages for one book, there are two possible designs (that I thought off):

  • One SUMMARY.md at the root of the source directory that will be used for all languages.
  • SUMMARY.md for every language

Let's see both in more details.

One SUMMARY.md for all languages

Consider this SUMMARY.md for a book:

# Summary

- [hello world](hello-world.md)
- [second chapter](second-chapter.md)

and this directory structure:

├── book
└── src
    ├── en
    │   ├── hello-world.md
    │   └── second-chapter.md
    ├── fr
    │   ├── hello-world.md
    │   └── second-chapter.md
    ├── ru
    │   ├── hello-world.md
    │   └── second-chapter.md
    └── SUMMARY.md

As you can see here, every language has the same markdown files defined in the global SUMMARY.md. This means that the "hello world" chapter has a corresponding page in every language! (1 to 1 mapping)

Advantages

Having a guarantee that every chapter in one language has a corresponding chapter in another language gives us the possibility to change the language from any chapter and land on that same chapter in the other language.

Example: I am reading the "borrowing" chapter of the Rust book. I want to see that same chapter in French. I just select "French" from the dropdown button in the menu-bar and I will land on the French version of the chapter.

Drawbacks

When the SUMMARY.md is modified it can cause some consistency problems in the translations because changes in the SUMMARY.md
will be reflected immediately in all languages. However, changes in the SUMMARY.md should be relatively rare once the book has found it's "final" structure.

Problems that could occur:

  • Chapter is moved: When a chapter is moved (the order of the chapters is rearenged) it could cause problems with text flow.
  • markdown file is renamed: When a markdown file is renamed it should be renamed in all languages and in all the references to it. This should not be too big of a problem.
  • New chapter is added: When a new chapter is added it will appear blank in the other languages until it's translated.

Content is not modified by the SUMMARY.md so any of the designs here is not going to cause any trouble with the content if the SUMMARY.md is modified.

Another drawback is that I am not sure yet how translations will give a translation for the chapter titles in the sidebar (SUMMARY.md). Maybe just take the first heading from the corresponding markdown file?

One SUMMARY.md for EVERY language

Let's consider this directory structure:

├── book
└── src
    ├── en
    │   ├── hello-world.md
    │   ├── second-chapter.md
    │   └── SUMMARY.md
    ├── fr
    │   ├── hello-world.md
    │   └── SUMMARY.md
    └── ru
        ├── hello-world.md
        ├── second-chapter.md
        └── SUMMARY.md

As you can see here, every language has it's own SUMMARY.md and thus can define the order of their chapters and the markdown files as they wish.

There is absolutely no more guarantee that the French version contains the same chapters as the English version. No 1 to 1 mapping. Essentially every language is its own separate book, they could have exactly the same structure or they could have totally different chapters. There is no way for the program to know that.

It is thus impossible to change the language from a chapter. You would have navigate to the French version manually and search for the chapter you were reading if it exists in the French version at all!

Advantages

Translations have a lot more freedom, but this can also be seen as a drawback. Translations do not need to have the same structure, so when the SUMMARY.md is changed in the English version, absolutely nothing is going to change in the other languages. Every change in the translations has to be done manually.

Drawbacks

There is no guarantee that a chapter in one language as an equivalent in another language.(No 1 to 1 mapping) The program can not know what chapters are equivalent in the different languages and it would thus be impossible to change the language from a chapter to land on the same chapter in the other language.


I hope this made it more clear, if there is still something you don't understand I can elaborate more on some specific area. 😉

EDIT: A little quote from a response I made on Rust's internals forum:

And to be honest, if you have different TOCs you essentially have different books. There is little gain to support that, other than being able to group all the translations in one directory and build them in one go.

You can already group the multiple translations in one directory as different books each with it's own SUMMARY.md and book.json and if you configure the source and destination directories correctly there should be minimum trouble to integrate with automatic deployment scripts etc.

There is no guarantee that a chapter in one language as an equivalent in another language.

Regarding Rust Book translation process, it is not disadvantages of some solution, but simply a fact. I think that the other projects that will use mdBook with multiple languages will have the same problem.

The program can not know what chapters are equivalent in the different languages and it would thus be impossible to change the language from a chapter to land on the same chapter in the other language.

Can we make it simple and assume that the files with the same name in different languages are the same chapter? Then we can give the opportunity to switch to another language. I think this approach will satisfy both cases:

  1. When there is complete consistency between all languages.
  2. When consistency between languages is not complete.

Also, I don't like the idea that when I read the book in Russian, I'll see TOC in English. I think we should not assume that the reader is familiar enough with the language of original to understand the chapter titles.

When consistency between languages is not complete.

How would you handle that? On some pages you can change the language and on others not? That would be really confusing for users I think.

Also, I don't like the idea that when I read the book in Russian, I'll see TOC in English.

Of course that was not the plan, I just hadn't found a good solution for it yet so I didn't discuss it too much

How would you handle that? On some pages you can change the language and on others not? That would be really confusing for users I think.

Why not? We can clearly indicate that the translation for this chapter is not available yet. Another possible situation is that translation for some languages is available, but for other languages it's not.

Another example that I care about.

Let's compare the structure of the section "Getting started" in the nightly and stable books. As you can see, Steve joined 4 chapters into one. Imagine that not all the language versions supported this change yet. If we have common TOC, this means that there is no possibility to open "Installing Rust", "Hello World" and "Hello Cargo" chapters in non-English version of book, because they do not exist in the original TOC anymore.

Yes I totally agree with you! This would be a big problem. However I am not sure I want to settle with the solution Gitbook proposes either. Maybe we can come up with something better that combines all the advantages and none of the drawbacks? (even if it's a little more complex)

Gitbook uses the "one SUMMARY.md per language" method and to be honest I don't think it is real multilingual support. They essentially have one book per language no cross-linking between the different languages except on a landing page...

I think you could already achieve something very similar with mdBook with multiple books and configuring the source and output directories according to what you want. The only difference is that Gitbook makes it just a little bit easier to setup.

My suggestion is to have "one SUMMARY.md per language", but support page-to-page cross-linking between the different languages. The easiest way to do this is to consider that the files with the same name are the same chapters. In 99% this should work. A more complex way to do this is to add some kind of identifier to each file (something like UUID). If the identifiers of the files are identical, we can cross-link them.

Hmm yes that might be a good compromise. At least if the translations don't diverge to much from the original. I will try to think about this a little more and see if I can come up with other ideas.

Thanks for the valuable input! :)

FWIW, there are tools to handle translations which I didn't see mentioned here yet. For example, crowdin is used (or was when I was involved) over at freecad for document translation of their wiki. It was noteworthy that when an update was made to an english file, the plugin would notify you that the other translations need to be updated for that specific section or they would be out of date. The page linked above actually lists how complete each language translation is and maintains that information.

It is possible a tool like crowdin could just be added to the build process as a plugin which has been notified of which files require translating. Then it will maintain the database itself somewhere and you could tell mdbook where the translated files are located.

A solution like this seems worth the time exploring before spending effort creating a new ground up approach to solve the same problem.


EDIT: Also note they offer free support to open source projects

For you information, what about single file for the source???

like

[es]
Esto es un ejemplo
[en]
This is an example
[fr]
Ceci une example

[es]
Esto no
[fr]
Ce n'est pas

Well, just saying :) (I mean for example for making a book/tutorial with code examples it will be better to only have one source code but the explanation in different languages.

And sure, switching between languages could be possible, and if there is no paragraph, show the default language of the document.

How about a src/SUMMARY.md specifying the default chapter structure expected for all languages that are up to date and forcing specialized src/*/SUMMARY.md for the languages that have not yet made similar changes? This puts the penalty on the translations who have to keep a separate SUMMARY.md around for some time and do work to be up to date. The con is that the person updating the English translation does a minor amount of work when, in essence, causing the translation to fork.

So the rule would be: src/*/SUMMARY.md has higher precedence than src/SUMMARY.md

├── book
└── src
    ├── SUMMARY.md
    ├── en
    │   ├── hello-world.md
    │   └── second-and-third-chapter-combined.md
    ├── fr
    │   ├── SUMMARY.md
    │   ├── hello-world.md
    │   └── second-chapter.md
    │   └── third-chapter.md
    └── ru
         ├── hello-world.md
         └── second-and-third-chapter-combined.md

Consider e.g. the case you mentioned above where the original English book combined several chapters into one (or conversely split one into many). In this case the English translation would need to update src/SUMMARY.md, at this point the English author copies src/SUMMARY.md into each translation not yet updated. Hopefully these src/*/SUMMARY.md only stay around for a short period of time until the translations are updated accordingly.

In the example above before the English original text combined its chapters, src/SUMMARY.md is copied into src/fr/SUMMARY.md and src/ru/SUMMARY.md, next the English original text combines src/en/second-chapter.md and src/en/third-chapter.md into src/en/second-and-third-chapter-combined.md and updates src/SUMMARY.md to refer to the new second-and-third-chapter-combined.md (which at this point only exists in en). Some time later perhaps src/ru/second-and-third-chapter-combined.md is created at which point src/ru/SUMMARY.md may be deleted. src/fr might not yet have been updated so its src/fr/SUMMARY.md stays around a bit longer. Once all languages are updated their specialized src/*/SUMMARY.md can all be deleted and all languages can again rely on the default src/SUMMARY.md.

Do you think an approach like this is feasible and desirable?

I'm eager to do a translation of the Rust book, so I'd like for mdbook to resolve this bug and support translations, hence I'm trying to help you make progress. :)

Thank you for your input!

Do you think an approach like this is feasible and desirable?

Unfortunately, I don't think this will work well in practice because there is a lot of overhead for the author of the original text. Every time the original texts diverge, the burden is on the the author to copy over the old summary to the translations before making a change. If he forgets, things will break, this seems very error prone.

I am more in favour of having one summary per language, cross-link files with the same name. This approach is, in my opinion, simpler to understand and doesn't require any extra work when the original text and the translations diverge.

I hope to make progress on this issue in the "near" future, we are slowly reworking parts of the internals to make it possible.

I am more in favour of having one summary per language, cross-link files with the same name.

If there is one SUMMARY.md per language, what forces the files containing chapters to be named the same way in every language? I do agree about this design being less work for the original author of course. :)

If there is one SUMMARY.md per language, what forces the files containing chapters to be named the same way in every language?

Nothing, it would be a convention. A translation would keep the same file structure and just modify the content of the files. If the translations diverge, you loose cross-linking but everything still works.

I am open to alternative ideas, but I think we should go with something that has minimal friction. :)

If the translations diverge, you loose cross-linking but everything still works.

That's a good point. Maybe mdBook can warn if this is the case?

I am open to alternative ideas, but I think we should go with something that has minimal friction. :)

Yes, I absolutly. I was worried was no progress because of lack of design discussion, hence my suggestion to try to help you decide. I don't know the mdBook code base (or rust) yet. :)

HI!

I'm probably going to reiterate on some already discussed topics but I'd still like to describe this case hoping it's useful to define the best mechanism for book translations in mdbook.

So I've been trying to define a process we could recommend for a localisation team to tackle tasks such as The Rust Programming Language book translation.

One of the things is how to integrate translated contents with the build output. For this specific case, and after having asked the docs team for feedback, it should be easier to handle all of the book contents independently in its own directory, including SUMMARY.md. This would allow the book translators to work in a completely independent way by forking the book repository and probably integrating it back as git submodules in the original repo. There would not be any kind of enforcement on the document internal structure neither on the phrase level content of translations.

Another thing is how to link translated content in the output. It could be linked on a per document fashion by mapping translations using the exact file name, in which case we'd have folder structure enforcement, or it could be linked only on the front page, in which case translation would have complete freedom on the folder structure, and even the Tree/Table of Contents. In the latter case, the contents tree guidelines could be defined by maintainers but not enforced at all by the tooling.

This two features or mechanisms, however, might not work for people wanting to use tools such as crowdin, transifex or weblate to manage their translations, which is probably more adequate for Software translation than for book translations. To support this case mdbook might need to generate a paragraph level mapping of translations and probably support output to any standard internationalization format such as gettext's PO files or L20N.

I'm absolutely willing to dedicate some time to this feature since this could be one of the primary goals of the localisation team. So of course I'm completely open to any kind of feedback and collaboration so we can lay out a plan to implement this.

Regards,

Hi @sebasmagri

Thank you for the input! I would love to work together with the concerned parties to end up with a strong design that is both useful for simple and more complex requirements.

Currently, the design we are considering is the following:

To make a book multi-lingual, you would have to add some information to the configuration file:

[languages]
en = { name = "English", default = true }
fr = { name = "Français" }
# OR alternatively
# [languages.en]
# name = "English"
# default = true
#
# [languages.fr]
# name = "Français"

For the example above, we would expect to have sub-folders in the src directory, matching the keycodes en and fr used in the config, containing the source files for each language.

We could imagine having an optional source = "path" key in the language tables for more flexibility. This would then allow the submodule scenario you described.

We also think it is better to have a SUMMARY.md file for each translation. This allows translations to diverge without breaking the build.

For the HTML output, we consider cross-linking chapters from different languages based on the file structure. An English chapter called src/en/chapter_2/lifetimes-in-a-nutshell.md would be mapped to all the same chapters in different languages src/*/chapter_2/lifetimes-in-a-nutshell.md. This has the advantage of being simple and degrading gracefully when translations diverge. So if authors want cross-language linking they would have to keep the same structure, but if they don't or the structure diverges, the books will still build fine with non-matching chapters pointing to the index when changing languages.

To support this case mdbook might need to generate a paragraph level mapping of translations and probably support output to any standard internationalization format such as gettext's PO files or L20N.

This seems very complex? I am not very familiar with this issue but it seems to me that it would either require a lot of manual annotations for correct paragraph mapping or some heuristics. I would think this is (currently) out of scope for mdBook. Lets first focus on having basic but strong multi-lingual facilities and eventually expand from there. :)

Does that correspond to the requirements of the localisation team? If there is anything I missed or there are additional requirements that haven't been considered, please feel free to post 😉

I'm absolutely willing to dedicate some time to this feature since this could be one of the primary goals of the localisation team.

That would be wonderful, I am particularly interested in the perspective of the Rust project on this issue because I think they will be the ones using this feature the most.

Just to resurface what @mattico said at #687

This should be fairly straightforward:

  1. Add a config option to set the default language.
  2. Determine & document the folder structure used for the translations.
  3. Change index generation to ignore translations.
  4. Set the lang template parameter in hbs_renderer based on the page path + default config.
  5. Add a menu to the html template + stylus.
  6. In hbs_renderer, look for different versions of the current page and add them to the template parameter.
  7. Set the language used to generate the search index.
  8. Add a cargo feature to disable this functionality, since rustc can't have the search language support due to licensing issues.
    Edit:
    Links might also need to be adjusted so they point at the page for the current language. This might not be necessary if the correct relative links are used, I'd have to check.

Wouldn't it be much easier to simply abuse the branches for translations and a small bot to throw a "needs review" for each language if the English version got updated?
This isn't meant to be a permanent solution, more as a stopgap for now since this issue blocks all enthusiasm for any pending translations.

IMO now incoming:

  • Each translation is its own book and needs to be handled as one (it is a book, not a documentation)
  • Cross-linking is unnecessary, a good lander page to select your preferred language is much more appropriate
  • A literal translation is the worst kind you could wish for. So give the native speakers some leeway.
    A promt update may sound like good idea but depending on the translation could be counterproductive
  • Also, yes... I'm slightly annoyed that this issue isn't resolved yet, for two reasons:
    First, I wouldn't mind helping the German translation, and second I need the German version to bring my "rusty" coworkers up to speed with this language since it will be the one used going forward.

How is the progress for this feature? is that usable now?

commented

Bump.

+1

Is anyone working on this? I know there was a PR a long time ago that looked great, but it was closed.

More information on the state of this feature would be welcome. :)

Any traction on this feature?

If anyone would like try a version of mdbook with localisation support you can try my fork in #1201

I tried addressing this issue in #1306. I would appreciate feedback.

I'm still wondering if you should be able to build the book with all languages bundled in at once, and have a drop-down for switching the language of the current page.

FWIW, there are tools to handle translations which I didn't see mentioned here yet. For example, crowdin is used (or was when I was involved) over at freecad for document translation of their wiki. It was noteworthy that when an update was made to an english file, the plugin would notify you that the other translations need to be updated for that specific section or they would be out of date. The page linked above actually lists how complete each language translation is and maintains that information.

Thanks for bringing this up, @mdinger. This is a hugely important point and something @sebasmagri also touched upon:

To support this case mdbook might need to generate a paragraph level mapping of translations and probably support output to any standard internationalization format such as gettext's PO files or L20N.

[...] I would think this is (currently) out of scope for mdBook. Lets first focus on having basic but strong multi-lingual facilities and eventually expand from there. :)

Hey @azerupi, I've worked with translations in the past as part of the Mercurial project. We used the standard Gettext format for this: we had the command line tool itself translated, including the help texts. I used very similar infrastructure to translate a Mercurial guide I wrote.

My experience is that it's the tooling for the translators that is important. That is, it's not helpful to simply create multiple independent Markdown files in separate directories. The result of that is that the translations drift apart: an update to the original text may or may not make it into the translations and there is no system in place to track this.

Translating the software is easy enough: you integrate with Gettext and use its tools to extract the strings to .po files. You let translators translate these — there are tooling for this, such as https://poedit.net/ and various online tools. These PO editors can show the translators what changed since they touched the files last, which is invaluable when updating translations.

What we did to translate bigger pieces of text was to simply split it into paragraphs. So the fr.po file with the French translation found here looks like this:

#: src/index.txt:12
msgid ""
"====================\n"
"Mercurial Kick Start\n"
"===================="
msgstr ""
"====================\n"
"Tutoriels Mercurial\n"
"===================="

#: src/index.txt:19
msgid ""
"Welcome to the `aragost Trifork`_ Mercurial Kick Start. We have\n"
"prepared several different sets of exercises for you:"
msgstr ""
"Bienvenue dans les tutoriels Mercurial de `aragost Trifork`_. Nous avons "
"préparé des exercices pour vous :"

#: src/index.txt:22
msgid ""
"`Basic Mercurial`__:\n"
"  Install Mercurial and get started right away. We will show you the\n"
"  basic commands and show you how to work with others as a team."
msgstr ""
"`Premiers pas avec Mercurial`__:\n"
" Installer et faire ses premiers pas avec Mercurial. Nous vous montrerons "
"les premières commandes et comment travailler avec les autres membres d'une "
"équipe."

The source text for this was formatted in reStructuredText (a Markdown-like format popular with Python projects back then):

====================
Mercurial Kick Start
====================

.. image:: mercurial.png
   :align: right

Welcome to the `aragost Trifork`_ Mercurial Kick Start. We have
prepared several different sets of exercises for you:

`Basic Mercurial`__:
  Install Mercurial and get started right away. We will show you the
  basic commands and show you how to work with others as a team.

This would be the equivalent of the source Markdown file in mdBook.

This seems very complex? I am not very familiar with this issue but it seems to me that it would either require a lot of manual annotations for correct paragraph mapping or some heuristics.

As you can see, there are no markers in the source text: they stay as they are. The Gettext tooling is used to extract the text, and I wrote some extra Python scripts to split the files into individual paragraphs.

The mapping is done based on the content: if the source text is updated, the translation gradually becomes out of date. This is why I split the text into paragraphs: when I rephrased something in my guide, that individual paragraph would turn into English again until someone contributes an updated French translation. The same approach is typically used for software: if a menu item hasn't been translated, you show the original text instead.

In short: yes, this kind of infrastructure comes with a startup cost — and it's a ton of work for the translators to keep a translation up to date. However, with this approach they actually have a fighting chance: the tooling will flag outdated paragraphs and the editors can do fuzzy matching to find the previous translation. Also, note how this infrastructure is trivial for the authors of the source text: they just write it like normal.

I would recommend implementing such a system for mdBook instead of letting people writing what is essentially multiple books split across different directories.

I would recommend implementing such a system for mdBook instead of letting people writing what is essentially multiple books split across different directories.

Just a small update: I've started working on this and have written two small tools:

  • extract: takes a Markdown file and generates a .po file with a message for each paragraph. The .po file is the human-readable file shown above, which is the basis for the translation.
  • reconstruct: takes a Markdown file and a .mo file and will output a translated Markdown file. The .mo file is the compiled message catalog which is used to look up translations.

With those two tools, you can translate the Markdown files which make up your book.

While writing this, I realized that the translations can be done "outside" of mdBook: the input is a source Markdown file and the output is a translated Markdown file. All that's left to do is to glue the translation together with something like a language-picker on each page. @Ruin0x11, this means that translating with Gettext and these scripts is fully compatible with the approach in #1306!

I'll put up the scripts when I've tested them a bit more.

@mgeisler This sounds very interesting. Mainly because it converts the process of maintaining a translation into the traditional approach. :) So then what would be checked into the git repo? The original English language Markdown files and the translated .po files? I'm guessing that whenever the original book is upated the corresponding .po files either have existing translation strings marked fuzzy or have entirely new strings added and then translators would find those half finished .po-files and send pull requests to the golden git repo. Whom do you envision to be running extract and reconstruct? In other projects where I am a participating translator tools like that are normally run by the maintainer so that's what I'd assume.

Mainly because it converts the process of maintaining a translation into the traditional approach. :)

Yes, precisely! Glad you like it.

So then what would be checked into the git repo? The original English language Markdown files and the translated .po files?

Exactly, those would be the source files, the rest are derived and can be left out. The flow would be something like

$ extract src/*.md --output-file messages.pot      # extract all strings from the source Markdown files
$ msgmerge --update po/xx.po messages.pot          # merge the XX translation with new files
$ msgfmt po/xx.po --output-file xx.mo              # convert xx.po into a xx.mo file
$ reconstruct src/*.md xx.mo --output-dir src/xx/  # write translated Markdown files to src/xx/

Honestly, we probably don't need to compile the .po files into a .mo file — the whole workflow is offline so there's no real benefit in a precompiled catalog.

The translated Markdown files are derived from the .po and the source Markdown files. As such, they don't have to be checked in. However, as the source Markdown files change, the translations immediately become outdated. Checking in the translated Markdown files every time the translation is complete would allow you to still deploy all languages for a book and know that each language is the last complete version.

I'm guessing that whenever the original book is upated the corresponding .po files either have existing translation strings marked fuzzy or have entirely new strings added and then translators would find those half finished .po-files and send pull requests to the golden git repo. Whom do you envision to be running extract and reconstruct? In other projects where I am a participating translator tools like that are normally run by the maintainer so that's what I'd assume.

I think it could be either the maintainer (if you want to commit a messages.pot file to the repository) or the translators themselves. In the Mercurial project we had a small Makefile which made it easy to do make update-po xx to update the xx.po file with the latest strings. I would guess similar helpers could be built for mdBook (either as a script or as a new mdbook command).

Mainly because it converts the process of maintaining a translation into the traditional approach. :)
Yes, precisely! Glad you like it.

Honestly, we probably don't need to compile the .po files into a .mo file — the whole workflow is offline so there's no real benefit in a precompiled catalog.

I agree since even software that I contribute translations for do not appear to deal with .mo files until the software is built and installed on a system.

However, as the source Markdown files change, the translations immediately become outdated. Checking in the translated Markdown files every time the translation is complete would allow you to still deploy all languages for a book and know that each language is the last complete version.

Right, so then you'd have a few paragraphs of the translated langauge, then a section in english and then trailing sections in the translated language? Until the translators do their magic that is. Sounds like a good approach, and yeah if the intent is to parse the markdown files directly from the git repo onto a webserver perhaps it would be beneficial to actually check in the translated markdown.

I think it could be either the maintainer (if you want to commit a messages.pot file to the repository) or the translators themselves. In the Mercurial project we had a small Makefile which made it easy to do make update-po xx to update the xx.po file with the latest strings. I would guess similar helpers could be built for mdBook (either as a script or as a new mdbook command).

If the maintainer updates the .po files whenever strings are changed/added then they'd know that all translations are up-to-date, albeit maybe not perfectly translated. And then the maintainers just poke the translators and wait.

This all makes sense to me, but I'm not affiliated with the project, have your heard anything from them? Are they as eager to get extract and recontruct as I am? If you direct me to your scripts I can take them for a spin and see if they work for me (too). :)

This all makes sense to me, but I'm not affiliated with the project, have your heard anything from them? Are they as eager to get extract and recontruct as I am?

Not sure 😄 I simply implemented the infrastructure which I expect to need myself in a few months. I hope it'll be useful for the project maintainers too since it can help solve a very long-standing issue for the community.

If you direct me to your scripts I can take them for a spin and see if they work for me (too). :)

I've put up a PR which adds new commands to mdbook to drive the translation process: #1864. Please give it a spin and send me feedback there! I'll might be slow to respond since I'm traveling to the US next week for RustConf, but I'll get to it eventually.

How close is this to shipping? I am currently assessing what docs generator I can use for an OSS project and really like mdbook, but will need translations available. I know this is volunteered effort (and thank you) but this issue has been open for seven years now, so hoping it should be close right ? :)

Please, please make it happen 👏🏻

@lukehinds and @aellwein, could you please try the code I put up in #1864 ? Please give me feedback on the PR, that way I can learn if that approach makes sense to more people than just me :-)

Hi, is there any updates? Thanks

Hi, is there any updates? Thanks

Hi @fzyzcjy, you could try out the code I put up in #1864 and let me know how it works for you.

To @ehuss or anyone else who maintains mdBook, I would like to know what I can do to help add support for this feature. There is some code in #1864 that adds support for translator tooling to my original PR in #1306. However, the last time I rebased my original code I didn't seem to get much feedback. I would like to make sure that any work I contribute this time won't be in vain. Is there a roadmap for having some iteration of these features looked at in the near future?

Ideally I think it would be good to write out a proposal for what you would like to add. It would help to write some background information about the problems you are trying to solve. That is, why is additional tooling needed beyond just having multiple books in separate directories? Why can't translation tooling be managed as separate tools? What is the impact on the experience for readers? What is the expected translator experience like? Does it need to integrate with other translator tools? Have you discussed the problems and solutions with other groups, like the Rust teams working on localization now? What are some alternate solutions? The more you can provide in clear terms, the easier it will be to understand why a particular approach is being taken.

#1306 looks to be a massive PR. Once we agree on a particular proposal, I think it would be good to try to break out the approach into smaller pieces if possible.

commented

Hello, everyone.
I've found a tool mdbook-i18n.

It provides a content structure like below, which is not bad. You guys can try it with the example in the project.

One thing should be reminded is , when you wanna browser the localhost:3000/chapter_1, you'd better browser it with identified-lang such as localhost:3000/en/chapter_1 or localhost:3000/zh/chapter_1.

image

For now, there are some problems for the project.

  1. It lacks the lang-switcher. if you wanna switch the lang, you have to modify the url.
  2. The project haven't been updated for a long time.

Hope this can help you. ~~

commented

Hi guys. I've known the solution about the problem#1.

  1. It lacks the lang-switcher. if you wanna switch the lang, you have to modify the url.
    Just add a theme...The lang-swtich is new the printer.

And I've publish a multi-lang docs on my gh-page.
Welcome to have a view.
https://chengyuejia.github.io/move/en/introduction.html

Welcome to have a view.

@ChengYueJia It looks great! Although some style needs adjustions

image

commented

@huangjj27 Year.There is a problem on style. I've made a pr to the author.

The problem has been fixed by the author.
And you can try the author's example.https://funkill.github.io/mdbook-i18n/en/

Hi all,

Based on the ideas in #5 (comment), I've put up google/comprehensive-rust#130 to add support for multi-lingual books. I wrote this for a Google Rust course, but it can be used with any book.

What this does:

  • Provides a mdbook renderer (output format) and a preprocessor. The first gathers strings to translate (mdbook-xgettext) and the second does the actual translation (mdbook-gettext).
  • Gives you a structure for the translations. @ehuss, you asked above why you cannot simply have different books in different directories: the answer is that it becomes a nightmare to maintain consistency between the books. You cannot merge in changes made to the original text (you'll get merge conflicts on every step). Using Gettext, you get a well-support standard flow for this. There are tons of tools for editing Gettext .po files and there are many online platforms as well.

The code I put up does not yet give you a language selector — but I intend on building this next. This will be a small theme change, so it'll be non-invasive and won't require any mdbook changes.

The workflow I propose actually uses separate books, but only conceptually during the publishing step. I will publish them using a script like this:

export MDBOOK_PREPROCESSOR__GETTEXT__RENDERERS='["html"]'
export MDBOOK_PREPROCESSOR__GETTEXT__BEFORE='["svgbob"]'

for po_file in po/*.po; do
  export MDBOOK_BOOK__LANGUAGE=$(basename ${po_file%.po})
  export MDBOOK_PREPROCESSOR__GETTEXT__PO_FILE=$po_file
  mdbook build -d book/$(basename ${po_file%.po})
done

This means that

  1. Searches are independent for each language (which I think you'll want)
  2. Pages have the correct languages="xx" HTML attribute.

I think this gives you the best of both worlds: simple publishing flow, no invasive changes to mdbook, full control over when to publish which languages (you might want to only publish a language if it's more than 90% translated).

@Ruin0x11, @ChengYueJia, @huangjj27, and others, I would be interested in hearing your feedback on this approach. Do you think it would work for your use cases as well?

Hi all, I've published the plugins for a Gettext i18n translation workflow as a separate crate! You can install it with the usual

cargo install mdbook-i18n-helpers

Please see https://crates.io/crates/mdbook-i18n-helpers and let me know what you think in https://github.com/google/mdbook-i18n-helpers.

We've been using this infrastructure for 4 months now in the Comprehensive Rust 🦀 project. People have translated the course into Korean and Brazilian Portuguese and we have a few more languages in the pipeline.

What I like about this approach is that it's a very classic approach — Gettext is more than 30 years old now and there are a lot of tools out there which can help translators wrangle the .po files it uses.

commented

Hi all, I've published the plugins for a Gettext i18n translation workflow as a separate crate! You can install it with the usual

cargo install mdbook-i18n-helpers

Please see https://crates.io/crates/mdbook-i18n-helpers and let me know what you think in https://github.com/google/mdbook-i18n-helpers.

This is indeed a good idea, but unfortunately, every time switch languages need to reload some styles, and maybe mdbook needs to make some changes for this to get a better experience.

This is indeed a good idea, but unfortunately, every time switch languages need to reload some styles,

Are you talking about how the different languages are completely independent books (with their own assets such as stylesheets, images, etc)? I agree that it's a bit unfortunate.

and maybe mdbook needs to make some changes for this to get a better experience.

Yes, it could certainly be made easier! One pain point right now is that I need to copy the index.hbs file to be able to add a language picker to it.

Hi again, just wanted to let people here know that I've released a version 0.2 of mdbook-i18n-helpers. This version changes how the text is extracted: paragraph are now unwrapped, headings are stripped of #, and tables are translated on a cell-by-cell basis.

A normalization tool is included to help you convert old translation files to the new format — we have ~18 translations now for Comprehensive Rust, so it's important for us to have a migration path for those files.

I would be very interested in feedback if you try it out! Thanks 🙂

As I haven't really made a lot of progress on this front, besides setting up the template. It makes more sense I guess to reinitialize the whole .po/.pot files with the new version. I'll do that when I can spend time on it. 👍🏽 Thanks for keeping us up-to-date! <3

I know you might want to use Rust here but at KDE we have written and used a Python program to do i18n with gettext for our Hugo websites for some years. Recently I have separated the Markdown stuff from the Hugo-specific stuff, and so if you want to do i18n and l10n for individual Markdown files, markdown-gettext might be helpful for you. It is compliant with CommonMark, and has support for all core Markdown elements, as well as YAML front matter, table, and definition list. The support here means that only text is processed (i12ized/localized), all formatting characters (at block level) are ignored during i18n but the file structure will be the same after l10n.

I understand this package might not be 100% fit with mdBook; however, writing an extension for the lib behind it is not difficult. I hope by using the package, you won't have to recreate the processing of common Markdown elements, and can focus on the differences.

Hi @PhuNH,

It is compliant with CommonMark, and has support for all core Markdown elements, as well as YAML front matter, table, and definition list. The support here means that only text is processed (i12ized/localized), all formatting characters (at block level) are ignored during i18n but the file structure will be the same after l10n.

That sounds nice and it sounds similar to the processing done by mdbook-i18n-helpers. A mdbook preprocessor can be written in any language — the manual has a Python example. It's probably very easy to create a wrapper around your library.

I recently found another tool for translating Markdown: https://github.com/mondeja/mdpo, also written in Python. There is also https://po4a.org/index.php.en, which handles even more formats.