Roam-migration

CAUTION: This is incomplete and probably broken. I am in the process of converting it from text processing on a directory of markdown files to using the RoamResearch database from an EDN export. Once it works, I’ll add more instructions here.

A tool to convert a database exported from Roam Research into a directory full of Org-roam files.

Usage

Start by exporting your Roam database.
1. Click the ellipsis (…) button at the top of your Roam window.
2. Select “Export All”
3. In the dialog that pops up, select “EDN” as the export format.
4. Save the zip file that Roam produces, then unzip it.
Note that the unzipped file will be named differently than the zip file that Roam gave you. The zip file will be named something like “Roam-Export-xxxx.zip”. The database file will be named for your database, and it will end with .edn.
Pick a directory for the .org files to be created in. It doesn’t have to be empty… but any files in there could be overwritten if you have a Roam page with the same name.
Run convert.sh -s _your-database.edn_ -d _destination-directory_
Check errors.txt for the db-id and page title of any pages that didn’t export correctly.

Errors are likely to be reported due to file name collision. There are two common cases for this:

Two Roam pages that differ only by capitalization.
Two Roam pages where one title has a hyphen and the other does not.
Either of these can easily occur as the result of inconsistent hashtags across multiple pages. (For example, if you’ve ever imported a Zotero library into Roam…)

Features

Works from a Roam database (EDN export) not markdown files. This allows more precision.
Handles daily notes pages, converting them to org-roam-dailies.
Converts internal links to org-roam links.
Mostly accurate conversion of markdown to org-mode syntax
Reports successful conversions in success.txt and errors in errors.txt

Markdown conversion

Works OK

Plain text
Roam style inline markup for:
- Bold (double earmuffs)
- Italic (double underscore)
- Highlight (double caret)
- Strikethrough (double tilde)
TODO and DONE macros are converted to org-mode task statuses
Backticks are recognized and converted to monospace in org
Source blocks (triple-backtick) with optional language are converted to begin_src blocks
Image links with optional alt text are converted to image links with CAPTION attributes
Internal links are converted to use org-roam “id:” targets
External links are converted from markdown format to org-mode format
LaTeX is not converted, so it works “by accident” when exporting from org-mode
Highlight (double caret)
View as Numbered List
View as Document
Embedded videos become hyperlinks
:hiccup [:hr] becomes an org-mode horizontal line
Roam bullets are converted to org headers if they have children or the headline is short (less than 60 characters)
- This is a heuristic that gives me nice output of daily notes pages, especially when there are empty “heading” sections.

Partially working, needs attention

Tables.
- Formula macros are not converted.

Does not work, but could

Block refs and embedded blocks
Page refs
Mermaid diagrams (could be exported as source blocks with suitable language)

Unlikely to ever work

Most of these have no org-mode equivalent:

Roam header formatting (h1, h2, etc.)
Spaced repetition
Pomodoro timer
Calculator macros
Word count macro
Iframe
Encrypted text
Kanban board
Mentions
Query macros
Attribute tables
Embedded PDFs
Nested tables
General hiccup formatting
General Roam diagrams (i.e., the `{{diagram}}` nodes)
Roam templates

A note on Markdown parsing

If you look into src/roam.clj you will probably be surprised, then worried. You will wonder “Why didn’t you just use Instaparse?” or “Don’t you know regular expressions?”

The problem is that Markdown cannot be parsed with regexes nor context-free grammars. It is made for humans to read not computers. There’s no formal specification for markdown syntax and there’s no BNF grammar that works.

The parser implemented here is a cross between a state machine and a virtual machine. It maps the current state and next input to a list of register-manipulation instructions. When an input matches, the instructions are interpreted sequentially. This allows me to have a compact representation of some complex logic that includes “pseudo-backtracking” without using a pushdown parsing stack. (Although there is a stack of parser states to handle cases like “an inline code segment inside a bold text span”.)

The tough cases are things like incomplete hyperlinks, where you think you’re parsing the link text and href but lacking a closing delimiter, it turns out you’re just accumulating a text span. For these cases, you’ll see the virtual machine accumulating text in two registers, one of which (usually z) is a fallback that gets used if it turns out to be plain text. Instead of backtracking, the machine accumulates both options and decides at the end which option to use.

This approach does not generalize to other languages, where you might have an unbounded amount of backtracking, but it works well enough when there are only two alternatives. It’s easy enough to support more alternatives if necessary… since each “register” is just a map key, I can always add more registers. The bookkeeping in the instruction lists would get increasingly hairy though.

Remaining tasks

[X] PDFs embedded in a page export as “pdfhttps://….” because the macro isn’t handled in org.clj
[X] Rewrite links for downloaded assets
- [X] Bug: link must be relative to the .org file. Currently is generated relative to the execution directory
[ ] Bug? Does Roam markdown inside table cells get rewritten?

Remaining features

[ ] Download firebase (and maybe other locations) images & attachments to local folder
[X] Convert Roam tables to org-mode
[ ] Allow cli argument to specify partial processing (batches? entities?)

mtnygard / roam-migration