Roam-migration
CAUTION: This is incomplete and probably broken. I am in the process of converting it from text processing on a directory of markdown files to using the RoamResearch database from an EDN export. Once it works, I’ll add more instructions here.
A tool to convert a database exported from Roam Research into a directory full of Org-roam files.
Usage
- Start by exporting your Roam database.
- Click the ellipsis (…) button at the top of your Roam window.
- Select “Export All”
- In the dialog that pops up, select “EDN” as the export format.
- Save the zip file that Roam produces, then unzip it.
- Note that the unzipped file will be named differently than the zip file that Roam gave you. The zip file will be named something like “Roam-Export-xxxx.zip”. The database file will be named for your database, and it will end with .edn.
- Pick a directory for the
.org
files to be created in. It doesn’t have to be empty… but any files in there could be overwritten if you have a Roam page with the same name. - Run
convert.sh -s _your-database.edn_ -d _destination-directory_
- Check
errors.txt
for the db-id and page title of any pages that didn’t export correctly.
Errors are likely to be reported due to file name collision. There are two common cases for this:
- Two Roam pages that differ only by capitalization.
- Two Roam pages where one title has a hyphen and the other does not.
Either of these can easily occur as the result of inconsistent hashtags across multiple pages. (For example, if you’ve ever imported a Zotero library into Roam…)
Features
- Works from a Roam database (EDN export) not markdown files. This allows more precision.
- Handles daily notes pages, converting them to org-roam-dailies.
- Converts internal links to org-roam links.
- Mostly accurate conversion of markdown to org-mode syntax
- Reports successful conversions in
success.txt
and errors inerrors.txt
Markdown conversion
Works OK
- Plain text
- Roam style inline markup for:
- Bold (double earmuffs)
- Italic (double underscore)
- Highlight (double caret)
- Strikethrough (double tilde)
- TODO and DONE macros are converted to org-mode task statuses
- Backticks are recognized and converted to
monospace
in org - Source blocks (triple-backtick) with optional language are converted to
begin_src
blocks - Image links with optional alt text are converted to image links with
CAPTION
attributes - Internal links are converted to use org-roam “id:” targets
- External links are converted from markdown format to org-mode format
- LaTeX is not converted, so it works “by accident” when exporting from org-mode
- Highlight (double caret)
- View as Numbered List
- View as Document
- Embedded videos become hyperlinks
:hiccup [:hr]
becomes an org-mode horizontal line- Roam bullets are converted to org headers if they have children or the headline is short (less than 60 characters)
- This is a heuristic that gives me nice output of daily notes pages, especially when there are empty “heading” sections.
Partially working, needs attention
- Tables.
- Formula macros are not converted.
Does not work, but could
- Block refs and embedded blocks
- Page refs
- Mermaid diagrams (could be exported as source blocks with suitable language)
Unlikely to ever work
Most of these have no org-mode equivalent:
- Roam header formatting (h1, h2, etc.)
- Spaced repetition
- Pomodoro timer
- Calculator macros
- Word count macro
- Iframe
- Encrypted text
- Kanban board
- Mentions
- Query macros
- Attribute tables
- Embedded PDFs
- Nested tables
- General hiccup formatting
- General Roam diagrams (i.e., the `{{diagram}}` nodes)
- Roam templates
A note on Markdown parsing
If you look into src/roam.clj
you will probably be surprised, then worried. You will wonder “Why didn’t you just use Instaparse?” or “Don’t you know regular expressions?”
The problem is that Markdown cannot be parsed with regexes nor context-free grammars. It is made for humans to read not computers. There’s no formal specification for markdown syntax and there’s no BNF grammar that works.
The parser implemented here is a cross between a state machine and a virtual machine. It maps the current state and next input to a list of register-manipulation instructions. When an input matches, the instructions are interpreted sequentially. This allows me to have a compact representation of some complex logic that includes “pseudo-backtracking” without using a pushdown parsing stack. (Although there is a stack of parser states to handle cases like “an inline code segment inside a bold text span”.)
The tough cases are things like incomplete hyperlinks, where you think you’re parsing the link text and href but lacking a closing delimiter, it turns out you’re just accumulating a text span. For these cases, you’ll see the virtual machine accumulating text in two registers, one of which (usually z
) is a fallback that gets used if it turns out to be plain text. Instead of backtracking, the machine accumulates both options and decides at the end which option to use.
This approach does not generalize to other languages, where you might have an unbounded amount of backtracking, but it works well enough when there are only two alternatives. It’s easy enough to support more alternatives if necessary… since each “register” is just a map key, I can always add more registers. The bookkeeping in the instruction lists would get increasingly hairy though.
Remaining tasks
- [X] PDFs embedded in a page export as “pdfhttps://….” because the macro isn’t handled in
org.clj
- [X] Rewrite links for downloaded assets
- [X] Bug: link must be relative to the .org file. Currently is generated relative to the execution directory
- [ ] Bug? Does Roam markdown inside table cells get rewritten?
Remaining features
- [ ] Download firebase (and maybe other locations) images & attachments to local folder
- [X] Convert Roam tables to org-mode
- [ ] Allow cli argument to specify partial processing (batches? entities?)