18F / omb-eregs

A tool to find, read, and maintain White House Office of Management and Budget (OMB) policy requirements

Home Page:https://policy-beta.cio.gov/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Serialize section title from editor

cmc333333 opened this issue · comments

The ProseMirror editor isn't adding a title field to sections when it serializes them, meaning we lose that from the database.

This may be a good time to encode the associated schema as

sect: heading block+

to ensure we'll always have a heading to grab a title from. If we do that, though, we'll need to verify the pdf parser is emitting that consistently.

As far as I can tell, it seems like the PDF parser is always emitting a sec before it emits a heading:

    def begin_heading(self, heading):
        while heading.level <= self.sec_level:
            self.cursor_stack.pop()
        while heading.level > self.sec_level:
            self.cursor_stack.append(self.cursor.add_child('sec'))
        self.cursor_stack.append(self.cursor.add_child('heading'))
        self.cursor.pdf_node = heading

The first while loop ensures that heading.level > self.sec_level, so the second while loop is guaranteed to execute at least once, meaning that at least one sec is going to be added immediately before a heading.

That second while loop is also the only place in semdb.py where a sec is created, so it's also true that there aren't any secs that don't have headings immediately after them. So I think that part is cool, though perhaps we could add tests that ensure this is the case--or perhaps we could add some sort of schema validator (see #907) that ensures that once we've imported all the PDFs we can, the sec: heading block+ rule is valid for all our imported documents.

Excellent, thanks for verifying, @toolness !