jnoortheen / parser-experiments

Python parsing experiments

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Overview

experiment with different ways to parse and generate ast with optimal speed & memory. All of these below parse Python3.10 grammar.

module total allocated size time peak
pgen2 369KiB 0.08s 684KiB
pegen + regex(tokenizer) 1767KiB 0.32s 2015KiB
pegen 1234KiB 0.38s 2281.9KiB
xonsh-ply 8240.6 KiB 0.65s 10333.5KiB
lark (lalr-cached) 3753.7 KiB 0.74s 9307.3KiB
parso 3542.7 KiB 0.80s 3690.2KiB
treesitter 9137.0 KiB 1.56s 9708.7KiB
libcst 21817.KiB 6.5s 23024.4KiB

seems like both are good. easpecially pgen2 interms of memory usage and performance. but we can use pegen2 as it has a separate pypi package. We can expect some stability as Python may include more and more peg only changes

Conclusions

A. xonsh-ply

  • the existing parser is slow and uses more memory.
  • the ply codebase is a mess though rply is good and we can optimize with some care

B. pegen

  • It will be following the official parser, hence future proof
  • generates AST which we can feed directly to the interpreter
  • has big peak memory size but it gets released and will end up with optimal size
    • when regex is used to tokenize the peak memory is 2015KiB
  • I found a PR which intends to make use of pegen in place of ply

C. pgen2

  • it comes from lib2to3 package of CPython. but it will be removed in py3.13 or so ... not much future proof
    • but black-formatter has forked it and it may stick around sometime more. we can refer these packages if we decided to base our parser on this
  • but has very less memory usage and faster too for any of the tested tools here
  • Links

D. parso

  • it is a fork of pgen2
  • does error recovery of sorts and hence the high memory usage
  • we can pick some pieces from this project if we decided to use pgen2

E. treesitter

  • even with the python bindings it ended up using more memory.
  • seems like the memory is not freed as the peak memory is the same as total allocated.

Step forward

  1. implement the completion-context parser in pgen2 and pegen and compare the
    1. development time
    2. performance
    3. memory usage

Links

About

Python parsing experiments


Languages

Language:Python 99.9%Language:C 0.1%Language:Makefile 0.1%Language:Cython 0.0%