tree-sitter / py-tree-sitter

Python bindings to the Tree-sitter parsing library

Home Page:https://tree-sitter.github.io/py-tree-sitter/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Tree-sitter Fails with Core Dump on Processing Large Input Code File

rationalga opened this issue · comments

Hi,
Whenever there is a large file (> 1MB) encountered in a codebase, Tree-sitter crashes. Here is an example of such input file.
https://github.com/redox-os/binutils-gdb/blob/master/opcodes/arc-tbl.h
Here is a minimal code to reproduce the issue.

py-tree-sitter version: 0.21.1
python version: 3.7

import sys
from tree_sitter import Language, Parser

file_to_parse ='./arc-tbl.h'
with open(file_to_parse, 'rb') as f:
            code = f.read()
            
library_path = 'build/' + sys.platform + '/my-languages.so'
C_LANGUAGE = Language(library_path, 'c')
parser = Parser()
parser.set_language(C_LANGUAGE)

tree = parser.parse(code)

Thank you.

I cannot reproduce - I just built tree-sitter-c with cc parser.c -o parser.so -ltree-sitter -fPIC -shared

If you can provide some more info or some kind of debugging info like a backtrace that'd be helpful.

Hi, Thanks for taking this up. Here is the stack trace output. I have tried the above code with python 3.8 here. Maybe you could please use the pip version of tree-sitter (pip install tree_sitter==0.20.1) . Thank you.

Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `python debug.py'.
Program terminated with signal SIGABRT, Aborted.
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.

(gdb) bt
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007f2634ea8859 in __GI_abort () at abort.c:79
#2 0x00007f2634f1326e in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7f263503d298 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#3 0x00007f2634f1b2fc in malloc_printerr (str=str@entry=0x7f263503f1e0 "munmap_chunk(): invalid pointer") at malloc.c:5347
#4 0x00007f2634f1b54c in munmap_chunk (p=) at malloc.c:2830
#5 0x00007f2634306e14 in ts_subtree_release (self=..., pool=) at tree_sitter/core/lib/src/./subtree.c:609
#6 ts_subtree_release (pool=0x7fff06380c80, self=...) at tree_sitter/core/lib/src/./subtree.c:588
#7 0x00007f263431494c in ts_tree_delete (self=0x1124aa0) at tree_sitter/core/lib/src/./tree.c:31
#8 0x00007f26342f49e1 in tree_dealloc (self=0x7f26343f33d0) at tree_sitter/binding.c:568
#9 0x00000000005aced3 in ?? ()
#10 0x00000000005b0174 in ?? ()
#11 0x00000000006a180c in ?? ()
#12 0x00000000005afc89 in ?? ()
#13 0x000000000067a583 in PyImport_Cleanup ()
#14 0x000000000067423f in Py_FinalizeEx ()
#15 0x00000000006b418d in Py_RunMain ()
#16 0x00000000006b43fd in Py_BytesMain ()
#17 0x00007f2634eaa083 in __libc_start_main (main=0x4c4510

, argc=2, argv=0x7fff06381038, init=, fini=, rtld_fini=, stack_end=0x7fff06381028)
at ../csu/libc-start.c:308
#18 0x00000000005da67e in _start ()

below is the complete minimal code to reproduce the error.

import sys
from tree_sitter import Language, Parser


Language.build_library(
    "build/my-languages.so",
    ["vendor/tree-sitter-c"],
)


file_to_parse ='./arc-tbl.h'
with open(file_to_parse, 'rb') as f:
            code = f.read()

library_path = 'build/my-languages.so'
C_LANGUAGE = Language(library_path, 'c')
parser = Parser()
parser.set_language(C_LANGUAGE)

tree = parser.parse(code)

Ah, I can reproduce with 0.20.1. Can you please use 0.20.4 instead? 0.20.1 is quite old

Thanks, latest version works fine.