tzickel / chunkedbuffer

An attempt at making a more efficient buffered I/O data structure for Python

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add common string methods to the buffer

mattip opened this issue · comments

It would be nice if this would implement common string methods like strip or replace or encode

See this discussion on Python Ideas

@apalala

The problem with this idea is this:

$ python3 -c "from pprint import pprint; pprint(dir(str))"
['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmod__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isascii',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'removeprefix',
 'removesuffix',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'zfill']

The goal of this library is not to be a replacement for a generic bytes or str implementation, such an undertaking would be huge since you will need to implement all the low-level bits of those algorithms to take care with the bytes not being consecutive in memory (and in case of str, you will need to deal with the specific unicode encoding, and cut points) both in regards to input and output.

I would be surprised if any language / library supports this use case.

The goal is to provide enough of an API to allow stream processing of bytes data in a context of some stream protocol (like HTTP over TCP) with 0 or minimal memory allocation and copying.

A concurrent huge payload use case might benefit even more with this goal.

It's also important that if you really reach bottlenecks that you understand what they are and where they are coming from, in my simple benchmark (assuming I did it right), CPython's interpreter overhead was slower than the optimizations this library provided.


I also support some helper functions such as split, strip, splitlines (I can add more from bytesarray), that are intended to be use on sub-parts of the buffer (thus still keeping in line of minimal copy and allocation), which can help with processing:

def split(self, sep=None, maxsplit=-1):