hplt-project / OpusCleaner

OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.

Home Page:https://pypi.org/project/opuscleaner/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

A simple length ratio filter shouldn't require manually updating pyhash submodules

kpu opened this issue · comments

I wanted to do a length ratio filter.

From the UI it looks like this is only available from opusfilter. That means I did pip install opusfilter except that failed because pyhash won't compile.

      building '_pyhash' extension                                                                                                                                                                        
      x86_64-pc-linux-gnu-gcc -Wsign-compare -DNDEBUG -march=native -O3 -pipe -fPIC -DSUPPORT_INT128=1 -Isrc/pybind11/include -Isrc/highwayhash -I/tmp/pip-build-env-5gdw27bb/overl                       ay/lib/python3.11/site-packages/pybind11/include -I/home/kpu/hplt/opuscleaner/include -I/usr/include/python3.11 -c src/Hash.cpp -o build/temp.linux-x86_64-cpython-311/src/Hash.o -                       std=c++17 -fvisibility=hidden -g0 -march=native -std=c++14                                                                                                                                                
      In file included from src/pybind11/include/pybind11/cast.h:16,                                                                                                                                      
                       from src/pybind11/include/pybind11/attr.h:13,                                                                                                                                      
                       from src/pybind11/include/pybind11/pybind11.h:13,                                                                                                                                  
                       from src/Hash.h:8,                                                                                                                                                                 
                       from src/Hash.cpp:1:                                                                                                                                                               
      src/pybind11/include/pybind11/detail/type_caster_base.h: In function ‘std::string pybind11::detail::error_string()’:                                                                                
      src/pybind11/include/pybind11/detail/type_caster_base.h:482:26: error: invalid use of incomplete type ‘PyFrameObject’ {aka ‘struct _frame’}                                                         
        482 |             frame = frame->f_back;                                                                                                                                                          
            |                          ^~                                                                                                                                                                 
      In file included from /usr/include/python3.11/Python.h:42,                                                                                                                                          
                       from src/pybind11/include/pybind11/detail/common.h:186,                                                                                                                            
                       from src/pybind11/include/pybind11/pytypes.h:12,                                                                                                                                   
                       from src/pybind11/include/pybind11/cast.h:13:                                                                                                                                      
      /usr/include/python3.11/pytypedefs.h:22:16: note: forward declaration of ‘PyFrameObject’ {aka ‘struct _frame’}                                                                                      
         22 | typedef struct _frame PyFrameObject;                                                                                                                                                        
            |                ^~~~~~                                                                                                                                                                       
      In file included from /usr/include/python3.11/Python.h:38:                                                                                                                                          
      src/pybind11/include/pybind11/pybind11.h: In function ‘pybind11::function pybind11::detail::get_type_override(const void*, const type_info*, const char*)’:                                         
      src/pybind11/include/pybind11/pybind11.h:2348:54: error: ‘PyCodeObject’ {aka ‘struct PyCodeObject’} has no member named ‘co_varnames’; did you mean ‘co_names’?                                     
       2348 |                     locals, PyTuple_GET_ITEM(f_code->co_varnames, 0)                                                                                                                        
            |                                                      ^~~~~~~~~~~                                                                                                                            
      /usr/include/python3.11/pyport.h:24:38: note: in definition of macro ‘_Py_CAST’                                                                                                                     
         24 | #define _Py_CAST(type, expr) ((type)(expr))                                                                                                                                                 
            |                                      ^~~~                                                                                                                                                   
      /usr/include/python3.11/cpython/tupleobject.h:30:38: note: in expansion of macro ‘_PyTuple_CAST’                                                                                                    
         30 | #define PyTuple_GET_ITEM(op, index) (_PyTuple_CAST(op)->ob_item[index])                                                                                                                     
            |                                      ^~~~~~~~~~~~~                                                                                                                                          
      src/pybind11/include/pybind11/pybind11.h:2348:29: note: in expansion of macro ‘PyTuple_GET_ITEM’                                                                                                    
       2348 |                     locals, PyTuple_GET_ITEM(f_code->co_varnames, 0)                                                                                                                        
            |                             ^~~~~~~~~~~~~~~~                                                                                                                                                
      In file included from src/Halftime.h:9,                                                                                                                                                             
                       from src/Hash.cpp:16:                                                                                                                                                              
      src/halftime/halftime-hash.hpp: In instantiation of ‘struct halftime_hash::advanced::{anonymous}::RepeatWrapper<halftime_hash::advanced::{anonymous}::BlockWrapper256, 2>’:                         
      src/halftime/halftime-hash.hpp:842:9:   required from ‘void halftime_hash::advanced::{anonymous}::Hash(const uint64_t*, const char*, size_t, uint64_t*) [with BlockWrapper =                        RepeatWrapper<BlockWrapper256, 2>; unsigned int dimension = 5; unsigned int in_width = 3; unsigned int encoded_dimension = 9; unsigned int out_width = 5; uint64_t = long unsigned                        int; size_t = long unsigned int]’                                                                                                                                                                         
      src/halftime/halftime-hash.hpp:1039:1:   required from ‘void halftime_hash::advanced::V4Avx2(const uint64_t*, const char*, size_t, uint64_t*) [with unsigned int dimension =                        5; unsigned int in_width = 3; unsigned int encoded_dimension = 9; unsigned int out_width = 5; uint64_t = long unsigned int; size_t = long unsigned int]’                                                  
      src/halftime/halftime-hash.hpp:1092:1:   required from here                                                                                                                                         
      src/halftime/halftime-hash.hpp:869:9: warning: ignoring attributes on template argument ‘halftime_hash::advanced::{anonymous}::BlockWrapper256::Block’ {aka ‘__m256i’} [-Wign                       ored-attributes]                                                                                                                                                                                          
        869 |   using Block = Repeat<InnerBlock, count>;                                                                                                                                                  
            |         ^~~~~                                                                                                                                                                               
      src/halftime/halftime-hash.hpp: In static member function ‘static halftime_hash::advanced::{anonymous}::EhcBadger<BlockWrapper, dimension, in_width, encoded_dimension, out_w                       idth, fanout>::Block halftime_hash::advanced::{anonymous}::EhcBadger<BlockWrapper, dimension, in_width, encoded_dimension, out_width, fanout>::MixOne(Block, Block, uint64_t) [with                        BlockWrapper = halftime_hash::advanced::{anonymous}::RepeatWrapper<halftime_hash::advanced::{anonymous}::BlockWrapper256, 2>; unsigned int dimension = 6; unsigned int in_width =                        3; unsigned int encoded_dimension = 7; unsigned int out_width = 2; unsigned int fanout = 8]’:                                                                                                             
      src/halftime/halftime-hash.hpp:468:16: note: the ABI for passing parameters with 64-byte alignment has changed in GCC 4.6                                                                           
        468 |   static Block MixOne(Block accum, Block input, uint64_t entropy) {                                                                                                                         
            |                ^~~~~~                                                                                                                                                                       
      error: command '/usr/bin/x86_64-pc-linux-gnu-gcc' failed with exit code 1                                                                                                                           
      [end of output]                                                                                                                                                                                     
                                                                                                                                                                                            
  note: This error originates from a subprocess, and is likely not a problem with pip.                                                                                                                  
  ERROR: Failed building wheel for pyhash                                                                                                                                                               
Failed to build pyhash                                                                                                                                                                       

It appears pyhash has an ancient pybind11 so I updated that submodule.

git clone https://github.com/flier/pyfasthash
cd pyfasthash/
git submodule init
git submodule update
cd src/pybind11
git pull https://github.com/pybind/pybind11.git
cd ../..
pip3 install .

Seems a bit much for a length ratio filter.