CogStack / MedCATtutorials

General tutorials for the setup and use of MedCAT.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

medcat==1.8.0 doesn't install with latest Rust-1.73.0

mkorvas opened this issue · comments

Following commands from the MedCAT tutorial on my recently updated Arch Linux, I started by pip-installing medcat==1.8.0:

TMPDIR=$(realpath tmp) pip install medcat==1.8.0

and it failed while installing the transitive dependency of tokenizers-0.12.1:

         Compiling tokenizers v0.12.1 (/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/tokenizers-lib)
           Running `rustc --crate-name tokenizers --edition=2018 tokenizers-lib/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --crate-type lib --emit=dep-info,metadata,link -C opt-level=3 -C embed-bitcode=no --cfg 'feature="cached-path"' --cfg 'feature="clap"' --cfg 'feature="cli"' --cfg 'feature="default"' --cfg 'feature="http"' --cfg 'feature="indicatif"' --cfg 'feature="progressbar"' --cfg 'feature="reqwest"' -C metadata=6e744bd72fbca6b6 -C extra-filename=-6e744bd72fbca6b6 --out-dir /home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps -L dependency=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps --extern aho_corasick=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/libaho_corasick-945b53c31d17d93a.rmeta --extern cached_path=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/libcached_path-f08bff030f68babf.rmeta --extern clap=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/libclap-e9d371f5e8d6a9a3.rmeta --extern derive_builder=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/libderive_builder-fa11fc961fe52533.so --extern dirs=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/libdirs-1a1d9e829264b7da.rmeta --extern esaxx_rs=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/libesaxx_rs-85538497f74112a9.rmeta --extern indicatif=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/libindicatif-d0d39a7cdd2548d8.rmeta --extern itertools=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/libitertools-69eed52371d42a58.rmeta --extern lazy_static=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/liblazy_static-f66451aaeb61e431.rmeta --extern log=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/liblog-c574061a79b01b9c.rmeta --extern macro_rules_attribute=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/libmacro_rules_attribute-fba70e287e0c3709.rmeta --extern onig=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/libonig-e1ec9f287b0bb2a0.rmeta --extern paste=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/libpaste-1e8a081fe8f77648.so --extern rand=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/librand-901f96c0508326da.rmeta --extern rayon=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/librayon-21e5476475f6123c.rmeta --extern rayon_cond=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/librayon_cond-abebb32de588b7d4.rmeta --extern regex=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/libregex-e63a632912025278.rmeta --extern regex_syntax=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/libregex_syntax-2ed5634723cf75a8.rmeta --extern reqwest=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/libreqwest-31671eba5c38f195.rmeta --extern serde=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/libserde-7eecf8cc84b5f85e.rmeta --extern serde_json=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/libserde_json-486bdd6da639b7af.rmeta --extern spm_precompiled=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/libspm_precompiled-38b82cabeec534fc.rmeta --extern thiserror=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/libthiserror-b66f0526c1fb2f50.rmeta --extern unicode_normalization_alignments=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/libunicode_normalization_alignments-2c588a19019b70cf.rmeta --extern unicode_segmentation=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/libunicode_segmentation-76879f425c2b2d2d.rmeta --extern unicode_categories=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/libunicode_categories-766047a35d8335eb.rmeta -L native=/usr/lib -L native=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/build/zstd-sys-39732ab2cbd6d2b3/out -L native=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/build/esaxx-rs-56e5ad34b63d614b/out -L native=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/build/onig_sys-ded1a183a7abd08d/out`
      warning: variable does not need to be mutable
         --> tokenizers-lib/src/models/unigram/model.rs:265:21
          |
      265 |                 let mut target_node = &mut best_path_ends_at[key_pos];
          |                     ----^^^^^^^^^^^
          |                     |
          |                     help: remove this `mut`
          |
          = note: `#[warn(unused_mut)]` on by default

      warning: variable does not need to be mutable
         --> tokenizers-lib/src/models/unigram/model.rs:282:21
          |
      282 |                 let mut target_node = &mut best_path_ends_at[starts_at + mblen];
          |                     ----^^^^^^^^^^^
          |                     |
          |                     help: remove this `mut`

      warning: variable does not need to be mutable
         --> tokenizers-lib/src/pre_tokenizers/byte_level.rs:200:59
          |
      200 |     encoding.process_tokens_with_offsets_mut(|(i, (token, mut offsets))| {
          |                                                           ----^^^^^^^
          |                                                           |
          |                                                           help: remove this `mut`

      error: casting `&T` to `&mut T` is undefined behavior, even if the reference is unused, consider instead using an `UnsafeCell`
         --> tokenizers-lib/src/models/bpe/trainer.rs:526:47
          |
      522 |                     let w = &words[*i] as *const _ as *mut _;
          |                             -------------------------------- casting happend here
      ...
      526 |                         let word: &mut Word = &mut (*w);
          |                                               ^^^^^^^^^
          |
          = note: `#[deny(invalid_reference_casting)]` on by default

      warning: `tokenizers` (lib) generated 3 warnings
      error: could not compile `tokenizers` (lib) due to previous error; 3 warnings emitted

      Caused by:
        process didn't exit successfully: `rustc --crate-name tokenizers --edition=2018 tokenizers-lib/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --crate-type lib --emit=dep-info,metadata,link -C opt-level=3 -C embed-bitcode=no --cfg 'feature="cached-path"' --cfg 'feature="clap"' --cfg 'feature="cli"' --cfg 'feature="default"' --cfg 'feature="http"' --cfg 'feature="indicatif"' --cfg 'feature="progressbar"' --cfg 'feature="reqwest"' -C metadata=6e744bd72fbca6b6 -C extra-filename=-6e744bd72fbca6b6 --out-dir /home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps -L dependency=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps --extern aho_corasick=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/libaho_corasick-945b53c31d17d93a.rmeta --extern cached_path=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/libcached_path-f08bff030f68babf.rmeta --extern clap=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/libclap-e9d371f5e8d6a9a3.rmeta --extern derive_builder=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/libderive_builder-fa11fc961fe52533.so --extern dirs=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/libdirs-1a1d9e829264b7da.rmeta --extern esaxx_rs=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/libesaxx_rs-85538497f74112a9.rmeta --extern indicatif=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/libindicatif-d0d39a7cdd2548d8.rmeta --extern itertools=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/libitertools-69eed52371d42a58.rmeta --extern lazy_static=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/liblazy_static-f66451aaeb61e431.rmeta --extern log=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/liblog-c574061a79b01b9c.rmeta --extern macro_rules_attribute=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/libmacro_rules_attribute-fba70e287e0c3709.rmeta --extern onig=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/libonig-e1ec9f287b0bb2a0.rmeta --extern paste=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/libpaste-1e8a081fe8f77648.so --extern rand=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/librand-901f96c0508326da.rmeta --extern rayon=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/librayon-21e5476475f6123c.rmeta --extern rayon_cond=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/librayon_cond-abebb32de588b7d4.rmeta --extern regex=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/libregex-e63a632912025278.rmeta --extern regex_syntax=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/libregex_syntax-2ed5634723cf75a8.rmeta --extern reqwest=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/libreqwest-31671eba5c38f195.rmeta --extern serde=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/libserde-7eecf8cc84b5f85e.rmeta --extern serde_json=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/libserde_json-486bdd6da639b7af.rmeta --extern spm_precompiled=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/libspm_precompiled-38b82cabeec534fc.rmeta --extern thiserror=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/libthiserror-b66f0526c1fb2f50.rmeta --extern unicode_normalization_alignments=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/libunicode_normalization_alignments-2c588a19019b70cf.rmeta --extern unicode_segmentation=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/libunicode_segmentation-76879f425c2b2d2d.rmeta --extern unicode_categories=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/deps/libunicode_categories-766047a35d8335eb.rmeta -L native=/usr/lib -L native=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/build/zstd-sys-39732ab2cbd6d2b3/out -L native=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/build/esaxx-rs-56e5ad34b63d614b/out -L native=/home/matej/proj/medcat/tmp/pip-install-ke2936w0/tokenizers_6e6a038a805943288e0da324d69fa299/target/release/build/onig_sys-ded1a183a7abd08d/out` (exit status: 1)
      error: `cargo rustc --lib --message-format=json-render-diagnostics --manifest-path Cargo.toml --release -v --features pyo3/extension-module --crate-type cdylib --` failed with code 101
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for tokenizers
Failed to build tokenizers
ERROR: Could not build wheels for tokenizers, which is required to install pyproject.toml-based projects

[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: pip install --upgrade pip

This was with Rust-1.73.0 installed on the system. After downgrading to Rust-1.72.1, the build worked. This post in the discussion of the python-tokenizers package in AUR suggests that requiring tokenizers==0.14.1 instead should make this work (with at least Rust-1.70.0 or newer).

I am posting this issue here because it effectively causes the instructions of the tutorial to be broken, even though it's probably not an issue that could be easily fixed in the tutorial itself.

Thank you for letting us know! It's much appreciated!

Although v1.8.0 would probably be not be installable in a clean environment due to the py2neo issue (see CogStack/MedCAT#356).

With that said, I'll try and bump the versions in the tutorial to the latest medcat version (1.9.3).
That fixed the py2neo issue. Plus, it doesn't pose the same limitation transformers versions. In medcat~=1.8 transformers>=4.19.2,<4.22.0 is specified, which in turn pins tokenizers>=0.11.1,!=0.11.3,<0.13.

I've created a PR for the above.

Given such a prompt and kind response, let me mention a few more hiccups I encountered when exploring the tutorial. Some of them may be an issue on my end, I haven't checked very thoroughly and I ran the commands locally in an IPython shell or the system shell (for the system commands included in the Jupyter notebooks). Here is the list, anyway:

  1. I had to pip-install seaborn manually -- I only saw a command to install medcat==1.8.0 in the tutorial and that did not already pull in seaborn.
  2. I also had to pip-install PyQt5 to make the plt.show() calls do something visible. (I guess this is likely caused by me not running the tutorial in the context of Jupyter.)
  3. The cat.cdb.print_stats() calls in sections 3.2 and 3.3 of the tutorial didn't have any visible effect when I ran them, either. However, another similar method that I found in the MedCAT docs, make_stats(), did print something informative:
    In [72]: %cpaste -q
    # Now print statistics on the CDB after training
    cat.cdb.print_stats()
    --
    
    In [73]: cat.cdb.make_stats()
    Out[73]:
    {'Number of concepts': 34724,
     'Number of names': 92740,
     'Number of concepts that received training': 34724,
     'Number of seen training examples in total': 4098991,
     'Average training examples per concept': 118.04489690127865}
    
    In [74]: cat.cdb.print_stats()
    
    
  4. In part 3.2, the simple cat.train(...) method apparently worked (as the later cat.get_entities call identified the entity in the test input sentence) but running results = cat.multiprocessing(in_data, nproc=2) yielded empty results (I think it was an empty tuple). Maybe I need a (stronger) GPU card for that to work? I just noticed an update of the MedCAT readme providing an alternative command for installing MedCAT in a CPU-only setup...

Thank you for the further feedback. We don't get much feedback for the tutorials so this is much appreciated!

With that said, our tutorials are (at least for the time being) targeting Jupyter Notebooks and/or Google Colab. Feature parity in other environments is not guaranteed.

  1. It's a little odd that this isn't caught by the smoke tests in the GitHub actions. Nor has it really been an issue in Google Colab. In any case, since I recall having this issue myself (and I couldn't find anything that explicitly said that Jupyter notebook provides seaborn out of the box), I've created a PR for this.
  2. That's most likely to do with your environment indeed.
  3. This works as intended on a notebook. The reason it doesn't work for your is because the output is logged rather than printed (though that would mean that the method name is somewhat misleading). The reason it works on notebooks is likely because they configure the root logger to output to the notebook. With that said, your workaround is exactly what's used in the internals. So it should do fine as a replacement. PS: Also couldn't find the CDB.print_stats method being used in Part 3.3.
  4. The multiprocessing method works just fine on the Colab page, returning a dict as expected. You don't need a powerful GPU to run it. Especially for the limited amount of data in the tutorial. The CPU-only setup is really only necessary if you wish to limit the size of your install (i.e if you want a smaller docker image). The default install should be able to run in a system with or without a GPU. And if multiprocessing failed to run, it should generate an exception. So if it didn't run for you, I'd expect something else to be wrong (i.e empty input data).

Thanks for the quick and extensive reply again!

Indeed, I am not finding any occurrences of "print_stats" in Part 3.3 of the tutorial, I didn't even download a copy of that one... However, FWIW, I am noticing it's titled "Part 3.2 - Extracting Diseases from Electronic Health Records.ipynb" although the URL has "Part_3_3_Model_technical_optimisations.ipynb" in it... probably a copy-paste error?

Indeed, I am not finding any occurrences of "print_stats" in Part 3.3 of the tutorial, I didn't even download a copy of that one... However, FWIW, I am noticing it's titled "Part 3.2 - Extracting Diseases from Electronic Health Records.ipynb" although the URL has "Part_3_3_Model_technical_optimisations.ipynb" in it... probably a copy-paste error?

I see what you mean now. Though I'm not entirely sure where it grabs the title or how to change it.