tensorflow / tensorflow

An Open Source Machine Learning Framework for Everyone

Home Page:https://tensorflow.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Update grpc dependency for glibc 2.30 compatibility

m01 opened this issue · comments

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Arch Linux
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: n/A
  • TensorFlow installed from (source or binary): source
  • TensorFlow version: 2.0.0
  • Python version: 3.7.4
  • Installed using virtualenv? pip? conda?: n/A
  • Bazel version (if compiling from source): bazel 0.29.1- (@non-git)
  • GCC/Compiler version (if compiling from source): gcc-8 (GCC) 8.3.0
  • CUDA/cuDNN version: 10.1.243-1/7.6.4.38-1
  • GPU model and memory: NVIDIA GeForce GTX 760 4GB

Describe the problem

When building tensorflow 2.0.0 on a system with glibc version 2.30, the build fails due to a function name clash issue in grpc, which is already fixed (in grpc/grpc#18950) and there are grpc releases available with this fix. I believe updating the grpc dependency should fix this issue in Tensorflow.

Provide the exact sequence of commands / steps that you executed before running into the problem

On Arch Linux:

cd $(mktemp -d /tmp/tensorflow-test-build-XXX)
curl -L -o PKGBUILD 'https://git.archlinux.org/svntogit/community.git/plain/trunk/PKGBUILD?h=packages/tensorflow'
makepkg

If you're not on Arch Linux, have a look at the build script in the PKGBUILD file - it contains all the environment variable definitions and build commands, and is very readable.

Any other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

End of build log:

...
INFO: From Compiling external/llvm/lib/DebugInfo/CodeView/TypeRecordMapping.cpp:
external/llvm/lib/DebugInfo/CodeView/TypeRecordMapping.cpp: In member function 'virtual llvm::Error llvm::codeview::TypeRecordMapping::visitKnownRecord(llvm::codeview::CVType&, llvm::codeview::VFTableShapeRecord&)':
external/llvm/lib/DebugInfo/CodeView/TypeRecordMapping.cpp:293:61: warning: 'Byte' may be used uninitialized in this function [-Wmaybe-uninitialized]
  293 |         Record.Slots.push_back(static_cast<VFTableSlotKind>(Byte >> 4));
      |                                                             ^~~~
ERROR: /tmp/bazel/michiel/output/41c10338046435fcb3c7d7f27ec34951/external/grpc/BUILD:507:1: C++ compilation of rule '@grpc//:gpr_base' failed (Exit 1)
external/grpc/src/core/lib/gpr/log_linux.cc:43:13: error: ambiguating new declaration of 'long int gettid()'
   43 | static long gettid(void) { return syscall(__NR_gettid); }
      |             ^~~~~~
In file included from /usr/include/unistd.h:1170,
                 from external/grpc/src/core/lib/gpr/log_linux.cc:41:
/usr/include/bits/unistd_ext.h:34:16: note: old declaration '__pid_t gettid()'
   34 | extern __pid_t gettid (void) __THROW;
      |                ^~~~~~
external/grpc/src/core/lib/gpr/log_linux.cc:43:13: warning: 'long int gettid()' defined but not used [-Wunused-function]
   43 | static long gettid(void) { return syscall(__NR_gettid); }
      |             ^~~~~~
INFO: Elapsed time: 744.187s, Critical Path: 41.65s
INFO: 3096 processes: 3096 local.
FAILED: Build did NOT complete successfully
==> ERROR: A failure occurred in build().
    Aborting...

What I've tried so far to fix this
I tried to emulate the grpc version update from 061c359 and bump grpc to v1.24.3. Since v1.19.x, which the currently referenced grpc version seems to be from, grpc has added a dependency on https://github.com/protocolbuffers/upb, and it wasn't clear to me how to add and initialise this dependency correctly in a way that's consistent with tensorflow's use of bazel. I specifically wasn't sure where to call grpc_deps()/upb_deps() from, or what equivalent action was required instead.

Context
In case it matters, I'm trying to build the Arch Linux Package from source, so that I can add 3.0 to TF_CUDA_COMPUTE_CAPABILITIES, which is required for my graphics card. I managed to do this for an earlier version of tensorflow some time ago without too much difficulty.

I'm having this exact issue. I've been trying to build version 1.13.0 for the past two days.

I am trying to build version 1.13.0 with the compile flag "-mno-avx" because my CPU doesn't support the AVX instruction set. I keep getting errors regarding gettid() being ambiguous. I found the post about the patch to change the function names but I couldn't actually find the patch file that he used.

I also changed the function names by hand in the ~/. cache/bazel directory but Bazel will not compile the source code that is in it's cache. It redownloads it before starting every build.

I may have found a workaround. I downloaded the tarball for the version of grpc that the version of tensorflow that I'm compiling uses. I extracted the tarball to ~/grpc and manually changed gettid to sys_gettid in the source code. It was only three or four edits.

I set up a local repository of grpc and added it to the tensorflow WORKSPACE file as a local_repository. I commented out the grpc repository in the tensorflow/workspace.bzl file.

WORKSPACE: (added to bottom of file)

local_repository(
    name = "grpc",                                                     
    path = "/home/loophole/grpc",
) 

tensorflow/workspace.bzl: (I had to change the sha256 hash and I removed the mirror.bazel.build mirror. That may not be necessary but I downloaded the file from the python.org mirror used the sha256 hash of that file, but I didn't check the hash of the one on bazel.build.)

EDIT: I'm having some weird formatting issue. Change the sha256 value to "fcaec9796c8cc3618899b4aeb62d1a4741830b682b2d8db502a05f9b93c08937" and comment out or delete the bazel.build mirror in "filegroup_external" with the name of "org_python_license"

These are the files that needed to have gettid changed to sys_gettid. First I used grep to see which files had gettid in them, then I used nano and pressed ctrl+w to search for gettid:

./src/core/lib/iomgr/ev_epollex_linux.cc
./grpc/src/core/lib/gpr/log_posix.cc
./grpc/src/core/lib/gpr/log_linux.cc

It's still compiling, so I don't know if it's going to work yet.

Make sure you get the version of Bazel that was tested with the version of tensorflow you want to compile. There's a table at the bottom of this page: https://www.tensorflow.org/install/source

I used this command to remove the previous version of Bazel:

rm -rf ~/.bazel ~/.bazelrc ~/.cache/bazel ~/bin/bazel

I installed Bazel with --user, so if you installed without that flag then you may also need to remove: (before installing another version of Bazel)

/usr/local/bin/bazel
Target //tensorflow/tools/pip_package:build_pip_package up-to-date:
  bazel-bin/tensorflow/tools/pip_package/build_pip_package
INFO: Elapsed time: 12803.103s, Critical Path: 439.59s, Remote (0.00% of the time): [queue: 0.00%, setup: 0.00%, process: 0.00%]
INFO: 4414 processes: 4414 local.
INFO: Build completed successfully, 4757 total actions

Ah, that's a good workaround approach @l0ophole. There's another way to do that - letting bazel do the patching for us:

  1. Create .patch file with the gettid rename changes using your method, or:
git clone https://github.com/grpc/grpc.git && cd grpc
git checkout 4566c2a29ebec0835643b972eb99f4306c4234a3
git cherry-pick 57586a1ca7f17b1916aed3dea4ff8de872dbf853
# resolve conflicts
git diff [--cached] > /path/to/tensorflow-2.0.0/third_party/grpc/gettid.patch
  1. Make the following change to tensorflow/workspace.bzl:
--- a/tensorflow/workspace.bzl	2019-09-27 22:56:33.000000000 +0100
+++ b/tensorflow/workspace.bzl	2019-10-28 19:19:32.441547370 +0000
@@ -519,6 +519,7 @@
             "https://storage.googleapis.com/mirror.tensorflow.org/github.com/grpc/grpc/archive/4566c2a29ebec0835643b972eb99f4306c4234a3.tar.gz",
             "https://github.com/grpc/grpc/archive/4566c2a29ebec0835643b972eb99f4306c4234a3.tar.gz",
         ],
+        patch_file = clean_dep("//third_party/grpc:gettid.patch"),
     )
 
     tf_http_archive(
  1. build.

(I do think updating grpc would be cleaner)

Can you send a PR with this approach please?

Or, do you consider it worthwhile to upgrade grpc?

I would certainly prefer the grpc upgrade: currently tensorflow depends on a specific commit of grpc. This would be a good opportunity to change that dependency to a released version tag. Changing it to a dependency on a specific commit of grpc + patch feels like going in the wrong direction, unless there's a really good justification I'm not aware of - if ever had to file a bug against grpc I'd much rather do that against a released version than against effectively a custom branch.

There's the following note above the grpc dependency in workspace.bzl:

# WARNING: make sure ncteisen@ and vpai@ are cc-ed on any CL to change the below rule

Perhaps they know more (I'm not sure if those are github user ids).
(Apologies if your comment wasn't aimed at me)

@m01 thanks for the alternate approach. I had never used bazel before trying to compile this project so I'm sure there are much better ways of patching than the way I did it. :-)

recent versions of grpc aren't affected by this issue, are they? I figured I was having issues because I was trying to compile an old version of tensorflow that had a dependency on an old version of grpc. I thought the gettid patch was already merged into the repository. I was compiling a tarball for an old version from the releases page.

I'll attempt a grpc update this week. It might take a while though.

I'm also on Arch and had the same problem when trying to build jax (which uses xla bits of tensorflow), but following @m01's approach, I managed to build it, thanks a lot!

Just in case other people want a quick temporary measure before @mihaimaruseac completes grpc update, I made a patch here master...hi-ogawa:grpc-backport-pr-18950.

Using GitHub's ".patch" magic url, probably something like this would work:

curl -L https://github.com/tensorflow/tensorflow/compare/master...hi-ogawa:grpc-backport-pr-18950.patch | git apply

I'll attempt a grpc update this week. It might take a while though.

Hi @mihaimaruseac, just curious if there is any update on this? I think as of today's master branch, I still run into this issue. Thanks!

@abcdabcd987

I was able to build tf 1.14.0 with python 3.8 by modifying the PKGBUILD in the tensorflow114 AUR package and adding a couple patches:

tensorflow114/PKGBUILD
https://pastebin.com/kQue1cps

tensorflow114/tensorflow-1.14-python3.8.diff
https://pastebin.com/MLrrGXKD

tensorflow114/src/Add-grpc-fix-for-gettid.patch
https://pastebin.com/XDvwPzdL

There was an attempt in 8497ae4 but that got rolled back.

It doesn't look likely that we can pin to a release. Let me try it again today

It seems we cannot upgrade this yet as we need work to support upb (micro protocol buffers)

@mihaimaruseac Thanks for the effort. If upgrade is not an option now, could you add the gettid patch into the bazel workspace?

The one mentioned at #33758 (comment), right?

Right.

I will attempt another grpc update using a different direction for patching, but it will take a few days.

@hi-ogawa thanks for the oneliner. @mihaimaruseac maybe we can merge that while you update to latest grpc ?

Are you satisfied with the resolution of your issue?
Yes
No

Apologies for the long delay. We got this fixed now, I think (not my work, but of a concurrent contributor)

I ran into this as well, @hi-ogawa's workaround worked. Which commit fixes it in the mainline tree?

Ah, good to know, thanks!