ofiwg / libfabric

Open Fabric Interfaces

Home Page:http://libfabric.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to clean up the EFA installation

coderodyhpc opened this issue · comments

Hello,
The installation of the EFA drivers/kernel was interrupted (OS is Ubuntu 22.04-aarch64). After it, I tried to uninstall in order to do a fresh installation, but it doesn't do anything:

$sudo ./efa_installer.sh --uninstall
= Starting Amazon Elastic Fabric Adapter Installation Script =
= EFA Installer Version: 1.20.0 =

Please confirm that you would like to uninstall EFA [y/n]: y
Error: EFA is not installed, exiting.

When I try to use the installer v1.20, the output reads:

$ sudo ./efa_installer.sh
= Starting Amazon Elastic Fabric Adapter Installation Script =
= EFA Installer Version: 1.20.0 =

This script will install the EFA kernel driver and required user space packages.                                                               Do you wish to continue? [y/n]: y
Error: unknown package found: DEBS/UBUNTU2204/aarch64/libfabric-aws-bin_1.16.0_a                                                              rm64.deb
Please remove unknown packages from the RPMS/DEBS directory trees

If I switch to v1.19, the output reads:

$ sudo ./efa_installer.sh -y
= Starting Amazon Elastic Fabric Adapter Installation Script =
= EFA Installer Version: 1.19.0 =

== Installing EFA dependencies ==
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
linux-headers-5.15.0-1019-aws is already the newest version (5.15.0-1019.23).
0 upgraded, 0 newly installed, 0 to remove and 58 not upgraded.
1 not fully installed or removed.
After this operation, 0 B of additional disk space will be used.
Setting up efa (1.16.0-1.amzn1) ...
Error! DKMS tree already contains: efa-1.16.0
You cannot add the same module/version combo more than once.
dpkg: error processing package efa (--configure):
 installed efa package post-installation script subprocess returned error exit status 3
Errors were encountered while processing:
 efa
E: Sub-process /usr/bin/dpkg returned an error code (1)
Error: Failed to install packages.

==============================================================================
The kernel header of the current kernel version cannot be installed and is required
to build the EFA kernel module. Please install the kernel header package for your
distribution manually or build the EFA kernel driver manually and re-run the installer
with --skip-kmod.
==============================================================================

I have also tried installing the deb files individually, but I get the same type of errors. It definitely looks like that I need to clean up the interrupted installation (so that it can be installed anew), but I'm stuck on how to accomplish it. Thanks.

The 1st error:

Error: unknown package found: DEBS/UBUNTU2204/aarch64/libfabric-aws-bin_1.16.0_a                                                              rm64.deb
Please remove unknown packages from the RPMS/DEBS directory trees

is caused by that you decompressed EFA installer 1.20 to the same aws-efa-installer directory of EFA installer 1.19.0

The 2nd error is new to me, but the key part is:

Error! DKMS tree already contains: efa-1.16.0
You cannot add the same module/version combo more than once.

You mentioned that EFA installation was interrupted, so it seems EFA kernel module was added to dkms, but not fully installed. Maybe manually remove dkms would help?

commented

I think we should detect dkms duplication and skip (or remove and readd) during installation.

@wzamazon You're correct: I tried installing v1.19 before v1.20 and used the same target directory.
P.S. I tried v1.20 in a different directory but it results in the same error as for v1.19.
P.S. (1): I tried manually removing the EFA component for DKMS but got an error as my reply to @jtamzn indicates.

Hi @jtamzn,
How would you do that?
I'm getting:

~$ dkms status
efa/1.16.0: added
eveusb/1.0.0, 5.15.0-1019-aws, aarch64: installed

Trying to remove it also results in an error:

~$ sudo dkms uninstall efa/1.16.0
Error! The module efa 1.16.0 is not currently installed.
This module is not currently ACTIVE for kernel 5.15.0-1019-aws (aarch64).

P.S. I've also tried:

$ sudo dkms remove efa/1.16.0
Error! There is no instance of efa 1.16.0 for kernel 5.15.0-1019-aws (aarch64) located in the DKMS tree.

@jtamzn and @wzamazon,
After much tinkering, it seems that the sequence (that works) is an autoinstallation followed by uninstallation followed by removal.
P.S. After cleaning up dmks following those 3 steps, the EFA components finally built.