neuronx-cc fails during fine-tuning attempt for pre-trained microsoft/layoutlm-base-uncased when using torchrun
vprecup opened this issue · comments
First of all, let me mention that the compilation works successfully when the training script is ran with python
.
When I want to run the training script in distributed mode with torchrun
though, the complication errs after ~30 minutes.
The environment
An EC2 trn1.2xlarge
instance the latest Neuron Ubuntu AMI (amazon/Deep Learning AMI Neuron PyTorch 1.11.0 (Ubuntu 20.04) 20230215), and the following library versions are installed:
- Neuron-related: latest (i.e. all from https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/setup/pytorch-update.html#pytorch-neuronx-update - PyTorch 1.13.0)
- transformers: 4.26.1
- numpy: 1.20.0
- protobuf: 3.20.3
The error
02/17/2023 04:18:15 PM WARNING 57168 [StaticProfiler]: matmul-based transposes inserted by penguin takes up 100.00 percent of all matmul computation
02/17/2023 04:18:15 PM INFO 57168 [StaticProfiler]: Finished (changed=False)
02/17/2023 04:18:15 PM INFO 57168 [sg0000/Tensorizer/StaticProfiler]: Exit time region: delta=0.057s
02/17/2023 04:18:15 PM INFO 57168 [sg0000/Tensorizer/SplitAPUnionSets]: Enter time region
02/17/2023 04:18:17 PM INFO 57168 [SplitAPUnionSets]: Finished (changed=True)
02/17/2023 04:18:17 PM INFO 57168 [sg0000/Tensorizer/SplitAPUnionSets]: Exit time region: delta=1.946s
02/17/2023 04:18:17 PM INFO 57168 [sg0000/Tensorizer/SundaLowerGenericAccess]: Enter time region
02/17/2023 04:18:17 PM INFO 57168 [SundaLowerGenericAccess]: Finished (changed=False)
02/17/2023 04:18:17 PM INFO 57168 [sg0000/Tensorizer/SundaLowerGenericAccess]: Exit time region: delta=0.007s
02/17/2023 04:18:17 PM INFO 57168 [sg0000/Tensorizer/SundaLowerAPIndices]: Enter time region
02/17/2023 04:18:18 PM INFO 57168 [SundaLowerAPIndices]: Finished (changed=True)
02/17/2023 04:18:18 PM INFO 57168 [sg0000/Tensorizer/SundaLowerAPIndices]: Exit time region: delta=0.264s
02/17/2023 04:18:18 PM INFO 57168 [sg0000/Tensorizer/DumpGraphAndMetadata]: Enter time region
02/17/2023 04:18:18 PM INFO 57168 [DumpGraphAndMetadata]: Finished (changed=False)
02/17/2023 04:18:18 PM INFO 57168 [sg0000/Tensorizer/DumpGraphAndMetadata]: Exit time region: delta=0.055s
02/17/2023 04:18:18 PM INFO 57168 [sg0000/Tensorizer/BirCodeGenLoop]: Enter time region
02/17/2023 04:18:18 PM INFO 57168 [sg0000/Tensorizer/BirCodeGenLoop]: Exit time region: delta=0.098s
02/17/2023 04:18:18 PM ERROR 57168 [Tensorizer]: Transformation error on operator: _multiply.2
02/17/2023 04:18:18 PM INFO 57168 [root/Tensorizer/All]: Exit time region: delta=1454.982s
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: ***************************************************************
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: An Internal Compiler Error has occurred
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: ***************************************************************
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: Error message: 'TensorCopyOp' object has no attribute 'y_size'
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: Error class: AttributeError
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: Error location: Unknown
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: Command line: /opt/aws_neuron_venv_pytorch/bin/neuronx-cc --target=trn1 compile --framework XLA /tmp/MODULE_2_SyncTensorsGraph.953_14707670721571507784_ip-172-31-96-209-4b79574d-56559-5f4e75285fb40.hlo.pb --output /var/tmp/neuron-compile-cache/USER_neuroncc-2.4.0.21+b7621be18/MODULE_14707670721571507784/MODULE_2_SyncTensorsGraph.953_14707670721571507784_ip-172-31-96-209-4b79574d-56559-5f4e75285fb40/70460d4e-df05-4c13-b6d3-edc39296772e/MODULE_2_SyncTensorsGraph.953_14707670721571507784_ip-172-31-96-209-4b79574d-56559-5f4e75285fb40.neff --enable-experimental-O1 --verbose=INFO
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: Internal details:
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: File "neuronxcc/driver/CommandDriver.py", line 235, in neuronxcc.driver.CommandDriver.CommandDriver.run
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: File "neuronxcc/driver/commands/CompileCommand.py", line 1014, in neuronxcc.driver.commands.CompileCommand.CompileCommand.run
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: File "neuronxcc/driver/commands/CompileCommand.py", line 965, in neuronxcc.driver.commands.CompileCommand.CompileCommand.runPipeline
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: File "neuronxcc/driver/commands/CompileCommand.py", line 990, in neuronxcc.driver.commands.CompileCommand.CompileCommand.runPipeline
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: File "neuronxcc/driver/commands/CompileCommand.py", line 994, in neuronxcc.driver.commands.CompileCommand.CompileCommand.runPipeline
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: File "neuronxcc/driver/Job.py", line 300, in neuronxcc.driver.Job.SingleInputJob.run
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: File "neuronxcc/driver/Job.py", line 326, in neuronxcc.driver.Job.SingleInputJob.runOnState
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: File "neuronxcc/driver/Pipeline.py", line 30, in neuronxcc.driver.Pipeline.Pipeline.runSingleInput
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: File "neuronxcc/driver/Job.py", line 300, in neuronxcc.driver.Job.SingleInputJob.run
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: File "neuronxcc/driver/Job.py", line 326, in neuronxcc.driver.Job.SingleInputJob.runOnState
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: File "neuronxcc/driver/jobs/Frontend.py", line 591, in neuronxcc.driver.jobs.Frontend.Frontend.runSingleInput
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: File "neuronxcc/driver/jobs/Frontend.py", line 387, in neuronxcc.driver.jobs.Frontend.Frontend.runXLAFrontend
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: File "neuronxcc/starfish/penguin/Frontend.py", line 168, in neuronxcc.starfish.penguin.Frontend.tensorizeXla
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: File "neuronxcc/starfish/penguin/Frontend.py", line 243, in neuronxcc.starfish.penguin.Frontend.tensorizeXlaImpl
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: File "neuronxcc/starfish/penguin/Frontend.py", line 244, in neuronxcc.starfish.penguin.Frontend.tensorizeXlaImpl
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: File "neuronxcc/starfish/penguin/Frontend.py", line 266, in neuronxcc.starfish.penguin.Frontend.tensorizeXlaImpl
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: File "neuronxcc/starfish/penguin/Compile.py", line 183, in neuronxcc.starfish.penguin.Compile.compile_cu
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: File "neuronxcc/starfish/penguin/Compile.py", line 185, in neuronxcc.starfish.penguin.Compile.compile_cu
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: File "neuronxcc/starfish/penguin/Compile.py", line 225, in neuronxcc.starfish.penguin.Compile.compile_cu
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: File "neuronxcc/starfish/penguin/Compile.py", line 106, in neuronxcc.starfish.penguin.Compile.generate_code_and_meta_data
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: File "neuronxcc/starfish/penguin/Compile.py", line 301, in neuronxcc.starfish.penguin.Compile.codegen
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: File "neuronxcc/starfish/penguin/Compile.py", line 307, in neuronxcc.starfish.penguin.Compile.codegenBIR
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: File "neuronxcc/starfish/penguin/targets/tonga/codegen/BirCodeGenLoop.py", line 1550, in neuronxcc.starfish.penguin.targets.tonga.codegen.BirCodeGenLoop.BirCodeGenLoop.runOnFunction
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: File "neuronxcc/starfish/penguin/DotTransform.py", line 196, in neuronxcc.starfish.penguin.DotTransform.DotTransform.run_with_exception_handling
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: File "neuronxcc/starfish/penguin/DotTransform.py", line 178, in neuronxcc.starfish.penguin.DotTransform.DotTransform.run_with_exception_handling
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: File "neuronxcc/starfish/penguin/DotTransform.py", line 208, in neuronxcc.starfish.penguin.DotTransform.DotTransform.timed_run_
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: File "neuronxcc/starfish/penguin/DotTransform.py", line 210, in neuronxcc.starfish.penguin.DotTransform.DotTransform.timed_run_
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: File "neuronxcc/starfish/penguin/DotTransform.py", line 211, in neuronxcc.starfish.penguin.DotTransform.DotTransform.timed_run_
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: File "neuronxcc/starfish/penguin/DotTransform.py", line 240, in neuronxcc.starfish.penguin.DotTransform.DotTransform.run_
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: File "neuronxcc/starfish/penguin/DotTransform.py", line 241, in neuronxcc.starfish.penguin.DotTransform.DotTransform.run_
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: File "neuronxcc/starfish/penguin/DotTransform.py", line 334, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transformFunction
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: File "neuronxcc/starfish/penguin/DotTransform.py", line 329, in neuronxcc.starfish.penguin.DotTransform.DotTransform.runTransforms
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: File "neuronxcc/starfish/penguin/DotTransform.py", line 318, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transformStmts
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: File "neuronxcc/starfish/penguin/DotTransform.py", line 133, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transform
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: File "neuronxcc/starfish/penguin/DotTransform.py", line 366, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transformBasicBlock
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: File "neuronxcc/starfish/penguin/DotTransform.py", line 369, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transformBasicBlock
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: File "neuronxcc/starfish/penguin/DotTransform.py", line 133, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transform
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: File "neuronxcc/starfish/penguin/targets/tonga/codegen/BirCodeGenLoop.py", line 106, in neuronxcc.starfish.penguin.targets.tonga.codegen.BirCodeGenLoop.BirCodeGenLoop.transformAxis
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: File "neuronxcc/starfish/penguin/DotTransform.py", line 133, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transform
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: File "neuronxcc/starfish/penguin/targets/tonga/codegen/BirCodeGenLoop.py", line 106, in neuronxcc.starfish.penguin.targets.tonga.codegen.BirCodeGenLoop.BirCodeGenLoop.transformAxis
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: File "neuronxcc/starfish/penguin/DotTransform.py", line 133, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transform
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: File "neuronxcc/starfish/penguin/targets/tonga/codegen/BirCodeGenLoop.py", line 106, in neuronxcc.starfish.penguin.targets.tonga.codegen.BirCodeGenLoop.BirCodeGenLoop.transformAxis
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: File "neuronxcc/starfish/penguin/DotTransform.py", line 133, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transform
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: File "neuronxcc/starfish/penguin/targets/tonga/codegen/BirCodeGenLoop.py", line 1351, in neuronxcc.starfish.penguin.targets.tonga.codegen.BirCodeGenLoop.BirCodeGenLoop.transformInstruction
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: File "neuronxcc/starfish/penguin/targets/tonga/codegen/BirCodeGenLoop.py", line 1172, in neuronxcc.starfish.penguin.targets.tonga.codegen.BirCodeGenLoop.BirCodeGenLoop.addInstToBir
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: File "neuronxcc/starfish/penguin/targets/tonga/codegen/BirCodeGenLoop.py", line 1169, in neuronxcc.starfish.penguin.targets.tonga.codegen.BirCodeGenLoop.BirCodeGenLoop.dispatch_codegen
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: File "neuronxcc/starfish/penguin/targets/tonga/codegen/BirCodeGenLoop.py", line 800, in neuronxcc.starfish.penguin.targets.tonga.codegen.BirCodeGenLoop.BirCodeGenLoop.codegenTensorCopyOp
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: File "neuronxcc/starfish/penguin/targets/tonga/codegen/BirCodeGenLoop.py", line 333, in neuronxcc.starfish.penguin.targets.tonga.codegen.BirCodeGenLoop.BirCodeGenLoop.codegenDMATranspose
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: Version information:
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: NeuronX Compiler version 2.4.0.21+b7621be18
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: HWM version 2.4.0.1-90172456c
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: NEFF version Dynamic
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: TVM not available
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: NumPy version 1.20.0
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: MXNet not available
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: Artifacts stored in: /home/ubuntu/eco-ml-sds-training/remote-training-scripts/neuronxcc-fwkota21
2023-02-17 16:18:40.000461: ERROR ||NCC_WRAPPER||: There was a compilation error for /tmp/MODULE_2_SyncTensorsGraph.953_14707670721571507784_ip-172-31-96-209-4b79574d-56559-5f4e75285fb40.hlo.pb graph. Returning with an errored graph
2023-02-17 16:18:40.491873: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: neuronx-cc compilation failed.
2023-02-17 16:18:40.504644: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] StackTrace:
2023-02-17 16:18:40.504671: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] *** Begin stack trace ***
2023-02-17 16:18:40.504680: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] tsl::CurrentStackTrace()
2023-02-17 16:18:40.504689: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] xla::util::ReportComputationError(tsl::Status const&, absl::lts_20220623::Span<xla::XlaComputation const* const>, absl::lts_20220623::Span<xla::Shape const* const>)
2023-02-17 16:18:40.504708: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] xla::XrtComputationClient::ExecuteComputation(xla::ComputationClient::Computation const&, absl::lts_20220623::Span<std::shared_ptr<xla::ComputationClient::Data> const>, std::string const&, xla::ComputationClient::ExecuteComputationOptions const&)
2023-02-17 16:18:40.504716: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]
2023-02-17 16:18:40.504727: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] xla::util::MultiWait::Complete(std::function<void ()> const&)
2023-02-17 16:18:40.504734: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]
2023-02-17 16:18:40.504741: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]
2023-02-17 16:18:40.504749: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]
2023-02-17 16:18:40.504760: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] clone
2023-02-17 16:18:40.504770: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] *** End stack trace ***
2023-02-17 16:18:40.504778: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]
2023-02-17 16:18:40.504786: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Status: INTERNAL: From /job:localservice/replica:0/task:0:
2023-02-17 16:18:40.504793: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 2 root error(s) found.
2023-02-17 16:18:40.504796: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] (0) INTERNAL: neuronx-cc compilation failed.
2023-02-17 16:18:40.504804: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] [[{{node XRTExecute}}]]
2023-02-17 16:18:40.504813: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] [[XRTExecute_G15]]
2023-02-17 16:18:40.504822: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] (1) INTERNAL: neuronx-cc compilation failed.
2023-02-17 16:18:40.504829: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] [[{{node XRTExecute}}]]
2023-02-17 16:18:40.504836: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 0 successful operations.
2023-02-17 16:18:40.504846: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 0 derived errors ignored.
2023-02-17 16:18:40.504853: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Recent warning and error logs:
2023-02-17 16:18:40.504861: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: neuronx-cc compilation failed.
Is there any configuration that I am missing when dealing with such a transformer model?
Thank you, we believe we have identified the issue and are working on a fix. We'll continue to update this ticket as it becomes available.
@aws-mvaria thank you. much appreciated!
Hi @vprecup , Are you able to provide your training script so we can confirm we are able to resolve the issue you are seeing? Thanks.
Hi @aws-mvaria. I am simply using the Trainer script, thus deferring the xla-related code to the Transformers library (notice that I am using version 4.26.1 of the Transformers library, which I saw has the DistributedSampler support, xm.optimizer_step(optimizer)
, xm.mark_step(optimizer)
calls in the code etc).
So the script looks something like
config = AutoConfig.from_pretrained(
"microsoft/layoutlm-base-uncased",
num_labels=...,
label2id=...,
id2label=...,
finetuning_task="ner",
cache_dir=...,
revision="main",
max_position_embeddings=512,
max_2d_position_embeddings=2 * 512
)
tokenizer = AutoTokenizer.from_pretrained(
"microsoft/layoutlm-base-uncased",
cache_dir=...,
use_fast=True,
revision="main"
)
model = AutoModelForTokenClassification.from_pretrained(
"microsoft/layoutlm-base-uncased",
config=config,
cache_dir=...,
revision="main"
)
trainer = Trainer(
model=model,
args=...,
train_dataset=...,
eval_dataset=...,
tokenizer=tokenizer,
data_collator=...,
callbacks=...,
compute_metrics=...,
)
It's quite similar to this one: https://github.com/huggingface/transformers/blob/main/examples/pytorch/token-classification/run_ner.py.
Thanks @vprecup for the wait. The fix is not ready yet and we are still working on it. We will update once the fix becomes available.
Hi @vprecup , we have resolved the issue and you can expect the fix in an upcoming release. Will update you once the release is available.
Closing this issue. Please re-open if you encounter any further issues. Thanks.