aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services

Home Page:https://aws.amazon.com/machine-learning/neuron/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

neuronx-cc fails during fine-tuning attempt for pre-trained microsoft/layoutlm-base-uncased when using torchrun

vprecup opened this issue · comments

First of all, let me mention that the compilation works successfully when the training script is ran with python.

When I want to run the training script in distributed mode with torchrun though, the complication errs after ~30 minutes.

The environment

An EC2 trn1.2xlarge instance the latest Neuron Ubuntu AMI (amazon/Deep Learning AMI Neuron PyTorch 1.11.0 (Ubuntu 20.04) 20230215), and the following library versions are installed:

The error

02/17/2023 04:18:15 PM WARNING 57168 [StaticProfiler]: matmul-based transposes inserted by penguin takes up 100.00 percent of all matmul computation
02/17/2023 04:18:15 PM INFO 57168 [StaticProfiler]: Finished (changed=False)
02/17/2023 04:18:15 PM INFO 57168 [sg0000/Tensorizer/StaticProfiler]: Exit time region: delta=0.057s
02/17/2023 04:18:15 PM INFO 57168 [sg0000/Tensorizer/SplitAPUnionSets]: Enter time region
02/17/2023 04:18:17 PM INFO 57168 [SplitAPUnionSets]: Finished (changed=True)
02/17/2023 04:18:17 PM INFO 57168 [sg0000/Tensorizer/SplitAPUnionSets]: Exit time region: delta=1.946s
02/17/2023 04:18:17 PM INFO 57168 [sg0000/Tensorizer/SundaLowerGenericAccess]: Enter time region
02/17/2023 04:18:17 PM INFO 57168 [SundaLowerGenericAccess]: Finished (changed=False)
02/17/2023 04:18:17 PM INFO 57168 [sg0000/Tensorizer/SundaLowerGenericAccess]: Exit time region: delta=0.007s
02/17/2023 04:18:17 PM INFO 57168 [sg0000/Tensorizer/SundaLowerAPIndices]: Enter time region
02/17/2023 04:18:18 PM INFO 57168 [SundaLowerAPIndices]: Finished (changed=True)
02/17/2023 04:18:18 PM INFO 57168 [sg0000/Tensorizer/SundaLowerAPIndices]: Exit time region: delta=0.264s
02/17/2023 04:18:18 PM INFO 57168 [sg0000/Tensorizer/DumpGraphAndMetadata]: Enter time region
02/17/2023 04:18:18 PM INFO 57168 [DumpGraphAndMetadata]: Finished (changed=False)
02/17/2023 04:18:18 PM INFO 57168 [sg0000/Tensorizer/DumpGraphAndMetadata]: Exit time region: delta=0.055s
02/17/2023 04:18:18 PM INFO 57168 [sg0000/Tensorizer/BirCodeGenLoop]: Enter time region
02/17/2023 04:18:18 PM INFO 57168 [sg0000/Tensorizer/BirCodeGenLoop]: Exit time region: delta=0.098s
02/17/2023 04:18:18 PM ERROR 57168 [Tensorizer]: Transformation error on operator: _multiply.2
02/17/2023 04:18:18 PM INFO 57168 [root/Tensorizer/All]: Exit time region: delta=1454.982s
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: ***************************************************************
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:  An Internal Compiler Error has occurred
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: ***************************************************************
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: 
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: Error message:  'TensorCopyOp' object has no attribute 'y_size'
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: 
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: Error class:    AttributeError
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: Error location: Unknown
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: Command line:   /opt/aws_neuron_venv_pytorch/bin/neuronx-cc --target=trn1 compile --framework XLA /tmp/MODULE_2_SyncTensorsGraph.953_14707670721571507784_ip-172-31-96-209-4b79574d-56559-5f4e75285fb40.hlo.pb --output /var/tmp/neuron-compile-cache/USER_neuroncc-2.4.0.21+b7621be18/MODULE_14707670721571507784/MODULE_2_SyncTensorsGraph.953_14707670721571507784_ip-172-31-96-209-4b79574d-56559-5f4e75285fb40/70460d4e-df05-4c13-b6d3-edc39296772e/MODULE_2_SyncTensorsGraph.953_14707670721571507784_ip-172-31-96-209-4b79574d-56559-5f4e75285fb40.neff --enable-experimental-O1 --verbose=INFO
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: 
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: Internal details:
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   File "neuronxcc/driver/CommandDriver.py", line 235, in neuronxcc.driver.CommandDriver.CommandDriver.run
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   File "neuronxcc/driver/commands/CompileCommand.py", line 1014, in neuronxcc.driver.commands.CompileCommand.CompileCommand.run
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   File "neuronxcc/driver/commands/CompileCommand.py", line 965, in neuronxcc.driver.commands.CompileCommand.CompileCommand.runPipeline
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   File "neuronxcc/driver/commands/CompileCommand.py", line 990, in neuronxcc.driver.commands.CompileCommand.CompileCommand.runPipeline
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   File "neuronxcc/driver/commands/CompileCommand.py", line 994, in neuronxcc.driver.commands.CompileCommand.CompileCommand.runPipeline
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   File "neuronxcc/driver/Job.py", line 300, in neuronxcc.driver.Job.SingleInputJob.run
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   File "neuronxcc/driver/Job.py", line 326, in neuronxcc.driver.Job.SingleInputJob.runOnState
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   File "neuronxcc/driver/Pipeline.py", line 30, in neuronxcc.driver.Pipeline.Pipeline.runSingleInput
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   File "neuronxcc/driver/Job.py", line 300, in neuronxcc.driver.Job.SingleInputJob.run
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   File "neuronxcc/driver/Job.py", line 326, in neuronxcc.driver.Job.SingleInputJob.runOnState
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   File "neuronxcc/driver/jobs/Frontend.py", line 591, in neuronxcc.driver.jobs.Frontend.Frontend.runSingleInput
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   File "neuronxcc/driver/jobs/Frontend.py", line 387, in neuronxcc.driver.jobs.Frontend.Frontend.runXLAFrontend
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   File "neuronxcc/starfish/penguin/Frontend.py", line 168, in neuronxcc.starfish.penguin.Frontend.tensorizeXla
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   File "neuronxcc/starfish/penguin/Frontend.py", line 243, in neuronxcc.starfish.penguin.Frontend.tensorizeXlaImpl
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   File "neuronxcc/starfish/penguin/Frontend.py", line 244, in neuronxcc.starfish.penguin.Frontend.tensorizeXlaImpl
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   File "neuronxcc/starfish/penguin/Frontend.py", line 266, in neuronxcc.starfish.penguin.Frontend.tensorizeXlaImpl
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   File "neuronxcc/starfish/penguin/Compile.py", line 183, in neuronxcc.starfish.penguin.Compile.compile_cu
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   File "neuronxcc/starfish/penguin/Compile.py", line 185, in neuronxcc.starfish.penguin.Compile.compile_cu
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   File "neuronxcc/starfish/penguin/Compile.py", line 225, in neuronxcc.starfish.penguin.Compile.compile_cu
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   File "neuronxcc/starfish/penguin/Compile.py", line 106, in neuronxcc.starfish.penguin.Compile.generate_code_and_meta_data
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   File "neuronxcc/starfish/penguin/Compile.py", line 301, in neuronxcc.starfish.penguin.Compile.codegen
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   File "neuronxcc/starfish/penguin/Compile.py", line 307, in neuronxcc.starfish.penguin.Compile.codegenBIR
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   File "neuronxcc/starfish/penguin/targets/tonga/codegen/BirCodeGenLoop.py", line 1550, in neuronxcc.starfish.penguin.targets.tonga.codegen.BirCodeGenLoop.BirCodeGenLoop.runOnFunction
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 196, in neuronxcc.starfish.penguin.DotTransform.DotTransform.run_with_exception_handling
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 178, in neuronxcc.starfish.penguin.DotTransform.DotTransform.run_with_exception_handling
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 208, in neuronxcc.starfish.penguin.DotTransform.DotTransform.timed_run_
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 210, in neuronxcc.starfish.penguin.DotTransform.DotTransform.timed_run_
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 211, in neuronxcc.starfish.penguin.DotTransform.DotTransform.timed_run_
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 240, in neuronxcc.starfish.penguin.DotTransform.DotTransform.run_
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 241, in neuronxcc.starfish.penguin.DotTransform.DotTransform.run_
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 334, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transformFunction
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 329, in neuronxcc.starfish.penguin.DotTransform.DotTransform.runTransforms
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 318, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transformStmts
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 133, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transform
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 366, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transformBasicBlock
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 369, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transformBasicBlock
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 133, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transform
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   File "neuronxcc/starfish/penguin/targets/tonga/codegen/BirCodeGenLoop.py", line 106, in neuronxcc.starfish.penguin.targets.tonga.codegen.BirCodeGenLoop.BirCodeGenLoop.transformAxis
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 133, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transform
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   File "neuronxcc/starfish/penguin/targets/tonga/codegen/BirCodeGenLoop.py", line 106, in neuronxcc.starfish.penguin.targets.tonga.codegen.BirCodeGenLoop.BirCodeGenLoop.transformAxis
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 133, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transform
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   File "neuronxcc/starfish/penguin/targets/tonga/codegen/BirCodeGenLoop.py", line 106, in neuronxcc.starfish.penguin.targets.tonga.codegen.BirCodeGenLoop.BirCodeGenLoop.transformAxis
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   File "neuronxcc/starfish/penguin/DotTransform.py", line 133, in neuronxcc.starfish.penguin.DotTransform.DotTransform.transform
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   File "neuronxcc/starfish/penguin/targets/tonga/codegen/BirCodeGenLoop.py", line 1351, in neuronxcc.starfish.penguin.targets.tonga.codegen.BirCodeGenLoop.BirCodeGenLoop.transformInstruction
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   File "neuronxcc/starfish/penguin/targets/tonga/codegen/BirCodeGenLoop.py", line 1172, in neuronxcc.starfish.penguin.targets.tonga.codegen.BirCodeGenLoop.BirCodeGenLoop.addInstToBir
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   File "neuronxcc/starfish/penguin/targets/tonga/codegen/BirCodeGenLoop.py", line 1169, in neuronxcc.starfish.penguin.targets.tonga.codegen.BirCodeGenLoop.BirCodeGenLoop.dispatch_codegen
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   File "neuronxcc/starfish/penguin/targets/tonga/codegen/BirCodeGenLoop.py", line 800, in neuronxcc.starfish.penguin.targets.tonga.codegen.BirCodeGenLoop.BirCodeGenLoop.codegenTensorCopyOp
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   File "neuronxcc/starfish/penguin/targets/tonga/codegen/BirCodeGenLoop.py", line 333, in neuronxcc.starfish.penguin.targets.tonga.codegen.BirCodeGenLoop.BirCodeGenLoop.codegenDMATranspose
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: 
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: Version information:
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   NeuronX Compiler version 2.4.0.21+b7621be18
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   HWM version 2.4.0.1-90172456c
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   NEFF version Dynamic
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   TVM not available
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   NumPy version 1.20.0
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]:   MXNet not available
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: 
02/17/2023 04:18:18 PM ERROR 57168 [neuronx-cc]: Artifacts stored in: /home/ubuntu/eco-ml-sds-training/remote-training-scripts/neuronxcc-fwkota21
2023-02-17 16:18:40.000461: ERROR ||NCC_WRAPPER||: There was a compilation error for /tmp/MODULE_2_SyncTensorsGraph.953_14707670721571507784_ip-172-31-96-209-4b79574d-56559-5f4e75285fb40.hlo.pb graph. Returning with an errored graph
2023-02-17 16:18:40.491873: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: neuronx-cc compilation failed.
2023-02-17 16:18:40.504644: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] StackTrace:
2023-02-17 16:18:40.504671: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] *** Begin stack trace ***
2023-02-17 16:18:40.504680: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 	tsl::CurrentStackTrace()
2023-02-17 16:18:40.504689: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 	xla::util::ReportComputationError(tsl::Status const&, absl::lts_20220623::Span<xla::XlaComputation const* const>, absl::lts_20220623::Span<xla::Shape const* const>)
2023-02-17 16:18:40.504708: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 	xla::XrtComputationClient::ExecuteComputation(xla::ComputationClient::Computation const&, absl::lts_20220623::Span<std::shared_ptr<xla::ComputationClient::Data> const>, std::string const&, xla::ComputationClient::ExecuteComputationOptions const&)
2023-02-17 16:18:40.504716: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 	
2023-02-17 16:18:40.504727: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 	xla::util::MultiWait::Complete(std::function<void ()> const&)
2023-02-17 16:18:40.504734: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 	
2023-02-17 16:18:40.504741: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 	
2023-02-17 16:18:40.504749: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 	
2023-02-17 16:18:40.504760: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 	clone
2023-02-17 16:18:40.504770: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] *** End stack trace ***
2023-02-17 16:18:40.504778: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 
2023-02-17 16:18:40.504786: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Status: INTERNAL: From /job:localservice/replica:0/task:0:
2023-02-17 16:18:40.504793: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 2 root error(s) found.
2023-02-17 16:18:40.504796: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]   (0) INTERNAL: neuronx-cc compilation failed.
2023-02-17 16:18:40.504804: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 	 [[{{node XRTExecute}}]]
2023-02-17 16:18:40.504813: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 	 [[XRTExecute_G15]]
2023-02-17 16:18:40.504822: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]   (1) INTERNAL: neuronx-cc compilation failed.
2023-02-17 16:18:40.504829: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 	 [[{{node XRTExecute}}]]
2023-02-17 16:18:40.504836: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 0 successful operations.
2023-02-17 16:18:40.504846: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] 0 derived errors ignored.
2023-02-17 16:18:40.504853: E tensorflow/compiler/xla/xla_client/xla_util.cc:90] Recent warning and error logs:
2023-02-17 16:18:40.504861: E tensorflow/compiler/xla/xla_client/xla_util.cc:90]   OP_REQUIRES failed at tpu_execute_op.cc:266 : INTERNAL: neuronx-cc compilation failed.

Is there any configuration that I am missing when dealing with such a transformer model?

Thank you, we believe we have identified the issue and are working on a fix. We'll continue to update this ticket as it becomes available.

@aws-mvaria thank you. much appreciated!

Hi @vprecup , Are you able to provide your training script so we can confirm we are able to resolve the issue you are seeing? Thanks.

Hi @aws-mvaria. I am simply using the Trainer script, thus deferring the xla-related code to the Transformers library (notice that I am using version 4.26.1 of the Transformers library, which I saw has the DistributedSampler support, xm.optimizer_step(optimizer), xm.mark_step(optimizer) calls in the code etc).

So the script looks something like

config = AutoConfig.from_pretrained(
       "microsoft/layoutlm-base-uncased",
       num_labels=...,
       label2id=...,
       id2label=...,
       finetuning_task="ner",
       cache_dir=...,
       revision="main",
       max_position_embeddings=512,
       max_2d_position_embeddings=2 * 512
)

tokenizer = AutoTokenizer.from_pretrained(
        "microsoft/layoutlm-base-uncased",
        cache_dir=...,
        use_fast=True,
        revision="main"
)

model = AutoModelForTokenClassification.from_pretrained(
       "microsoft/layoutlm-base-uncased",
       config=config,
       cache_dir=...,
       revision="main"
)

trainer = Trainer(
        model=model,
        args=...,
        train_dataset=...,
        eval_dataset=...,
        tokenizer=tokenizer,
        data_collator=...,
        callbacks=...,
        compute_metrics=...,
    )

It's quite similar to this one: https://github.com/huggingface/transformers/blob/main/examples/pytorch/token-classification/run_ner.py.

Thanks @vprecup for the wait. The fix is not ready yet and we are still working on it. We will update once the fix becomes available.

Hi @vprecup , we have resolved the issue and you can expect the fix in an upcoming release. Will update you once the release is available.

@vprecup the fix is available in just released Neuron 2.10. Please give it a try.

Closing this issue. Please re-open if you encounter any further issues. Thanks.