edouardlp / Mask-RCNN-CoreML

Mask-RCNN for Core ML

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

macOS compatibility?

vade opened this issue · comments

Hello

Firstly, thank you for this repo and your work. Im able to run your examples on iOS sans issue.

I am attempting to run your example code on a simple macOS test harness. However, I am not getting expected results. Prediction runs, I can load the model, load the config, the anchors, configure Vision, the request, provide an image, however the VNCoreMLRequest results always have 0 score within Detection.detectionsFromFeatureValue - and I'm trying to (unsuccessfully debug).

Verified:

  • properly configure MaskRCNNConfig.defaultConfig
  • load model
  • set up vision model
  • fetch image, make CIImage
  • set up request
  • set up handler
  • run predict
  • get results with 2 VNCoreMLFeatureValueObservation

However I don't seem to get valid scores for what appear to be valid input images. I can verify that iOS on the same image works and makes a great prediction, bounding box and mask.

Have you been able to run this code on macOS?

macOS computed feature for image 'a'

MultiArray : Double 100 x 6 matrix
[0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0]

iOS computed feature for the same image 'a'

Double 100 x 6 matrix
[0.2269468903541565,0.246346652507782,0.7716894745826721,0.9700236916542053,1,0.9998897314071655;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0;
 0,0,0,0,0,0]

Dropping directly into CoreML rather than vision produces the same result. I manually resize a CGImageRef to 1024x1024 and convert it to a CVPixelBufferRef and pass it to MaskRCNN.prediction(image:pixelBuffer) and get the same as vision, an empty MLFeatureValue same as above.

Apologies for the monologue - here is an interesting observation on the issue:

It appears that custom layers on CoreML models are loaded slightly different on iOS than on macOS - at least with a sample size of 1 for this MaskRCNN.

To realize this I added some debug logging to the custom layer initializers and functions.

when iOS loads the CoreML model and runs inference (prediction is called), we see:

2020-07-18 18:00:30.364779-0400 Example[5118:1071328] Metal GPU Frame Capture Enabled
2020-07-18 18:00:30.366195-0400 Example[5118:1071328] Metal API Validation Enabled
init(parameters:) ["nmsIOUThreshold": 0.7, "bboxStdDev_3": 0.2, "bboxStdDev_1": 0.1, "engineName": ProposalLayer, "preNMSMaxProposals": 6000, "maxProposals": 1000, "bboxStdDev_2": 0.2, "bboxStdDev_0": 0.1, "bboxStdDev_count": 4]
init(parameters:) ["imageHeight": 1024, "imageWidth": 1024, "engineName": PyramidROIAlignLayer, "poolSize": 7]
init(parameters:) ["engineName": TimeDistributedClassifierLayer]
init(parameters:) ["maxDetections": 100, "bboxStdDev_1": 0.1, "engineName": DetectionLayer, "bboxStdDev_3": 0.2, "bboxStdDev_0": 0.1, "scoreThreshold": 0.7, "nmsIOUThreshold": 0.3, "bboxStdDev_2": 0.2, "bboxStdDev_count": 4]
init(parameters:) ["imageWidth": 1024, "engineName": PyramidROIAlignLayer, "poolSize": 14, "imageHeight": 1024]
init(parameters:) ["engineName": TimeDistributedMaskLayer]
2020-07-18 18:00:30.935153-0400 Example[5118:1071289] [discovery] errors encountered while discovering extensions: Error Domain=PlugInKit Code=13 "query cancelled" UserInfo={NSLocalizedDescription=query cancelled}
init(parameters:) ["bboxStdDev_2": 0.2, "bboxStdDev_3": 0.2, "bboxStdDev_count": 4, "maxProposals": 1000, "nmsIOUThreshold": 0.7, "preNMSMaxProposals": 6000, "engineName": ProposalLayer, "bboxStdDev_0": 0.1, "bboxStdDev_1": 0.1]
init(parameters:) ["engineName": PyramidROIAlignLayer, "imageHeight": 1024, "imageWidth": 1024, "poolSize": 7]
init(parameters:) ["engineName": TimeDistributedClassifierLayer]
init(parameters:) ["scoreThreshold": 0.7, "bboxStdDev_count": 4, "bboxStdDev_1": 0.1, "engineName": DetectionLayer, "bboxStdDev_2": 0.2, "maxDetections": 100, "bboxStdDev_0": 0.1, "nmsIOUThreshold": 0.3, "bboxStdDev_3": 0.2]
init(parameters:) ["imageHeight": 1024, "engineName": PyramidROIAlignLayer, "imageWidth": 1024, "poolSize": 14]
init(parameters:) ["engineName": TimeDistributedMaskLayer]
outputShapes(forInputShapes:) [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0]] [[1, 0, 0, 0, 0]]
outputShapes(forInputShapes:) [[261888, 1, 2, 1, 1], [261888, 1, 4, 1, 1]] (Function)
outputShapes(forInputShapes:) [[1000, 1, 4, 1, 1], [1, 1, 256, 256, 256], [1, 1, 256, 128, 128], [1, 1, 256, 64, 64], [1, 1, 256, 32, 32]] [[1000, 1, 256, 7, 7]]
outputShapes(forInputShapes:) [[1000, 1, 256, 7, 7]] [[1000, 1, 1, 1, 6]]
outputShapes(forInputShapes:) [[1000, 1, 4, 1, 1], [1000, 1, 1, 1, 6]] [[100, 1, 6, 1, 1]]
outputShapes(forInputShapes:) [[100, 1, 6, 1, 1], [1, 1, 256, 256, 256], [1, 1, 256, 128, 128], [1, 1, 256, 64, 64], [1, 1, 256, 32, 32]] [[100, 1, 256, 14, 14]]
outputShapes(forInputShapes:) [[100, 1, 256, 14, 14], [100, 1, 6, 1, 1]] [[1, 1, 100, 28, 28]]
outputShapes(forInputShapes:) [[261888, 1, 2, 1, 1], [261888, 1, 4, 1, 1]] (Function)
outputShapes(forInputShapes:) [[1000, 1, 4, 1, 1], [1, 1, 256, 256, 256], [1, 1, 256, 128, 128], [1, 1, 256, 64, 64], [1, 1, 256, 32, 32]] [[1000, 1, 256, 7, 7]]
outputShapes(forInputShapes:) [[1000, 1, 256, 7, 7]] [[1000, 1, 1, 1, 6]]
outputShapes(forInputShapes:) [[1000, 1, 4, 1, 1], [1000, 1, 1, 1, 6]] [[100, 1, 6, 1, 1]]
outputShapes(forInputShapes:) [[100, 1, 6, 1, 1], [1, 1, 256, 256, 256], [1, 1, 256, 128, 128], [1, 1, 256, 64, 64], [1, 1, 256, 32, 32]] [[100, 1, 256, 14, 14]]
outputShapes(forInputShapes:) [[100, 1, 256, 14, 14], [100, 1, 6, 1, 1]] [[1, 1, 100, 28, 28]]
outputShapes(forInputShapes:) [[261888, 1, 2, 1, 1], [261888, 1, 4, 1, 1]] (Function)
outputShapes(forInputShapes:) [[1000, 1, 4, 1, 1], [1, 1, 256, 256, 256], [1, 1, 256, 128, 128], [1, 1, 256, 64, 64], [1, 1, 256, 32, 32]] [[1000, 1, 256, 7, 7]]
outputShapes(forInputShapes:) [[1000, 1, 256, 7, 7]] [[1000, 1, 1, 1, 6]]
outputShapes(forInputShapes:) [[1000, 1, 4, 1, 1], [1000, 1, 1, 1, 6]] [[100, 1, 6, 1, 1]]
outputShapes(forInputShapes:) [[100, 1, 6, 1, 1], [1, 1, 256, 256, 256], [1, 1, 256, 128, 128], [1, 1, 256, 64, 64], [1, 1, 256, 32, 32]] [[100, 1, 256, 14, 14]]
outputShapes(forInputShapes:) [[100, 1, 256, 14, 14], [100, 1, 6, 1, 1]] [[1, 1, 100, 28, 28]]
evaluate(inputs:outputs:) 2 1
evaluate(inputs:outputs:) 5 1
evaluate(inputs:outputs:) 1 1
evaluate(inputs:outputs:) 2 1
evaluate(inputs:outputs:) 5 1
evaluate(inputs:outputs:) 2 1
[Example.Detection(index: 0, boundingBox: (0.24634665250778198, 0.2269468903541565, 0.7236770391464233, 0.5447425842285156), classId: 1, score: 0.9998897314071655, mask: Optional(<CGImage 0x168800680> (DP)
	<(null)>
		width = 28, height = 28, bpc = 8, bpp = 8, row bytes = 28 
		kCGImageAlphaNone | 0 (default byte order)  | kCGImagePixelFormatPacked 
		is mask? Yes, has masking color? No, has soft mask? No, has matte? No, should interpolate? No))]

Which shows us initializers and expected keys / values, output sizes and shapes.

macOS however shows a 2 pass initialization - one with very different :

init(parameters:) ["bboxStdDev_3": 0.2, "bboxStdDev_count": 4, "nmsIOUThreshold": 0.7, "bboxStdDev_0": 0.1, "maxProposals": 1000, "bboxStdDev_1": 0.1, "bboxStdDev_2": 0.2, "preNMSMaxProposals": 6000, "engineName": ProposalLayer]
init(parameters:) ["engineName": PyramidROIAlignLayer, "poolSize": 7, "imageHeight": 1024, "imageWidth": 1024]
init(parameters:) ["engineName": TimeDistributedClassifierLayer]
init(parameters:) ["bboxStdDev_0": 0.1, "bboxStdDev_3": 0.2, "bboxStdDev_2": 0.2, "engineName": DetectionLayer, "bboxStdDev_1": 0.1, "bboxStdDev_count": 4, "nmsIOUThreshold": 0.3, "scoreThreshold": 0.7, "maxDetections": 100]
init(parameters:) ["poolSize": 14, "imageHeight": 1024, "engineName": PyramidROIAlignLayer, "imageWidth": 1024]
init(parameters:) ["engineName": TimeDistributedMaskLayer]
init(parameters:) ["nmsIOUThreshold": 0.7, "preNMSMaxProposals": 6000, "maxProposals": 1000, "bboxStdDev_0": 0.1, "bboxStdDev_3": 0.2, "bboxStdDev_2": 0.2, "bboxStdDev_count": 4, "engineName": ProposalLayer, "bboxStdDev_1": 0.1]
init(parameters:) ["poolSize": 7, "imageHeight": 1024, "engineName": PyramidROIAlignLayer, "imageWidth": 1024]
init(parameters:) ["engineName": TimeDistributedClassifierLayer]
init(parameters:) ["bboxStdDev_3": 0.2, "bboxStdDev_count": 4, "maxDetections": 100, "engineName": DetectionLayer, "nmsIOUThreshold": 0.3, "scoreThreshold": 0.7, "bboxStdDev_2": 0.2, "bboxStdDev_1": 0.1, "bboxStdDev_0": 0.1]
init(parameters:) ["imageHeight": 1024, "imageWidth": 1024, "poolSize": 14, "engineName": PyramidROIAlignLayer]
init(parameters:) ["engineName": TimeDistributedMaskLayer]
init(parameters:) ["bboxStdDev_count": 4, "preNMSMaxProposals": 6000, "maxProposals": 1000, "bboxStdDev_2": 0.2, "nmsIOUThreshold": 0.7, "bboxStdDev_3": 0.2, "bboxStdDev_0": 0.1, "engineName": ProposalLayer, "bboxStdDev_1": 0.1]
init(parameters:) ["engineName": PyramidROIAlignLayer, "poolSize": 7, "imageHeight": 1024, "imageWidth": 1024]
init(parameters:) ["engineName": TimeDistributedClassifierLayer]
init(parameters:) ["bboxStdDev_3": 0.2, "bboxStdDev_0": 0.1, "nmsIOUThreshold": 0.3, "bboxStdDev_2": 0.2, "engineName": DetectionLayer, "bboxStdDev_1": 0.1, "maxDetections": 100, "scoreThreshold": 0.7, "bboxStdDev_count": 4]
init(parameters:) ["imageHeight": 1024, "imageWidth": 1024, "poolSize": 14, "engineName": PyramidROIAlignLayer]
init(parameters:) ["engineName": TimeDistributedMaskLayer]
outputShapes(forInputShapes:) [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0]] (Function)
outputShapes(forInputShapes:) [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0]] (Function)
outputShapes(forInputShapes:) [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]] [[0, 0, 0, 7, 7]]
outputShapes(forInputShapes:) [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]] [[0, 0, 0, 7, 7]]
outputShapes(forInputShapes:) [[0, 0, 0, 0, 0]] [[0, 0, 1, 1, 6]]
outputShapes(forInputShapes:) [[0, 0, 0, 0, 0]] [[0, 0, 1, 1, 6]]
outputShapes(forInputShapes:) [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0]] [[100, 0, 6, 1, 1]]
outputShapes(forInputShapes:) [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0]] [[100, 0, 6, 1, 1]]
outputShapes(forInputShapes:) [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]] [[0, 0, 0, 14, 14]]
outputShapes(forInputShapes:) [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]] [[0, 0, 0, 14, 14]]
outputShapes(forInputShapes:) [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0]] [[1, 0, 0, 0, 0]]
outputShapes(forInputShapes:) [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0]] [[1, 0, 0, 0, 0]]
init(parameters:) ["preNMSMaxProposals": 6000, "bboxStdDev_3": 0.2, "bboxStdDev_0": 0.1, "bboxStdDev_1": 0.1, "bboxStdDev_count": 4, "nmsIOUThreshold": 0.7, "engineName": ProposalLayer, "bboxStdDev_2": 0.2, "maxProposals": 1000]
init(parameters:) ["poolSize": 7, "engineName": PyramidROIAlignLayer, "imageHeight": 1024, "imageWidth": 1024]
init(parameters:) ["engineName": TimeDistributedClassifierLayer]
init(parameters:) ["maxDetections": 100, "bboxStdDev_2": 0.2, "bboxStdDev_0": 0.1, "bboxStdDev_3": 0.2, "bboxStdDev_count": 4, "scoreThreshold": 0.7, "nmsIOUThreshold": 0.3, "engineName": DetectionLayer, "bboxStdDev_1": 0.1]
init(parameters:) ["poolSize": 14, "engineName": PyramidROIAlignLayer, "imageWidth": 1024, "imageHeight": 1024]
init(parameters:) ["engineName": TimeDistributedMaskLayer]
init(parameters:) ["bboxStdDev_1": 0.1, "nmsIOUThreshold": 0.7, "engineName": ProposalLayer, "preNMSMaxProposals": 6000, "maxProposals": 1000, "bboxStdDev_3": 0.2, "bboxStdDev_0": 0.1, "bboxStdDev_count": 4, "bboxStdDev_2": 0.2]
init(parameters:) ["imageHeight": 1024, "imageWidth": 1024, "poolSize": 7, "engineName": PyramidROIAlignLayer]
init(parameters:) ["engineName": TimeDistributedClassifierLayer]
init(parameters:) ["engineName": DetectionLayer, "bboxStdDev_1": 0.1, "bboxStdDev_3": 0.2, "nmsIOUThreshold": 0.3, "bboxStdDev_count": 4, "bboxStdDev_0": 0.1, "bboxStdDev_2": 0.2, "scoreThreshold": 0.7, "maxDetections": 100]
init(parameters:) ["imageHeight": 1024, "engineName": PyramidROIAlignLayer, "poolSize": 14, "imageWidth": 1024]
init(parameters:) ["engineName": TimeDistributedMaskLayer]
outputShapes(forInputShapes:) [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0]] [[1, 0, 0, 0, 0]]
outputShapes(forInputShapes:) [[261888, 1, 2, 1, 1], [261888, 1, 4, 1, 1]] (Function)
outputShapes(forInputShapes:) [[1000, 1, 4, 1, 1], [1, 1, 256, 256, 256], [1, 1, 256, 128, 128], [1, 1, 256, 64, 64], [1, 1, 256, 32, 32]] [[1000, 1, 256, 7, 7]]
outputShapes(forInputShapes:) [[1000, 1, 256, 7, 7]] [[1000, 1, 1, 1, 6]]
outputShapes(forInputShapes:) [[1000, 1, 4, 1, 1], [1000, 1, 1, 1, 6]] [[100, 1, 6, 1, 1]]
outputShapes(forInputShapes:) [[100, 1, 6, 1, 1], [1, 1, 256, 256, 256], [1, 1, 256, 128, 128], [1, 1, 256, 64, 64], [1, 1, 256, 32, 32]] [[100, 1, 256, 14, 14]]
outputShapes(forInputShapes:) [[100, 1, 256, 14, 14], [100, 1, 6, 1, 1]] [[1, 1, 100, 28, 28]]
outputShapes(forInputShapes:) [[261888, 1, 2, 1, 1], [261888, 1, 4, 1, 1]] (Function)
outputShapes(forInputShapes:) [[1000, 1, 4, 1, 1], [1, 1, 256, 256, 256], [1, 1, 256, 128, 128], [1, 1, 256, 64, 64], [1, 1, 256, 32, 32]] [[1000, 1, 256, 7, 7]]
outputShapes(forInputShapes:) [[1000, 1, 256, 7, 7]] [[1000, 1, 1, 1, 6]]
outputShapes(forInputShapes:) [[1000, 1, 4, 1, 1], [1000, 1, 1, 1, 6]] [[100, 1, 6, 1, 1]]
outputShapes(forInputShapes:) [[100, 1, 6, 1, 1], [1, 1, 256, 256, 256], [1, 1, 256, 128, 128], [1, 1, 256, 64, 64], [1, 1, 256, 32, 32]] [[100, 1, 256, 14, 14]]
outputShapes(forInputShapes:) [[100, 1, 256, 14, 14], [100, 1, 6, 1, 1]] [[1, 1, 100, 28, 28]]
outputShapes(forInputShapes:) [[261888, 1, 2, 1, 1], [261888, 1, 4, 1, 1]] (Function)
outputShapes(forInputShapes:) [[1000, 1, 4, 1, 1], [1, 1, 256, 256, 256], [1, 1, 256, 128, 128], [1, 1, 256, 64, 64], [1, 1, 256, 32, 32]] [[1000, 1, 256, 7, 7]]
outputShapes(forInputShapes:) [[1000, 1, 256, 7, 7]] [[1000, 1, 1, 1, 6]]
outputShapes(forInputShapes:) [[1000, 1, 4, 1, 1], [1000, 1, 1, 1, 6]] [[100, 1, 6, 1, 1]]
outputShapes(forInputShapes:) [[100, 1, 6, 1, 1], [1, 1, 256, 256, 256], [1, 1, 256, 128, 128], [1, 1, 256, 64, 64], [1, 1, 256, 32, 32]] [[100, 1, 256, 14, 14]]
outputShapes(forInputShapes:) [[100, 1, 256, 14, 14], [100, 1, 6, 1, 1]] [[1, 1, 100, 28, 28]]
evaluate(inputs:outputs:) 2 1
evaluate(inputs:outputs:) 5 1
evaluate(inputs:outputs:) 1 1
evaluate(inputs:outputs:) 2 1
evaluate(inputs:outputs:) 5 1
evaluate(inputs:outputs:) 2 1
[]

Closer :

Pyramid ROI layer requires a retained command buffer as opposed to a unretained, and metal GPU resources require manual synchronization on discrete GPU's to get results.

While I'm not matching iOS exactly, I am getting somewhat sensible results. Bit more to do.