Instance types "Unavailable in this Availability Zone" with erroneous price of 0

Question

Instance types "Unavailable in this Availability Zone" with erroneous price of 0

gabegorelick opened this issue 5 years ago · comments

Gabe Gorelick commented 5 years ago

Issue type

Bug Report

Build number

9b438dc

Configuration

allowed_instance_types: current, but not sure that matters.

Environment

AWS region: us-east-1
Type of environment: VPC

Summary

AutoSpotting is reporting that certain spot instance types have a price of 0.

Here's what I see in the logs:

instance.go:376: Comparing current type t2.small with price 0.023 with candidate t2.small with price 0

Since AutoSpotting interprets a spot price of $0 as meaning the instance type is unavailable, you get a message saying "Unavailable in this Availability Zone" and AutoSpotting refuses to use that instance type.

The weird thing is that calling describe-spot-price-history indicates that the instance type does seem to exist.

aws ec2 describe-spot-price-history --instance-types t2.small --availability-zone us-east-1b --product-description Linux/UNIX
{
    "SpotPriceHistory": [
        {
            "AvailabilityZone": "us-east-1b",
            "InstanceType": "t2.small",
            "ProductDescription": "Linux/UNIX",
            "SpotPrice": "0.006900",
            "Timestamp": "2020-01-24T09:30:58.000Z"
        },

This is happening intermittently for a number of instances types. I have no issues manually creating spot requests with the same instance type and AZ (they are fulfilled without issue).

Steps to reproduce

Setting allowed_instance_types: current makes the issue more visible, since if you have an "Unavailable instance" AutoSpotting won't substitute a replacement instance type.

Cristian Magherusan-Stanciu · Answer 1 · Sat Jan 25 2020 15:23:36 GMT+0800 (China Standard Time)

Thanks for reporting this.

This price data is coming from the spot instance price history, and intermittent issues might indicate API throttling, especially considering that you seem to be trying to run multiple instances of AutoSpotting in parallel from the same account.

Gabe Gorelick · Answer 2 · Sun Jan 26 2020 00:08:19 GMT+0800 (China Standard Time)

This price data is coming from the spot instance price history, and intermittent issues might indicate API throttling

It seems to be happening more consistently for some instance types than others. E.g. I saw it for a few hours straight with t2.medium while other instances returned data fine. So I'm not sure API throttling is the issue (or if it is, AWS is doing very strange throttling).

you seem to be trying to run multiple instances of AutoSpotting in parallel from the same account.

I actually still only have one instance of AutoSpotting. Gotta figure out these issues before rolling it out more 😄 But that lends more evidence for something beyond throttling going on.

Cristian Magherusan-Stanciu · Answer 3 · Sun Jan 26 2020 02:21:08 GMT+0800 (China Standard Time)

I might be wrong but I think there is a possibility that we cache the pricing information within the Lambda function so you might have the same corrupt data for subsequent executions until the Lambda function is scheduled to another host. Can you see any throttling errors in the Cloudtrail logs?

Gabe Gorelick · Answer 4 · Sun Jan 26 2020 02:57:10 GMT+0800 (China Standard Time)

@cristim I see no evidence of throttling in CloudTrail.

Gabe Gorelick · Answer 5 · Fri Jan 31 2020 09:10:02 GMT+0800 (China Standard Time)

Could this be because DescribeSpotPriceHistory is a paginated endpoint, but AutoSpotting is not using the paginated version?

Cristian Magherusan-Stanciu · Answer 6 · Fri Jan 31 2020 10:42:14 GMT+0800 (China Standard Time)

As far as I remember without looking at the code we just query the current price one instance type at a time so we shouldn't need to get more pages. I guess this can cause problems in case we query the price for multiple instance types that would not fit in a page.

Gabe Gorelick · Answer 7 · Fri Jan 31 2020 10:45:05 GMT+0800 (China Standard Time)

@cristim Pretty sure it queries for all instance types in the region at once.

Regardless, I'll have a fix up shortly. Just adding some tests now.

Cristian Magherusan-Stanciu · Answer 8 · Fri Jan 31 2020 11:04:07 GMT+0800 (China Standard Time)

Awesome, huge thanks!

Gabe Gorelick · Answer 9 · Fri Jan 31 2020 11:15:09 GMT+0800 (China Standard Time)

See #408. Not sure if it will fix it, but worth a shot.

Gabe Gorelick · Answer 10 · Sun Feb 09 2020 07:19:06 GMT+0800 (China Standard Time)

Haven't had any issues since upgrading to #408. I think we can close for now and reopen if it pops up again.

Cristian Magherusan-Stanciu · Answer 11 · Sun Feb 09 2020 08:10:32 GMT+0800 (China Standard Time)

Great to hear that, thanks!