raptor-ml / raptor

Transform your pythonic research to an artifact that engineers can deploy easily.

Home Page:https://raptor.ml

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[BUG] How to train a headless feature?

AlmogBaku opened this issue · comments

What is the bug?

Headless features doesn't have a datasource, therefore we don't know against what timestamps to run it.

Steps to reproduce the bug

@data_source(
    training_data=pd.read_parquet(
        'https://gist.github.com/AlmogBaku/a1b331615eaf1284432d2eecc5fe60bc/raw/emails.parquet'),
    keys=['id', 'account_id'],
    timestamp='event_at',
)
class Email(TypedDict('Email', {'from': str})):
    event_at: datetime
    account_id: str
    subject: str
    to: str


@feature(keys='account_id', data_source=Email)
@aggregation(function=AggregationFunction.Count, over='10h', granularity='1h')
def emails_10h(this_row: Email, ctx: Context) -> int:
    """email over 10 hours"""
    return 1

@data_source(
    training_data=pd.read_csv(
        'https://gist.githubusercontent.com/AlmogBaku/a1b331615eaf1284432d2eecc5fe60bc/raw/deals.csv'),
    keys=['id', 'account_id'],
    timestamp='event_at',
)
class Deal(TypedDict):
    id: int
    event_at: pd.Timestamp
    account_id: str
    amount: float

@feature(keys='account_id', data_source=Deal)
@aggregation(
    function=[AggregationFunction.Sum, AggregationFunction.Avg, AggregationFunction.Max, AggregationFunction.Min],
    over='10h',
    granularity='1m'
)
def deals_10h(this_row: Deal, ctx: Context) -> float:
    """sum/avg/min/max of deal amount over 10 hours"""
    return this_row['amount']


@feature(keys='account_id', data_source=None)
@freshness(target='-1', invalid_after='-1')
def emails_deals(_, ctx: Context) -> float:
    """emails/deal[avg] rate over 10 hours"""
    e, _ = ctx.get_feature('emails_10h+count')
    d, _ = ctx.get_feature('deals_10h+avg')
    if e is None or d is None:
        return None
    return e / d

What is the expected result?

The user needs to provide training timestamps somehow. The simplest way is to provide a df of timestamp

Version

No response

Cluster version

No response

What else should we know?

Think in advance for something that can work on the future train-on-scale version

Also- where does it get its keys' values from?

Option 1: recover the data from the dependent sources.

  1. Fetch the timestamps and keys from the depended feature sources.
  2. Merge it with respecting the freshness settings(remove duplicated by the freshness)
  3. Calculate

This can be replaced by a list of timestamps and keys

Option 2: We don't have to replay the data at all. We can run the function upon request:

When the headless is requested as a dependent feature- calculate it right away.

When the headless is requested for training- calculate it as in #1?

Another idea:
Provide a flag that specifies this as headless.

Then we use training source to replay, and don't opt out of the training source in the manifest.

This is probably the easiest solution.

Another option is specifying a flag called sourceless_training_df