[BUG] How to train a headless feature?
AlmogBaku opened this issue · comments
What is the bug?
Headless features doesn't have a datasource, therefore we don't know against what timestamps to run it.
Steps to reproduce the bug
@data_source(
training_data=pd.read_parquet(
'https://gist.github.com/AlmogBaku/a1b331615eaf1284432d2eecc5fe60bc/raw/emails.parquet'),
keys=['id', 'account_id'],
timestamp='event_at',
)
class Email(TypedDict('Email', {'from': str})):
event_at: datetime
account_id: str
subject: str
to: str
@feature(keys='account_id', data_source=Email)
@aggregation(function=AggregationFunction.Count, over='10h', granularity='1h')
def emails_10h(this_row: Email, ctx: Context) -> int:
"""email over 10 hours"""
return 1
@data_source(
training_data=pd.read_csv(
'https://gist.githubusercontent.com/AlmogBaku/a1b331615eaf1284432d2eecc5fe60bc/raw/deals.csv'),
keys=['id', 'account_id'],
timestamp='event_at',
)
class Deal(TypedDict):
id: int
event_at: pd.Timestamp
account_id: str
amount: float
@feature(keys='account_id', data_source=Deal)
@aggregation(
function=[AggregationFunction.Sum, AggregationFunction.Avg, AggregationFunction.Max, AggregationFunction.Min],
over='10h',
granularity='1m'
)
def deals_10h(this_row: Deal, ctx: Context) -> float:
"""sum/avg/min/max of deal amount over 10 hours"""
return this_row['amount']
@feature(keys='account_id', data_source=None)
@freshness(target='-1', invalid_after='-1')
def emails_deals(_, ctx: Context) -> float:
"""emails/deal[avg] rate over 10 hours"""
e, _ = ctx.get_feature('emails_10h+count')
d, _ = ctx.get_feature('deals_10h+avg')
if e is None or d is None:
return None
return e / d
What is the expected result?
The user needs to provide training timestamps somehow. The simplest way is to provide a df of timestamp
Version
No response
Cluster version
No response
What else should we know?
Think in advance for something that can work on the future train-on-scale version
Also- where does it get its keys' values from?
Option 1: recover the data from the dependent sources.
- Fetch the timestamps and keys from the depended feature sources.
- Merge it with respecting the freshness settings(remove duplicated by the freshness)
- Calculate
This can be replaced by a list of timestamps and keys
Option 2: We don't have to replay the data at all. We can run the function upon request:
When the headless is requested as a dependent feature- calculate it right away.
When the headless is requested for training- calculate it as in #1?
Another idea:
Provide a flag that specifies this as headless.
Then we use training source to replay, and don't opt out of the training source in the manifest.
This is probably the easiest solution.
Another option is specifying a flag called sourceless_training_df