Proposal lacks clarity about stability of Cohort IDs

Question

Proposal lacks clarity about stability of Cohort IDs

AramZS opened this issue 3 years ago · comments

Aram Zucker-Scharff commented 3 years ago

Hi, reviewing the proposal for FLoC and there is some clarity issues about Cohort IDs that are assigned to users.

I am operating on the assumption that there are a limited set of cohort IDs possible across all users at any one time. For most users their cohort IDs will shift as their behavior shifts over time.

That is my interpretation now, hopefully correct. What is unclear is if the Cohort IDs themselves will shift over time in either their names or meaning based on user behavior across browsers or if they are relatively stable representatives of a user's interests.

Let's take an example:

I have a very boring individual user UserA who goes on to their computer and only visits GamingSiteA dot com, that is the only site on the web UserA ever visits. I am assigned a single FLoC [CohortID_A] that is shared by others who visit gaming sites. A consumer of FLoCs sees this Cohort ID with the value of [CohortID_A] and sees it's attachment to users on gaming sites, including GamingSiteA and others, and conversions to gaming purchases, say 'mygamingproduct.com/checkout'. As an exchange or SSP or publisher I assign "gaming" as an interest to that FLoC ID and make it available to be targeted by marketers who want to sell gaming-adjacent items.

For UserA, they keep visiting the same site forever, never going to any other site. Does [CohortID_A] remain stable on my browser and get re-applied as it expires? Or, as other people's browsers shift into different behaviors over time, does that redefine the field of play and cause [CohortID_A] to shift its value to [CohortID_B], which would mean that even if UserA's behavior is completely static, their Cohort ID will shift over time based on the behavior of all users who participate in FLoCs? Or does [CohortID_A] represent a cohortA which remain stable in what behavior it is attached to, but the name shifts over time so cohortA is represented by [CohortID_A] today and next month is represented by [CohortID_Z]?

This is very important to know because it means the allocation of computational resources may be continuous vs one-time. If the definition of FLoCs can be considered stable ([CohortID_A] is believed by my model to always represent "gaming-interested") then we can run ML to build those solutions and eventually end use of that resource once we understand the definitions of all FLoCs relevant to us. If [CohortID_A] is not a stable value we could bake a model that says 'behaviors x will always generate a cohort ID that means gaming-interested, so capture that ID and define it as gaming-interested'. Or if the meaning of [CohortID_A] shifts due to behavior across all browsers every week, then a machine learning system might have to operate very differently. In each case it might require a different level of resources dedicated to take advantage of FLoC.

I think this document needs to be clearer on this point, I propose we add a section to describe the lifecycle of a FLoC Cohort ID, it should be clear that the values for timing are not currently set in stone (we're still at an experimentation phase) and what might be defined as 'a week' may change. But if FLoC IDs themselves has a lifecycle outside of their assignment to users that should be noted and made clear for those of us interested in experimentation, as it will be useful for budgeting and assigning computational resources.

The section should make the following things clear:

Does a FLoC Cohort ID names expire (is [CohortID_A] expected to be present somewhere in the FLoC system forever)
Does a FLoC Cohort ID have a stable meaning
a. If [CohortID_A] is persistent does it always represent the same in-browser historical behavior even if who it is assigned to changes over time.
b. If [CohortID_A] is not persistent and will eventually be replaced by [CohortID_B] can we expect that if we have the same model assigned (all users on sites gamingX, gamingB, and gamingY who share a cohort ID can be understood to share a cohort ID that means 'gaming', even if today 'gaming' == [CohortID_A] and next week 'gaming' == [CohortID_B])
If Cohort Name or Cohort Meaning or both are not stable over time (and therefore require regular recalculation), at what pace do shifts occur?
If there are one-time shifts in meaning (Some issue in calculation of FLoCs forces the methodology to change via a browser update), how are systems notified of that shift? Is there a browser-level signal like "version", or a notification on the Chrome Developer Blog or... something else? How much time is given between the notification of intent to change and the actual change? Do they get signaled in developer tools with a warning in Canary? Is there an expected pace to such updates?

Hopefully this is an opportunity for clarity that will help with ongoing conversations and make it easier for people to make decisions around what to test, how and what the cost involved will be.

Thank you!

Michael Kleber · Answer 1 · Thu Mar 11 2021 04:09:40 GMT+0800 (China Standard Time)

Hi Aram:

When you call the document.interestCohort() API [draft spec], you'll get back both a cohort id (which cohort you're in, the "name" you referred to above) and a version that indicates what model was used to produce it (exactly as you said in your 4). As long as the algorithm doesn't change, the cohort "means" the same thing. So the assign "gaming" as an interest to that FLoC ID approach you describe is a good way to use FLoC, as long as it's based on both the id and version.

You're quite right that if browsers change versions often, it will mean a lot of work for people. During the experimental Origin Trial stage, I expect a Chrome will try multiple clustering techniques so that we can all learn what works best, and those would have different version strings of course. But once things are launched, I expect version changes to be infrequent.

Of course it's always possible that even if the browser keeps everything the same, a group's behavior can change — some new crocheting site becomes really popular, and there's a hash collision, and it turns out that the cohort that used to have lots of gamers in it is now full of both gamers and crocheters. (Ah — I guess this is plausible when cohort calculation is just based on domain names, as the initial Chrome experiment will be; maybe if we end up with a clustering technique based on topics in the future, that won't be much of a concern.)

captify-mgruau · Answer 2 · Thu Mar 11 2021 23:18:30 GMT+0800 (China Standard Time)

Hi @michaelkleber,

Thanks for this. One other point of technicality, on the "version" field: the way the clustering works, will the version evolve at the same time for all cohorts? (e.g. from day N to day N+1, all calls would now return version = version+1)
Or will some stable cohorts remain the same with an unchanged version number, while others shift and have their version incremented by 1?

It impacts the technical design of how we should store this version number.

While we're on this topic, the version readable is currently "chrome.1.0". If other chromium-based browsers implement the cohort technology, will they have different version values (e.g. "edge.1.0","comodo.1.0"...)? If so, would that in effect mean that the clustering space is siloed from one browser to the other?

Thanks in advance.

Michael Kleber · Answer 3 · Thu Mar 11 2021 23:30:07 GMT+0800 (China Standard Time)

Hi Martin,

One other point of technicality, on the "version" field: the way the clustering works, will the version evolve at the same time for all cohorts? (e.g. from day N to day N+1, all calls would now return version = version+1)

If we just update to a new version in a subsequent release of Chrome, then the change would roll out as people restart their browsers. So you would probably observe a migration of most people from chrome.1.0 to, say, chrome.1.1 over the course of a week. It wouldn't depend on what cohort someone is in.

However, we might want to have two different clustering algorithms running at the same time, for different people. In that case some people might have version chrome.1.0 and others might have chrome.2.0 at the same time. (We would do this in the hopes that it would put you in a good position to evaluate which algorithm seems more useful in a head-to-head comparison.)

While we're on this topic, the version readable is currently "chrome.1.0". If other chromium-based browsers implement the cohort technology, will they have different version values (e.g. "edge.1.0","comodo.1.0"...)?

The draft spec explains it this way:

The string representation of the interest cohort version is implementation-defined. It’s recommended that the browser vendor name is part of the version (e.g. “chrome.2.1”, “v21/mozilla”), so that when exposed to the Web, there won’t be naming collisions across browser vendors. As an exception, if two browsers choose to deliberately use the same cohort assignment algorithm, they should pick some other way to give it an unambiguous name and avoid collisions.

captify-mgruau · Answer 4 · Thu Mar 11 2021 23:42:53 GMT+0800 (China Standard Time)

Thank you, very clear. We'll have to see in practice how the pace of version change impacts our ability to draw conclusions from cohort IDs. There is a balance to find between reactivity to the latest trends in browsing behaviour vs stability and ease of use, but I guess that's why we need to trial!

Michael Kleber · Answer 5 · Fri Mar 12 2021 06:49:27 GMT+0800 (China Standard Time)

Absolutely! Thanks for your willingness to experiment, and we look forward to hearing what you learn.

Aram Zucker-Scharff · Answer 6 · Fri Mar 12 2021 06:53:38 GMT+0800 (China Standard Time)

Thanks! This answers a number of questions that were floating around.

Aram Zucker-Scharff · Answer 7 · Thu Apr 01 2021 22:59:26 GMT+0800 (China Standard Time)

@michaelkleber These questions keep coming up in FLoC convos, I really think it would be helpful to add your response in some form to the README itself so it is very clear to all players.

Michael Kleber · Answer 8 · Sat Apr 03 2021 00:43:25 GMT+0800 (China Standard Time)

Good point, added.