AI-SDC / ACRO

Tools for the Automatic Checking of Research Outputs. These are the tools for researchers to use as drop-in replacements for commands that produce outputs in Stata Python and R

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Shape mismatch when there are two columns and the aggfunc is count or sum

mahaalbashir opened this issue · comments

table= acro.crosstab(df.year, [df.grant_type, df.survivor], values=df.inc_grants, aggfunc="aggfunc", margins=True )
This command produces the error ValueError: Array conditional must be same shape as self. The test for this case is the function test_crosstab_with_sum in the test-initial.

Explanation
When the pd.crosstab is used if the aggfunc is:

  1. Sum or count
    columns with zeros are not deleted
  2. Mean or std
    columns with zeros are deleted

The threshold mask is created using the count function which by default (the pandas version) doesn’t delete a column if it is all zeros. So the threshold originally is created with the columns that have zeros in all cells. After the creation of the threshold mask, every column that is all zeros is deleted.
The p-ratio and the nk-rule and p-ratio masks by default don’t show columns with zeros. When run the command:

  1. If the agg function is mean or std the resulting table is without columns with zeros, so the masks and the table are the same shape.
  2. If the agg function is sum or count the resulting table is with columns with zeros, so the masks and the tables are not the same shape.

I think the above explanation is not so accurate. I tested the same scenario with different columns and while this command
table= acro.crosstab(df.year, [df.grant_type, df.survivor], values=df.inc_grants, aggfunc="aggfunc", margins=True ) doesn't work and throw an error, replacing the second column with df.status instead of df.survivor
table= acro.crosstab(df.year, [df.grant_type, df.status], values=df.inc_grants, aggfunc="aggfunc", margins=True ) seems to work fine.

The suggested solution, deleting the columns with zeros from the table before applying the masks, will work fine and solve the issue, but I am not sure why different columns resulted in different behavior. @jim-smith