swcarpentry / python-novice-gapminder

Plotting and Programming in Python

Home Page:http://swcarpentry.github.io/python-novice-gapminder/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Adding caveats for DataFrame.iloc under Pandas Dataframes

candemircan opened this issue · comments

Hi!

For the Pandas Dataframes episode, under the DataFrame.iloc[..., ...] section, it might be worth mentioning the caveats of this method, i.e. if you add new columns to your data later, an index based selection (as opposed to using column names) can lead to problems. If this is worth adding, I would be happy to make the edit and make a pull request.

Thanks,

Can

I think a 'caveats' section would be a great addition, @candemircan. The lesson workflow does a good job of introducing learners to using different approaches to slice data frames. However, I wonder if learners may run into trouble when applying some of the knowledge from this lesson (e.g., adding columns to data later and running into index-based selection problems, as you mention). Perhaps immediately after the section "Result of slicing can be used in further operations", a section could be added to demonstrate the caveats and how learners might run into trouble. After completing the lesson, if learners start adding on further operations to .iloc, they might run into a "SettingWithCopyWarning", and be unsure why it is happening. Maybe addressing this specific warning is beyond the scope of the lesson, but including a brief section demonstrating the caveats would be valuable.

Hi there,

I agree and I would be happy to write something up in that direction. However, I would discuss this further before starting an attempt:
Before getting to the point, I want to suggest going even further and teaching the .loc method before .iloc because it is more pandasothic (in the sense of pythonic): If you reach for pandas instead of numpy that should be because there is a special meaning to rows and columns of your data (and not their indices) and, if so, you should assign meaningful labels and use those.

Anyways, both methods have their caveats. A list from the top of my head would be (inlcuding yours):

  1. Choose .loc over .iloc if possible because operations might change the index in a non-obvious way (and above).
  2. Refrain from combining more than one .loc and/or .iloc due to the mentioned SettingWithCopyWarning. I think that should be best practice even if not setting anything because you might later copy that expression for setting the same elements. Also, try to never return .(i)loc[...] from a function because there is no guarantee what will happen outside the function (another .(i)loc perhaps?).
  3. pandas' slicing is inclusive for .loc but exclusive for .iloc. That is particularly mean for trivial integer indices because df.loc[0:1] != df.iloc[0:1] even in situations when both expressions are valid and 0 and 1 refer to the same rows.
  4. Indexing with lists and tuples is not semantically equivalent (although I can't come up with an example where both are valid but yield different results on the spot).

There are probably even more subtle things to be aware of. My questions:

  • Anything missing? Anything irrelevant or too advanced?
  • As my points go a bit beyond the original scope, it might be better to put them at the end of the "Use/Select ..." sections (i.e. before "GroupBy...")?
  • How verbose should that be? More explicit information and reasoning or rather "do this because see this link"? (pandas has very informative docs on those things.) Include a MWE exhibiting a confusing scenario?
  • Should we have dedicated exercises about these pitfalls?

Best,
Julian