Pandas Style Guide
Minimally Sufficient Pandas article
Most of these points were graciously taken from Ted Petrou's-
Use brackets for selecting a column of data over dot notation. This avoids name collisions and column names with spaces and/or reserved characters.
df['count'] # good df.count # ambiguous
-
Use loc and iloc, ix is deprecated.
-
Use read_csv over read_table (only difference is what delimiter is being used by default).
-
Use isna and notna over isnull and notnull (aliases of each other).
-
Use the Pandas method over any built-in Python function with the same name.
df['column_name'].sum() # faster sum(df['column_name']) # slower
-
Use df.groupby('grouping column').agg({'aggregating column': 'aggregating function'}) as your primary syntax of choice.
- More room for configuration
- Easily expandable
- Easily decomposable into separate variables if needed
-
Use reset_index to avoid a MultiIndex after a groupby call.
df.groupby('column_name').agg({'other_column': 'max'}).reset_index()
-
Use pivot_table over pivot, unstack, or crosstab.
- Use crosstab if you need to find relative frequency.
-
Use melt over stack since it allows you to rename columns and avoids a MultiIndex.
-
Avoid iterating over a DataFrame, Pandas is built around vectorization.
- From fastest to slowest: NumPy array > Pandas vectorized operation > apply > iterrows > Python loop.
- Sofia Heisler's article has more detail.
-
Use NumPy arrays if your application relies on performance.