This section highlights differences in Python and R that could result in inadvertent errors if the wrong convention is used (i.e. code may still run but would produce wrong result).
Topic | R | Python |
---|---|---|
General purposes | R was developed specifically for statistical computing and data analysis. | Python was developed as a general-purpose programming language. |
Boolean | TRUE or T FALSE or F |
True False |
Array indexing | Starts at 1 | Starts at 0 |
Indentation | Has no impact on code – is purely cosmetic | Has a specific meaning in the code. Reducing the indentation level indicates the end of a block of code. |
Length of a string | nchar(x) Do not use length(x) |
len(x) |
Return statements in functions | If no return statement is specified, will return the last calculation done within the function | Return statement must be specified if we want the function to return an output; otherwise it will return “None” |
Interpretation of “=” | The “=” sign will create an independent copy of the object. For example, if we do data_2 = data_1, and perform some manipulations on data_2, then data_1 will be unchanged. | The “=” sign will create a new pointer to the original object, which will not behave independently. For example, if we do data_2 = data_1, and perform some manipulations on data_2, the same operations will be applied to data_1. To make an independent copy of a dataset, use data_2 = data_1.copy() instead. |
This section highlights differences in Python and R that represent significant differences in the way the code is structured, but which are unlikely to cause non-obvious errors (i.e. if the wrong approach is used then the code would not run).
Topic | R | Python |
---|---|---|
Code blocks | Are encased in braces { and } example = function(x){ some code some more code return(something) } |
Begins with a line ending with a colon. On the next line, the indentation level increases by 1. The code block ends when the indentation level returns back to where it was at the start of the code block. def example(x): some code some more code return somethig more code that is not part of the function definition |
Common ways to create unlabelled sequences of objects | In R, these are called vectors. Use the “c” command to create one, e.g. c(1, 2, 3) c here stands for combination. Elements in an R vector must all be of the same type. |
In Python, this is called a list. Use square brackets with elements separated by commas, e.g. [1, 2, 3] Elements in a Python list can be of mixed type. Can also create a tuple using round parentheses, but these cannot be changed after being created. Example: (1, 2, 3) |
Common ways to create labelled sequences of objects | In R, this is called a list. Use the “list” command to create one, separating key-value pairs with an equal sign, e.g. l = list('a' = 1, 'b' = 2, 'c' = 3) Access elements with the $ symbol, e.g. l$a is 1 |
In Python, this is called a dictionary. Use braces to create on, separating the list of key-value pairs with commas, e.g. d = {'a':1, 'b':2, 'c':3} Access elements using square brackets, e.g. d['a'] is 1 |
Applying a function across all elements of an array | Use the lapply command | Use the list comprehension syntax, e.g. [formula for x in list if condition] |
Loop | for(i in 1:10) {...} | for i in range(10): ... |
Conditional statement | if(x > 3) {...} | if x > 3: ... |
Call a function | function(data) | function(data) data.function() In Python, we have more ways to call a function, in which data oriented is a common way: For example: mean(data) but we also call: data.mean() |
Access a column in a data frame | 1. Using the $ operator: data_frame$column 2. Using square brackets []: data_frame[, "column"] |
1. Using square brackets []: data_frame["column"] 2. Using the dot notation: data_frame.column (only works if the column name does not contain any spaces or special characters) |
These are differences in naming or notational conventions that don’t cause major changes in the structure of a code, but which might result in needing to change the name of a keyword or function. Items in this list will cause an obvious error (e.g. code won’t run) if the wrong convention is used.
Topic | R | Python |
---|---|---|
Concatenating strings | Use “cat”, paste() or “paste0” cat("Hello,", "world!") Note: paste0() is similar to paste(), but it does not add any separator between the strings, while we can regulate the separator in paste(). For example, we can paste("Line 1", "Line 2", sep = "\n") to break line. |
Use “+” “Hello, " + "World!" Other options: format(), join() concatenated_string = "{}{}".format(string1, string2) concatenated_string = "_".join([string1, string2]) |
Displaying text | Use “print” – this can only display a single string | Use the print command. This can handle a sequence of strings / variables and will print them all out with a space between them. |
Exponentiation | Can use a ** b or a^b | Use a ** b |
Modular arithmetic | Use a %% b | Use a % b |
Integer division, discarding remainder | Use a %/% b | Use a // b |
Determine type of a variable | Use typeof(x) | Use type(x) |
Change type of a variable | General format of the function is “to.type()”. Example: to.integer(x) | General format of the function is “type()”. Example: int(x) |
Boolean variables | Use all-caps, TRUE and FALSE | Capitalize only first letter, True and False |
Boolean operators | Use symbols &, |
, ! For containment in a vector use %in% |
Install package | install.packages('name') | pip install name |
Importing additional functionality | These are called packages in R Use library(package) In R, when we access library, all functions of that library will be available. |
These are called modules in Python Use from package import module from sklearn import metrics Note: In Python, every import only does with a specific function from that library. So if you need to import all modules, you need to use below syntax: from package import * For example: from pandas import * |
Comment out | Ctrl + Shift + C | · Windows: CTRL + 3 · Mac: CMD + 3 |
Create a function | Use function() function_name <- function(arg1, arg2){ return() } |
Use def() def function_name(arg1, arg2): return() |
Lambda functions | Do not have. Still use using the function() keyword to create function. |
Lambda functions are anonymous functions in Python, meaning they are functions without a name. They are used to perform a small task or calculation and are often used in combination with other functions like filter(), map() or reduce(). The syntax of a lambda function in Python is: lambda arguments: expression f = lambda x: x**2 print(f(5)) # 25 |
Condition ifelse() | R offers this. | Not offer. Need to use normal syntax: result = x if x > y else y |
Call for help | Type in Console: ?function_name() ??function_name()→ To check the package that contains the function. For example: ?mutate() |
Help(function_name) For example: Help(len) Note: Python also has dir(), a function used to return a list of valid attributes and methods of an object. For example, dir(list) returns a list of attributes and methods available for the built-in list type |
Check available built-in functions | ls("package:base") :This will return a character vector of all the functions in the base package or help(base) : see a list of all the functions in the base package, along with brief descriptions of each. | dir(__builtins__) |
Unequal | != | != or <> |
These are differences in fundamental data cleaning steps when working with data frame.
In R, R Base and dplyr are two main libraries for data frame manipulation, when in Python, they are Python Base and Pandas.
Topic | R | Python |
---|---|---|
Check structure | str(df) | df.info() |
Data dimension | dim(df) | df.shape Note: No bracket here for shape |
Variables of data frame | colnames(df) | df.columns Note: No bracket here for columns |
Drop columns | Single: df$column <- NULL Multiples: · By index df[ , -c(column_index_1, column_index_2)] · By name df[ , !names(data_frame) %in% c("column_name1", "column_name2")] |
Single: del df["column_name"] or We can use below code with 1 column name. Multiples: df.drop(columns=["column_1", "column_2"], inplace=True) |
Check unique values | unique(df$column_name) | df["column_name"].unique() df["column_name"].value_counts() For value_counts() we can state an argument normalize = True to calculate the proportion of each element in the column. |
Check duplicated observations | summary(duplicated(df)) | df.duplicated().sum() |
Drop duplicated values | df[!duplicated(df), ] or you can use dplyr as follow: df %>% distinct() |
df.drop_duplicates(inplace=True) |
Check missing values | · Single column is.na(df$column_name) · Multiple columns sapply(df, function(x) sum(is.na(x))) |
df.isnull().sum() |
Drop NA | new_df <- na.omit(df) | new_df = df.dropna() |
Fill NA | We can use replace_na() function in tidyr, or na.fill() from the zoo library, or we can use R base like this: df[is.na(df$col), "col"] <- value |
It’s easier to replace NA in Python. We just need to specify column and value then conduct this code: df.fillna(value) df.column_name.fillna(value) |