data-8 / textbook

The textbook Computational and Inferential Thinking: The Foundations of Data Science

Home Page:http://www.inferentialthinking.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Faster code for "Simulating a Statistic" (in Chapter 10.3)

mycarta opened this issue · comments

With reference to Simulating a statistic and in particular Step 4: Write the code to generate an array of simulated values, the code below is exceedingly slow (which is stated in the text, to be fair):

%%timeit
medians = make_array()
for i in np.arange(5000):
    medians = np.append(medians, random_sample_median())

>>> 10.9 s ± 317 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I understand this would require introducing more concepts, but still I wonder if it may be worth showing how to use np.random_choice to make directly the 5000 repetitions of 1000 samples in one go, and then a list comprehension to apply np.median, which is a lot faster:

%%timeit
medians=[np.median(a) for a in np.random.choice(united.column('Delay'), (5000, 1000))]
medians = np.array(medians)
 
>>> 188 ms ± 1.77 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Thank you for the analysis and the idea. That's definitely a better solution for production code. However, I think it's too much for this course.

This is an introductory course. In this part we're trying to focus on the statistical concepts, not on the Python code or on efficient implementation. I think this is too much of a distraction from what we're trying to teach. Our primary goal isn't to teach good software engineering or programming skills; our primary goal is teach them just enough Python to illustrate and allow them to learn and use the concepts we're teaching them. Hopefully they'll learn better algorithms in later courses. I'd much rather have their computer take an extra 10 seconds, then us having to spend the time it would take to explain the extra stuff they'd need to know to understand this code. Sorry if I'm repeating things you already know.

Thank you for the analysis and the idea. That's definitely a better solution for production code. However, I think it's too much for this course.

This is an introductory course. In this part we're trying to focus on the statistical concepts, not on the Python code or on efficient implementation. I think this is too much of a distraction from what we're trying to teach. Our primary goal isn't to teach good software engineering or programming skills; our primary goal is teach them just enough Python to illustrate and allow them to learn and use the concepts we're teaching them. Hopefully they'll learn better algorithms in later courses. I'd much rather have their computer take an extra 10 seconds, then us having to spend the time it would take to explain the extra stuff they'd need to know to understand this code. Sorry if I'm repeating things you already know.

Thank you for the analysis and the idea. That's definitely a better solution for production code. However, I think it's too much for this course.

This is an introductory course. In this part we're trying to focus on the statistical concepts, not on the Python code or on efficient implementation. I think this is too much of a distraction from what we're trying to teach. Our primary goal isn't to teach good software engineering or programming skills; our primary goal is teach them just enough Python to illustrate and allow them to learn and use the concepts we're teaching them. Hopefully they'll learn better algorithms in later courses. I'd much rather have their computer take an extra 10 seconds, then us having to spend the time it would take to explain the extra stuff they'd need to know to understand this code. Sorry if I'm repeating things you already know.

Thank you, I appreciate your feedback on my proposed approach.
And I do understand the motivation behind your comments, they make very much sense.
Also, while at it, I just wanted to say how much I've been enjoying the course (lectures notebooks, and labs) and the textbook. I've been learnign python and data science on my own through hackathons, conferences, projects and reading, but have been on the fence about taking courses for credit... until I discovered data8. Awesome!