- Semester: Fall 2022
- Instructor: Jimmy Lin
- Time & Location: Mondays 12:30-02:50, DC 2568
Graduate students in computer science aspire to "do computer science" (research), but what exactly does that mean? It involves, among a multitude of activities, reading papers, learning the "state of the art", advancing knowledge, writing papers, and (hopefully) getting them published. Graduate students learn how to do these things under tutelage of professors, but rarely is there explicit or deliberate instruction on these myriad activities. With a focus on empirical computer science, this course covers elements that comprise the research enterprise, synthesizing both "art" — personal experiences I have accumulated over the years — as well as "science" — insights derived from quantitative analyses. The hope is that knowledge and actionable advice from this course will help graduate students better understand research, hopefully leading to more productive and fulfilling careers.
Material for this course will draw from "The Science of Science" by Wang and Barabási, academic papers, as well as other sources on the web.
Context is important. Most of the questions and issues we grapple with in this course have no simple answer. Nearly always, it depends on context. As such, it is important to properly scope the coverage of this course.
Wang and Barabási attempt to paint broad strokes with their work, encompassing all of science. However, some of their findings and recommendations may seem at odds with realities in computer science. Much of the "art" (e.g., advice, best practices, etc.) covered in this course is drawn from my personal experience, which will of course be colored by my own background. I am a computer scientist (actually, also a formal linguist) by training, and I work presently at the intersection of natural language processing (NLP) and information retrieval (IR), although over the years I have dabbled in other sub-disciplines of computer science as well.
For lack of a better term, I characterize the focus of this course as "empirical computer science", but it really is a shorthand for "stuff that I have worked on" and "stuff that I am familiar with". NLP and IR can be characterized as "applied machine learning", so perhaps that's a more accurate scope. The contents of this course will certainly be relevant to graduate students wishing to pursue these topics, and I suspect for related sub-disciplines in computer science such as data mining, or even perhaps computer vision (although I don't work in those fields). However, I am quite certain that portions of this course will not apply to, for example, theoretical machine learning and complexity. All findings, advice, recommendations, etc. need to be properly contextualized.
Week | Date | Type | Description |
---|---|---|---|
1 | 9/12 | - | Introduction [Slides] |
2 | 9/19 | "science" | The Science of Career [Slides] |
3 | 9/26 | "science" | The Science of Collaboration [Slides] |
4 | 10/3 | "science" | The Science of Impact [Slides] |
5 | 10/17 | - | Presentation of Visualization Projects |
6 | 10/24 | "science" | The Science of Impact (Still) [Slides] |
7 | 10/31 | "art" | Research as a Social Process [Slides] |
8 | 11/7 | "art" | Working With Your Advisor [Slides] |
9 | 11/14 | "art" | On Writing Papers [Slides] |
10 | 11/21 | "art" | Responsible Research [Slides] |
11 | 11/28 | "science" | Paper Presentations (I) |
12 | 12/5 | "science" | Paper Presentations (II) |
Weight | Component |
---|---|
15% | Debate participation |
15% | Paper presentation |
20% | Visualization project |
40% | Final project |
10% | Class participation |
In addition to weekly preparation (readings and other material), the course will have the following assignments:
- Preparation and participation in a debate. These debates will be scattered throughout the semester, where the debate topics will be complementary to the topic of that week.
- Presentation of a paper on meta-research (Weeks 11 and 12).
- Visualization project due in mid-October.
- Final project due at the end of the semester.
Slides: [PDF]
For more details on normative vs. positive approaches, Wikipedia provides a good starting point: Positivism and Normativity.
Readings (to be completed prior to the class session): "The Science of Science" by Wang and Barabási, Part 1: The Science of Career (pages 5-80).
Slides: [PDF]
We'll be having our debate on Topic 1: How should we evaluate excellence? Quality only or quality and quantity?
- Position A: Researchers should be evaluated solely on the quality of their publications. Quantity is irrelevant and we shouldn't even bother counting.
- Position B: Researchers should be evaluated on both the quality and quantity of their publications. High-quality publications are of course important, but quantity is also an important component of excellence.
Readings (to be completed prior to the class session): "The Science of Science" by Wang and Barabási, Part 2: The Science of Collaboration (pages 81-158).
Slides: [PDF]
We'll be having our debate on Topic 2: Should you collaborate or not?
- Position A: Early-stage researchers should actively seek out collaborations beyond their research group. Participation in multiple research projects across many different groups builds breadth.
- Position B: Early-stage researchers should not actively seek out collaborations beyond their research group. Focusing on depth is more important than breadth.
Readings (to be completed prior to the class session): "The Science of Science" by Wang and Barabási, Part 1: The Science of Impact (pages 159-219).
Slides: [PDF]
We'll be having our debate on Topic 3: How should you approach open-sourcing computational artifacts associated with your work?
- Position A: Early-stage researchers should do the minimal in open-sourcing computational artifacts that arise from their work. Doing anything more than the community norm is a waste of time and effort that could be better spent writing more papers.
- Position B: Early-stage researchers should actively promote the adoption of computational artifacts that arise from their work, for example, contributing to popular open-source libraries. Even if this requires a lot of time (e.g., refactoring code into a production-ready state), such efforts are worthwhile.
Presentation of visualization projects!
Slides: [PDF]
We'll be having our debate on Topic 4: Is social media a waste of time?
- Position A: Early-stage researchers should actively incorporate social media use as a component of their career development. This means appropriate use of sites like Twitter, Facebook, and LinkedIn to build professional reputation, engage with the community, hear about recent work by others, etc.
- Position B: Early-stage researchers should stay off social media. It's a complete waste of time.
Links to the case studies of impact that we discussed in class:
- LUCENE-2959: Implementing State of the Art Ranking for Lucene
- LUCENE-4100: Maxscore - Efficient Scoring
- LUCENE-8135: Implement Block-Max WAND
- Lucene v8.0.0 Release Notes
Slides: [PDF]
Links to content discussed in class:
Supplemental readings:
- Becerra et al. Maximizing the Conference Experience: Tips to Effectively Navigate Academic Conferences Early in Professional Careers., Behavior Analysis in Practice, 13(3):479-491, 2020.
- Leininger et al. Ten Simple Rules for Attending Your First Conference. PLoS Computational Biology, 17(7):e1009133, 2021.
Slides: [PDF]
Links to content discussed in class:
Slides: [PDF]
Papers used in the abstract analysis exercise:
- Deng et al. ImageNet: A Large-Scale Hierarchical Image Database. CVPR 2009. (45k citations)
- Krizhevsky et al. ImageNet Classification with Deep Convolutional Neural Networks. NIPS 2012. (119k citations)
- Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL 2019. (53k citations)
- Radford et al. Improving Language Understanding by Generative Pre-Training. (4k citations)
- Peters et al. Deep Contextualized Word Representations. NAACL 2018. (11k citations)
Links to content discussed in class:
- Writing Is Thinking
- Mensh and Kording. Ten simple rules for structuring papers PLoS Computational Biology, 13(9):e1005619, 2017.
- Baquero. Picking Publication Targets. CACM, 65(3):10-11, 2022.
- My writing pet peeves
Slides: [PDF]
Links to content discussed in class:
- Distributive Justice: entry from Stanford Encyclopedia of Philosophy.
- Dressel et al. The accuracy, fairness, and limits of predicting recidivism. Science Advances, 4(1), 2018.
- Friedler et al. The (Im)possibility of fairness: different value systems require different mechanisms for fair decision making. ACM, 64(4)136-143, 2021.
- Ghassemi et al. The false hope of current approaches to explainable artificial intelligence in health care. The Lancet, 3(11):E745-E750.
- The impact of the COVID-19 pandemic on scientific research in the life sciences
- Massive covidization of research citations and the citation elite
- The evolution of citation graphs in artificial intelligence research
- What Do NLP Researchers Believe? Results of the NLP Community Metasurvey
- Examining Citations of Natural Language Processing Literature
- NLP Scholar: A Dataset for Examining the State of NLP Research
- Geographic Citation Gaps in NLP Research
- Quotation accuracy in educational research articles
- Hydrology research articles are becoming more topically diverse
- Systematic inequality and hierarchy in faculty hiring networks
- Gender, Productivity, and Prestige in Computer Science Faculty Hiring Networks
- Early-career setback and future career impact
- Reviewer bias in single- versus double-blind peer review
- The Global Burden of Journal Peer Review in the Biomedical Literature: Strong Imbalance in the Collective Enterprise
- Are You Open? A Content Analysis of Transparency and Openness Guidelines in HCI Journals
- To ArXiv or not to ArXiv: A Study Quantifying Pros and Cons of Posting Preprints Online
- Meta-assessment of bias in science
- There is a blind spot in AI research
- Gender Diversity in Research Teams and Citation Impact in Economics and Management
- Men Set Their Own Cites High: Gender and Self-citation across Fields and over Time
- How much is too much? The difference between research influence and self-citation excess
- Relative Citation Ratio (RCR): A New Metric That Uses Citation Rates to Measure Influence at the Article Level