venuv123 / Data-findings-

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Communicating data findings of globally-selected 15/16-year-old students in Mathematics,and Science Literacy, based on the results of the PISA 2012 test


Table of Contents

Introduction of the topic and dataset

Introduction to PISA (ref.: NCES 2014-024, U.S. Department of Education)

What Is PISA?

The Program for International Student Assessment (PISA) is a system of international assessments that allows countries to compare outcomes of learning as students near the end of compulsory schooling. PISA core assessments measure the performance of 15-year-old students in mathematics, science, and reading literacy every 3 years. Coordinated by the Organization for Economic Cooperation and Development (OECD), PISA was first implemented in 2000 in 32 countries. It has since grown to 65 education systems in 2012.

What PISA Measures

PISA’s goal is to assess students’ preparation for the challenges of life as young adults. PISA assesses the application of knowledge in mathematics, science, and reading literacy to problems within a reallife context (OECD 1999). PISA does not focus explicitly on curricular outcomes and uses the term “literacy” in each subject area to indicate its broad focus on the application of knowledge and skills. For example, when assessing mathematics, PISA examines how well 15-year-old students can understand, use, and reflect on mathematics for a variety of real-life problems and settings that they may not encounter in the classroom. Scores on the PISA scales represent skill levels along a continuum of literacy skills.

Each PISA data collection cycle assesses one of the three core subject areas in depth (considered the major subject area), although all three core subjects are assessed in each cycle (the other two subjects are considered minor subject areas for that assessment year). Assessing all three subjects every 3 years allows countries to have a consistent source of achievement data in each of the three subjects, while rotating one area as the primary focus over the years. Mathematics was the major subject area in 2012, as it was in 2003, since each subject is a major subject area once every three cycles. In 2012, mathematics, science, and reading literacy were assessed primarily through a paper-and-pencil assessment, and problem solving was administered via a computer-based assessment. In addition to these core assessments, education systems could participate in optional paper-based financial literacy and computer-based mathematics and reading assessments. The United States participated in these optional assessments. Visit www.nces.ed.gov/surveys/pisa for more information on the PISA assessments, including information on how the assessments were designed and examples of PISA questions.

Introduction to the PISA 2012 dataset

PISA is a survey of students' skills and knowledge as they approach the end of compulsory education. It is not a conventional school test. Rather than examining how well students have learned the school curriculum, it looks at how well prepared they are for life beyond school.

Around 510,000 students in 65 economies took part in the PISA 2012 assessment of reading, mathematics and science representing about 28 million 15-year-olds globally. Of those economies, 44 took part in an assessment of creative problem solving and 18 in an assessment of financial literacy.


Dataset Investigation and Preliminary Wrangling

Since the dataset provided has 636 variables (according to its specified data dictionary), we will begin our exploration by wrangling the data accordingly, in order to better understand which variables might be worth delving into.

# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

sb.set()

# We are interested in exploring the formatting of all columns (variables), hence we will display all of them
pd.set_option('display.max_rows', 636)
pd.set_option('display.max_columns', 636)
df = pd.read_csv('pisa2012b.csv', encoding='latin-1', low_memory = False)
df.head(3)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Unnamed: 0 CNT SUBNATIO STRATUM OECD NC SCHOOLID STIDSTD ST01Q01 ST02Q01 ST03Q01 ST03Q02 ST04Q01 ST05Q01 ST06Q01 ST07Q01 ST07Q02 ST07Q03 ST08Q01 ST09Q01 ST115Q01 ST11Q01 ST11Q02 ST11Q03 ST11Q04 ST11Q05 ST11Q06 ST13Q01 ST14Q01 ST14Q02 ST14Q03 ST14Q04 ST15Q01 ST17Q01 ST18Q01 ST18Q02 ST18Q03 ST18Q04 ST19Q01 ST20Q01 ST20Q02 ST20Q03 ST21Q01 ST25Q01 ST26Q01 ST26Q02 ST26Q03 ST26Q04 ST26Q05 ST26Q06 ST26Q07 ST26Q08 ST26Q09 ST26Q10 ST26Q11 ST26Q12 ST26Q13 ST26Q14 ST26Q15 ST26Q16 ST26Q17 ST27Q01 ST27Q02 ST27Q03 ST27Q04 ST27Q05 ST28Q01 ST29Q01 ST29Q02 ST29Q03 ST29Q04 ST29Q05 ST29Q06 ST29Q07 ST29Q08 ST35Q01 ST35Q02 ST35Q03 ST35Q04 ST35Q05 ST35Q06 ST37Q01 ST37Q02 ST37Q03 ST37Q04 ST37Q05 ST37Q06 ST37Q07 ST37Q08 ST42Q01 ST42Q02 ST42Q03 ST42Q04 ST42Q05 ST42Q06 ST42Q07 ST42Q08 ST42Q09 ST42Q10 ST43Q01 ST43Q02 ST43Q03 ST43Q04 ST43Q05 ST43Q06 ST44Q01 ST44Q03 ST44Q04 ST44Q05 ST44Q07 ST44Q08 ST46Q01 ST46Q02 ST46Q03 ST46Q04 ST46Q05 ST46Q06 ST46Q07 ST46Q08 ST46Q09 ST48Q01 ST48Q02 ST48Q03 ST48Q04 ST48Q05 ST49Q01 ST49Q02 ST49Q03 ST49Q04 ST49Q05 ST49Q06 ST49Q07 ST49Q09 ST53Q01 ST53Q02 ST53Q03 ST53Q04 ST55Q01 ST55Q02 ST55Q03 ST55Q04 ST57Q01 ST57Q02 ST57Q03 ST57Q04 ST57Q05 ST57Q06 ST61Q01 ST61Q02 ST61Q03 ST61Q04 ST61Q05 ST61Q06 ST61Q07 ST61Q08 ST61Q09 ST62Q01 ST62Q02 ST62Q03 ST62Q04 ST62Q06 ST62Q07 ST62Q08 ST62Q09 ST62Q10 ST62Q11 ST62Q12 ST62Q13 ST62Q15 ST62Q16 ST62Q17 ST62Q19 ST69Q01 ST69Q02 ST69Q03 ST70Q01 ST70Q02 ST70Q03 ST71Q01 ST72Q01 ST73Q01 ST73Q02 ST74Q01 ST74Q02 ST75Q01 ST75Q02 ST76Q01 ST76Q02 ST77Q01 ST77Q02 ST77Q04 ST77Q05 ST77Q06 ST79Q01 ST79Q02 ST79Q03 ST79Q04 ST79Q05 ST79Q06 ST79Q07 ST79Q08 ST79Q10 ST79Q11 ST79Q12 ST79Q15 ST79Q17 ST80Q01 ST80Q04 ST80Q05 ST80Q06 ST80Q07 ST80Q08 ST80Q09 ST80Q10 ST80Q11 ST81Q01 ST81Q02 ST81Q03 ST81Q04 ST81Q05 ST82Q01 ST82Q02 ST82Q03 ST83Q01 ST83Q02 ST83Q03 ST83Q04 ST84Q01 ST84Q02 ST84Q03 ST85Q01 ST85Q02 ST85Q03 ST85Q04 ST86Q01 ST86Q02 ST86Q03 ST86Q04 ST86Q05 ST87Q01 ST87Q02 ST87Q03 ST87Q04 ST87Q05 ST87Q06 ST87Q07 ST87Q08 ST87Q09 ST88Q01 ST88Q02 ST88Q03 ST88Q04 ST89Q02 ST89Q03 ST89Q04 ST89Q05 ST91Q01 ST91Q02 ST91Q03 ST91Q04 ST91Q05 ST91Q06 ST93Q01 ST93Q03 ST93Q04 ST93Q06 ST93Q07 ST94Q05 ST94Q06 ST94Q09 ST94Q10 ST94Q14 ST96Q01 ST96Q02 ST96Q03 ST96Q05 ST101Q01 ST101Q02 ST101Q03 ST101Q05 ST104Q01 ST104Q04 ST104Q05 ST104Q06 IC01Q01 IC01Q02 IC01Q03 IC01Q04 IC01Q05 IC01Q06 IC01Q07 IC01Q08 IC01Q09 IC01Q10 IC01Q11 IC02Q01 IC02Q02 IC02Q03 IC02Q04 IC02Q05 IC02Q06 IC02Q07 IC03Q01 IC04Q01 IC05Q01 IC06Q01 IC07Q01 IC08Q01 IC08Q02 IC08Q03 IC08Q04 IC08Q05 IC08Q06 IC08Q07 IC08Q08 IC08Q09 IC08Q11 IC09Q01 IC09Q02 IC09Q03 IC09Q04 IC09Q05 IC09Q06 IC09Q07 IC10Q01 IC10Q02 IC10Q03 IC10Q04 IC10Q05 IC10Q06 IC10Q07 IC10Q08 IC10Q09 IC11Q01 IC11Q02 IC11Q03 IC11Q04 IC11Q05 IC11Q06 IC11Q07 IC22Q01 IC22Q02 IC22Q04 IC22Q06 IC22Q07 IC22Q08 EC01Q01 EC02Q01 EC03Q01 EC03Q02 EC03Q03 EC03Q04 EC03Q05 EC03Q06 EC03Q07 EC03Q08 EC03Q09 EC03Q10 EC04Q01A EC04Q01B EC04Q01C EC04Q02A EC04Q02B EC04Q02C EC04Q03A EC04Q03B EC04Q03C EC04Q04A EC04Q04B EC04Q04C EC04Q05A EC04Q05B EC04Q05C EC04Q06A EC04Q06B EC04Q06C EC05Q01 EC06Q01 EC07Q01 EC07Q02 EC07Q03 EC07Q04 EC07Q05 EC08Q01 EC08Q02 EC08Q03 EC08Q04 EC09Q03 EC10Q01 EC11Q02 EC11Q03 EC12Q01 ST22Q01 ST23Q01 ST23Q02 ST23Q03 ST23Q04 ST23Q05 ST23Q06 ST23Q07 ST23Q08 ST24Q01 ST24Q02 ST24Q03 CLCUSE1 CLCUSE301 CLCUSE302 DEFFORT QUESTID BOOKID EASY AGE GRADE PROGN ANXMAT ATSCHL ATTLNACT BELONG BFMJ2 BMMJ1 CLSMAN COBN_F COBN_M COBN_S COGACT CULTDIST CULTPOS DISCLIMA ENTUSE ESCS EXAPPLM EXPUREM FAILMAT FAMCON FAMCONC FAMSTRUC FISCED HEDRES HERITCUL HISCED HISEI HOMEPOS HOMSCH HOSTCUL ICTATTNEG ICTATTPOS ICTHOME ICTRES ICTSCH IMMIG INFOCAR INFOJOB1 INFOJOB2 INSTMOT INTMAT ISCEDD ISCEDL ISCEDO LANGCOMM LANGN LANGRPPD LMINS MATBEH MATHEFF MATINTFC MATWKETH MISCED MMINS MTSUP OCOD1 OCOD2 OPENPS OUTHOURS PARED PERSEV REPEAT SCMAT SMINS STUDREL SUBNORM TCHBEHFA TCHBEHSO TCHBEHTD TEACHSUP TESTLANG TIMEINT USEMATH USESCH WEALTH ANCATSCHL ANCATTLNACT ANCBELONG ANCCLSMAN ANCCOGACT ANCINSTMOT ANCINTMAT ANCMATWKETH ANCMTSUP ANCSCMAT ANCSTUDREL ANCSUBNORM PV1MATH PV2MATH PV3MATH PV4MATH PV5MATH PV1MACC PV2MACC PV3MACC PV4MACC PV5MACC PV1MACQ PV2MACQ PV3MACQ PV4MACQ PV5MACQ PV1MACS PV2MACS PV3MACS PV4MACS PV5MACS PV1MACU PV2MACU PV3MACU PV4MACU PV5MACU PV1MAPE PV2MAPE PV3MAPE PV4MAPE PV5MAPE PV1MAPF PV2MAPF PV3MAPF PV4MAPF PV5MAPF PV1MAPI PV2MAPI PV3MAPI PV4MAPI PV5MAPI PV1READ PV2READ PV3READ PV4READ PV5READ PV1SCIE PV2SCIE PV3SCIE PV4SCIE PV5SCIE W_FSTUWT W_FSTR1 W_FSTR2 W_FSTR3 W_FSTR4 W_FSTR5 W_FSTR6 W_FSTR7 W_FSTR8 W_FSTR9 W_FSTR10 W_FSTR11 W_FSTR12 W_FSTR13 W_FSTR14 W_FSTR15 W_FSTR16 W_FSTR17 W_FSTR18 W_FSTR19 W_FSTR20 W_FSTR21 W_FSTR22 W_FSTR23 W_FSTR24 W_FSTR25 W_FSTR26 W_FSTR27 W_FSTR28 W_FSTR29 W_FSTR30 W_FSTR31 W_FSTR32 W_FSTR33 W_FSTR34 W_FSTR35 W_FSTR36 W_FSTR37 W_FSTR38 W_FSTR39 W_FSTR40 W_FSTR41 W_FSTR42 W_FSTR43 W_FSTR44 W_FSTR45 W_FSTR46 W_FSTR47 W_FSTR48 W_FSTR49 W_FSTR50 W_FSTR51 W_FSTR52 W_FSTR53 W_FSTR54 W_FSTR55 W_FSTR56 W_FSTR57 W_FSTR58 W_FSTR59 W_FSTR60 W_FSTR61 W_FSTR62 W_FSTR63 W_FSTR64 W_FSTR65 W_FSTR66 W_FSTR67 W_FSTR68 W_FSTR69 W_FSTR70 W_FSTR71 W_FSTR72 W_FSTR73 W_FSTR74 W_FSTR75 W_FSTR76 W_FSTR77 W_FSTR78 W_FSTR79 W_FSTR80 WVARSTRR VAR_UNIT SENWGT_STU VER_STU
0 1 Albania 80000 ALB0006 Non-OECD Albania 1 1 10 1.0 2 1996 Female No 6.0 No, never No, never No, never None None 1.0 Yes Yes Yes Yes NaN NaN <ISCED level 3A> No No No No Other (e.g. home duties, retired) <ISCED level 3A> NaN NaN NaN NaN Working part-time <for pay> Country of test Country of test Country of test NaN Language of the test Yes No Yes No No No No Yes No Yes No Yes No Yes 8002 8001 8002 Two One None None None 0-10 books Agree Strongly agree Agree Agree Agree Agree Agree Strongly agree Disagree Agree Disagree Agree Agree Agree Not at all confident Not very confident Confident Confident Confident Not at all confident Confident Very confident Agree Disagree Agree Agree Agree Agree Agree Disagree Disagree Disagree Agree Disagree Disagree Agree NaN Disagree Likely Slightly likely Likely Likely Likely Very Likely Agree Agree Agree Agree Agree Agree Agree Agree Agree Courses after school Test Language Major in college Science Study harder Test Language Maximum classes Science Pursuing a career Math Often Sometimes Sometimes Sometimes Sometimes Never or rarely Never or rarely Never or rarely NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Every Lesson Every Lesson Every Lesson Every Lesson Every Lesson Never or Hardly Ever Most Lessons Never or Hardly Ever Every Lesson Most Lessons Every Lesson Every Lesson Every Lesson Never or Hardly Ever Most Lessons Every Lesson Every Lesson Every Lesson Always or almost always Sometimes Never or rarely Always or almost always Always or almost always Always or almost always Always or almost always Often Often Never or Hardly Ever Never or Hardly Ever Never or Hardly Ever Never or Hardly Ever Never or Hardly Ever Strongly disagree Strongly disagree Strongly disagree Strongly disagree Agree Agree Agree Strongly agree Strongly agree Disagree Agree Strongly disagree Disagree Agree Agree Strongly disagree Agree Agree Disagree Agree Agree Strongly disagree Strongly agree Strongly agree Strongly disagree Agree Strongly disagree Agree Agree Strongly agree Strongly disagree Strongly disagree Agree Strongly agree Strongly agree Strongly agree Strongly agree Strongly agree Strongly agree Strongly disagree Disagree Strongly disagree Very much like me Very much like me Very much like me Somewhat like me Very much like me Somewhat like me Mostly like me Mostly like me Mostly like me Somewhat like me definitely do this definitely do this definitely do this definitely do this 4.0 2.0 1.0 1.0 1.0 2.0 1.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 99 99 99 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN A Simple calculator 99 99 99 StQ Form B booklet 7 Standard set of booklets 16.17 0.0 Albania: Upper secondary education 0.32 -2.31 0.5206 -1.18 76.49 79.74 -1.3771 Albania Albania Albania 0.6994 NaN -0.48 1.85 NaN NaN NaN NaN 0.6400 NaN NaN 2.0 ISCED 3A, ISCED 4 -1.29 NaN ISCED 3A, ISCED 4 NaN -2.61 NaN NaN NaN NaN NaN -3.16 NaN Native NaN NaN NaN 0.80 0.91 A ISCED level 3 General NaN Albanian NaN NaN 0.6426 -0.77 -0.7332 0.2882 ISCED 3A, ISCED 4 NaN -0.9508 Building architects Primary school teachers 0.0521 NaN 12.0 -0.3407 Did not repeat a <grade> 0.41 NaN -1.04 -0.0455 1.3625 0.9374 0.4297 1.68 Albanian NaN NaN NaN -2.92 -1.8636 -0.6779 -0.7351 -0.7808 -0.0219 -0.1562 0.0486 -0.2199 -0.5983 -0.0807 -0.5901 -0.3346 406.8469 376.4683 344.5319 321.1637 381.9209 325.8374 324.2795 279.8800 267.4170 312.5954 409.1837 388.1524 373.3525 389.7102 415.4152 351.5423 375.6894 341.4161 386.5945 426.3203 396.7207 334.4057 328.9531 339.8582 354.6580 324.2795 345.3108 381.1419 380.3630 346.8687 319.6059 345.3108 360.8895 390.4892 322.7216 290.7852 345.3108 326.6163 407.6258 367.1210 249.5762 254.3420 406.8496 175.7053 218.5981 341.7009 408.8400 348.2283 367.8105 392.9877 8.9096 13.1249 13.0829 4.5315 13.0829 13.9235 13.1249 13.1249 4.3389 4.3313 13.7954 4.5315 4.3313 13.7954 13.9235 4.3389 4.3313 4.5084 4.5084 13.7954 4.5315 13.1249 13.0829 4.5315 13.0829 13.9235 13.1249 13.1249 4.3389 4.3313 13.7954 4.5315 4.3313 13.7954 13.9235 4.3389 4.3313 4.5084 4.5084 13.7954 4.5315 4.5084 4.5315 13.0829 4.5315 4.3313 4.5084 4.5084 13.7954 13.9235 4.3389 13.0829 13.9235 4.3389 4.3313 13.7954 13.9235 13.1249 13.1249 4.3389 13.0829 4.5084 4.5315 13.0829 4.5315 4.3313 4.5084 4.5084 13.7954 13.9235 4.3389 13.0829 13.9235 4.3389 4.3313 13.7954 13.9235 13.1249 13.1249 4.3389 13.0829 19 1 0.2098 22NOV13
1 2 Albania 80000 ALB0006 Non-OECD Albania 1 2 10 1.0 2 1996 Female Yes, for more than one year 7.0 No, never No, never No, never One or two times None 1.0 Yes Yes NaN Yes NaN NaN <ISCED level 3A> Yes Yes No No Working full-time <for pay> <ISCED level 3A> No No No No Working full-time <for pay> Country of test Country of test Country of test NaN Language of the test Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes 8001 8001 8002 Three or more Three or more Three or more Two Two 201-500 books Disagree Strongly agree Disagree Disagree Agree Agree Disagree Disagree Strongly agree Strongly agree Disagree Agree Disagree Agree Confident Very confident Very confident Confident Very confident Confident Very confident Not very confident NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Strongly agree Strongly agree Strongly disagree Disagree Agree Disagree Likely Slightly likely Slightly likely Very Likely Slightly likely Likely Agree Agree Strongly agree Strongly agree Strongly agree Agree Agree Disagree Agree Courses after school Math Major in college Science Study harder Math Maximum classes Science Pursuing a career Science Sometimes Often Always or almost always Sometimes Always or almost always Never or rarely Never or rarely Often relating to known Improve understanding in my sleep Repeat examples I do not attend <out-of-school time lessons> i... 2 or more but less than 4 hours a week 2 or more but less than 4 hours a week Less than 2 hours a week NaN NaN 6.0 0.0 0.0 2.0 Rarely Rarely Frequently Sometimes Frequently Sometimes Frequently Never Frequently Know it well, understand the concept Know it well, understand the concept Heard of it once or twice Know it well, understand the concept Know it well, understand the concept Know it well, understand the concept Never heard of it Know it well, understand the concept Know it well, understand the concept Never heard of it Know it well, understand the concept Heard of it once or twice Know it well, understand the concept Know it well, understand the concept Never heard of it Heard of it often 45.0 45.0 45.0 7.0 6.0 2.0 NaN 30.0 Frequently Sometimes Frequently Frequently Sometimes Sometimes Sometimes Sometimes NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Not at all like me Not at all like me Mostly like me Somewhat like me Very much like me Somewhat like me Not much like me Not much like me Mostly like me Not much like me probably not do this probably do this probably not do this probably do this 1.0 2.0 3.0 2.0 2.0 3.0 1.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 99 99 99 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN A Simple calculator 99 99 99 StQ Form A booklet 9 Standard set of booklets 16.17 0.0 Albania: Upper secondary education NaN NaN NaN NaN 15.35 23.47 NaN Albania Albania Albania NaN NaN 1.27 NaN NaN NaN -0.0681 0.7955 0.1524 0.6387 -0.08 2.0 ISCED 3A, ISCED 4 1.12 NaN ISCED 5A, 6 NaN 1.41 NaN NaN NaN NaN NaN 1.15 NaN Native NaN NaN NaN -0.39 0.00 A ISCED level 3 General NaN Albanian NaN 315.0 1.4702 0.34 -0.2514 0.6490 ISCED 5A, 6 270.0 NaN Tailors, dressmakers, furriers and hatters Building construction labourers -0.9492 8.0 16.0 1.3116 Did not repeat a <grade> NaN 90.0 NaN 0.6602 NaN NaN NaN NaN Albanian NaN NaN NaN 0.69 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 486.1427 464.3325 453.4273 472.9008 476.0165 325.6816 419.9330 378.6493 359.9548 384.1019 373.1968 444.0801 456.5431 401.2385 461.2167 366.9653 459.6588 426.1645 423.0488 443.3011 389.5544 438.6275 417.5962 379.4283 438.6275 440.1854 456.5431 486.9216 458.1010 444.0801 411.3647 437.8486 457.3220 454.2063 460.4378 434.7328 448.7537 494.7110 429.2803 434.7328 406.2936 349.8975 400.7334 369.7553 396.7618 548.9929 471.5964 471.5964 443.6218 454.8116 8.9096 13.1249 13.0829 4.5315 13.0829 13.9235 13.1249 13.1249 4.3389 4.3313 13.7954 4.5315 4.3313 13.7954 13.9235 4.3389 4.3313 4.5084 4.5084 13.7954 4.5315 13.1249 13.0829 4.5315 13.0829 13.9235 13.1249 13.1249 4.3389 4.3313 13.7954 4.5315 4.3313 13.7954 13.9235 4.3389 4.3313 4.5084 4.5084 13.7954 4.5315 4.5084 4.5315 13.0829 4.5315 4.3313 4.5084 4.5084 13.7954 13.9235 4.3389 13.0829 13.9235 4.3389 4.3313 13.7954 13.9235 13.1249 13.1249 4.3389 13.0829 4.5084 4.5315 13.0829 4.5315 4.3313 4.5084 4.5084 13.7954 13.9235 4.3389 13.0829 13.9235 4.3389 4.3313 13.7954 13.9235 13.1249 13.1249 4.3389 13.0829 19 1 0.2098 22NOV13
2 3 Albania 80000 ALB0006 Non-OECD Albania 1 3 9 1.0 9 1996 Female Yes, for more than one year 6.0 No, never No, never No, never None None 1.0 Yes Yes No Yes No No <ISCED level 3B, 3C> Yes Yes Yes No Working full-time <for pay> <ISCED level 3A> Yes No Yes Yes Working full-time <for pay> Country of test Country of test Country of test NaN Language of the test Yes Yes Yes Yes No Yes Yes Yes Yes Yes No Yes No Yes 8001 8001 8001 Three or more Two Two One Two More than 500 books Agree Strongly agree Agree Agree Strongly agree Strongly agree Strongly agree Strongly agree Strongly agree Strongly agree Agree Strongly agree Strongly agree Agree Confident Very confident Very confident Confident Very confident Not very confident Very confident Confident NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Strongly agree Agree Strongly agree Strongly disagree Strongly agree Strongly disagree Likely Likely Very Likely Very Likely Very Likely Slightly likely Strongly agree Strongly agree Strongly agree Strongly agree Strongly agree Agree Strongly agree Strongly agree Strongly agree Courses after school Math Major in college Science Study harder Math Maximum classes Science Pursuing a career Science Sometimes Always or almost always Sometimes Never or rarely Always or almost always Never or rarely Never or rarely Never or rarely Most important Improve understanding learning goals more information Less than 2 hours a week 2 or more but less than 4 hours a week 4 or more but less than 6 hours a week I do not attend <out-of-school time lessons> i... NaN 6.0 6.0 7.0 2.0 3.0 Frequently Sometimes Frequently Rarely Frequently Rarely Frequently Sometimes Frequently Never heard of it Know it well, understand the concept Heard of it once or twice Know it well, understand the concept Know it well, understand the concept Know it well, understand the concept Heard of it once or twice Know it well, understand the concept Know it well, understand the concept Heard of it once or twice Know it well, understand the concept Know it well, understand the concept Know it well, understand the concept Know it well, understand the concept Know it well, understand the concept Know it well, understand the concept 60.0 NaN NaN 5.0 4.0 2.0 24.0 30.0 Frequently Frequently Frequently Frequently Frequently Frequently Rarely Rarely NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Not much like me Not much like me Very much like me Very much like me Somewhat like me Mostly like me Mostly like me Very much like me Mostly like me Very much like me probably not do this definitely do this definitely not do this probably do this 1.0 3.0 4.0 1.0 3.0 4.0 1.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 99 99 99 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN A Simple calculator 99 99 99 StQ Form A booklet 3 Standard set of booklets 15.58 -1.0 Albania: Lower secondary education NaN NaN NaN NaN 22.57 NaN NaN Albania Albania Albania NaN NaN 1.27 NaN NaN NaN 0.5359 0.7955 1.2219 0.8215 -0.89 2.0 ISCED 5A, 6 -0.69 NaN ISCED 5A, 6 NaN 0.14 NaN NaN NaN NaN NaN -0.40 NaN Native NaN NaN NaN 1.59 1.23 A ISCED level 2 General NaN Albanian NaN 300.0 0.9618 0.34 -0.2514 2.0389 ISCED 5A, 6 NaN NaN Housewife Bricklayers and related workers 0.9383 24.0 16.0 0.9918 Did not repeat a <grade> NaN NaN NaN 2.2350 NaN NaN NaN NaN Albanian NaN NaN NaN -0.23 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 533.2684 481.0796 489.6479 490.4269 533.2684 611.1622 486.5322 567.5417 541.0578 544.9525 597.1413 495.1005 576.8889 507.5635 556.6365 594.8045 473.2902 554.2997 537.1631 568.3206 471.7324 431.2276 460.8272 419.5435 456.9325 559.7523 501.3320 555.0787 467.0587 506.7845 580.7836 481.0796 555.0787 453.8168 491.2058 527.0369 444.4695 516.1318 403.9648 476.4060 401.2100 404.3872 387.7067 431.3938 401.2100 499.6643 428.7952 492.2044 512.7191 499.6643 8.4871 12.7307 12.7307 4.2436 12.7307 12.7307 12.7307 12.7307 4.2436 4.2436 12.7307 4.2436 4.2436 12.7307 12.7307 4.2436 4.2436 4.2436 4.2436 12.7307 4.2436 12.7307 12.7307 4.2436 12.7307 12.7307 12.7307 12.7307 4.2436 4.2436 12.7307 4.2436 4.2436 12.7307 12.7307 4.2436 4.2436 4.2436 4.2436 12.7307 4.2436 4.2436 4.2436 12.7307 4.2436 4.2436 4.2436 4.2436 12.7307 12.7307 4.2436 12.7307 12.7307 4.2436 4.2436 12.7307 12.7307 12.7307 12.7307 4.2436 12.7307 4.2436 4.2436 12.7307 4.2436 4.2436 4.2436 4.2436 12.7307 12.7307 4.2436 12.7307 12.7307 4.2436 4.2436 12.7307 12.7307 12.7307 12.7307 4.2436 12.7307 19 1 0.1999 22NOV13

As we can see above, the data is clearly very abundant, with a large number of variables to take into consideration.

After looking throughout the Dataset Dictionary to find out what each of these columns represents, a number of leads to be explored have been considered:

  • We are interested in finding out how students from individual countries perform in Math, Reading and Science literacy.

    • For that, we will check the average world and country-wide distribution of Math, Reading and Science literacy scores, individually.
  • Considering that we can see the countries' average literacy patters in different subjects, we are also curious about from which countries do the "geniuses" stem, meaning which countries have students with exceptionally high literacy scores.

    • For that, we will check the distrbution of exceptional scores in Math, Reading and Science literacy, grouped by country.
  • Lastly, we would like to find out whether students whose parents have different cultural backgrounds will report any changes in average scores, compared with students raised in a homogenous family background.

    • For that, we will compare the distribution of mean scores in each subject across both students with homogenous family background (parents born in same country) and students with heterogenous family background (parents born in two different countries).
df = df[['CNT', 'ST03Q02', 'ST04Q01', 'AGE', 'PV1MATH', 'PV2MATH', 'PV3MATH', 'PV4MATH', 'PV5MATH', 'PV1READ', 'PV2READ', 
         'PV3READ', 'PV4READ', 'PV5READ','PV1SCIE', 'PV2SCIE', 'PV3SCIE', 'PV4SCIE', 'PV5SCIE', 'COBN_F', 'COBN_M', 'COBN_S']]
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
CNT ST03Q02 ST04Q01 AGE PV1MATH PV2MATH PV3MATH PV4MATH PV5MATH PV1READ PV2READ PV3READ PV4READ PV5READ PV1SCIE PV2SCIE PV3SCIE PV4SCIE PV5SCIE COBN_F COBN_M COBN_S
0 Albania 1996 Female 16.17 406.8469 376.4683 344.5319 321.1637 381.9209 249.5762 254.3420 406.8496 175.7053 218.5981 341.7009 408.8400 348.2283 367.8105 392.9877 Albania Albania Albania
1 Albania 1996 Female 16.17 486.1427 464.3325 453.4273 472.9008 476.0165 406.2936 349.8975 400.7334 369.7553 396.7618 548.9929 471.5964 471.5964 443.6218 454.8116 Albania Albania Albania
2 Albania 1996 Female 15.58 533.2684 481.0796 489.6479 490.4269 533.2684 401.2100 404.3872 387.7067 431.3938 401.2100 499.6643 428.7952 492.2044 512.7191 499.6643 Albania Albania Albania
3 Albania 1996 Female 15.67 412.2215 498.6836 415.3373 466.7472 454.2842 547.3630 481.4353 461.5776 425.0393 471.9036 438.6796 481.5740 448.9370 474.1141 426.5573 Albania Albania Albania
4 Albania 1996 Female 15.50 381.9209 328.1742 403.7311 418.5309 395.1628 311.7707 141.7883 293.5015 272.8495 260.1405 361.5628 275.7740 372.7527 403.5248 422.1746 Albania Albania Albania

Here, we have only kept variables that may aid in the exploration of our desired leads.

Before we begin visualizing the data, further wrangling needs to be done in order to ensure clarily of information when working with the dataset.


Further Data Wrangling

# Replace NaN age values with the mean age of students in the dataset
df.loc[np.isfinite(df['AGE']) == False, 'AGE'] = df['AGE'].mean()
# Replace NaN or 'Invalid' values for father/mother birth country to 'Missing', 
# which is already being used to represent missing information

df.loc[df['COBN_F'].isna() == True, 'COBN_F'] = 'Missing'
df.loc[df['COBN_M'].isna() == True, 'COBN_M'] = 'Missing'
df.loc[df['COBN_S'].isna() == True, 'COBN_S'] = 'Missing'

df.loc[df['COBN_F'] == 'Invalid', 'COBN_F'] = 'Missing'
df.loc[df['COBN_M'] == 'Invalid', 'COBN_M'] = 'Missing'
df.loc[df['COBN_S'] == 'Invalid', 'COBN_S'] = 'Missing'
# Check if there are any columns which still have unwrangled NA values
# If this function prints a column, it will also show the total number of NA values in the column, otherwise nothing will print
# The ideal case is that this function does not print anything, meaning there are no more NA values in our working dataset

for column in df.columns:
    if (df[column].isna().sum() > 0):
        print((column) + '  ' + str(df[column].isna().sum()))

Within the dataset, for each literacy subject, there are 5 plausible scores of performance recorded for a student. We will compute the actual score of the student by taking the average of the 5 plausible scores, as performed below:

# Compute the average of plausible scores determines the PISA score of a student in a particular subject

df['Math Score'] = (df['PV1MATH'] + df['PV2MATH'] + df['PV3MATH'] + df['PV4MATH'] + df['PV5MATH']) / 5
df['Reading Score'] = (df['PV1READ'] + df['PV2READ'] + df['PV3READ'] + df['PV4READ'] + df['PV5READ']) / 5
df['Science Score'] = (df['PV1SCIE'] + df['PV2SCIE'] + df['PV3SCIE'] + df['PV4SCIE'] + df['PV5SCIE']) / 5

Since we will work only with the mean of scores from now on, the plausible scores used to make up the mean are no longer needed, therefore they will be dropped.

# Drop any further-unnecessary columns

df.drop(columns = ['PV1MATH', 'PV2MATH', 'PV3MATH', 'PV4MATH', 'PV5MATH', 'PV1READ', 'PV2READ', 'PV3READ', 'PV4READ', 
                   'PV5READ','PV1SCIE', 'PV2SCIE', 'PV3SCIE', 'PV4SCIE', 'PV5SCIE'], inplace = True)

We now need to name our variables in a more descriptive manner than initially provided. We will decipher the meaning of the initial variable names using, once again, the data dictionary of the PISA 2012 test.

# Rename columns appropriately

df.rename({'CNT' : 'Country', 'ST03Q02' : 'Birth year', 'ST04Q01' : 'Gender', 'AGE' : 'Age', 'COBN_F' : 'Birth Country Father', 
           'COBN_M' : 'Birth Country Mother', 'COBN_S' : 'Birth Country Child'}, axis = 'columns', inplace = True)

Since we need to find out whether a student comes from a homogenous or heterogenous family background, we will perform feature engineering to create a variable which tells us this information.

df['Parents - Same Cultural Background'] = (df['Birth Country Father'] == df['Birth Country Mother'])
df.loc[df['Parents - Same Cultural Background'] == True, 'Parents - Same Cultural Background'] = 'Same'
df.loc[df['Parents - Same Cultural Background'] == False, 'Parents - Same Cultural Background'] = 'Different'
The final form of the wrangled working dataset looks as seen below:
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
Country Birth year Gender Age Birth Country Father Birth Country Mother Birth Country Child Math Score Reading Score Science Score Parents - Same Cultural Background
0 Albania 1996 Female 16.17 Albania Albania Albania 366.18634 261.01424 371.91348 Same
1 Albania 1996 Female 16.17 Albania Albania Albania 470.56396 384.68832 478.12382 Same
2 Albania 1996 Female 15.58 Albania Albania Albania 505.53824 405.18154 486.60946 Same
3 Albania 1996 Female 15.67 Albania Albania Albania 449.45476 477.46376 453.97240 Same
4 Albania 1996 Female 15.50 Albania Albania Albania 385.50398 256.01010 367.15778 Same
df.shape
(485490, 11)

Univariate Exploration and Analysis

In this section, we will investigate the distributions of individual variables.

Visualisation 1

First, we are interested in seeing what is the distribution of PISA scores for each subject in part, along with their type of distribution and mode values. Also, for us to determine what "exceptionally high" scores represent, we first need to understand what is common to happen within the data and between what intervals do most score values lie.

plt.figure(figsize = [17, 5])

bins_hist = np.arange(0, 1000 + 1, 100)

plt.subplot(1, 3, 1)
plt.hist(df['Math Score'], bins = bins_hist, ec = 'black', alpha = 0.85);

plt.xlim(0, 1000);
plt.ylim(0, 180000 + 1);
plt.xticks(bins_hist)
plt.xlabel('Math Score');
plt.ylabel('Nr. of students')
plt.title("Math Score distribution across students");

plt.subplot(1, 3, 2)
plt.hist(df['Reading Score'], bins = bins_hist, ec = 'black', alpha = 0.85);

plt.xlim(0, 1000);
plt.ylim(0, 180000 + 1);
plt.xticks(bins_hist)
plt.xlabel('Reading Score');
plt.title("Reading Score distribution across students");

plt.subplot(1, 3, 3)
plt.hist(df['Science Score'], bins = bins_hist, ec = 'black', alpha = 0.85);

plt.xlim(0, 1000);
plt.ylim(0, 180000 + 1);
plt.xticks(bins_hist)
plt.xlabel('Science Score');
plt.title("Science Score distribution across students");

png

From the above distributions, we find out that:

  • The literacy scores are spread out in a clear, smooth unimodal distribution of values

  • The vast majority of the students are scoring in each subject between 300 and 600 points, while a small portion of the total number achieves poorer (between 100 and 300 points) or greater (between 600 and 800 points) test performance

  • For each of these three score distributions, most students fitted within the interval of scores between 400 and 500 points, which happens to also be the middle interval of the possible score range. This shows that the PISA 2012 test has been constructed in a balanced manner with respect to student reviewing methodology and the difficulty of its questions


Visualisation 2

Next, we are interested in finding out about which countries host "exceptionally performing" students, and what their number is, with respect to each country and subject.

Since we have found out that a distribution of scores between 100 and 800 points is large enough to be noticed in our histogram of score count distributions, we understand that "exceptionally high" scores will be valued above 800 points.

# Retrieve entries of students with scores above 800 points in each subject

high_math_score = df[df['Math Score'] > 800]['Country'].value_counts()
high_reading_score = df[df['Reading Score'] > 800]['Country'].value_counts()
high_science_score = df[df['Science Score'] > 800]['Country'].value_counts()
plt.figure(figsize = [15, 5])
plt.subplots_adjust(wspace = 0.6) # adjust spacing between subplots, in order to show long country names nicely
x_lim_max = high_math_score.values[0] + 6 # '+6' is done in order to show the text counts properly next to the bars


plt.subplot(1, 3, 1)
sb.barplot(y = high_math_score.index, x = high_math_score.values, color = sb.color_palette()[0])
plt.title('Students with Math score > 800');
plt.xlabel('Nr. of students')
plt.ylabel('Countries (ordered by student count)')

# Write the total number of students with exceptionally high scores in each country, right after the horizontal count bar
indexes, labels = plt.yticks()
for index, label in zip(indexes, labels):
    plt.text(y = index, x = high_math_score[label.get_text()] + 1, s = high_math_score[label.get_text()], va = 'center')
plt.xlim(0, x_lim_max);    


plt.subplot(1, 3, 2)
sb.barplot(y = high_reading_score.index, x = high_reading_score.values, color = sb.color_palette()[0])
plt.xlim(0, x_lim_max);
plt.title('Students with Reading score > 800');
plt.xlabel('Nr. of students')

# Write the total number of students with exceptionally high scores in each country, right after the horizontal count bar
indexes, labels = plt.yticks()
for index, label in zip(indexes, labels):
    plt.text(y = index, x = high_reading_score[label.get_text()] + 1, s = high_reading_score[label.get_text()], va = 'center')
plt.xlim(0, x_lim_max);    


plt.subplot(1, 3, 3)
sb.barplot(y = high_science_score.index, x = high_science_score.values, color = sb.color_palette()[0])
plt.xlim(0, x_lim_max);
plt.title('Students with Science score > 800');
plt.xlabel('Nr. of students')

# Write the total number of students with exceptionally high scores in each country, right after the horizontal count bar
indexes, labels = plt.yticks()
for index, label in zip(indexes, labels):
    plt.text(y = index, x = high_science_score[label.get_text()] + 1, s = high_science_score[label.get_text()], va = 'center')
plt.xlim(0, x_lim_max);

png

Here, we have counted the total number of exceptional students for each subject, and from their count distributions, we find out that:

  • Oceania and Asian countries, in particular China and Singapore, seem to be the places where exceptional students are most likely to stem from

  • Singapore is the clear leader in excellency, since it manages to reach within the top 3 countries with most exceptional students in all three areas of study

  • Korea and Australia also have a respectable number of excellent students in both Math and Science

  • Besides Asia and Oceania, a number of European and North American countries also make the podium: Canada is present in all three of the above distributions, while Poland, Czech Republic and Switzerland appear to be educating Math-bright students


Visualisation 3

Lastly, we would like to find out whether students with heterogenous family roots provide different score readings, on average, than students raised by parents with a homogenous family background.

For that, we would first like to see the proportion of students having parents with same or different cultural background:

plt.figure(figsize=[3, 5]);
sb.countplot(x = 'Parents - Same Cultural Background', data = df, color = sb.color_palette()[0]);

y_ticks = np.arange(0, 450000 + 1, 50000)
plt.yticks(y_ticks, y_ticks);
plt.ylabel("Nr. of students");
plt.title('Family background distribution');

png

From this previous distribution, we discover that there are almost 9 times more students having parents with same cultural background than those having parents with different backgrounds.


Bivariate Exploration and Analysis

In this section, we will investigate the relationships between pairs of relevant variables in our data.

Visualisation 4

After finding out previously the global distribution of scores for each literacy category in part, we are interested to look into the relationship how the country of residence/education affects scores on each of the subjects individually.

Also, we have previously discovered countries where exceptional students stem from, such as China, Singapore, Korea or Poland.

Was this just a strong deviation from the norm of scores within these countries, or are these countries also between the top leaders when it comes to educating all of their students in these subjects?

plt.figure(figsize = [15, 35])
plt.subplots_adjust(wspace = 0.85) # adjust spacing between subplots, in order to show long country names nicely

math_score_country_order = df.groupby('Country')['Math Score'].mean().sort_values(ascending = False).index
reading_score_country_order = df.groupby('Country')['Reading Score'].mean().sort_values(ascending = False).index
science_score_country_order = df.groupby('Country')['Science Score'].mean().sort_values(ascending = False).index

plt.subplot(1, 3, 1)
sb.boxplot(x = df['Math Score'], y = df['Country'], order = math_score_country_order, color = sb.color_palette()[9]);
plt.ylabel('Countries (ordered descendingly by score ranking)')
plt.title('Math score distributions by country');

plt.subplot(1, 3, 2)
sb.boxplot(x = df['Reading Score'], y = df['Country'], order = reading_score_country_order, color = sb.color_palette()[9]);
plt.ylabel(''); # Remove the redundant label
plt.title('Reading score distributions by country');

plt.subplot(1, 3, 3)
sb.boxplot(x = df['Science Score'], y = df['Country'], order = science_score_country_order, color = sb.color_palette()[9]);
plt.ylabel(''); # Remove the redundant label
plt.title('Science score distributions by country');

png

Here we have, in decreasing order, the rankings of countries with the best-performing students, on average, for each of the three subjects. We find that:

  • Most countries seem to achieve, in different subjects, rankings which are close to each other, with 80% of countries reaching rankings within 10 places higher/lower across all the subjects. The remaining 20% of countries with large deviations will be analyzed further on, as we would like to find out in what subjects are they deviating and how large is the difference

  • From the top ranking of countries, we can clearly see that it was no coincidence that exceptional students come from Asian countries, in particular China and Singapore, as these are the countries which are 1st, 2nd or 3rd place across all subjects. Another country which impressed was Poland, with 3 exceptional students in Mathematics. Here, it can be seen that this is no coincidence either, as Poland occupies a spot in the top 10 placements in Reading and Science, and 11th in Mathematics

  • The box-and-wiskers plot allows us to see the outliers in scores for each country in part. For example, we find out that, even though a number of Singapore's students managed to achieve outstanding results in the field of Mathematics, that is not to be considered "out of ordinary" (or, with other words, as outliers) in the country's perspective, however the outstanding results received in Science can be indeed considered to be slightly out of regular bounds, since we can notice the Singapore's outliers in the plot

  • Other such findings can be performed by picking a country of interest and checking its score distribution, rankings and outlier distribution accordingly


Visualisation 5

Further on, we are interested to see if there are any countries which had multi-talented students, performing exceptionally (score > 800) in not only one discipline, but in multiple ones at the same time.

high_math_and_reading_score = df[(df['Math Score'] >= 800) & (df['Reading Score'] >= 800)]['Country'].value_counts()
high_math_and_science_score = df[(df['Math Score'] >= 800) & (df['Science Score'] >= 800)]['Country'].value_counts()
high_reading_and_science_score = df[(df['Reading Score'] >= 800) & (df['Science Score'] >= 800)]['Country'].value_counts()
plt.figure(figsize = [15, 1])
plt.subplots_adjust(wspace = 0.6) # adjust spacing between subplots, in order to show long country names nicely
x_lim_max = high_math_and_science_score.values[0] # adjust the proportions of the x-axis with respect to all 3 plots using this

plt.subplot(1, 3, 1)
sb.barplot(y = high_math_and_reading_score.index, x = high_math_and_reading_score.values, color = sb.color_palette()[0])
plt.title('Students with Math & Reading scores > 800');
plt.xticks(np.arange(0, x_lim_max + 2, 1));

plt.subplot(1, 3, 2)
sb.barplot(y = high_math_and_science_score.index, x = high_math_and_science_score.values, color = sb.color_palette()[0])
plt.title('Students with Math & Science scores > 800');
plt.xticks(np.arange(0, x_lim_max + 2, 1));

plt.subplot(1, 3, 3)
sb.barplot(y = high_reading_and_science_score.index, x = high_reading_and_science_score.values, color = sb.color_palette()[0])
plt.title('Students with Reading & Science scores > 800');
plt.xticks(np.arange(0, x_lim_max + 2, 1));

png

We have once again realized plots to visualize the distribution of counted students with exceptional results in multiple fields. It can be found that:

  • Singapore leads in the world ranking of multi-talented outstanding results, being the only one country present in all three possible combination of two talents. Coupled together with previous findings of Singapore's on-average and above-average student results, we can determine that its country is world-leader in providing multidisciplinary high-quality education, across all measured disciplines.

  • China is once again present in the rankings, together with (South) Korea, determining that, as said about Singapore, such countries have a very high quality of education in the measured fields. The connection between all previous visualizations of scores determine that such outstanding results are not simple coincidence, but only slightly-higher than average results compared with these countries' mean of scores.

  • There were no students who have managed to reach scores above 800 in all three subjects simultaneously, which is why there is no further visualization on this matter.


Visualisation 6

Lastly, we are investigating whether students having parents from different cultural (country) backgrounds are performing differently than students whose parents come from the same type of background.

plt.figure(figsize = [15, 4])
plt.subplots_adjust(wspace = 1.2)

plt.subplot(1, 3, 1)
sb.boxplot(x = df['Parents - Same Cultural Background'], y = df['Math Score'], palette = 'Set2')
plt.title('Math scores related to family background');

plt.subplot(1, 3, 2)
sb.boxplot(x = df['Parents - Same Cultural Background'], y = df['Reading Score'], palette = 'Set2')
plt.title('Reading scores related to family background');

plt.subplot(1, 3, 3)
sb.boxplot(x = df['Parents - Same Cultural Background'], y = df['Science Score'], palette = 'Set2');
plt.title('Science scores related to family background');

png

It can be seen that, on average, indeed students coming from heterogenic family backgrounds report increased performance in all areas, compared to students from homogenous family backgrounds.


Multivariate Exploration and Analysis

In this section, we will investigate the relationships between the interactions of three or more variables at the same time.

Visualisation 7

Out of curiosity, we would like to know whether it is normal that most countries who had many students scoring high in one of the subjects, also had, on average, students scoring high on the other two subjects.

Therefore, we are going to study the pair-by-pair relationship between Math Scores, Reading Scores and Science Scores, and correlate them in pair scatter plots, in order to see what type and strength of correlation exists.

grid = sb.pairplot(data = df, vars=["Math Score", "Reading Score", "Science Score"]);
grid.fig.suptitle("Pair-by-pair representation of different scores' correlation", y = 1.02);

png

As it could be expected, there is a very strong and positive correlation between any pair of the three variables representing the scores of the three subjects. Therefore, the previous relationships between scores observed in the behaviour of many countries' students is justified.


Visualisation 8

Lastly in this analysis, even though we found out that positive correlation of students' scores is normal behaviour for a country, and that most countries display similar rankings in scores across all three subjects, there are still 20% of countries who deviate from this behaviour, where these countries have differences in rankings of some scores of more than 10 places.

We will look into who are those countries, and where do the differences take place in the scores of each of these countries.

country_outliers = []

for country in df['Country'].unique():
    if ((np.abs((math_score_country_order.get_loc(country) - reading_score_country_order.get_loc(country))) > 10) |\
        (np.abs((math_score_country_order.get_loc(country) - science_score_country_order.get_loc(country))) > 10) |\
        (np.abs((reading_score_country_order.get_loc(country) - science_score_country_order.get_loc(country))) > 10)):
        
        country_outliers.append(country)
        
        
country_outliers.sort() # Sort countries alphabetically

for country in country_outliers:
    print((country) + ':' + str((len('United States of America:') - len(country) + 1) * ' ') + 'Math place: ' + str(math_score_country_order.get_loc(country)) + str((5 - len(str(math_score_country_order.get_loc(country))) + 1) * ' ') + 'Reading place: ' + str(reading_score_country_order.get_loc(country)) + str((5 - len(str(reading_score_country_order.get_loc(country))) + 1) * ' ') + 'Science place: ' + str(science_score_country_order.get_loc(country)))
Austria:                   Math place: 17    Reading place: 31    Science place: 23
Florida (USA):             Math place: 44    Reading place: 33    Science place: 40
Iceland:                   Math place: 27    Reading place: 40    Science place: 42
Ireland:                   Math place: 20    Reading place: 5     Science place: 13
Israel:                    Math place: 43    Reading place: 32    Science place: 44
Kazakhstan:                Math place: 53    Reading place: 65    Science place: 54
Macao-China:               Math place: 6     Reading place: 17    Science place: 14
Massachusetts (USA):       Math place: 19    Reading place: 8     Science place: 15
Norway:                    Math place: 31    Reading place: 22    Science place: 33
Slovak Republic:           Math place: 34    Reading place: 45    Science place: 43
Slovenia:                  Math place: 37    Reading place: 46    Science place: 32
Switzerland:               Math place: 9     Reading place: 26    Science place: 26
United States of America:  Math place: 39    Reading place: 25    Science place: 30
Vietnam:                   Math place: 15    Reading place: 19    Science place: 7
df_country_outliers = df[['Country', 'Math Score', 'Reading Score', 'Science Score']][df['Country'].isin(country_outliers)]
df_country_outliers = df_country_outliers.melt('Country', var_name = 'Score Type', value_name = 'Scores')
plt.figure(figsize = [7, 10])

sb.pointplot(x = 'Scores', y = 'Country', hue = 'Score Type', data = df_country_outliers, linestyles = '', dodge = 0.4, ci = 'sd', palette = 'deep', order = country_outliers);
plt.legend(loc = 2);
plt.title('Score distribution across subjects of countries with strong deviations in results');

png

From this visualization, we can better understand the deviations in the outlier-countries' score pattern. For example, we find out that Kazakhstan is deviating due to much weaker scores in Reading than in the other two disciplines; while Switzerland is deviating through much higher scores in Mathematics than in the other two subjects. For any country of interest, such an analysis can be performed.


Conclusions and answers

Lastly, we will restate our three questions of interest, along with a summary of our conclusions:

  • How do students from individual countries perform in Math, Reading and Science literacy?

    • We have found the literacy trends for each country in part, along with countries with high deviations in scores compared to the norm. The results can be explored in Visualisation 4 and 8.
  • What are the countries from which "geniuses" stem, meaning which countries have students with exceptionally high literacy scores?

    • Generally, we have seen Asian countries as being the world leaders at educating students who perform exceptionally, not only in each different subject individually, but also at multiple subjects simultaneously. Singapore and China are examples of this.
  • Do students whose parents have different cultural backgrounds report any changes in average scores, compared with students raised in a homogenous family background?

    • We have discovered that students whose parents are from two different cultural backgrounds report, on average, slightly higher performance (scores) in all three measured subjects.

About


Languages

Language:Jupyter Notebook 100.0%