Repository for final project allocation and submission for CS60013 : Programming &; Data Structures offered in Autumn 2022 at IIT Kharagpur taught by Prof Subhamoy Mandal.
Deadline for project submission is 10th November 2022 at 23:59 IST.
This repository might be updated with new projects and/or changes to existing projects. Please check back regularly.
Final projects are crafted by Vinay and Sai Pavan and approved by Prof Subhamoy Mandal.
- Final Projects for CS60013 : Programming and Data Structures
- The project is to be done in groups of 3 students except (5th group). The students are expected to work together collaboratively.
- The choice of programming language is left to the students. However, the most common languages used are Python and C/C++.
- Each group will be assigned a mentor TA who will be responsible for guiding the group throughout the project.
- Meetings with the mentor TA will be scheduled at the beginning of the project and at regular intervals.
- Each student will be evaluated based on the contribution towards the project. Make sure you are contributing equally to the project.
- Code plagiarism will not be tolerated. Any submission found to be plagiarized will be awarded a zero grade.
- Late submissions will not be accepted.
- The final project evaluation is based on the following criteria:
Continuous Evaluation (CE) : 40%
Code Quality and Documentation : 20%
Final Submission and Report : 40%
Continuous Evaluation (CE)
: 40%- The CE will be based on the following criteria:
- Your participation in the weekly meetings with your mentor TA.
- Your weekly progress and updates on the project.
- The CE will be based on the following criteria:
Code Quality and Documentation
: 20%- This will be based on the following criteria:
- Code Quality : 10% (based on the code quality and readability)
- Documentation : 10% (based on the documentation of the code and the project)
- This will be based on the following criteria:
Final Submission and Report
: 40%- This will be based on the following criteria:
- Final Submission : 20% (based on the final submission of the project)
- Final Report : 20% (based on the final report of the project)
- This will be based on the following criteria:
- CE will be evaluated if you have attended
at least 75%
of the weekly meetings with your mentor TA.
Fork
thisgithub.com/ummadiviany/pds_final_projects
repository.Clone
the forked repository to your local machine using the following command:git clone github.com/{your_username}/pds_final_projects
- Your projects are in the
submissions
directory. You can find the project description in the README.md file of the respective project directory. - Work on the project and make
regular commits
to your local repository andpush
them to your forked repository. - Your mentor TA will review your code and provide feedback.
- You have to submit the following:
Final Code
: The final code of your project in the respective project directory.- Code should be highly readable and well documented.
- Try to write efficient code and avoid unnecessary code.
Final Report
: The final report of your project in the respective project directory. The report should be in the form of amarkdown
file with the namereport.md
. The report should contain the following:Introduction
: A brief introduction of the project.Data
: A brief description of the data used in the project.Questions & Answers
: The questions and their respective answers. Also include the code snippets used to answer the questions andwho solved
the question.References
: The references used in the project.
- Submission of the final project will be done via
GitHub Pull Requests
. - Once you are done with the project, you can create a
Pull Request
to themain
branch of thegithub.com/ummadiviany/pds_final_projects
repository. - We will review your merge request and provide feedback. You can make changes to your code and update the merge request. If accepted, your project will be merged to the
main
branch of thegithub.com/ummadiviany/pds_final_projects
repository. - That's it!
Congratulations!!
have successfully submitted your final project.
The deadline for the final project submission is 10th November 2022, 23:59 IST.
Students | Project | Mentor TA |
---|---|---|
Amar Majhi, Mamta Rani, Reflex Kumar Patel | Project 4 : Medical Image Visualization and Analysis | Sai Pavan |
Bhanu Kumar Meena, Syeda Najafara Fathima, Kavin Puri | Project 1 : Medical Transcription Analysis | Vinay |
Pooja P Jain, Sathishkumar S, P.V.Kamlesh | Project 3 : ISBI 2022 Accepted Submissions Analysis | Vinay |
Ramkumar K, Chaudhari Saurabh Santosh, Samriddha Das | Project 2 : Agriculture Crop Production Analysis | Vinay |
Prabhukalyan Dash, Soumita Guria | Project 5 : Patient Health Statistical Analysis | Sai Pavan |
- The project aims to analyse the medical transcription dataset. The dataset is located in the
data/medical_transcriptions/mtsamples.csv
directory. - The dataset is a
csv
file.CSV
stands forC
ommaS
eparatedV
alues. It is a simple file format used to store tabular data, such as a spreadsheet or database. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format. - The dataset contains following fields :
description
: Short brief of the interaction between the patient and the doctor.medical_specialty
: Medical specialty of the issue discussed in the transcription.sample_name
: Medical Samples used for the diagnosis.transcription
: Full transcription of the interaction between the patient and the doctor.keywords
: Keywords of the transcription
- The project can be divided into sub-areas as follows :
Data Preprocessing
- Write functions to read the csv file. Suggestion : Use the
pandas
library. - This dataset needs bit of pre-processing. The
medical_specialty
field contains multiple values. You need to split the values and create a list of values. For example, if themedical_specialty
field containsOrthopedics, Neurology
, then you need to split it into['Orthopedics', 'Neurology']
. - The keywords field contains multiple values. You need to split the values and transform it into a list of values. For example, if the
keywords
field contains'pain, headache, migraine'
, then you need to split it into['pain', 'headache', 'migraine']
. - Look into the dataset and find out if there are any other fields that need to be pre-processed.
- Write functions to read the csv file. Suggestion : Use the
Data Analysis
- In this part you can prepare a set of questions at least 10 and answer them using the dataset.
- Some examples questions to get you started:
- What is the
most common
medical specialty? - What is the
most common
medical sample? - What is the
most common
keyword? - What is the
average
length of the transcription? - What is the
average
length of the description? - What is the
average
length of the keywords? - And so on... Get creative and come up with your own questions.
- What is the
Data Visualization
- In this part you can make use of the
matplotlib
andseaborn
libraries to visualize the answers to the questions you asked in the previous part. - Everyone likes to see the results in the form of
graphs
andcharts
. So, make sure you visualize the answers to the questions you asked in the previous part.
- In this part you can make use of the
-
This project aims to analyse the crop production data from 2006 to 2011 from all the states of India. The dataset is located in the
data/crop_production/
directory. -
The data directory contains 5 csv files. Go through the data files and understand the data.
-
Different data files contain different types of data. For example
datafile_1.csv
contains the following fields:Crop
: Name of the cropState
: Name of the stateCost of Cultivation (/Hectare) A2+FL
: Cost of cultivation per hectareCost of Cultivation (/Hectare) C2
: Cost of cultivation per hectareCost of Production (/Quintal) C2
: Cost of production per quintalYield (Quintal/ Hectare)
: Yield per hectare
-
The
datafile_2.csv
contains the following fields:Crop
: Name of the cropProduction (YYYY - YY)
: Production of the crop between two consecutive yearsArea (YYYY - YY)
: Area of the crop between two consecutive yearsYield (YYYY - YY)
: Yield of the crop between two consecutive years
-
Go through the data files and understand the data. You can use the
pandas
library to read the csv files and perform analysis on the data. -
The data files are not clean. You need to clean the data before you start analysing it.
-
The project can be divided into the following parts:
Data Processing
- Writing the functions for reading the data files.
- Once you have read the data files, you need to clean the data. You can use the
pandas
library to clean the data. - Only keep the data which is relevant to the analysis and drop the rest of the data.
Data Analysis
- In this part, you need to prepare a set of questions and answer them using the data provided.
- Answer
at least 15 questions
using the data provided. - A few examples questions to get you started are as follows:
- Which
crop
has thehighest production
in the country? - What are the major
states
whererice
is grown? - What is the
average cost of cultivation
ofrice
in the country? - What are seasons where
Sunflower
is grown? (data availabe indatafile_5.csv
) - What is average crop duration for
Paddy
,Wheat
andMaize
?
- Which
- You can come up with your own questions and answer them using the data provided.
Data Visualization
- Visualize the data using
matplotlib
orseaborn
library. - Visualizing the data will help you understand the data better and answer the questions.
- Visualize the data using
- The project aims to analyse the accepted submissions of ISBI 2022. The dataset is located in the
data/isbi2022/
directory. - The dataset comprised of multiple
json
files.JSON
stands forJ
avaScriptO
bjectN
otation. It is a lightweight data-interchange format. It is easy for humans to read and write. - Each json file contain the information about multiple papers(about 100 papers in each). The information about the paper is stored in the form of key-value pairs.
JSON
is all about key-value pairs (akadictionaries
in Python). - Each paper contains more than 20 attributes, but the most useful attributes are listed as follows :
articleTitle
: Title of the paperauthors
: List of authors of the papercitationCount
: Number of citations of the paperdownloadCount
: Number of downloads of the paperstartPage
: Starting page of the paperendPage
: Ending page of the paperabstract
: Stripped abstract of the paper
- The project can be divided into sub-areas as follows :
Data Preprocessing
- Write functions to read to multiple json files and concatenate them into a single dataframe.
- Also only keep the useful attributes mentioned above and drop the rest.
Data Analysis
- In this part you can prepare a set of questions at least 15 and answer them using the dataset.
- Some examples questions to get you started:
- On which
area
of ISBI 2022, the most number of papers were submitted? - Which are the
top 10
downloaded papers and what are they about? - Which are the
top 10
cited papers and what are they about? - What are the
mean
andmedian
number ofauthors
per paper? - Most common words in the abstracts of the papers? Form a
word cloud
. - What is the
average
number ofpages
per paper? - And so on... Get creative and come up with your own questions.
- On which
Data Visualization
- In this part you can make use of the
matplotlib
andseaborn
libraries to visualize the answers to the questions you asked in the previous part. - Everyone likes to see the results in the form of
graphs
andcharts
. So, make sure you visualize the answers to the questions you asked in the previous part.
- In this part you can make use of the
- The project aims to read, visualize and analyze the medical images. The dataset is located in the
data/medical_images/
directory. - The dataset contains medical images of
MRI
andCT
scans for different anatomical parts of the body. It also contains thesegmentation masks
for the images. - The dataset has
Hippocampus MRI
images and segmentation masksHeart MRI
images and segmentation masksProstate MRI
images and segmentation masksAbdomen CT
images and segmentation masks
- These scans are used to diagnose the diseases of the body. The segmentation masks are used to identify the different parts of the body in the images.
- Scans are in
NIFTI
format.NIFTI
is a standard format for storing medical images. - All the scans are
3D Volumes
. Each 3D volume is a stack of2D images
. Each 2D image is called aslice
. - Your first task is to read the images and visualize them. You can use the
nibabel
library to read the images. - Visualizing the images is important to understand the data. You can use the
matplotlib
library to visualize the images. Visualization can be done in multiple ways. You can visualize the images in the following ways:- Visualize the
slices
of the images and the segmentation masks. - Visualize the
3D volumes
of the images and the segmentation masks.
- Visualize the
- The next task is to analyze the images. You can use the
numpy
library to analyze the images. The analysis part is open ended. - You can perform simple statistical analysis on the images. You can also perform more complex analysis like
image segmentation
andimage classification
. - Statistical analysis may include the following:
- Calculate the
mean
,median
,standard deviation
,minimum
andmaxmum
for the whole image, segmented image. - Now compare the statistics of the segmented image with the whole image. What do you observe?
- Calculate the
- Complex analysis may include the following:
- Perform
image segmentation
on the images. You can use thescikit-image
library to perform image segmentation. - Perform
image classification
on the images. You can use thescikit-learn
library to perform image classification. - You can also perform
image registration
on the images. You can use theSimpleITK
library to perform image registration.
- Perform
- Try with statistical analysis first and then move on to more complex analysis. Although, we do not expect you to perform complex analysis, you can try it if you want to.
- Remember, the analysis part is open ended. You can come up with your own analysis ideas and implement them.
-
Create a .csv file which contain the following information
-
create an attribute with name
patient
Ten names of your friends or Random names-string
format -
add the attribute
patient Identifier
and assign 1 to 10 digits for each person-integer format. -
add the attribute
Height
and add the respective heights infloat
format{5.5,5.6,6.1,6.1,6.0,5.9,5.8,5.8,5.8,9.1} Float format
-
add the attribute
Temperature
and add the respective heights infloat
format{97.2,97.3,97.8,98,98.1,98.2,97.3,98,101,102} Float format
-
add the attribute '
disease
and assign the following as per their patient identifierRandomly assign the disease to patients with the following {Headeach ,cold ,fever}
-
add the attribute
Hospital
and assign the following as per the patient identifier randomly. -
add the attribute
Cost
and assign the following as per the patient identifier randomly.{20.0,1000.0,800.0,910.0,950.0,980.0,990.0,890.0,880.0,930.0} Float format
-
-
Obtain the statistics from the dataset you created
-
Now create the class to represent the same above data
-
create the methods to calculate the statistics
Mean ,Median ,Mode of 'Height' Mean ,Median ,Mode of 'Cost' Mean ,Median ,Mode of 'Temperature'
-
-
comment on the statistic calculations and clearly mention your observation
Note: This section may hold high weightage so write the observations in short and specific to point.
- Python Documentation
- Class Code Materials
- Introduction to Computation and Programming Using Python
- Elements of Programming Interviews in Python
- Python Libraries