Creative Young Teen Minds

Universal Event | COVID-19: Calculate the Risk

Awards & Nominations

Creative Young Teen Minds has received the following awards and nominations. Way to go!

Global Nominee

Covid 19 Risk Calculator

High-Level Project Summary

We have developed a county wise Covid 19 risk calculator which will generate risk warnings and vulnerability ratings for people depending upon their county wise location by using various publicly available datasets.Our project aims to solve the problem of lack of availability of information to the public and authorities at the local level. It provides a deep insight into various factors which could lead to a surge in the covid cases.If people have access to such tools and data insights they could make better and informed decisions to protect themselves and authorities can take the necessary precautions to control the possible outbreak of pandemic in future.

Link to Project "Demo"

https://docs.google.com/presentation/d/1n0JaMdYjXfBMNQPuhRr70zFkqtwQUV4NO5wn6-o1taY/edit?usp=sharing

Link to Final Project

https://colab.research.google.com/drive/15ySNAL5OHC3wkVEm9ZNjyd1Ew5-YLXoi?usp=sharing

Detailed Project Description

What exactly does it do?

Our model provides Covid-19 risk warnings to the public based on their geographical location till the county level and the possible factors which are most likely to cause that calculated uprising in the cases. It provides precise information about risky activities and places in their locality so that they can do the requisite things to protect themselves and their community.

Report(Our project report) (Please open in new tab if does not open directly)

How does it work?

The model uses the openly available data of the past one week about vaccinations, Covid-19 cases and deaths, population density and mobility to discover the patterns and find the correlations amongst these data and find which activities are more likely to cause the surge in cases. The model is trained on county wise data from these datasets and it uses the Random Forest algorithm and permutation importance method implemented using the Python sklearn package to calculate the importance between the number of cases and the possible factors. It uses a bagging approach to train the model and the best results are obtained by keeping the number of estimator decision trees at 50. So, 50 decision trees are involved in training the model. We have repeated this process 10 times in order to obtain the best possible model.

Relation among covid cases and other factors using permutation importance (Please open in new tab if does not open directly)

Our Vulnerability Calculation Formula

For a particular county:

from n = N

Risk = Summation( in*fn/mn)

from n= 1

Where i = feature importance percentage of feature n

f = value of feature n for that county

m = maximum value of feature n

N = total number of features

What benefits does our model have?

Our model has various benefits for the people who have very less access to proper information.

Most of the Covid-19 risk calculators that are available uses only medical data as the base to calculate the risk for the individual. Though medical records have a great importance in determining to what scale a person will be infected and what special measures he needs to take as per his medical history.

But this is not a disease where one can stay safe only by following the measures properly by themselves only as it depends upon the actions of the community as a whole. We have seen in several areas of the world that there has been an uprising in cases just because of negligence of 10 percent of the population, and the consequences of which were suffered by the people who were precautious.

If these people who were following the precautions and the authorities who were taking every possible measure had access to the overall picture of their locality that one can get collectively from these datasets, that bigger picture of combined datasets would definitely have played a key role in this situation.

So, our project provides proper information to people from these datasets in an easily understandable way that will enable proper decision making to prevent such things in the future.

What do we hope to achieve?

We hope to provide accurate and precise information to people about Covid-19 risk in their locality so that better decisions could be taken to prevent the increase in Covid-19 19 cases and destruction caused by it at the local level because it will also increase the awareness among people and the combined efforts at county level will result in better and efficient mitigation of the pandemic at the national and global levels as well. Some things we would like to incorporate in our current model are:

Due to limited time and resources we were only able to implement our solution only to the United States till the level of county because of lack of availability of data for other countries. But we would like to enhance our model further by incorporating new data to make it applicable for all different locations of the world.
We would also like to incorporate some reinforcement learning into the model so that model can learn by comparing its predictions with the actual data gathered in future so as to improve overall performance
We would like to build an interactive dashboard for our model through which users can get information about the risk in their vicinity supported by strong visualizations
Finally, as of now we have developed a model which works better for a short term in issuing warnings to the public and authorities about upcoming threats to them in the next few weeks or months. But we would also like to develop a tool which will help policy makers to gain a better knowledge of vulnerability in these pandemics and determine better indicators of sustainability in these situations, so we can deal with such challenges in future in a better way. Since, this is one of our main focus we have described this solution in detail below under the heading 'Looking at the Big Picture'.

Looking at the Big Picture

The current model that we made is suitable for a short time period as it uses the data of past one week to past one month.

But this pandemic also forced us to look at some of the other key factors in the long run which we need to focus on to tackle such pandemics in future and which will also be critical for other humanitarian goals of sustainable development, climate change etc.

We would like to make another indicator which will utilize following five different kinds of data categorized according to country(for bigger countries we would like to do it till state, city and county level as per the size and population)and demography:

1.Health Data

Health is an important factor especially in pandemics like these. We know the places which have good health infrastructure and most of their population covered with health insurance will be better off. But it is also important to note that there are some diseases like several respiratory and heart diseases whose patients are more vulnerable to Covid-19 . So, we would like to do the following with the long term health data:

Make a comparative study of medical infrastructure and its accessibility to people as per the demography and household income in different regions of all the countries.
Health insurance coverage of people across different countries as per the demography.
Studying and mapping several possible hotspots for different types of diseases as per the data available for different diseases. This will help in taking good measures in these hotspots if such a situation breaks out. For example - People in large cities with high population density and poor AQI have more risk of respiratory problems and would be more vulnerable to diseases like Covid-19 19.
Herd Immunity to some particular diseases which will help in identifying least risk zones.
Mapping Regional and seasonal diseases across all countries which spread through smaller regions in a particular time period and comparing its effects with the pandemic that whether it intensifies the situation

We hope to conduct this till the local level so that authorities and people may take good decisions if such a situation breaks out in future.It will also help us in knowing which groups of people are more likely to survive so special attention can be paid to vulnerable groups as compared to equitable resource allocation. This will help governments to allocate proper resources in proper places as per the vulnerability of different groups of people.

2.Educational and Literacy Data

Better literacy and education rates results in better awareness among the society and the greater is awareness in the society, the lower is the risk. For this we would like to do the following:

Find a correlation between the Covid-19 cases, deaths and the literacy rates
Find a relation between educational status, vaccination rates and vaccine hesitancy.
Study of awareness campaigns especially in underdeveloped zones and their effectiveness
Following of personal hygiene, sanitation measures at different places like public places, workplaces and local businesses.

For example - A less educated business shop owner in an underdeveloped area is less likely to take proper sanitation measures. Moreover, we would also like to investigate the strictness of authorities and its effect on the public.

All these trends will help in launching better awareness campaigns and devising new strategies to increase awareness among the targeted audience in a more understandable way.

3.Economic Data

This will deal with research, collection, study and analysis of the data for the following purposes across all countries till county level as per the demography:

Study of loss of work due to Covid-19 and the pay cuts across different communities in different places
A comparative study of which industries are more sustainable and adaptive in situations like these
The various measures taken by large ,small and medium companies across different countries to keep their business moving
Identifying most economically vulnerable sections of society in situations like these across different regions. For example - A person may not be economically vulnerable in general but can become so in a situation like this because of large pay cuts or work loss

This will help the concerned institutions and organisation in determining a proper policy to tackle problems like this and provide more support to the vulnerable economic sections in such situations and developing a better approach towards more sustainable economy

4.Environmental and Geographic Data

Through these data we would like to find the relationship between the climate change and environmental crisis across different geographical regions and the outbreak of the pandemic. Climate change is resulting to several genetic mutations in different creatures which may increase of such pandemics in future:

Finding relationship between spread of Covid-19 19 and climate change that is whether spread of such pandemics is directly related to increasing climate change
Studying the outbreak of Covid-19 19 pandemic across different geographical regions
For example - If Covid-19 is less likely to spread in hilly areas due to fresh atmosphere and different climatic conditions
This will increase our knowledge about the correlation of climate change with such pandemics so we can take proper actions to stop the outbreak of such pandemics in future

5.Administrative Data

Through this we would like to study the efficacy, ability and willingness of administration at various levels like national, state and local level to find which countries were better of in procuring vaccines and health infrastructure to provide proper facilities to their people and why, so that similar measures could be adopted in other countries in the upcoming times.

What tools, coding languages, softwares and hardwares did we use?

We have used the following tools and softwares in development of our project:

•Python

•MS Excel.

•Google Sheets

•Pycharm

•Google Colab

•Sklearn

•Numpy

•Pandas

•Matplotlib

•Season

•Pickle

•MS Word

•Google Docs

•Google Slides

•Google Drive

Space Agency Data

We used the mobility data from the EO Dashboard maintained by NASA,ESA and JAXA which can be accessed from the link below and it played a crucial role in training our model

https://eodashboard.org/?poi=GG-GG

We also intend to use the air quality data for our long term model to find correlation between AQI,climate change and severeness of pandemic across different regions but were not able to include it in our current model due to lack of time

https://eodashboard.org/?poi=W1-N1

Hackathon Journey

How would we describe our Space Apps Challenge Experience?

Space Apps have provided us a golden opportunity to learn new skills and use our skills and talents for the betterment of society and to solve the problem which has caused large scale destruction in the whole world.

What did we learn?

Working on this project for the hackathon to contribute in providing a solution to mitigate the effects of the Covid-19 19 pandemic has provided us the chance and opportunity to learn and grow ourselves. We have gained knowledge about many new tools and technologies as well as about the studies and research going on for Covid-19 19. Some of our major learning outcomes were:

We did an extensive search to gather the various datasets which sharpened our querying skills and led us to scanning various datasets on multiple websites to find useful data as compared to other projects in which data is readily taken from one or two places.
While searching for the data, we also came to know about several new information about Covid-19 19 and the ongoing researches which has increased our knowledge about the ongoing pandemic and made us more aware about the precautions to follow.
We learned to work in an efficient way with big data using several Python frameworks like Pandas, Pypolars and MS Excel. Before this we never had an opportunity to work with such massive datasets.
We got the opportunity to polish our skills about the things we had good knowledge of and exploring and discovering new features, functions and algorithms. For example- We learnt and used several libraries in Python and inbuilt features in Excel that we even hadn't previously heard.
We also have a good experience of teamwork and time management by working together for a common cause and coordinating with each other as per our skill sets and time availability and developing something useful.
And most of all we came to know the practical application of data science and how we could use it to solve a problem that has ravaged us personally and professionally at the global level.

What inspired our team to choose this challenge?

As we all know that Covid-19 has badly affected our lives both physically as well as mentally and it has become the challenge for the whole world. Everyday the news headlines show the increasing number of cases and deaths due to which terror has grabbed the hearts and minds of people. The pandemic has disrupted the daily routine of the people. We as students could not go to our schools and miss those golden moments which we enjoyed with our friends and were highly strained because of the continuous online classes , our elders could not go to their workplaces and a mental stigma stuck everyone because of lack of joy and spending excessive time on screens which led to strain on their health. Everything appeared to be lifeless. But despite the strict control measures taken by the authorities and precautions taken by the people there was no improvement in the situation. There was no reduction in the number of Covid-19 patients and the number of deaths.

No household was left unaffected by Covid-19 especially during the second wave. Several of our relatives and acquaintances have also suffered from Covid-19 despite taking the necessary precautions and taking strict care of cleanliness and hygiene just because of lack of proper information. It's very difficult and frustrating to remain isolated and being trapped in one room especially when you are not well and need support from your family members. So this led us to the idea of the project and invent a tool to tackle the situation.

How did we develop our project?

1.Data Collection : Firstly, we gathered various kind of datasets from several different sources at the county level for the US. Our project is based on the following datasets -. Population density data ,Covid-19 cases data ,Mobility data,Vaccination data etc. We have gathered the following data in CSV format.

2.Data Filtration and Data Mining : We then opened, edited and filtered the gathered data MS Excel and filtered the data using various functions and algorithms like VLOOKUP,OFFSET,AVERAGE, SORT AND GROUP BY etc. to assemble the data in a proper order

3.Data Integration and Data Warehousing: Then we integrated all the data into a single spreadsheet to make a final database to train our model and the algorithm. We stored the final integrated data in the form of a spreadsheet on google drive.

4.Data Cleaning and Data Imputation: The final data was then opened in a Google Colaboratory and was cleaned using the Pandas library. Some of the values were missing which were imputed using the KNN Imputer of sklearn library.

5.Machine Learning Modelling and Model Training: We used the Random Forest Algorithm for our machine learning model and the permutation importance method to calculate feature importance. We fitted and trained the model on our final dataset using train_test_split method to split data into train and test datasets.

6.Model Validation and Model Deployment:We then tested the performance of the model on test datasets using various different metrics and repeated this process 10 times to get the best possible model. We further plan to deploy it to the web in the form of an interactive dashboard or API so that it can be easily used by the public.

What challenges and setbacks our team had faced?

We had faced a major problem in data collection as we did not get the uniform datasets. Some datasets were not available at the county level due to which we could not utilise them in our project.
Many datasets from US government sites and other credible sources were of past years and many datasets were not updated so they could not be used in our project.
Several datasets have copyrights due to which we could not include them in our project.
The data obtained was messy and unarranged.
Some datasets had missing values because of which they could not be used by algorithm and we need to use KNN Imputer to impute the missing values for the algorithm.
It took a lot of efforts in data cleaning and mining.
Combining and integrating the datasets was also a challenge.
It was a challenge against a constant race of time.
Due to lack of time we were not able to completely develop all the things that we had thought about.

A word of thanks!

First of all we would like to express our gratitude to NASA and other organisers for giving us a golden opportunity to express and explore ourselves, because of them we are getting a chance to think out of the box and create something which could be an asset for the society. Secondly we would also like to thank 'Centre of Disease Control and Prevention’, ‘US Census’, ‘Google LLC', 'The New York Times' for providing us with the datasets for our model and ultimately we would like to thank each and every person who supported us directly or indirectly in our journey.

References