Awards & Nominations

Ani's Cuff has received the following awards and nominations. Way to go!

Global Winner
Best Use of Science

The solution that makes the best and most valid use of science and/or the scientific method.

de Bugger, a geographic bio-info viewer, an emulator with bio-vector which analogize across species

High-Level Project Summary

The number of insect species takes half the species in the whole Animal kingdom. Mankind’s survival hinges on biodiversity of insects, which is losing now. Current bio-occurrence platforms try to solve it by just keeping the records. de Bugger is not only a geographic bioinfo viewer, but also can infer beyond time, space, and species. It’s modeled by U-Net, a CNN framework with a variety of Earth observation data and bio-vector. We use Taiwan’s eDNA and occurrence data as a proxy for inferring global insect distribution, writ large. de Bugger thus benefits those who care to preserve biodiversity but lacking occurrence data, and provides participant references for the next decision.

Link to Final Project

Detailed Project Description

What do we develop?






  • Geographical viewer of biodistribution, trail, and biodiversity of insect species.
  • 4 feature importances affecting biodistribution. Features are Elevation, Precipitation, Land Cover, and Vegetation, cited from 4 Earth observations.
  • Future (until 2022 Q3) biodistribution prediction of insect species. 
  • Bio-vector preserving cross-species relationships can apply to many applications in this field.
  • A predictor/emulator with CNN + bio-vector framework, which is able to analogize to species/area outside of the training data.
  • An emulator easily applied to predict biodistribution of different areas, future time, and various species.
  • A predictor to benefit areas/people lacking insect species occurrence data.


In order to breakthrough nowadays methodology for rescuing insect biodiversity, which are just platforms doing occurrence record preservation, we try to model a predictor to estimate biodistribution of insects. A reasonable approach is to use abundant NASA Earth observation data which encodes environmental DNA, such as climatological or geological variables into remotely sensed images. With those images, modeling such a predictor becomes possible by means of CNN families, which can map multiple images into an image of geometric distribution of insects.


To create a model applying to different landforms, using all Earth observation data should be intuitive, however, it becomes impractical due to too large data size (over 5TB for each data). Thus, a practical emulator can only learn from a bounded area and limited species lived inside. Since building one emulator for each species wastes too much time and effort, developing a bio-vector is necessary to expand our predictor’s ability to evaluate insect species that are not represented in the training data and/or live outside of the training data’s geographical coverage zone.


Given that, we decided to develop de Bugger by using NASA Earth observation data and in-situ species occurrence data inside Taiwan[7]. To preserve cross-species relationships such as symbiosis or competition, we borrowed the concept of famous word2vec which describes the relationship of neighboring words in the same sentence. This only analogizes the relation of neighboring words under the same space at the same time, which we call insect2vec or bio-vector.


With bio-vector, there is no need to see through all species’ biodistribution, we can model de Bugger with several representative species and analogize to rest species outside of the training data. Attributed to bio-vector, de Bugger can benefit areas or help participants lacking insect species occurrence data references to decide their next step. 


To sum up, global observation data are too large to process, so we choose to focus on observation data and in-situ species distribution data inside Taiwan as a proxy for global data writ large.



What does it do?

The characteristics of de Bugger are summarized below.

To view de Bugger: https://debugger.vercel.app/


1. Overview of de Bugger.

de Bugger shows the habitat distributions of various native insect species in Taiwan via an open-sourced geographical information system. 


2. Present the provided insect species.

de Bugger enumerates all the Taiwan native insect species, of which the observation/occurrence data are open to the public. Users can select one of these pads for more detailed information.


3. Visualize the temporal spatio information.

After selecting an insect species, users can access the temporal geo-spatial information related to the insect species and shown in the map, where 4 ecology-related factors: vegetation, rainfall, and elevation, and land cover, are also visualized. In addition, The time bar could help users to know what time the information in the map shows.


4. Feature importance for predicting future biodistribution.

4 environmental features: vegetation, precipitation, elevation, and land cover, are visualized. These scores are the features importance automatically learned by CNN attention module which help researchers understand the correlation between features and biodistribution.


5. Information Filtration.

de Bugger provides useful information, such as the trail of selected insect species, which is visualized on the map. Users can determine to view them or not on their own by toggling it on or off.


6. Customize the time span.

One of the coolest features provided by de Bugger is that we offer a future estimation of insects’ distribution and trails. Users could customize the time duration they want to view. When users change the time span, the occurrence and trail in each grid will change in the map automatically.


7. Prediction of future insect distribution

One of the main differences between de Bugger and existing biological platforms is that de Bugger is powered by an AI engine. Therefore, using bio-vector and Earth data jointly, de Bugger is able to predict future biological distribution.


8. Visualize time-variant information

By extending the below time bar vertically, the time bar actually is the time-variant information window, where the total occurrence of observing the selected insect species could be shown. Notice that de Bugger also provides the occurrence forecast given by the proposed time-series deep learning model based on our bio-vectors.


9. About product about us

Our team consists of a project manager, data scientists, a frontend engineer, a UI/UX designer, and a marketing.




How does it work?

de Bugger can be broken down into the following three parts, bio-vector formulation, CNN training, user-friendly web service.


Firstly, we collect insect spatial-temporal distribution data from Taiwan Biodiversity Network (TBN)[7]. The core tasks of TBN are Taiwan's biodiversity survey and the structuring of biological distribution data. TBN has collected 9 million records, including distribution records of more than 50,000 taxa across the four biological kingdoms. The distribution data can be queried by common name, scientific name, etc., in a way that conforms to the classification hierarchy.


According to the spatial-temporal distribution of insects, we can find out the relationship between different insects. Using these relationships, a vector unique to each insect can be calculated. Bio-vector is inspired by word2vec, which is a well-known model for natural language processing (NLP). In word2vec, assuming that two consecutive words in a sentence should be more related than others, the relationship between words can be modeled by estimating their distance. Based on this, by defining a latent space with pre-defined dimensionality, one can cast each word to a position in the latent space; thus, a word can be represented by a vector, also called embedding. The relationship between two words can be directly and approximately quantified by calculating the distance between their corresponding embeddings. Since word2vec extracts the relationship between words with no need for data labeling, the derived models in many domains not limited to NLP have sprung out recently.

 

In this field, our goal is to cast each insect species to a latent space to generate embedding for each of them. We assume that insect co-occurrence can be regarded as a close relationship, and thus the embeddings/vectors of two insects that occur at the same space and time should be as close as possible. By contrast, two insect species not occurring at the same place could be regarded as a long-distance between their embeddings in the latent space. The insect2vec (bio-vector) can be applied in many analytical tasks. For example, one can easily estimate the habitat distribution of an insect by using the bio-vector integrated with geo-spatial data. Based on this model, one can also predict which insect species could be harsh to the environment in a place and accordingly take actions to avoid predictable ecological disasters.



Secondly, an U-Net[8] like CNN is used to integrate the rich information including precipitation, vegetation, temperature, etc. provided by NASA Earthdata Search. We use the awesome tool[6] to download environmental information of Taiwan from NASA Earthdata Search. Every environmental information of a certain time period can be viewed as an image. In order to combine the spatial-temporal information of the environment, we stack the environmental information into a multi-channel tensor. Based on our novel insect vectors and environmental conditions, DeBugger can predict the future distribution of different types of insects. 


We believe that the prediction of different species can rely on different environmental information. The distribution of one insect may be highly correlated to precipitation while the distribution of another insect may relate to elevation. Knowing what feature is important to which insect is a real issue since this can make the research a lot easier. To address this problem, an attention module is integrated into the CNN which selects important features automatically during training. Frankly speaking, the attention layer gives a score to each input channel, the environmental information provided by NASA, and the CNN generates predictions according to the weighted inputs. Therefore, at the training phase, the weights are automatically adjusted by the gradient optimizer. The more important the feature is , the larger the weight is. In this way, the attention layer masks out all unrelated layers and leaves the import feature for further usage. 



[8]



With the help of bio-vector and the power of deep learning, we integrate environmental information and insect co-occurrence, to build a CNN engine which predicts the distribution of insects. This way, we can understand the changes in insect biodiversity and the trajectory of insect migration.


Last but not least, we build a web-based service featured by an user-friendly front-end, reliable backend, and delicate UI aiming to clearly communicate our model predictions. With de Bugger, the loss of insect biodiversity can be seen at a glance, and it can also draw the public’s attention to the insect biodiversity.



What benefits does it have?

de Bugger is built using Earth observation data and in-situ species distribution data inside Taiwan. 


It has visualized geographic historical bioinformation, such as: the distribution, tracking, and diversity of insect species. Furthermore, it can use different time points, environments, and species to infer insect distribution. In addition, we also provide feature importance to let users know more clearly where the potential environmental factors are.


de Bugger’s bio-vector framework expands our predictor’s ability to evaluate insect species that are not represented in the training data and/or live outside of the training data’s geographical coverage zone.


We believe that our de Bugger tool can give countries lacking in biological occurrence data the means to make scientifically reasonable inferences and help local communities and scientists hold the ground in the battle to preserve insect biodiversity.



What do we hope to achieve?

de Bugger is a user-friendly web service which visualizes and predicts insect distribution. We hope de Bugger advances our ability to detect insect life, track and predict change over time, and draws the attention of scientists and society to combat the loss of insect biodiversity. 



What tools, coding languages, hardware, software did you use to develop your project?


de Bugger adopts the front-end and back-end separation method for development and deployment. At the same time, continuous integration using CICD tools to rapidly develop web apps. de Bugger service can be divided into view layer, control layer and service layer. The service layer is close to the database and data source, and is responsible for data query, cache, and DAO. As the middleware, the control layer is responsible for managing and transforming the data of the service layer to make it closer to the business logic. Finally, the presentation layer runs with the client and is responsible for displaying the data returned by the control layer in front of the user.


Front-end


We use the Next.Js framework to develop the front end of the web page. Next.Js is a react framework to design server-side rendered react applications. It is open-source and compatible with Windows, Linux, and mac. In simple words, IT can create SEO friendly websites, Static applications/websites, Lightweight apps and PWA which is compatible with mobile and desktop applications both. It works on an enterprise level.


Unlike PHP, next.js also works on the same concept however here to build applications we use JavaScript and React. de bugger also uses the Next.js middleware api as the control layer of the service and as a bridge between front-and and back-end.


Back-end


We use FastAPI as the backend framework. FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.6+ based on standard Python type hints. The most exciting feature of FastAPI is that it supports asynchronous code out of the box using the async/await Python keywords. FastAPIi is lightweight, easy to develop, and expandable, making it quite conducive to the development of machine learning projects.


The back-end database uses postgresql, and pre-stores the results of model pre-training in the database, making our service faster, more stable, and more reliable. Finally, we containerize each back-end service to facilitate faster expansion, development, iteration and deployment.



Deployment

The service of de Bugger is carried out in a separate deployment method of front-and back -ends. The back-end production environment runs on the team’s personal server, while the front-end is deployed on the Vercel CMS service.


Model

We use Tensorflow and Keras deep learning framework to train our CNN.



Project Code

https://github.com/TsungTang/debugger

https://github.com/bonzoyang/DeBugger

https://github.com/iankuoli/Insect2Vec

https://github.com/s83711123456789/DeBuggerNet

Space Agency Data

How We Used Space Agency Data in This Project?

Satellite images around Taiwan from the Space Agency are fed into the Convolutional Neural Network as input. We select environmental information including elevation, precipitation, vegetation and land cover data which are highly correlated to the insect distribution.


We use the following dataset from NASA Earthdata Search, including The MOD44B Version 6 Vegetation Continuous Fields (VCF) product, The Terra and Aqua combined Moderate Resolution Imaging Spectroradiometer (MODIS) Land Cover Type (MCD12Q1) Version 6 product, The MOD13A1 Version 6 product, NASA Digital Elevation Model (DEM) version 1 (NASADEM_HGT) dataset, The TRMM (TMPA) Rainfall Estimate.



The MOD44B Version 6 Vegetation Continuous Fields (VCF) yearly product is a global representation of surface vegetation cover as gradations of three ground cover components: percent tree cover, percent non-tree cover, and percent non-vegetated (bare). VCF products provide a continuous, quantitative portrayal of land surface cover at 250 meter (m) pixel resolution, with a sub-pixel depiction of percent cover in reference to the three ground cover components. The sub-pixel mixture of ground cover estimates represents a revolutionary approach to the characterization of vegetative land cover that can be used to enhance inputs to environmental modeling and monitoring applications.[1]


The Terra and Aqua combined Moderate Resolution Imaging Spectroradiometer (MODIS) Land Cover Type (MCD12Q1) Version 6 data product provides global land cover types at yearly intervals (2001-2019), derived from six different classification schemes listed in the User Guide. The MCD12Q1 Version 6 data product is derived using supervised classifications of MODIS Terra and Aqua reflectance data. The supervised classifications then undergo additional post-processing that incorporate prior knowledge and ancillary information to further refine specific classes.[2]


The MOD13A1 Version 6 product provides Vegetation Index (VI) values at a per pixel basis at 500 meter (m) spatial resolution. There are two primary vegetation layers. The first is the Normalized Difference Vegetation Index (NDVI), which is referred to as the continuity index to the existing National Oceanic and Atmospheric Administration-Advanced Very High Resolution Radiometer (NOAA-AVHRR) derived NDVI. The second vegetation layer is the Enhanced Vegetation Index (EVI), which has improved sensitivity over high biomass regions. The algorithm for this product chooses the best available pixel value from all the acquisitions from the 16 day period. The criteria used is low clouds, low view angle, and the highest NDVI/EVI value.[3]


NASADEM are distributed in 1° by 1° tiles and consist of all land between 60° N and 56° S latitude. This accounts for about 80% of Earth’s total landmass.NASADEM_HGT data product layers include DEM, number of scenes (NUM), and an updated SRTM water body dataset (water mask). The NUM layer indicates the number of scenes that were processed for each pixel and the source of the data. A low-resolution browse image showing elevation is also available for each NASADEM_HGT granule.[4]

For precipitation data, we turn to the daily accumulated precipitation product generated from the research-quality 3-hourly TRMM Multi-Satellite Precipitation Analysis TMPA. Annual precipitation data can be easily obtain using NASA Giovanni web service which provides aggregated data within selected time interval.[5]

Hackathon Journey

How would you describe your Space Apps experience? 

NASA Hackathon journey was full of challenges, excitement, and a sense of accomplishment. In the process of project development, many challenges have been encountered. Fortunately, the encouragement of teammates and the enthusiasm to do our best for the earth inspire us to face the challenge. When the project development is completed, we are extremely excited about de Bugger's ability to make the earth a better place.


What did you learn? 

We learned the importance of insects to the environment and mankind. Also, we understand that we must take practical actions to maintain biodiversity.


In addition, we discovered the relationship between environmental data and species from NASA’s rich dataset. we learn how deep learning algorithm can be used to integrate environmental information and contribute to biodiversity



What inspired your team to choose this challenge?

When we went through all the challenges, the issue of biodiversity loss catched our eyes. The loss of biodiversity is a world-wide problem. On one hand, the relation of human life and insects are close, easily influencing each other; on the other hand, human life causes the loss of biodiversity of insect species is happening in every country everyday. 

There are currently some bio-occurrence platforms, showing the distribution of insect species. However, those platforms only keep records of distribution, and no further information was provided. This inspired us to develop a bioinformation tool hoping to provide more than just records.



What was your approach to developing this project?

We break down the project into UI/UX, presentation, front-end, back-end, and data modeling. All of our team members are responsible for their own parts. Through communication and collaboration, we make this significant project and bring de Bugger to the world for a win-win situation for mankind and all living creatures on Earth.



How did your team resolve setbacks and challenges? 

Our team members are willing to support each other and are full of enthusiasm for the environment. It is the companionship of the team members and the dream of ​​doing our own part for the earth that supports us through setbacks and challenges.


Is there anyone you'd like to thank and why?

We would like to thank Justina Hwang and Ani for their advice on English writing that greatly improves this project.


What problems and achievements did your team have?

In this work, we firstly proposed a model, insect2vec, to calculate an embedding for each insect by casting them to a latent space, where the distance between embeddings relates to the co-occurrence between the corresponding insects in a real-world environment. Then, based on these bio-vectors, a spatial-temporal model is accordingly proposed to model the habitat distribution of insects. The future habitat distribution of some insects can be predicted as bubble charts, as shown on our website. We encourage folks interested in this work to develop derived applications based on these tools to thrive the ecosystem for the social and ecological good.

References

[1] DiMiceli, C., Carroll, M., Sohlberg, R., Kim, D., Kelly, M., Townshend, J. (2015). MOD44B MODIS/Terra Vegetation Continuous Fields Yearly L3 Global 250m SIN Grid V006 [Data set]. NASA EOSDIS Land Processes DAAC. Accessed 2021-09-29 from https://doi.org/10.5067/MODIS/MOD44B.006


[2] Friedl, M., Sulla-Menashe, D. (2019). MCD12Q1 MODIS/Terra+Aqua Land Cover Type Yearly L3 Global 500m SIN Grid V006 [Data set]. NASA EOSDIS Land Processes DAAC. Accessed 2021-09-29 from https://doi.org/10.5067/MODIS/MCD12Q1.006


[3] Didan, K. (2015). MOD13A1 MODIS/Terra Vegetation Indices 16-Day L3 Global 500m SIN Grid V006 [Data set]. NASA EOSDIS Land Processes DAAC. Accessed 2021-09-29 from https://doi.org/10.5067/MODIS/MOD13A1.006


[4] NASA JPL (2020). NASADEM Merged DEM Global 1 arc second V001 [Data set]. NASA EOSDIS Land Processes DAAC. Accessed 2021-09-29 from https://doi.org/10.5067/MEaSUREs/NASADEM/NASADEM_HGT.001


[5] Huffman, G.J., D.T. Bolvin, E.J. Nelkin, and R.F. Adler (2016), TRMM (TMPA) Precipitation L3 1 day 0.25 degree x 0.25 degree V7, Edited by Andrey Savtchenko, Goddard Earth Sciences Data and Information Services Center (GES DISC), Accessed: [2021-09-28], 10.5067/TRMM/TMPA/DAY/7


[6] AppEEARS Team. (2021). Application for Extracting and Exploring Analysis Ready Samples (AppEEARS). Ver. 2.66. NASA EOSDIS Land Processes Distributed Active Archive Center (LP DAAC), USGS/Earth Resources Observation and Science (EROS) Center, Sioux Falls, South Dakota, USA. Accessed September 27, 2021. https://lpdaacsvc.cr.usgs.gov/appeears


[7] Taiwan Biodiversity Network (year) TBN Home Page https://www.tbn.org.tw/ . Accessed 29 September 2021. Taiwan Endemic Species Research Institute.


[8] Ronneberger, O.; Fischer, P.; and Brox, T. 2015. U-Net: Convolutional Networks for Biomedical Image Segmentation. CoRR, abs/1505.04597.

Tags

#Insect Biodiversity, #Earth Data, Earth Science, #Machine Learning, #Species Distribution Modeling

Global Judging

This project has been submitted for consideration during the Judging process.