Team Updates

Bio-vector: encode the habitant distribution of species


  • A species can be represented by a vector. Species in the same colony or with highly overlaped habitant distribution would have similar vectors.
  • Say that colony A and B are similar:If a rare insect x appears in colony A, B might also be a potential colony where x shows up.
  • If an endangered insect species x is successfully reproduced in A colony, then B would likely be another feasible colony for species reproduction.
  • If an ecological experiment is held in control group A, then B probably is a suitable treatment group.
  • To measure the similarity between species, in addition to taxonomy and DNA, bio-vector is a new similarity measurement from the perspective of the ecological environment.




We collect insect spatial-temporal distribution data from Taiwan Biodiversity Network (TBN)[1]. The core tasks of TBN are Taiwan's biodiversity survey and the structuring of biological distribution data. TBN has collected 9 million records, including distribution records of more than 50,000 taxa across the four biological kingdoms.


According to the spatial-temporal distribution of insects, we can find out the relationship between different insects. Using these relationships, a vector unique to each insect can be calculated. Bio-vector is inspired by word2vec, which is a well-known model for natural language processing (NLP). In word2vec, assuming that two consecutive words in a sentence should be more related than others, the relationship between words can be modeled by estimating their distance. Based on this, by defining a latent space with pre-defined dimensionality, one can cast each word to a position in the latent space; thus, a word can be represented by a vector, also called embedding. The relationship between two words can be directly and approximately quantified by calculating the distance between their corresponding embeddings. Since word2vec extracts the relationship between words with no need for data labeling, the derived models in many domains not limited to NLP have sprung out recently.

 

In this field, our goal is to cast each insect species to a latent space to generate embedding for each of them. We assume that insect co-occurrence can be regarded as a close relationship, and thus the embeddings/vectors of two insects that occur at the same space and time should be as close as possible. By contrast, two insect species not occurring at the same place could be regarded as a long-distance between their embeddings in the latent space. The insect2vec (bio-vector) can be applied in many analytical tasks. For example, one can easily estimate the habitat distribution of an insect by using the bio-vector integrated with geo-spatial data. Based on this model, one can also predict which insect species could be harsh to the environment in a place and accordingly take actions to avoid predictable ecological disasters.


[1] Taiwan Biodiversity Network (year) TBN Home Page https://www.tbn.org.tw/ . Accessed 29 September 2021. Taiwan Endemic Species Research Institute.

T
Tsao, Hao Chun