As a key technology of rapid and low-cost drug development, drug repositioning is getting popular. In this study, a text mining approach to the discovery of unknown drug-disease relation was tested. Using a word embedding algorithm, senses of over 1.7 million words were well represented in sufficiently short feature vectors. Through various analysis including clustering and classification, feasibility of our approach was tested. Finally, our trained classification model achieved 87.6% accuracy in the prediction of drug-disease relation in cancer treatment and succeeded in discovering novel drug-disease relations that were actually reported in recent studies.
To develop an effective and highly-demanded drug, hundreds million dollars and 10 or more years for R & D and clinical trial are typically required. Structure-based drug design (SBDD) is actively studied to reduce the cost and time by in-silico screening of candidate chemicals [
Besides biomedical experiments, computational methods are developed for drug repositioning. Most of them adopt network-based algorithms and combination of various databases including gene expression and pathway data [
In this section, data and algorithms are described. Overview of processing pipeline is shown in
As a raw corpus, we used a subset of PubMed abstracts downloaded in October 2013, filtered by the keyword “cancer”. From 3,099,076 abstracts, 14,847,050 sentences were extracted.
Enju [
So that word2vec can differently treat the same word with different POS categories, they were attached right after the base form of words (e.g. “care” ->“care(V)”). For readability, nouns are kept as is. To simplify the input for word2vec, we removed all words except nouns, adjectives, adverbs, and verbs.
Biological terms typically consist of two or more words. In addition, they have many synonyms. Since word2vec basically treats a sentence as a sequence of words, it is needed to recognize biological synonyms, aggregate them into primary terms, and convert them into single words (e.g. “yolk sac tumor” ->“endodermal sinus tumor” ->“endodermal_sinus_tumor”). In this study, primary names and synonyms of drugs, diseases, and genes were extracted from PharmGKB [
Related to the conversion above, we need to consider about the existence of original single words. Firstly, if a synonym word is aggregated into primary word, the original word disappears and is not used for word embedding. Secondly, if a multi-word term is converted into a single word, all the original single words in the multi-word term disappear. Thirdly, if two multi-word term occur in a sentence with overlapping, it is impossible to replace both of them at once. To avoid there problems, a sentence is converted into the sentences which containing at most one converted word per sentence. For example, if a sentence contains two terms to be converted, three sentences including original one are generated. After all conversion, 14,847,050 sentences are expanded to 45,264,480.
In the field of natural language processing and text mining, computational representation of a linguistic unit (e.g. documents, paragraphs, sentences, terms, and words) is essential. The simplest one for document is bag-of-word model, which represents each document as a vector of word frequencies in it. In case of word representation, only the neighboring words in the same sentence are counted. For better analysis, stop-words are removed and raw frequencies are modified by term weighting such as tf-idf. After that, these vectors are used to evaluate the characteristics of the units and similarities between them (vector space model).
One of the serious problems in such a representation and analysis is high dimensionality and sparseness of vectors. For instance, 10 millions of sentences may contain one million of different words, then the dimension of a vector is also one million. In addition, since frequency of word follows Zipf’s law, most of the one million of words only occur a few times, which makes the vectors quite sparse. Though there exist traditional algorithms for dimension reduction or compression like Principal Component Analysis (PCA) and Latent Semantic Analysis (LSA), this problem is not fully solved.
Word embedding for distributed representation of word sense is a new approach to this problem. Based on neural network algorithm, reasonably short numerical vectors (e.g. 100 dimensions) are calculated for all words in a set of sentences. Through the application studies, it is proved that the vector space constructed by word embedding represents word senses and distances (similarities) between them quite well. Additionally, in this space of word sense, word analogy works well in some domains. For example, given three words “man”, “woman”, and “king”, word analogy could predict “queen” by calculating vector(“man”) − vector(“woman”) + vector(“king”) and searching for the nearest word vector vector(“queen”). Though word analogy might allow wide variety of applications, the most desired one is discovery of unknown relations.
In this study, we used word2vec software, a de facto standard implementation of word embedding algorithm, with the following parameters by default.
vector size = 200
window size = 8
minimum count of words to be embedded = 1 (i.e. all words)
model = continuous bag of words
As a result, 1,772,186 words were embedded into word vectors (2303 for drugs, 3069 for diseases, 8703 for genes, and 1,758,111 for others).
For the evaluation of clustering results, ATC codes [
For the evaluation of difference vectors between drugs and diseases, relations between drugs and diseases occurring in the corpus were extracted from CTD [
In order to conduct detailed analysis on cancer-related drugs and diseases, 12,462 extracted drug-disease relations were further filtered so that both of drug and disease names in each relation are attached to an ATC code and a MeSH tree number beginning with “L” (Antineoplastic and immunomodulating agents) and “C04” (Neoplasms), respectively. As a result, 1097 relations consist of 104 anti-cancer drugs and 107 cancer-related diseases were extracted for detailed analysis.
For visual evaluation of word vector quality, we performed hierarchical clustering with cosine distance and Ward’s method [
Support Vector Machine (SVM) was adopted for learning and predicting possible relations between drugs and diseases. As an implementation, ksvm function included in kernlab package for R software was used with default parameters.
shown that the distributions of word vectors in three categories are clearly separated. In the right panel, it is also shown that the frequent words have clear separation, whereas it is relatively difficult to discriminate the categories of rare words.
Though word analogy is quite attractive, it does not always works well. To evaluate the applicability of word analogy to the discovery of new relation between drug and disease, we checked whether most of the displacement vectors between confirmed drug-disease pairs (i.e. correct relations) are similar in length and parallel to each other or not. Unfortunately, as shown in
Instead of simple application of word analogy, we constructed a classification model using SVM. For all combinations of 104 anti-cancer drugs and 107 cancer-related diseases (i.e. 11,128 drug-disease pairs), drug vectors and disease vectors were concatenated and binary class labels (i.e. positive or negative) were added according to 1097 correct drug-disease relations extracted from CTD. Due to the imbalance of two classes, 1097 out of 10,031 negative examples were randomly selected so that the numbers of positive and negative examples are balanced.
The result of performance evaluation is shown in
Finally, we tested all combinations of 2199 drugs not used in training and 107 cancer-related diseases (in total, 235,293 drug-disease pairs). In case of the classification model trained by 11,128 examples, only 64 test examples were predicted as positive, and all the drugs in the examples were anti-cancer drugs (but not included in 104 anti-cancer drugs used for training). By controlling the degree of class imbalance in training data, it is possible to predict a pair of non-anti-cancer drug and cancer-related disease as positive. For example, using the classification model trained by 1097 positive and 8776 negative examples (degree of imbalance is 1:8), 10 times training and test by 235,293 drug-disease pairs discovered the following candidate drugs for repositioning to cancer treatment, where the numbers indicate how many times they were discovered in 10 times training and test.
Vector size | Accuracy | |||
---|---|---|---|---|
Window size = 2 | Window size = 3 | Window size = 4 | Window size = 8 | |
50 | 0.872 | 0.873 | 0.875 | 0.872 |
75 | 0.873 | 0.874 | 0.874 | 0.874 |
100 | 0.874 | 0.874 | 0.874 | 0.876 |
200 | 0.874 | 0.874 | 0.874 | 0.874 |
drug::urokinase (10), drug::photodynamic_therapy (10), drug::oxygen (10), drug::nonoxynol-9 (10),
drug::nitroglycerin (10), drug::nitrogen (10), drug::l-phenylalanine (10), drug::l-methionine (10),
drug::l-glutamine (10), drug::l-cysteine (10), drug::glutathione (10), drug::glucose (10),
drug::epoxide (10), drug::enzyme (10), drug::collagenase (10), drug::bisphosphonate (10),
drug::amino_acid (10), drug::amide (10), drug::clarithromycin (9), drug::vitamin (8), drug::l-proline (7),
drug::vitamin_e (6), drug::xanthophylls (4), drug::phospholipid (4), drug::palifermin (4), drug::ether (4),
drug::ethacrynic_acid (4), drug::denosumab (4), drug::egfr_inhibitor (2), drug::pyruvic_acid (1).
Besides too general names like “drug::enzyme” and “drug::amide”, it is notable that the above list includes approved anti-cancer drugs (e.g. “drug::denosumab”), anti-cancer drugs under investigation (e.g. “drug:: clarithromycin”, “drug::bisphosphonate”, and “drug::xanthophyll”), and drugs potentially promote cancer (e.g. “drug::urokinase” and “drug::collagenase”). Especially, it should be emphasized that repositioning of clarithro- mycin to anti-cancer agent has been reported in 2015 [
One of the reasons why word embedding by word2vec becomes popular is its functionality of word analogy [
Although word analogy was not available, word2vec provided significant advantage in the text mining from a large number of biomedical texts in this study. It efficiently encoded more than 1.7 million words into quite short vectors (e.g. 200 dimensions). If we use traditional word frequency and vector space model, one vector for a word is a vector of 1.7 million features with extremely high sparsity. Due to the efficiency of encoding, we could process the whole corpus in reasonable memory space and computation time. Furthermore, the word vectors generated by word2vec seem to well reflect the semantic space of biomedical words.
In this study, it was revealed that word embedding is effective for representing sense of all words in a large number of cancer-related PubMed abstracts. Furthermore, concatenation of word vectors of drugs and diseases well represents their relations and could be used for finding candidate drugs for repositioning by classification. For better performance of classification, various feature selection and over-sampling algorithms [
In this research, the super-computing resource was provided by Human Genome Center, the Institute of Medical Science, the University of Tokyo. Additional computation time was provided by the super computer system in Research Organization of Information and Systems (ROIS), National Institute of Genetics (NIG). This work was supported by JSPS KAKENHI Grant Number 26330328.
Duc LuuNgo,NaokiYamamoto,Vu AnhTran,Ngoc GiangNguyen,DauPhan,Favorisen RosykingLumbanraja,MamoruKubo,KenjiSatou, (2016) Application of Word Embedding to Drug Repositioning. Journal of Biomedical Science and Engineering,09,7-16. doi: 10.4236/jbise.2016.91002