Isoform Inference From RNA-Seq Samples Based on Gene Structures on Chromosomes

doi:10.4236/jbm.2013.11001

Paper Menu >>

Journal Menu >>

Journal of Biosciences and Medicines, 2013, 1, 1-5 JBM

doi:10.4236/jbm.2013.11001 Published Online February 2013 (http://www.SciRP.org/journal/jbm/)

Published Online Febru ar y 2013 in SciRes. http://www.scirp.org/journal/jbm

Isoform Inference From RN A-Seq Samples Based on Gene

Structures on Chromosomes

Yan Ji, Jia Wei

R&D Information, AstraZeneca, Shanghai, China

Email: yan.ji@ as trazeneca.com, jenny.wei@astrazeneca.com

Received 2013.

ABSTRACT

The emerging RNA-Seq technology makes it possible

to infer splicing variants from millions of short

sequence reads. Here we present a method to i dentify

isoforms by their specific signatures on ch romo some s

including both exons and junctions. By applying this

method to a RNA-Seq dataset of gastric cancer, we

showed that our method is more accurate and

sensitive than other isoform inference tools such as

RSEM and Cufflinks. By constructing a network

from gene list identified by our method but misse d by

other tools, we found that some cancer-related genes

enriched in network modules have significant

implications for cancer drug discovery.

Keywords: RNA-Seq; isoform inference; cancer genes;

Cuffli nks; RSEM

1. INTRODUCTION

Next–generation sequencing (NGS) platforms have

been widely available recently [1]. A massively parallel

sequencing technology termed RNA-Seq has made it

possible to sequence cDNA derived from cellular RNA

[2]. Recently, many studies have applied RNA-Seq to

various biological and medical studies. Quantification

of alternative slicing in tissues [3] and new transcript

identification [4] have all benefited from this new tech-

nology. To fully enable RNA-Seq technology to solve

biological problems, powerful computational tools are

required. Since identifying disease driven differentially

expressed isoforms is extremely important for drug re-

search and development, we focus on inferring iso-

for ms fr o m RN A-Seq datasets.

In this work, we present an accurate and sensitive

method to identify isoforms b y their specific signatures.

Expression values of an exon or junction can be calculated

by counting the number of reads of a RNA-Seq dataset

falling withi n a n e xon o r sp annin g a j uncti o n if b ot h the

genome sequence and gene structure annotation are

available. For each isoform, there must be single exon

(junction) or a combination of exons (junctions) which

uniquely belong to the isoform. We term such exons

(junctions) as specific exons (junctions) of an isoform.

Comparing with other isoform inference tools such as

RSEM and Cufflinks, our method has two significant

advantages [5, 6]. First, for all the isoforms in the gene

structure annotations, our method can determine whether

or not they are expressed in a RNA-Seq dataset. In the

result section, we show that our method has the

capabilit y to identify isofor ms which are missed by both

RSEM and Cufflinks. Second, because some exons or

junctions of an isoform may be overlapped by other

isoforms, the estimation of the expression value of an

isoform is always affected by expression levels of other

isoforms. Therefore, instead of using expression values

of an isofo rm, using expression values of specific exons

or junctions of an isoform calculated in our method are

better choice for finding differently expressed isoforms

by comparing samples from patient and healthy donors.

2. METHODOLOGIES AND RESULTS

2.1. The Desc rip tion o f Our method for

Inf erring Iso for ms from RN A-Seq Datasets

First, RPKM expression values of both exons and junc-

tions are calculated. In this step, two files are needed

including a BAM file of the mapping result against

reference genome and an annotation file in GFF format

containing the structure information of genes on

ch r omosomes describing the relationship between exons

and is ofor ms. The RPK M exp re ssio n val ue o f an e xon i s

calculated by the read counts mapped to a chromosome

region of the exon normalized by the total read counts in

a sample and the length of the exon. The RPKM

expres sio n value of a junct ion is calculated in th e same way

except normalization by an extension length spanning

the j unct ion p osit io n. Second, a set o f e xons or j unct io ns

specific for an isoform are used to determine whether an

isoform exists in a sample or not. The rationale to

determine the expre ssion status of an iso form in a sample

Y. Ji et al. / Journal of Biosciences and Medicines 1 (2013) 1-5

is described as following. If more than sixty percent of

exons in an isoform are not expressed or with ver y s mall

expr ession va lues, the e xpr ess io n stat us o f the i sofo r m is

assigned a term “no expression”. Then, if either the set

of specific exons or the set of specific junctions of an

isoform are all expressed in a sample, the expression

status of the isoform is assigned a term “existence”.

Then, if one or more of specific exons or junctions are

not expressed and they are located in the 5’ end of the

isoform, the expression status of the isoform is assigned

a term “uncertainty”. It is because that in the process of

RNA- Seq sample preparation, a small fraction of

sequences at the 5’ end of an isoform may be degraded

or lost due to various reasons. Otherwise the expression

status of the isoform is assigned a term “absence”.

2.2. Datasets

We apply this method to a Gas tric cancer sample o btained

from gastric cancer patient. mRNA was fragmented and

plus- and minus- strand cDNA were synthesized for

illumina pair-end sequencing. A 300-bp fragment size

was selected by gel excision and the sample was

sequenced twice to avoid technical variance. There are

totally 30,121,416 and 17,510,256 read pairs for each

replica. We use TopHat with reference genome hg19 and

UCSC gene structure annotation to align the two RNA-

Seq datasets [7].

3. Results

In the UCSC gene structure annotation, there are 22920

genes. Our method is designed to determine expression

status of isoforms of all the genes having at least two

isoforms. That is, 18967 isoforms of 7209 genes can be

processed by our method. Because we only processed

genes mapped by RNA-Seq reads, our r esult s o n the t wo

RNA-Seq datasets, as shown in Figure 1, determined

expre ssio n stat us of 16151 isoforms of 5891 genes.

Fi g ure 1 . Comparisons of expression status of isoform between our method and RSEM. Level 1 is the evidence used to in-

fer isofo rm by our method: exon, junction and no expression. Level 2 is the type of isoform expression status inferred by

our method: existence, uncertainty and absence. Level 3 is the expression level of an isoform inferred from their expression

values calcu lated b y RSEM: i sofor ms marked with ‘yes’ ar e taken as express ed iso forms if their exp ression values are b ig-

ger than 0.3; isoforms marked with ‘no’ are taken to be absent in a RNA-seq dataset if their expression valu es are no bigger

than 0.3 .

Y. Ji et al. / Journal of Biosciences and Medicines 1 (2013) 1-5

3.1. Comparisons with Cufflinks and RSEM

The method based on specific exons or junctions has its

advantages over current popular isoform quantification

tools, such as RSEM and Cufflinks. In Figure 1, we

showed comparisons of expression status of isoforms

inferred by our method and the two other tools. First, t he

two isoform expression quantification tools quantified

isoforms of some genes marked as ‘no expression’ by

our method with big expression values. ANO7 and

TB XAS 1 are two examples. As sho wn i n Fi gu re 2, mo st

of exons of the two genes are not expressed. We guess

that the two tools quantify expressions of isoforms by

total read counts even if only a small fraction of exons

such as UTR exons were expressed. So it is necessary

for our method to check expression profiles of exons

before isoform quantification. Second, some isoforms

are marked as “existence” by our method, but they were

quantified by the two tools with very small expression

value s. Fo r exa mpl e, as s ho w n in Fi gu re 3, while RSEM

and cufflinks quantified the isoform NM_001135685

with 3.93E-05 and 0, there were 10 or 15 reads spanning

its specific junctions with RPKM expression values 1.6

or 2.4. For NM_000548, its unique specific exon has

expression value 15.03, while RSEM and cufflinks

quantified it as 5.05E-23 and 0. Third, some isoforms are

marked as “absence” by our method, but they were

quantified by the two tools with big expression values.

For example, as shown in Figure 4, while RSEM and

Cufflinks quantified the isoform NM_001135730 with

big expression values 5.66 and 4.66, there are no reads

spanning its specific junction.

3.2. Biological Significance of Identified

Isof orm s

Our method marked 1860 isoforms with ‘existence’

while they we re determined to be absent according to

their expressio n value s quantifi ed by RSEM . Expre ssion

values of these isoforms are smaller than 0.3 (e xpre ssio n

values of 62% of these isoforms were zero). Actually, b y

comparing expression values of isoforms and those of

their specific exons or junctions, we found that expression

values of most of these isoforms were under-estimated.

There are 468 cancer-related isoforms out of 1860 isoforms.

By using Analyze network algorithm with default settings

in GeneGo (http://www.genego.com/), biological net-

works were constructed from cancer-related iso forms. In

Table 1, some networks with annotated GO Processes

were shown. In the first network, NM_003376 is the

isoform b of VEGFA (vascula r endothelial growth factor

A) which plays a major role in vascular cell migration

and proliferation. It mediates the vascular permeability,

mitogenesis and angiogenesis. Inhibition of VEGF-A

Figure 2. Visualization of the coverage of genes mapped by reads of two RNA-Seq datasets. Isoforms of the two genes

(ANO7 and TBXAS1) are marked with ‘no expression’ by our method while they are quantified with big expression values

by RSEM and Cufflinks.

Y. Ji et al. / Journal of Biosciences and Medicines 1 (2013) 1-5

Figure 3. Visualization of the coverage of genes mapped by reads of two RNA-Seq datasets. As shown in ‘a’,

NM_001135685 has two specific junctions highlighted by arrows. As shown in ‘b’, NM_000548 has a specific exon. The

two isoforms are marked with ‘existence’ by our method while they are quantified with very small or zero expression val-

ues by RSEM and Cufflinks.

Figure 4. Visualization of the coverage of genes mapped by reads of two RNA-Seq datasets. NM_001135730 has a specific junction

highlighted by an arrow. It is marked with ‘absence’ by our method while it is quantified with big expression values by both RSEM and

Cufflinks.

promoter may be a useful approach for treating diseases

in which aberrant angiogensis is present such as

inflammatory diseases, cancer and diabetic retinopathy

among others [8]. In the second network, NM_021872 is

the smallest isoform of CDC25B (M-phase inducer

phosphatase 2) which plays a role in endogenous tyrosine

Y. Ji et al. / Journal of Biosciences and Medicines 1 (2013) 1-5

Table 1. Networks constructed from cancer related iso fo r ms

Index Key Network Objects GO Process es

ACACA, eIF4 G1, VEG F -A, MNK1, ABC C1

integrin-mediated signaling pathway, cell-mat r ix adhesi on,

prote in tr ans po r t w ith in lip id bilaye r , calcium -independent cell-ma trix adhesion,

cell mi gr ation involved in spro uting angiogenesis

CDC25B, VAV-2, al ph a-1/beta-1 integrin, ZO-2

posi tive regulati on of cell prolifera tion

Dyn a m in -2, TFIID, MALT1, PDK (PDPK1), Al-

pha-synuclein

protein phosphorylation, cell-cell signaling, synaptic transmission,

phosp horylation, signal transduction

MAP3K3, FANCA, KLF1 1 (TIEG2) , AP-1, WAS P

adenylate cyclase-modulating G-protein coupled receptor signaling pathway

phosphatase activity and G2/M phase transition of cell

cycle. Blocking CDC25B expression leads to inhibition

of cell proliferation, migration and invasion. CDC25B

may be a potential marker and therapeutic target for

hepatocellular carcinoma and pancreatic cancer [9]. In

the third network, NM_002613 is the longer isoform of

PDPK1 (3-phosphoinositide-dependent protein kinase 1)

which is involved in the regulation of cell proliferation,

cell survival and signal transduction. PDK1-expressing

MCF-7 cells showed up-regulation of genes of the Wnt

signaling pathway and down-regula tion of p utati ve t umor

suppressor genes. In the fourth network, NM_203351 is

the longest isoform of MAP3K3 (mitogen-activated

protein kinase kinase kinase 3) which is involved in

regulating interleukin 1 receptor (IL-1R), toll-like receptor

4 (TLR4), SAPK and ERK signaling pathways. MEKK3

may be a therapeutic target in controlling the apoptosis

resistance of some cancers [10].

Therefore, accurately inferring isoforms from RNA-

Seq datasets of tumor samples and identifying tumor

driven isoforms have important implications in the field

of cancer drug res earch and development.

4. CONCLUSIONS

We have developed a method for inferring isoforms from

RNA-Seq samples. Its advantages over current isoform

inference tools have been illustrated in previous sections.

Its limitations are that complete sequences and gene

structure annotation of transcriptome of targeted species

have to be available. With advances of sequencing

technologies and genome biology, our method can be

applied to more species. Furthermor e , our method can

detect isoform switching if multiple disease and no rmal

samp l e s are available. If expression status of isoforms

are different between disease and normal samples,

isoform switching can be identi fied . If expression status

of isofo r ms ar e both marked with ‘e xis tenc e’, expr essio n

of specific exons or junctions between samples can be

used to determine whether isoforms are differentially

expressed.

5. ACKNOWLEDGEMENTS

This work was supported in part by a project from

AstraZeneca.

REFERENCES

[1] Met zker, M.L. (2010) Sequencing technologies - the ne xt

generation. Nat R ev Genet, 11(1), 31-46.

[2] Mortazavi, A. Williams, B. A. McCue, K. Schaeffer, L.

Wold, B. (2008) Mapping and quantifying mammalian

transcri pt omes b y RNA-Seq . Nat M ethods, 2008. 5(7): p.

621-8.

[3] Wang, E.T., et al. (2008) Alternative isoform regulation

in human tissue transcriptomes. Nature, 456(7221),

470-6.

[4] Guttman, M., et al. (2010) Ab initio reconstruction of cell

type-specific transcriptomes in mouse reveals the con-

served multi-exonic structure of lincRNAs. Nat Biotech-

nol, 28(5): p. 503-10.

[5] Li, B. and C.N. Dewey (2011) RSEM: accu rate t rans crip t

quantification from RNA-Seq data with or without a ref-

erence genome. BMC Bioinformatics, 12: p. 323.

[6] Trapnel l, C., et al. (2010) Transcript assembly and quan-

tification by RNA-Seq reveals unannotated transcripts

and isoform switching during cell differentiation. Nat

Biotechnol, 28(5): p. 511-5.

[7] Trapnell, C., L. Pachter, and S.L. Salzberg (2009) TopHat:

discovering splice junctions with RNA-Seq. Bioinfor-

matics, 25(9): p. 1105-11.

[8] Snowden, A.W., et al. (2003) Repression of vascular

endothelial growth factor A in glioblastoma cells using

engineered zinc finger transcription factors. Cancer Res,

63(24): p. 8968-76.

[9] Ngan, E.S., et al. (2003) Overexpression of Cdc25B, an

androgen receptor coactivator, in prostate cancer. On co-

gene, 22(5): p. 734-9.

[10] Samanta, A.K., et al.(2004) Overexpression of MEKK3

confers resistance to apoptosis through activation of

NFkappaB. J Biol Chem, 27 9( 9): p. 757 6-83.