Downloaded from www.rnajournal.org on December 27, 2007 - Published by Cold Spring Harbor Laboratory Press

Semiautomated improvement of RNA alignments
Ebbe S. Andersen, Allan Lind-Thomsen, Bjarne Knudsen, Susie E. Kristensen, Jakob H. Havgaard, Elfar
Torarinsson, Niels Larsen, Christian Zwieb, Peter Sestoft, Jørgen Kjems and Jan Gorodkin
RNA 2007 13: 1850-1859; originally published online Sep 5, 2007;
Access the most recent version at doi:10.1261/rna.215407

References

Email alerting
service

This article cites 37 articles, 29 of which can be accessed free at:
http://www.rnajournal.org/cgi/content/full/13/11/1850#References
Receive free email alerts when new articles cite this article - sign up in the box at the
top right corner of the article or click here

Notes

To subscribe to RNA go to:
http://www.rnajournal.org/subscriptions/

© 2007 RNA Society

Downloaded from www.rnajournal.org on December 27, 2007 - Published by Cold Spring Harbor Laboratory Press

BIOINFORMATICS

Semiautomated improvement of RNA alignments
EBBE S. ANDERSEN,1 ALLAN LIND-THOMSEN,2,7 BJARNE KNUDSEN,3 SUSIE E. KRISTENSEN,2
JAKOB H. HAVGAARD,2 ELFAR TORARINSSON,2,4 NIELS LARSEN,5 CHRISTIAN ZWIEB,6
PETER SESTOFT,4 JØRGEN KJEMS,1 and JAN GORODKIN2
1

Department of Molecular Biology, University of Aarhus, DK-8000 Århus C, Denmark
Division of Genetics and Bioinformatics, IBHV, and Center for Bioinformatics, University of Copenhagen, DK-1870 Frederiksberg C, Denmark
3
CLC bio A/S, DK-8000 Århus C, Denmark
4
Center for Bioinformatics and Department of Natural Sciences, University of Copenhagen, DK-1871 Frederiksberg C, Denmark
5
Danish Genome Institute, DK-8000 Århus C, Denmark
6
Department of Molecular Biology, The University of Texas Health Science Center at Tyler, Tyler, Texas 75708-3154, USA
2

ABSTRACT
We have developed a semiautomated RNA sequence editor (SARSE) that integrates tools for analyzing RNA alignments. The
editor highlights different properties of the alignment by color, and its integrated analysis tools prevent the introduction of
errors when doing alignment editing. SARSE readily connects to external tools to provide a flexible semiautomatic editing
environment. A new method, Pcluster, is introduced for dividing the sequences of an RNA alignment into subgroups with
secondary structure differences. Pcluster was used to evaluate 574 seed alignments obtained from the Rfam database and we
identified 71 alignments with significant prediction of inconsistent base pairs and 102 alignments with significant prediction of
novel base pairs. Four RNA families were used to illustrate how SARSE can be used to manually or automatically correct the
inconsistent base pairs detected by Pcluster: the mir-399 RNA, vertebrate telomase RNA (vert-TR), bacterial transfer-messenger
RNA (tmRNA), and the signal recognition particle (SRP) RNA. The general use of the method is illustrated by the ability to
accommodate pseudoknots and handle even large and divergent RNA families. The open architecture of the SARSE editor makes
it a flexible tool to improve all RNA alignments with relatively little human intervention. Online documentation and software
are available at http://sarse.ku.dk.
Keywords: RNA structural alignment; RNA secondary structure; SARSE

INTRODUCTION
The vast amount of available sequence data makes it
necessary to use computational methods to generate RNA
alignments and extract structural information. Both energy
minimization and comparative sequence analysis have
been used to deduce RNA secondary structure (Zuker
and Stiegler 1981; Woese et al. 1983). Subsequently,
methods using stochastic context-free grammars (SCFGs)
(Eddy and Durbin 1994; Sakakibara et al. 1994) incorporated constraints of the biological system to improve the
7
Present address: Wilhelm Johannsen Centre for Functional Genome
Research, Department of Cellular and Molecular Medicine, The Panum
Institute, University of Copenhagen, Blegdamsvej 3, DK-2200, Copenhagen N, Denmark.
Reprint requests to: Jan Gorodkin, Division of Genetics and Bioinformatics, IBHV, and Center for Bioinformatics, University of Copenhagen,
Grønnegårdsvej 3, DK-1870 Frederiksberg C, Denmark; e-mail: gorodkin@
genome.ku.dk; fax: 45 3528 3042.
Article published online ahead of print. Article and publication date are
at http://www.rnajournal.org/cgi/doi/10.1261/rna.215407.

1850

structure predictions. SCFG methods were extended to fold
multiple RNA sequence alignments where phylogenetic
information is incorporated explicitly (Knudsen and Hein
1999, 2003). An alternative method based on energy
parameters and covariance scores was also introduced
(Hofacker et al. 2002). An apparent paradox of these
methods, which predict RNA structures from phylogenetic
data, is that they require the alignment of sequences to be
known in advance (Gorodkin et al. 1997).
An RNA alignment is easily constructed when the
sequences are conserved but in many cases secondary
structure features must be considered to align the variable
regions. Methods for pairwise alignment of sequences using
secondary structure features have been developed based on
Sankoff’s algorithm (Sankoff 1985): FOLDALIGN (Havgaard
et al. 2005), PMcomp (Hofacker et al. 2004), and Dynalign
(Mathews and Turner 2002; Mathews 2005). Also, SCFG
procedures have been devised for pairwise alignment of
RNA structure, such as Stemloc (Holmes and Rubin 2002;
Holmes 2005). However, these methods are relatively

RNA (2007), 13:1850–1859. Published by Cold Spring Harbor Laboratory Press. Copyright Ó 2007 RNA Society.

Downloaded from www.rnajournal.org on December 27, 2007 - Published by Cold Spring Harbor Laboratory Press

Semiautomated RNA sequence editing

calculation-expensive and thus applicable to only relatively
small portions of the available data. Recently, improved
methods for constructing a multiple alignment of RNA
sequences have been introduced: RNAcast (Reeder
and Giegerich 2005), CMfinder (Yao et al. 2006), and
FoldalignM (Torarinsson et al. 2007). For a comparison of
the various approaches, see Torarinsson et al. (2007).
Manual editing of sequence alignments by an expert
gives the most reliable prediction of RNA secondary
structures (for review, see Pace and Thomas 1999). By
inspecting a multiple alignment, the expert can make
corrections based on experimental data with close consideration of structure and function. To help in this task
several sequence alignment editors are available, including
DCSE (De Rijk and De Wachter 1993), Mview (Brown
et al. 1998), SEQPUP (Gilbert 1999), BioEdit (Hall 2005),
GDE (De Oliveira et al. 2003), Jalview (Clamp et al. 2004),
Construct (Luck et al. 1999), ARB (Ludwig et al. 2004),
RALEE (Griffiths-Jones 2005), and
4SALE (Seibel et al. 2006). Unfortunately, the majority of the programs
are no longer supported or lack a simple
way to incorporate existing or new
tools, e.g., the RNAdbtools package
(Gorodkin et al. 2001) and the Vienna
package (Hofacker 2003).
Here, we present an approach to
iteratively update structural RNA
alignments based on a new editor, the
semiautomated RNA sequence editor
(SARSE), and a new algorithm, Pcluster, as an extension to the SARSE
toolbox. Pcluster subgroups an alignment based on differences in secondary
structure prediction by Pfold and was
found to reveal misalignments, pseudoknots, and helix insertions/deletions.
The Pcluster algorithm was used to
investigate 574 alignments in the Rfam
database (Griffiths-Jones et al. 2005).
Using four different RNA families, we
demonstrate how SARSE and Pcluster
complement each other to reveal and
correct alignment mistakes. SARSE,
Pcluster, editing procedures, and evaluation measures are described in the
Materials and Methods section.
RESULTS
Semiautomated RNA
sequence editor
A semiautomated RNA sequence editor
was developed in Java to allow the

editing of RNA alignments. SARSE uses the conventions
described for the tmRDB and SRPDB resources (Andersen
et al. 2006). Lower case and upper case letters indicate
single-stranded and base-paired regions, respectively, and
a ‘‘pairing mask’’ designates the secondary structure of
the alignment. Each sequence has an independent annotation of base pairs to allow structural differences within
an alignment. The editor can be used to investigate
the secondary structure assignment by selection of a base
pair with a single click, and distant pairings can be
observed simultaneously in split view (Fig. 1A). Colors
are used to highlight secondary structure or other calculated features. As an example of such coloring, a Pfold
prediction (Knudsen and Hein 2003) results in an alignment where different colors indicate base pair or single
strand, and different shades are used to represent the
prediction reliabilities (Fig. 1A, inset below). The overview
window provides a zoomable representation of the colored

FIGURE 1. Semiautomated RNA sequence editor. (A) The editor window displays a Pfold
secondary structure prediction of an alignment of let-7 miRNA sequences. The blue selection
was made by the mouse in split view and shows how selection in the left panel automatically
selects the corresponding bases in the right panel. Color code is given below the window: Colors
indicate base pair or single-stranded. Shadings indicate the reliability of prediction. (B) The
overview window can be interactively clicked to move the editor window to a specific
alignment position. (C) The history window logs the editing manipulations and can be clicked
to jump back in the iterative alignment procedure.

www.rnajournal.org

1851

Downloaded from www.rnajournal.org on December 27, 2007 - Published by Cold Spring Harbor Laboratory Press

Andersen et al.

alignment (Fig. 1B) and by mouse clicking provides
interactive navigation of the editor window. The history
window allows backtracking within the editing history
(Fig. 1C).
SARSE is easily extended with new programs. A program
from the toolbox is activated from within SARSE, and
the results of the request are loaded back into the editor.
This toolbox relies on the communication with the Unix
command line, and thus SARSE is distributed only for
Linux and Mac OS X platforms. For this study, we used
RNAdbtools (Gorodkin et al. 2001), Pfold (Knudsen and
Hein 2003), Pcluster, and FoldalignM (Torarinsson et al.
2007). Further details about the use of the external tools are
at the SARSE homepage (http://sarse.ku.dk).
Clustering based on secondary structure prediction
Pcluster was developed to allow the analysis of erroneous
and structurally divergent alignments. The method uses the
structure score, S, for clustering sequences into subgroups.
S is defined as the sum of reliability values for all basepaired positions of a Pfold prediction multiplied by the
number of sequences (see Materials and Methods). The
clustering procedure proceeds by sorting the sequences of
an alignment into subgroups based on obtaining the
highest S score. To evaluate the clustering procedure, a
total structure score, Sto, is calculated for each step of the
clustering as the sum of the S scores of the subgroups (Fig.
2A). The Sto score will increase if the combined subgroups
support each other in secondary structure prediction and
decrease if they conflict in secondary structure prediction.
Thus, subgroups with significant structural differences will
be found in the decreasing part of the curve of Sto scores
(Fig. 2B, curves, left side). We devised an automatic
procedure to obtain the ‘‘best’’ subgrouping in the decreasing part of the curve that represents a tradeoff between
defining structural differences and loss of prediction (see
Materials and Methods). Pcluster allows other subgroupings to be investigated manually.
Pcluster was used in an attempt to evaluate the 574 seed
alignments of the Rfam database version 8.0 (GriffithsJones et al. 2005). The Rfam alignments were subjected to
Pcluster analysis and 290 alignments were automatically
subgrouped, indicating a loss of prediction during clustering of the alignment sequences (Fig. 2B, see examples).
Visual inspection of the subgrouped alignments revealed
possible misalignments (e.g., mir-399), pseudoknots
(e.g., in tmRNA), and hairpin insert/deletions (e.g., in
tRNA). Misalignments were observed as distinct subgroups.
Pcluster recognized the two parts of a pseudoknot as
different subgroups, since the Pfold algorithm is unable
to predict pseudoknots (Knudsen and Hein 1999). Helix
insertions were recognized as subgroups because Pcluster
favors subgrouping of sequences with an extra structure
feature.
1852

RNA, Vol. 13, No. 11

FIGURE 2. Clustering and subgrouping of Rfam seed alignments.
(A) Scheme showing the relation between the structure score, S, and
the total structure score, Sto, during the clustering of four sequences.
The S score is calculated from one Pfold prediction on one or more
sequences, and the Sto score is the sum of S scores for a given subgrouping. The first step of clustering is done by doing Pfold prediction on all combinations of two sequences, and then choosing the
combination with highest S score for further clustering. At last all
sequences end up in one group. (B) The cluster plot shows the Sto
scores versus the number of groups, G, obtained by the progressive
clustering of alignment sequences from right to left. Cluster curve
examples are shown from Rfam version 8.0 for the alignments U1,
RNaseP_nuc, and tmRNA. The maximum Sto score is indicated by
black arrowheads, and the ‘‘best’’ Sto score is indicated by open
arrowheads.

Detection of structural inconsistency
The predicted base pairs were classified as consistent,
inconsistent, or novel by comparing the structural annotation obtained by Pcluster to the Rfam annotation of
conserved secondary structure (Fig. 3A) and are easily
spotted in SARSE by coloring of the alignment (Fig. 3B).
To facilitate the inspection of Pcluster predictions, we
developed the ‘‘alistem-plot’’ that in a rectangle plots a
representation of the alignment (Fig. 1B, similar to the
overview window) and in a triangle below plots the basepairings (Fig. 3C). The plot provides an easy overview of
the consistent, inconsistent, and novel base pairs in an
alignment with different structural groups (Fig. 3D).
To identify Rfam alignments as candidates for manual
editing, we define scores for ranking prediction consistency,
Sco, inconsistency, Sin, and novelty, Sno (see definitions in
Materials and Methods). The three scores are calculated as
follows: The Sco score sums only the prediction reliability
of base pairs, when the assignment is equal to the Rfam
assignment; Sin only sums over predicted base pairs that
are incompatible with the Rfam assignment, that is, for a

Downloaded from www.rnajournal.org on December 27, 2007 - Published by Cold Spring Harbor Laboratory Press

Semiautomated RNA sequence editing

prediction and number of sequences.
Of all Pcluster-predicted base pairs
(weighed by prediction reliability) in
the Rfam alignments, 75% are found
to be consistent, 10% to be inconsistent,
and 15% novel.
To find the Rfam alignments that need
to be corrected in SARSE, we evaluate the
amount of score per sequence and the
average reliability of the prediction for
the Sco, Sin, and Sno scores (Fig. 4A–C).
Because the Rfam database contains
many small alignments (403 alignments
have #10 sequences), the reliability of
predictions is expected to be low since
they contain little phylogeny to support
Pfold predictions. Also, the Pcluster subgrouping of the alignments makes less
phylogenetic support available for predictions. To find the alignments that have
the most significant inconsistent and
novel predictions we calculate the average
reliability (Pav) for each category of consistent, inconsistent, and novel prediction
and partition the alignments in three
average reliability groups: high (Pav $
0.8), medium (0.8 > Pav $ 0.6), and
low (Pav < 0.6). Of the high average
reliability predictions we found that
319 alignments were consistent, 71 were
inconsistent, and 102 were novel. Thirtythree alignments contained both inconsistent and novel predictions of high
average reliability. This showed that
11% of all Rfam alignments should be
investigated for misalignment or inconFIGURE 3. Evaluating an alignment with subgroups. (A) The Rfam secondary structure sistent structure annotation and 18%
consensus (SS_cons) is shown in bracket annotation with ‘‘<’’ and ‘‘>’’ indicating a base pair. should be considered to have extended
The Pfold secondary structure prediction (Pfold) is shown in bracket annotation with ‘‘(’’ and structure annotation. The medium reli‘‘)’’ indicating a base pair. Each base pair of the Pfold secondary structure prediction is
evaluated in relation to the Rfam SS_cons as consistent (green), inconsistent (red), or novel ability predictions should also be investi(yellow). (B) The SARSE editor window showing the mir-399 alignment with Rfam structure gated, since they might contain a mixture
annotation (SS_cons) and two subgroups (SS_1 and SS_2). The evaluation shows that base of high- and low-reliability predictions.
pairs of one subgroup differ from the Rfam SS_cons (colored in red). The alignment is colored The ranking lists of consistent, inconsisas indicated by the color code shown below the editor window, where the shading indicates the
prediction reliability of a given base pair. (C) Schematics of the ‘‘alistem’’ plot. The rectangle tent, and novel predictions can be found
is a representation of the alignment, and the triangle below is used to show base pair. (D) at http://sarse.ku.dk/Rfam_sarse/.
‘‘Alistem’’ plot for the Pcluster-evaluated mir-399 alignment with alignment length shown
We investigated the high-reliability
above the plot.
inconsistent prediction ranked by the
Sin score per sequence (available at
http://sarse.ku.dk/Rfam_sarse/sin.html).
The ‘‘RNaseP_bact_a’’ alignment obtained the highest
position in the two assignments with base pairs, where the
score, which was caused by errors in the Rfam structure
base pair partner differs between the two structure assignannotation. The ‘‘S-element’’ and ‘‘K_chan_RES’’ alignments; the Sno score sums only when predicted base pairs
ments are examples of another type of high scoring alignare not coinciding with the Rfam structure annotation. The
sum of the Sco, Sin, and Sno score equals the Sto score.
ments, where the Pfold predicted structures were
Note that the scores depend both on the reliability of
significantly different from the Rfam structure annotation.
www.rnajournal.org

1853

Downloaded from www.rnajournal.org on December 27, 2007 - Published by Cold Spring Harbor Laboratory Press

Andersen et al.

RNA alignments: (1) mir-399 RNA; (2) vertebrate telomerase RNA (vert-TR); (3) tmRNA; and (4) SRP RNA. Alignments 1–3 are from Rfam version 7.0 to show severe cases
of misalignment (Rfam version 8.0 has updated alignments
2 and 3). Alignment 4 is from the SRPDB resource and
was used as an example of a large and structurally divergent
RNA family. The following editing procedure is used:
(1) Pcluster is run on the full alignment; (2) regions of
inconsistency are spotted and evaluated by Pcluster; (3)
the alignment is edited manually or automatically by
FoldalignM; and (4) Pcluster is used to evaluate the improvement in structural alignment quality observed as an
increase in Sto score (compared in Table 1). The editing
projects can be downloaded at http://sarse.ku.dk/Rfam_
sarse/, and the files inspected in the project directory or in
the SARSE editor using the history window.
Mir-399

FIGURE 4. Evaluation of Pcluster prediction in relation to Rfam
structure annotation. (A) The consistent structure score, Sco, per
sequence, N, of an alignment is plotted against the average reliability
of the predicted base pairs that are consistent with the Rfam structure
annotation. (B,C) Similar plots for the inconsistent structure score,
Sin, and the novel structure score, Sno.

The majority of the subgrouped Rfam alignments were only
partially inconsistent and, as judged by the Pfold reliability
scores, likely to be real misalignments (examples are given
below). The alignments with novel prediction of high
reliability were evaluated by ranking the Sno score per
sequence (available at http://sarse.ku.dk/Rfam_sarse/sno.
html). Several alignments with missing annotation were
identified. Of notable examples were ‘‘Intron_gpI,’’ which
had an unannotated structural insert for a large subgroup of
sequences, and ‘‘SSU_rRNA_5’’, with an unannotated helix.
Taken together, the Pcluster analysis revealed numerous
inconsistencies and novel base pairs, and the ranking isolated
a significant number of errors for further investigation.
Correcting alignment inconsistencies
Four RNA families are chosen to demonstrate the usefulness of SARSE and Pcluster in generating higher-quality
1854

RNA, Vol. 13, No. 11

Pcluster analysis of the Rfam ‘‘mir-399’’ alignment gave rise
to five subgroups with 3% of the predictions being
inconsistent with the Rfam structure annotation (Table
2). The inconsistent base pairs are located at the center
of the miRNA helix for a subset of the sequences (Fig. 3D).
Pcluster analysis of the misaligned region (positions 10–30
and 125–145) resulted in three subgroups with one subgroup containing a misalignment (Fig. 5A, Rfam). An extra
column was inserted at position 17 and the inconsistent
bases were moved to overlap with base pairs of the two
other subgroups, and reevaluation by Pcluster gave no
inconsistent base pairs (Fig. 5B, Manual). FoldalignM
was used in an attempt to automatically align the inconsistency (positions 15–25 and 129–138) but reevaluation
by Pcluster indicated an alternative alignment (Fig. 5A,
FoldalignM), and thus Pcluster and FoldalignM do not
seem to agree on this editing. Pcluster analysis of the full
manually edited alignment resulted in five subgroups with
84% consistent and no inconsistent prediction (Table 2).
Pcluster continued to subgroup the alignment due to
divergent structure outside the Rfam annotation corresponding to 16% of the prediction. The alignment without
subgroups had an increase in Sto score (Table 2), demonstrating that the editing improved the quality of the
structural alignment.
Vert-TR

Vertebrate telomerase RNA has three major domains, one
of which contains a pseudoknot (Fragnet et al. 2005).
Pcluster analysis of the Rfam ‘‘Telomerase-vert’’ seed
alignment generated two subgroups with 6% of the prediction being inconsistent and 18% outside of the Rfam
structure annotation (Table 2). Investigation of a plot of
the prediction (available at http://sarse.ku.dk/Rfam_sarse/
sin.html) shows that stem 1 has inconsistent base pairs and
is not predicted for one group. Clustering of this region

Downloaded from www.rnajournal.org on December 27, 2007 - Published by Cold Spring Harbor Laboratory Press

Semiautomated RNA sequence editing

SRP RNA

TABLE 1. Manual and automatic correction of misalignments

To demonstrate an example of insertion
and deletion of helices, we used the SRP
Mir-399 Rfam v7
10–30, 125–145
379 (3)
316
18
45
RNA alignment from the SRPDB
Mir-399 Manual
10–31, 126–146
365 (2)
324
0
41
resource (Andersen et al. 2006). In
Mir-399 FoldalignM
10–31, 126–146
358 (3)
304
22
32
comparison, the Rfam alignments
Tel-vert Rfam v7
15–40, 295–330
1202 (8)
673
490
38
‘‘SRP_euk_arch’’ and ‘‘SRP_bact’’ conTel-vert Manual
18–42, 310–335
863 (2)
827
4
0
Tel-vert FoldalignM
18–42, 310–335
1117 (1)
1117
0
0
tained only a fraction of the divergent
tmRNA Rfam v7
15–56, 505–550
2934 (7)
2054
678
202
SRP structures known. Since the
tmRNA Manual
15–56, 505–550
3227 (3)
2958
95
175
SRPDB alignment was too large for
tmRNA FoldalignM
15–58, 507–553
3492 (2)
3200
47
245
Pcluster analysis (385 sequences and
The table summarizes structure scores for the analysis of a specified region of Rfam
983 alignment positions), we manually
alignments. The region is given as alignment positions and changes because the manual or
divided the SRP RNA alignment into 17
automatic alignment introduces gaps. For each alignment manipulation the region is
subgroups according to phylogeny and
reevaluated by Pcluster to give the new scores. The manual and automatic editing can be
inspected at http://sarse.ku.dk/.
sequence length. Pcluster analysis of
a
The number in parenthesis is the number of ‘‘best’’ groups as estimated by Pcluster. The
each subgroup individually gave rise to
scores are rounded off to whole numbers.
a total of 45 subgroups. Evaluating the
predicted base pairs in relation to the
structure annotation from the SRPDB
resource we measured, 86% of all predicted base pairs were
(positions 15–40 and 295–330) resulted in eight subgroups
consistent with the original assignment, 13% were inconwith 41% of the prediction being inconsistent (Fig. 4B,
sistent, and 2% were not assigned previously (Table 2).
Rfam; Table 1). Editing results in a decrease of inconsisVisualizing the inconsistent base pairs with SARSE, we
tency to 4% (Fig. 5B, Manual; Table 1). Using FoldalignM
observed that high-reliability inconsistencies were due to
for automatic alignment of the region (positions 18–39 and
annotation problems and alignment mistakes. After the
300–327) results in a fully consistent alignment (Fig. 5B,
corrections we were able to decrease the inconsistent
FoldalignM). Evaluation of the full alignment by Pcluster
base pairs to 7% (now available at http://genome.ku.dk/
then gave rise to one subgroup and an increase in Sto score
resources/srpdb; Table 2).
(Table 2).
We conclude that structural misalignments can be
detected
and corrected with relatively little effort using
tmRNA
Pcluster for evaluation and SARSE for editing. In addition,
The tmRNA was used as an example of the analysis of a
the manual editing done in response to the Pcluster
structural alignment with pseudoknots (Williams and
subgrouping can, in most cases, be substituted by autoBartel 1996). The large amount of sequences in the Rfam
matic editing by FoldalignM. The positive effect of editing
seed alignment was reduced (see Materials and Methods)
on the quality of the alignments can be evaluated by
and Pcluster analysis gave rise to six subgroups with 18% of
the prediction being inconsistent and 36% being novel base
pairs (Table 2). The high amount of novel base pairs was
present since only one side of the pseudoknots was
TABLE 2. Evaluating the update of full alignments
annotated in Rfam version 7.0. Interestingly, the pseudoAlignment version
Stoa
Sco
Sin
Sno
knots were observed as crossing base pairs between subMir-399 v7
804 (5)
626
25
153
groups, and several misalignments were observed both in
Mir-399 manual
798 (5)
668
0
130
helix 2 and in the region of the four pseudoknots (see the
Vert-TR v7
6328 (2)
4803
374
1151
plot at http://sarse.ku.dk/Rfam_sarse/sin.html). Helix 2 was
Vert-TR manual
6355 (1)
5147
239
969
analyzed by Pcluster (positions 15–56 and 505–550), which
tmRNA v7
8812 (7)
4000
1605
3207
showed that 23% of its prediction was misaligned in this
tmRNA manual
9261 (6)
4726
978
3557
SRPDB v167
44,252 (45)
37,948
5623
681
region (Fig. 5C, Rfam; Table 1). This was corrected to
SRPDB manual
44,208 (45)
40,601
3090
517
decrease the inconsistency to 3% (Fig. 5C, Manual; Table
1). Automatic alignment was done for the region (positions
The table summarizes structure scores calculated from Pcluster
analysis of full alignments before and after editing. As compared
40–50 and 520–550) and resulted in only 1% inconsistency
to Table 1, more changes have been made and can be inspected
as evaluated by Pcluster (Fig. 5C, FoldalignM; Table 1).
at http://sarse.ku.dk/. Editing results in a decrease in Sin score and
Pcluster analysis was performed individually for the eight
increase in Sco score.
a
The number in parenthesis is the number of ‘‘best’’ groups as
stems of the four pseudoknots, and the detected misalignestimated
by Pcluster. The scores are rounded off to whole
ments are corrected using SARSE to obtain an increased Sto
numbers.
score (Table 2).
Alignment version

Region

Stoa

Sco

Sin

Sno

www.rnajournal.org

1855

Downloaded from www.rnajournal.org on December 27, 2007 - Published by Cold Spring Harbor Laboratory Press

Andersen et al.

FIGURE 5. Detection and editing of structural alignment errors. (A–
C) Region-specific ‘‘alistem’’-plots are shown for Rfam version 7.0,
the manually edited version, and the FoldalignM edited version.
Alignment positions are indicated above each ‘‘alistem’’ plot and the
base pairs are colored as in Fig. 3B. The ‘‘alistem’’-plots are direct
representations of the real alignment, which can be downloaded at
http://sarse.ku.dk and inspected in SARSE. The ‘‘alistem’’ plots for the
full alignment can be investigated in more details at http://sarse.ku.dk.

Pcluster to give an increased Sto score, decrease in Sin
score, and the appearance of fewer subgroups, overall
indicating stronger support between the sequences of the
alignment.
DISCUSSION
The SARSE program was developed to allow not only for
the basic functions required for editing of RNA structural
alignments, but also for simple integration of auxiliary tools.
Whereas the SARSE Java applet is platform independent,
the analysis programs in the toolbox depend on communication with the Unix command line. Thus, the full SARSE
package is distributed for Linux and Mac OS X computers
only. An important feature of SARSE is that it makes
alignment editing reproducible, since the editor logs the
1856

RNA, Vol. 13, No. 11

editing history, thus making it possible for other researchers
to inspect changes and updates made to an alignment.
For the current study, we developed tools for detecting
alignment inconsistencies and either manually or automatically correcting them. The user can easily customize
SARSE by adding new programs and constructing new
analysis pipelines. The editor uses the column format
(Gorodkin et al. 2001) and can load any type and number
of columns. Thus, SARSE is flexible as to the information
attached to each alignment position and is ready to
incorporate, e.g., 3D structural information for alignment
analysis. The column format can easily be converted to
other relevant formats by scripts in the RNAdbtool package
(Gorodkin et al. 2001) and new format converter scripts
can be made for, e.g., the RNAML format (Waugh et al.
2002).
Pfold was used as the basic algorithm for RNA structure
prediction in this study. It had earlier been found that the
Pfold parameters are applicable over a broad range of
evolutionary rates as tested by the example of the rapidly
evolving retrovirus HIV-1 (Knudsen et al. 2004). Thus, we
believe that Pfold is suitable for the broad analysis
presented here of all the RNA alignments of the Rfam
database. During the study it was found that a current
implementation of the Pfold prediction algorithm broke
down for some alignment (especially large alignments).
However, for this study the updated version of Pfold did
not break down. This problem has also been noted by
others (Gardner and Giegerich 2004).
Pcluster uses Pfold predictions as the means of clustering
the sequences of an alignment into structural groups. The
current study shows that this can be used as a method to
detect alignment inconsistencies. The detection of alignment inconsistencies depends on choosing a subgrouping
based on the clustering, and the ‘‘best’’ subgrouping devised in this study might not be optimal in all cases. By
Pcluster it is possible to inspect all subgroupings made
during the clustering. Another subgrouping procedure
might be based on a direct measurement of the amount
of inconsistency during clustering. The calculation speed of
the Pcluster algorithm is a problem for interactive editing
and evaluation. The amount of sequences causes the analysis to slow down since many combinations of sequences
have to be predicted by Pfold. In the current study we
limited the amount of sequences to 100 but this is still too
slow for interactive editing. By doing an initial subgrouping
based on sequence identity, the Pcluster algorithm could
work significantly faster. The secondary structure prediction algorithm used in this study was Pfold (Knudsen and
Hein 2003) but could be replaced by, e.g., RNAalifold
(Hofacker et al. 2002), which also assigns structure scores
for the individual alignment positions. The clustering
procedure could also be based on base-pair geometry
information for structural alignments with RNA motifs
(Leontis et al. 2002; Lescoute et al. 2005).

Downloaded from www.rnajournal.org on December 27, 2007 - Published by Cold Spring Harbor Laboratory Press

Semiautomated RNA sequence editing

The Rfam database was used as the subject of analysis
to show the broad applicability of the tools, and a general
procedure for the update of structural alignment quality
was devised. The Pcluster analysis of Rfam version 8.0
alignments showed that 10% of the predicted base pairs
were inconsistent and 12% were unannotated as compared
to the Rfam structure annotation. Taking only the most
reliable predictions of inconsistent and novel base pairs, we
found that 11% of all Rfam alignments should be investigated for misalignment or inconsistent structure annotation and 18% should be considered to have extended
structure annotation. For Rfam version 7.0 examples were
provided to show that the detected inconsistencies were
real misalignments. Since the clustering of a large alignment is based on effects from many structural elements,
specific regions were chosen as the subject of an additional
Pcluster analysis. This provides a region-specific clustering
that is optimal for editing. Several examples were shown
where manual editing and automatic editing could significantly improve the structural alignment in the region. It
was also shown that manual and automatic subgrouping
could be combined to analyze large and divergent alignments for structural misalignments. Given the advantages
gained by the subgrouping of the aligned sequences within
a larger RNA family, we suggest that each structurally
divergent subgroup should be treated as an individual seed
that is part of a larger family. We also suggest that Pcluster
analysis provides an indication of the quality of the Rfam
database and the inconsistent base-pair assignments are
expected to decrease in future versions as structural alignments are constructed according to higher-order structural
information (Leontis et al. 2002; Lescoute et al. 2005). We
thus provide a ranking of Rfam alignments on the SARSE
homepage (http://sarse.ku.dk/Rfam_sarse/sin.html). The evaluation scheme presented here will become more efficient as
more sequences arrive for each RNA family.
In conclusion, we have developed an RNA alignment editor with an open architecture that is suitable for
adding established and new tools, and is capable of
improving the quality of even the most challenging RNA
alignments.

MATERIALS AND METHODS
Data retrieval
The alignments of the Rfam database versions 7.0 and 8.0 were
retrieved as Rfam flat files containing annotated Rfam seed
alignments (http://www.sanger.ac.uk/Software/Rfam/). The secondary structure mask (SS_cons) was used for comparison with
the predictions. The Rfam seed was extracted to generate individual files, placed into separate alignment directories, and the
21 largest alignments were reduced by sequence similarity using
the Hobohm algorithm (Hobohm et al. 1992). The cutoff value
of the Hobohm algorithm was tuned to reduce the align-

ment to 70–100 sequences. The SARSE and Pcluster analyses were
performed inside each alignment directory. The curated alignment
of the signal recognition particle (SRP) RNA was retrieved from
the SRPDB resource (http://genome.ku.dk/resources/srpdb). The
full SRP alignment was subgrouped by phylogeny and sequence
length.

SARSE: RNA alignment editing and analysis
To allow for efficient editing of RNA structural alignments, a
semiautomated RNA sequence editor was implemented in Java
(version 1.5) with a user-friendly graphical front-end. Other programs can be integrated by describing executable scripts into an
XML file. SARSE allows the merging of several programs into a
pipeline. For this study, the pipeline scripts for Pfold, Pcluster,
and FoldalignM were constructed, which, apart from running the
core algorithms, create plots and output files in the working directory. A special pipe was used for folding of groups of sequences
such that Pfold was able to use the output of Pcluster. This was
widely used to facilitate the iterative procedure for generating
RNA structural alignments. The main data format of the editor
was the column format (http://genome.ku.dk/resources/colformat)
as this format easily integrates communication between the
applied programs. It is also possible to read and write fasta and
widetext formats (used in the SRPDB and tmRDB resources).
Other formats can be readily generated using the format converters of the RNAdbtools package (http://genome.ku.dk/resources/
rnadbtool). In addition we included FoldalignM to construct
multiple structural RNA alignments of unaligned sequences
(Torarinsson et al. 2007). The basic editor functionalities of
SARSE have successfully been applied to curate existing RNA
structural databases, SRPDB and tmRDB (Andersen et al. 2006).

Pcluster: Clustering RNA sequences based
on secondary structure
In a multiple RNA alignment, there may be sequences that disrupt
the Pfold secondary structure prediction for two reasons: poor
alignment or variations in structure. These problems may in part
be solved by splitting the sequences into smaller groups for which
both the structures and the alignments are consistent. Consequently, alignment errors may become apparent and can be corrected. Here, a method is described that clusters RNA sequences
into such groups according to structure without generating new
alignments.
To cluster the sequences of an alignment, we chose to evaluate
the secondary structure prediction of a subgroup, g, of n sequences by the structure score
L

SðgÞ = + Pd ði; gÞ  nðgÞ;
i=1

where i is the alignment position, and L is the length of the
alignment. Pd(i,g) is the Pfold reliability of the double-stranded
position i in group g. The S(g) score is used for clustering by the
following iterative procedure: (1) Let each sequence form its own
subgroup. (2) Calculate the S(g) score of all possible pairs of
subgroups. (3) Join the two subgroups with the highest S(g) score.
(4) Go back to step 2. This will produce a clustering of the
www.rnajournal.org

1857

Downloaded from www.rnajournal.org on December 27, 2007 - Published by Cold Spring Harbor Laboratory Press

Andersen et al.

sequences into subgroups. The last clustering will have all the
sequences in the same group.
Next, we evaluated the clustering, and for each step of the
clustering procedure we calculated the total score, Sto, as the sum
of the scores of the individual groups

where the conditional parameter, Bin(Rfam,g), is 1 if either the left
or right base coincides with the Rfam assignment but the partner
does not, and zero if not. The Sco score only sums if the
assignments are consistent. The Sno score sums base pairs that
are not coinciding with the Rfam pairing mask. The evaluation
scores add up to the Sto score:

G

Sto = Sco + Sin + Sno:

Sto = + SðgÞ;
g=1

where G is the number of subgroups in the alignment. The Sto
score will increase when the combined subgroups have similar
structures, and decrease when the combined subgroups have
different structures. We were interested in finding the subgroups
that have different structure and, thus, needed to subgroup
the alignment where the Sto score decreased. We devised the
following automatic procedure to find a point between the
maximum of the total score, Sto(max), and the score for the full
alignment, Sto(1): On the plot of Sto score versus number of
subgroups, we found the slope from the point at Sto(N) to the
point that was halfway to Sto(max), where N is the number of
sequences in the alignment and max is the number of groups with
maximum score (Fig. 6). Sto(best) is found by multiplying the
slope by 1 and finding the point where it tangents the graph on
the left side (Fig. 6).

Evaluating alignments with subgroups
The structure prediction of a subgrouped alignment is evaluated
in relation to the Rfam structure annotation, and the predicted
base pairs can be consistent, inconsistent, or novel (Fig. 3A). We
define scores for evaluating consistency, Sco, inconsistency, Sin,
and novelty, Sno. They are computed in a similar way as the total
score, Sto, but with a conditional parameter that compares Rfam
with the Pcluster assignments. As an example we find the Sin
score by
G

L

Sin = + + SðgÞ  BinðRfam; gÞ;

The scores were used for ranking the Rfam alignments (available
at http://sarse.ku.dk/Rfam_sarse/).

Editing of alignment errors
Alignments were further investigated in the SARSE editor to
find evidence for an alignment error. In some cases the specific
region containing the alignment errors was reanalyzed by Pcluster
to generate subgroups that were only dependent on a particular
alignment mistake by specifying the questionable region in the
Pcluster settings. Finally, SARSE was used to align the subgroups according to their structure prediction. Alternatively, the
FoldalignM–McCaskill program was used to automatically align
the inconsistencies. To allow single stems of a larger alignment to
be aligned, it was made possible to specify a region in the following way: The region was cut out and subjected to FoldalignM–
McCaskill analysis. If a region consisted of two parts, then the
FoldalignM output was separated into two files by reference to the
input sequences. Finally, the full alignment was reassembled. The
Rfam pairing mask was added and adjusted to fit the FoldalignM
pairing mask manually. The new alignment was evaluated with
Pcluster to compare the Sto scores before and after the editing.

ACKNOWLEDGMENTS
We thank Jody Burks for useful comments on SARSE. This work
was supported by the Danish Research Councils FNU and FTP,
the Ministry of Food, Agriculture and Fisheries, the Danish Center
for Scientific Computation, and the Danish National Research
Foundation.

g=1 i=1

Received July 11, 2007; accepted August 2, 2007.

REFERENCES

FIGURE 6. Finding the ‘‘best’’ structural subgrouping. The clustering
curve is plotted as total structure score (Sto) versus number of groups
(G). The best subgrouping is found by finding the slope from the
point at Sto(N) to the point that is halfway to the maximum Sto score,
where N is the number of sequences in the alignment. The ‘‘best’’
subgroping is found by multiplying the slope by 1 and finding the
point where it tangents the graph on the left side.

1858

RNA, Vol. 13, No. 11

Andersen, E.S., Rosenblad, M.A., Larsen, N., Westergaard, J.C.,
Burks, J., Wower, I.K., Wower, J., Gorodkin, J., Samuelsson, T.,
and Zwieb, C. 2006. The tmRDB and SRPDB resources. Nucleic
Acids Res. 34: D163–D168. doi: 10.1093/nar/gkj142.
Brown, N.P., Leroy, C., and Sander, C. 1998. MView: A webcompatible database search or multiple alignment viewer. Bioinformatics 14: 380–381.
Clamp, M., Cuff, J., Searle, S.M., and Barton, G.J. 2004. The Jalview
Java alignment editor. Bioinformatics 20: 426–427.
De Oliveira, T., Miller, R., Tarin, M., and Cassol, S. 2003. An
integrated genetic data environment (GDE)-based LINUX interface for analysis of HIV-1 and other microbial sequences. Bioinformatics 19: 153–154.
De Rijk, P. and De Wachter, R. 1993. DCSE, an interactive tool for
sequence alignment and secondary structure research. Comput.
Appl. Biosci. 9: 735–740.
Eddy, S.R. and Durbin, R. 1994. RNA sequence analysis using
covariance models. Nucleic Acids Res. 22: 2079–2088. doi:
10.1093/nar/22.11.2079.

Downloaded from www.rnajournal.org on December 27, 2007 - Published by Cold Spring Harbor Laboratory Press

Semiautomated RNA sequence editing

Fragnet, L., Kut, E., and Rasschaert, D. 2005. Comparative functional
study of the viral telomerase RNA based on natural mutations.
J. Biol. Chem. 280: 23502–23515.
Gardner, P.P. and Giegerich, R. 2004. A comprehensive comparison of
comparative RNA structure prediction approaches. BMC Bioinformatics 5: 140. doi: 10.1186/1471-2105-5-140.
Gilbert, D. 1999. SeqPup version 0.9. http://iubio.bio.indiana.edu/
soft/molbio/seqpup/java/seqpup-doc.html.
Gorodkin, J., Heyer, L.J., and Stormo, G.D. 1997. Finding the most
significant common sequence and structure motifs in a set of RNA
sequences. Nucleic Acids Res. 25: 3724–3732. doi: 10.1093/nar/
25.18.3724.
Gorodkin, J., Zwieb, C., and Knudsen, B. 2001. Semi-automated
update and cleanup of structural RNA alignment databases. Bioinformatics 17: 642–645.
Griffiths-Jones, S. 2005. RALEE–RNA ALignment editor in Emacs.
Bioinformatics 21: 257–259.
Griffiths-Jones, S., Moxon, S., Marshall, M., Khanna, A., Eddy, S.R.,
and Bateman, A. 2005. Rfam: Annotating noncoding RNAs in
complete genomes. Nucleic Acids Res. 33: D121–D124. doi:
10.1093/nar/gki081.
Hall, T.A. 2005. BioEdit version 7.0.5. http://www.mbio.ncsu.edu/
BioEdit/bioedit.html.
Havgaard, J.H., Lyngso, R.B., Stormo, G.D., and Gorodkin, J. 2005.
Pairwise local structural alignment of RNA sequences with
sequence similarity less than 40%. Bioinformatics 21: 1815–1824.
Hobohm, U., Scharf, M., Schneider, R., and Sander, C. 1992. Selection
of representative protein data sets. Protein Sci. 1: 409–417.
Hofacker, I.L. 2003. Vienna RNA secondary structure server. Nucleic
Acids Res. 31: 3429–3431. doi: 10.1093/nar/gkg599.
Hofacker, I.L., Fekete, M., and Stadler, P.F. 2002. Secondary
structure prediction for aligned RNA sequences. J. Mol. Biol. 319:
1059–1066.
Hofacker, I.L., Bernhart, S.H., and Stadler, P.F. 2004. Alignment of
RNA base pairing probability matrices. Bioinformatics 20: 2222–
2227.
Holmes, I. 2005. Accelerated probabilistic inference of RNA
structure evolution. BMC Bioinformatics 6: 73. doi: 10.1186/14712105-6-73.
Holmes, I. and Rubin, G.M. 2002. Pairwise RNA structure comparison with stochastic context-free grammars. Pac. Symp. Biocomput.
7: 163–174.
Knudsen, B. and Hein, J. 1999. RNA secondary structure prediction
using stochastic context-free grammars and evolutionary history.
Bioinformatics 15: 446–454.
Knudsen, B. and Hein, J. 2003. Pfold: RNA secondary structure
prediction using stochastic context-free grammars. Nucleic Acids
Res. 31: 3423–3428. doi: 10.1093/nar/gkg614.
Knudsen, B., Andersen, E.S., Damgaard, C., Kjems, J., and
Gorodkin, J. 2004. Evolutionary rate variation and RNA secondary
structure prediction. Comput. Biol. Chem. 28: 219–226.
Leontis, N.B., Stombaugh, J., and Westhof, E. 2002. The non-WatsonCrick base pairs and their associated isostericity matrices. Nucleic
Acids Res. 30: 3497–3531. doi: 10.1093/nar/gkf481.

Lescoute, A., Leontis, N.B., Massire, C., and Westhof, E. 2005.
Recurrent structural RNA motifs, Isostericity Matrices and
sequence alignments. Nucleic Acids Res. 33: 2395–2409. doi:
10.1093/nar/gki535.
Luck, R., Graf, S., and Steger, G. 1999. ConStruct: A tool for
thermodynamic controlled prediction of conserved secondary
structure. Nucleic Acids Res. 27: 4208–4217. http://nar.oxford
journals.org/content/vol27/issue21/index.dtl
Ludwig, W., Strunk, O., Westram, R., Richter, L., Meier, H., Yadhukumar, Buchner, A., Lai, T., Steppi, S., Jobb, G., et al. 2004. ARB:
A software environment for sequence data. Nucleic Acids Res. 32:
1363–1371. doi: 10.1093/nar/gkh293.
Mathews, D.H. 2005. Predicting a set of minimal free energy RNA
secondary structures common to two sequences. Bioinformatics 21:
2246–2253.
Mathews, D.H. and Turner, D.H. 2002. Dynalign: An algorithm for
finding the secondary structure common to two RNA sequences.
J. Mol. Biol. 317: 191–203.
Pace, N.R. and Thomas, B.C. 1999. Probing RNA structure, function,
and history by comparative analysis. In RNA World, 2d ed. (eds.
R.F. Gesteland et al.), pp. 113–141. Cold Spring Harbor Laboratory
Press, Cold Spring Harbor, NY.
Reeder, J. and Giegerich, R. 2005. Consensus shapes: An alternative to
the Sankoff algorithm for RNA consensus structure prediction.
Bioinformatics 21: 3516–3523.
Sakakibara, Y., Brown, M., Hughey, R., Mian, I.S., Sjolander, K.,
Underwood, R.C., and Haussler, D. 1994. Stochastic context-free
grammars for tRNA modeling. Nucleic Acids Res. 22: 5112–5120.
doi: 10.1093/nar/22.23.5112.
Sankoff, D. 1985. Simultaneous solution of the RNA folding,
alignment and protosequence problems. SIAM J. Appl. Math. 45:
810–825.
Seibel, P.N., Muller, T., Dandekar, T., Schultz, J., and Wolf, M. 2006.
4SALE—A tool for synchronous RNA sequence and secondary
structure alignment and editing. BMC Bioinformatics 7: 498. doi:
10.1186/1471-2105-7-498.
Torarinsson, E., Havgaard, J.H., and Gorodkin, J. 2007. Multiple
structural alignment and clustering of RNA sequences. Bioinformatics 23: 926–932.
Waugh, A., Gendron, P., Altman, R., Brown, J.W., Case, D.,
Gautheret, D., Harvey, S.C., Leontis, N., Westbrook, J.,
Westhof, E., et al. 2002. RNAML: A standard syntax for exchanging RNA information. RNA 8: 707–717.
Williams, K.P. and Bartel, D.P. 1996. Phylogenetic analysis of tmRNA
secondary structure. RNA 2: 1306–1310.
Woese, C.R., Gutell, R., Gupta, R., and Noller, H.F. 1983. Detailed
analysis of the higher-order structure of 16S-like ribosomal
ribonucleic acids. Microbiol. Rev. 47: 621–669.
Yao, Z., Weinberg, Z., and Ruzzo, W.L. 2006. CMfinder—A covariance model based RNA motif finding algorithm. Bioinformatics 22:
445–452.
Zuker, M. and Stiegler, P. 1981. Optimal computer folding of large
RNA sequences using thermodynamics and auxiliary information.
Nucleic Acids Res. 9: 133–148. doi: 10.1093/nar/9.1.133.

www.rnajournal.org

1859