Title: | 'ggplot2'-Based Tools for Visualising DNA Sequences and Modifications |
Version: | 0.2.1 |
Description: | Uses 'ggplot2' to visualise either (a) a single DNA/RNA sequence split across multiple lines, (b) multiple DNA/RNA sequences, each occupying a whole line, or (c) base modifications such as DNA methylation called by modified bases models in Dorado or Guppy. Functions starting with visualise_<something>() are the main plotting functions, and functions starting with extract_<something>() are key helper functions for reading files and reformatting data. Source code is available at https://github.com/ejade42/ggDNAvis and a full non-expert user guide is available at https://ejade42.github.io/ggDNAvis/. |
Imports: | ggplot2, dplyr, tidyr, stringr, raster, rlang, ragg, png, magick |
License: | MIT + file LICENSE |
Encoding: | UTF-8 |
RoxygenNote: | 7.3.3 |
Suggests: | testthat (≥ 3.0.0) |
Config/testthat/edition: | 3 |
Depends: | R (≥ 3.5) |
LazyData: | true |
Language: | en-GB |
URL: | https://ejade42.github.io/ggDNAvis/, https://github.com/ejade42/ggDNAvis |
BugReports: | https://github.com/ejade42/ggDNAvis/issues |
NeedsCompilation: | no |
Packaged: | 2025-09-15 02:38:11 UTC; evelyn |
Author: | Evelyn Jade |
Maintainer: | Evelyn Jade <evelynjade42@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2025-09-21 13:30:07 UTC |
ggDNAvis: 'ggplot2'-Based Tools for Visualising DNA Sequences and Modifications
Description
Uses 'ggplot2' to visualise either (a) a single DNA/RNA sequence split across multiple lines, (b) multiple DNA/RNA sequences, each occupying a whole line, or (c) base modifications such as DNA methylation called by modified bases models in Dorado or Guppy. Functions starting with visualise_<something>() are the main plotting functions, and functions starting with extract_<something>() are key helper functions for reading files and reformatting data. Source code is available at https://github.com/ejade42/ggDNAvis and a full non-expert user guide is available at https://ejade42.github.io/ggDNAvis/.
Author(s)
Maintainer: Evelyn Jade evelynjade42@gmail.com (ORCID) [copyright holder]
See Also
Useful links:
Report bugs at https://github.com/ejade42/ggDNAvis/issues
Convert MM tag to absolute index locations (read_modified_fastq()
helper)
Description
This function takes a sequence, a SAM-style vector of number of potential
target bases to skip in between each target base that was actually assessed,
and a target base type (defaults to "C"
as 5-methylcytosine is most common).
It identifies the indices/locations of all instances of the target base within the
sequence, and then goes along the vector of these indices, skipping them if requested
by skips
.
For example, the sequence "GGCGGCGGCGGC"
with target "C"
and skips c(0, 0, 1)
would identify that the indices where "C"
occurs are c(3, 6, 9, 12)
. It would then
take the first index, the second index, skip one, and take the fourth index i.e.
return c(3, 6, 12)
. If instead the skips were given as c(0, 2)
it would take the
first index, skip two, and take the fourth index i.e. return c(3, 12)
. If the skips
were given as c(1, 1)
it would skip one, take the second index, skip one, and take
the fourth index i.e. return c(6, 12)
.
The length of skips
corresponds to the number of indices/locations that will be returned
(i.e. the length of the returned locations vector).
Ideally the length of skips
plus the sum of skips
(i.e. the number returned plus the
total number skipped) is the same or less than the number of possible locations. If it is
the same, then the last possible location will be taken; if it is less then some number of
possible locations at the end will be skipped.
Important: if the length of skips
plus the sum of skips
is greater than the number
of possible locations (instances of the target base within the sequence), then the total
number of taken or skipped locations will be greater than the number of available locations.
In this case, the returned vector will contain NA after the available locations have run out.
In the example above, skips = c(0, 0, 0, 0, 0)
would return c(3, 6, 9, 12, NA)
, and
skips = c(0, 2, 0)
would return c(3, 12, NA)
.
Therefore, if the target base is totally absent from the sequence (e.g. target "A"
in
"GGCGGCGGCGGC"
), then any non-zero length of skips
will return the same length of NA
s e.g.
skips = c(0)
would return NA
, and skips = c(0, 1, 0)
would return c(NA, NA, NA)
.
If skips
has length zero, it will return numeric(0)
.
This function is reversed by convert_locations_to_MM_vector()
.
Usage
convert_MM_vector_to_locations(sequence, skips, target_base = "C")
Arguments
sequence |
|
skips |
|
target_base |
|
Value
integer vector
. All of the base indices at which methylation/modification information was processed. Will all be instances of the target base.
Examples
convert_MM_vector_to_locations(
"GGCGGCGGCGGC",
skips = c(0, 0, 0, 0),
target_base = "C"
)
convert_MM_vector_to_locations(
"GGCGGCGGCGGC",
skips = c(1, 1, 1, 1),
target_base = "G"
)
convert_MM_vector_to_locations(
"GGCGGCGGCGGC",
skips = c(0, 0, 2, 1, 0),
target_base = "G"
)
Map a single base to the corresponding number (generic ggDNAvis
helper)
Description
This function takes a single base and numerically
encodes it for visualisation via raster::raster()
.
Encoding: A = 1
, C = 2
, G = 3
, T/U = 4
.
Usage
convert_base_to_number(base)
Arguments
base |
|
Value
integer
. The corresponding number.
Examples
convert_base_to_number("A")
convert_base_to_number("c")
convert_base_to_number("g")
convert_base_to_number("T")
convert_base_to_number("u")
Split a single input sequence into a vector of "lines" for visualisation (visualise_single_sequence()
helper)
Description
Takes a single input sequence and an integer line length, and splits the input
sequence into lines of that length (with the last line potentially being shorter).
Optionally inserts empty strings ""
after each line to space them out.
Usage
convert_input_seq_to_sequence_list(
input_seq,
line_length,
spacing = 1,
spaces_first = TRUE
)
Arguments
input_seq |
|
line_length |
|
spacing |
|
spaces_first |
|
Value
character vector
. The input sequence split into multiple lines, with specified spacing in between.
Examples
convert_input_seq_to_sequence_list(
"GGCGGCGGC",
line_length = 6,
spacing = 1,
spaces_first = TRUE
)
convert_input_seq_to_sequence_list(
"GGCGGCGGC",
line_length = 3,
spacing = 2,
spaces_first = FALSE
)
convert_input_seq_to_sequence_list(
"GGCGGCGGC",
line_length = 6,
spacing = 0
)
Convert absolute index locations to MM tag (write_modified_fastq()
helper)
Description
This function takes a vector of modified base locations as absolute indices
(i.e. a 1
would mean the first base in the sequence has been assessed for
modification; a 15
would mean the 15th base has), and converts it to a vector
in the format of the SAM/BAM MM tags. The MM tag defines a particular target base (e.g.
C
for methylation), and then stores the number of skipped instances of that base
between sites where modification was assessed. In practice, this often means counting the
number of non-CpG C
s in between CpG C
s. In a GGC
repeat, this should be a bunch of 0
s
as every C
is in a CpG, but unique sequence will have many non-CpG C
s.
This function is reversed by convert_MM_vector_to_locations()
.
Usage
convert_locations_to_MM_vector(sequence, locations, target_base = "C")
Arguments
sequence |
|
locations |
|
target_base |
|
Value
integer vector
. A component of a SAM MM tag, representing the number of skipped target bases in between each assessed base.
Examples
convert_locations_to_MM_vector(
"GGCGGCGGCGGC",
locations = c(3, 6, 9, 12),
target_base = "C"
)
convert_locations_to_MM_vector(
"GGCGGCGGCGGC",
locations = c(1, 4, 7, 10),
target_base = "G"
)
convert_locations_to_MM_vector(
"GGCGGCGGCGGC",
locations = c(1, 2, 4, 5, 7, 8, 10, 11),
target_base = "G"
)
Convert string-ified modification probabilities and locations to a single vector of probabilities (visualise_methylation()
helper)
Description
Takes modification locations (indices along the read signifying bases at which
modification probability was assessed) and modification probabilities (the probability
of modification at each assessed location, as an integer from 0 to 255), as comma-separated
strings (e.g. "1,5,25"
) produced from numerical vectors via vector_to_string()
.
Outputs a numerical vector of the modification probability for each base along the read.
i.e. -2 for indices outside sequences, -1 for bases where modification was not assessed,
and probability from 0-255 for bases where modification was assessed.
Usage
convert_modification_to_number_vector(
modification_locations_str,
modification_probabilities_str,
max_length,
sequence_length
)
Arguments
modification_locations_str |
|
modification_probabilities_str |
|
max_length |
|
sequence_length |
|
Value
numeric vector
. A vector of length max_length
indicating the probability of methylation at each index along the read - 0 where methylation was not assessed, and probability from 0-255 where methylation was assessed.
Examples
convert_modification_to_number_vector(
modification_locations_str = "3,6,9,12",
modification_probabilities = "100,200,50,150",
max_length = 15,
sequence_length = 13
)
Map a sequence to a vector of numbers (generic ggDNAvis
helper)
Description
This function takes a sequence and encodes it as a vector
of numbers for visualisation via raster::raster()
.
Encoding: A = 1
, C = 2
, G = 3
, T/U = 4
.
Usage
convert_sequence_to_numbers(sequence, length = NA)
Arguments
sequence |
|
length |
|
Value
integer vector
. The numerical encoding of the input sequence, cut/padded to the desired length.
Examples
convert_sequence_to_numbers("ATCGATCG")
convert_sequence_to_numbers("ATCGATCG", length = NA)
convert_sequence_to_numbers("ATCGATCG", length = 4)
convert_sequence_to_numbers("ATCGATCG", length = 10)
Convert a vector of sequences to a dataframe for plotting sequence contents and index annotations (visualise_single_sequence()
helper)
Description
Takes the sequence list output from convert_input_seq_to_sequence_list()
and creates a dataframe
specifying x and y coordinates and the character to plot at each coordinate. This applies to both
the sequence itself (e.g. determining where on the plot to place an "A"
) and the periodicit
annotations of index number (e.g. determining where on the plot to annotate base number 15
).
Usage
convert_sequences_to_annotations(
sequences,
line_length,
interval = 15,
annotations_above = TRUE,
annotation_vertical_position = 1/3
)
Arguments
sequences |
|
line_length |
|
interval |
|
annotations_above |
|
annotation_vertical_position |
|
Value
dataframe
Dataframe of coordinates and labels (e.g. "A"
or "15
), readable by geom_text.
Examples
convert_sequences_to_annotations(
c("GGCGGC", "", "ATCG", ""),
line_length = 6,
interval = 3,
annotations_above = TRUE,
annotation_vertical_position = 1/3
)
convert_sequences_to_annotations(
c("GGCGGC", "", "ATCG", ""),
line_length = 6,
interval = 0
)
Rasterise a vector of sequences into a numerical dataframe for ggplotting (generic ggDNAvis
helper)
Description
Takes a character vector of sequences (which are allowed to be empty ""
to
act as a spacing line) and rasterises it into a dataframe that ggplot can read.
Usage
create_image_data(sequences)
Arguments
sequences |
|
Value
dataframe
. Rasterised dataframe representation of the sequences, readable by ggplot2::ggplot()
.
Examples
create_image_data(c("ATCG", "", "GGCGGC", ""))
Print a numeric vector to console (ggDNAvis
debug helper)
Description
Takes a numeric vector, and prints it to the console separated by ", "
.
This allows the output to be copy-pasted into a vector within an R script.
Used for taking vector outputs and then writing them as literals within a script.
E.g. when given input 1:5
, prints 1, 2, 3, 4, 5
, which can be directly copy-pasted
within c()
to input that vector. Printing normally via print(1:5)
instead prints
[1] 1 2 3 4 5
, which is not valid vector input so can't be copy-pasted directly.
See debug_join_vector_str()
for the equivalent for character/string vectors.
Usage
debug_join_vector_num(vector)
Arguments
vector |
|
Value
None (invisible NULL
) - uses cat()
to output directly to console.
Examples
debug_join_vector_num(1:5)
Print a character/string vector to console (ggDNAvis
debug helper)
Description
Takes a character/string vector, and prints it to the console separated by ", "
.
This allows the output to be copy-pasted into a vector within an R script.
Used for taking vector outputs and then writing them as literals within a script.
E.g. when given input strsplit("ABCD", split = "")[[1]]
, prints "A", "B", "C", "D"
,
which can be directly copy-pasted within c()
to input that vector.
Printing normally via print(strsplit("ABCD", split = "")[[1]])
instead prints
[1] "A" "B" "C" "D"
, which is not valid vector input so can't be copy-pasted directly.
See debug_join_vector_num()
for the equivalent for numeric vectors.
Usage
debug_join_vector_str(vector)
Arguments
vector |
|
Value
None (invisible NULL
) - uses cat()
to output directly to console.
Examples
debug_join_vector_str(c("A", "B", "C", "D"))
Example multiple sequences data
Description
A collection of made-up sequences in the style of long reads over a repeat region
(e.g. NOTCH2NLC), with meta-data describing the participant each read is from and
the family each participant is from. Can be used in visualise_many_sequences()
,
visualise_methylation()
, and helper functions to visualise these sequences.
Generation code is available at data-raw/example_many_sequences.R
Usage
example_many_sequences
Format
example_many_sequences
A dataframe with 23 rows and 10 columns:
- family
Participant family
- individual
Participant ID
- read
Unique read ID
- sequence
DNA sequence of the read
- sequence_length
Length (nucleotides) of the read
- quality
FASTQ quality scores for the read. Each character represents a score from 0 to 40 - see
fastq_quality_scores
.
These values are made up viapmin(pmax(round(rnorm(n, mean = 20, sd = 10)), 0), 40)
i.e. sampled from a normal distribution with mean 20 and standard deviation 10, then rounded to integers between 0 and 40 (inclusive) - seeexample_many_sequences.R
- methylation_locations
Indices along the read (starting at 1) at which methylation probability was assessed i.e. CpG sites. Stored as a single character value per read, condensed from a numeric vector via
vector_to_string()
.- methylation_probabilities
Probability of methylation (8-bit integer i.e. 0-255) for each assessed base. Stored as a single character value per read, condensed from a numeric vector via
vector_to_string()
.
These values are made up viaround(runif(n, min = 0, max = 255))
- seeexample_many_sequences.R
- hydroxymethylation_locations
Indices along the read (starting at 1) at which hydroxymethylation probability was assessed i.e. CpG sites. Stored as a single character value per read, condensed from a numeric vector via
vector_to_string()
.- hydroxymethylation_probabilities
Probability of hydroxymethylation (8-bit integer i.e. 0-255) for each assessed base. Stored as a single character value per read, condensed from a numeric vector via
vector_to_string()
.
These values are made up viaround(runif(n, min = 0, max = 255 - this_base_methylation_probability))
such that the summed methylation and hydroxymethylation probability never exceeds 255 (100%) - seeexample_many_sequences.R
Extract, sort, and add spacers between sequences in a dataframe
Description
This function takes a dataframe that contains sequences and metadata,
recursively splits it into multiple levels of groups defined by grouping_levels
,
and adds breaks between each level of group as defined by grouping_levels
.
Within each lowest-level group, reads are sorted by sort_by
, with order determined
by desc_sort
.
Default values are set up to work with the included dataset
example_many_sequences
.
The returned sequences vector is ideal input for
visualise_many_sequences()
.
Also called by extract_methylation_from_dataframe()
to produce input for visualise_methylation()
.
Usage
extract_and_sort_sequences(
sequence_dataframe,
sequence_variable = "sequence",
grouping_levels = c(family = 8, individual = 2),
sort_by = "sequence_length",
desc_sort = TRUE
)
Arguments
sequence_dataframe |
|
sequence_variable |
|
grouping_levels |
|
sort_by |
|
desc_sort |
|
Value
character vector
. The sequences ordered and grouped as specified, with blank sequences (""
) inserted as spacers as specified.
Examples
extract_and_sort_sequences(
example_many_sequences,
sequence_variable = "sequence",
grouping_levels = c("family" = 8, "individual" = 2),
sort_by = "sequence_length",
desc_sort = TRUE
)
extract_and_sort_sequences(
example_many_sequences,
sequence_variable = "sequence",
grouping_levels = c("family" = 3),
sort_by = "sequence_length",
desc_sort = FALSE
)
extract_and_sort_sequences(
example_many_sequences,
sequence_variable = "sequence",
grouping_levels = NA,
sort_by = "sequence_length",
desc_sort = TRUE
)
extract_and_sort_sequences(
example_many_sequences,
sequence_variable = "sequence",
grouping_levels = c("family" = 8, "individual" = 2),
sort_by = NA
)
extract_and_sort_sequences(
example_many_sequences,
sequence_variable = "sequence",
grouping_levels = NA,
sort_by = NA
)
extract_and_sort_sequences(
example_many_sequences,
sequence_variable = "quality",
grouping_levels = c("individual" = 3),
sort_by = "quality",
desc_sort = FALSE
)
Extract methylation information from dataframe for visualisation
Description
This function takes a dataframe that contains methylation information in the form of
locations (indices along the read signifying bases at which modification probability
was assessed) and probabilities (the probability of modification at each assessed
location, as an integer from 0 to 255).
Each observation/row in the dataframe represents one sequence (e.g. a Nanopore read).
In the locations and probabilities column, each sequence (row) has many numbers associated.
These are stored as one string per observation e.g. "3,6,9,12"
, with the column representing
a character vector of such strings (e.g. c("3,6,9,12", "1,2,3,4")
).
This function calls extract_and_sort_sequences()
on each of these three columns and returns
a list of vectors stored in $locations
, $probabilities
, and $lengths
.
These can then be used as input for visualise_methylation()
.
Default arguments are set up to work with the included example_many_sequences
data.
Usage
extract_methylation_from_dataframe(
modification_data,
locations_colname = "methylation_locations",
probabilities_colname = "methylation_probabilities",
lengths_colname = "sequence_length",
grouping_levels = c(family = 8, individual = 2),
sort_by = "sequence_length",
desc_sort = TRUE
)
Arguments
modification_data |
|
locations_colname |
|
probabilities_colname |
|
lengths_colname |
|
grouping_levels |
|
sort_by |
|
desc_sort |
|
Value
list
, containing $locations
(character vector
), $probabilities
(character vector
), and $lengths
(numeric vector
).
Examples
## See documentation for extract_and_sort_sequences()
## for more examples of changing sorting/grouping
extract_methylation_from_dataframe(
example_many_sequences,
locations_colname = "methylation_locations",
probabilities_colname = "methylation_probabilities",
lengths_colname = "sequence_length",
grouping_levels = c("family" = 8, "individual" = 2),
sort_by = "sequence_length",
desc_sort = TRUE
)
extract_methylation_from_dataframe(
example_many_sequences,
locations_colname = "hydroxymethylation_locations",
probabilities_colname = "hydroxymethylation_probabilities",
lengths_colname = "sequence_length",
grouping_levels = c("family" = 8, "individual" = 2),
sort_by = "sequence_length",
desc_sort = TRUE
)
Vector of the quality scores used by the FASTQ format
Description
A vector of the characters used to indicate quality scores from 0 to 40
in the FASTQ format. These scores are related to the error probability p
via Q = -10 \text{ log}_{10}(p)
, so a Q-score of 10 (represented by "+"
) means
the error probability is 0.1, a Q-score of 20 ("5"
) means the error probability
is 0.01, and a Q-score of 30 ("?"
) means the error probability is 0.001.
The character representations store Q-scores in one byte each by using ASCII encodings,
where the Q-score for a character is its ASCII code minus 33 (e.g. A
has an ASCII
code of 65 and represents a Q-score of 32).
This vector contains the characters in order but starting with a score of 0, meaning
the character at index n
represents a Q-score of n-1
e.g. the first
character ("!"
) represents a score of 0; the eleventh character ("+"
)
represents a score of 10.
The full set of possible score representations, in order and presented as a single
string, is !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHI
.
Generation code is available at data-raw/fastq_quality_scores.R
Usage
fastq_quality_scores
Format
fastq_quality_scores
A character vector of length 41
- fastq_quality_scores
The vector
c("!", '"', "#", "$", "%", "&", "'", "(", ")", "*", "+", ",", "-", ".", "/", "0", "1", "2", "3", "4", "5", "6", "7", "8", "9", ":", ";", "<", "=", ">", "?", "@", "A", "B", "C", "D", "E", "F", "G", "H", "I")
Merge FASTQ data with metadata
Description
Merge a dataframe of sequence and quality data (as produced by
read_fastq()
from an unmodified FASTQ file) with a dataframe of
metadata, reverse-complementing sequences if required such that all
reads are now in the forward direction.
merge_methylation_with_metadata()
is the equivalent function for
working with FASTQs that contain DNA modification information.
FASTQ dataframe must contain columns of "read"
(unique read ID),
"sequence"
(DNA sequence), and "quality"
(FASTQ quality score).
Other columns are allowed but not required, and will be preserved unaltered
in the merged data.
Metadata dataframe must contain "read"
(unique read ID) and "direction"
(read direction, either "forward"
or "reverse"
for each read) columns,
and can contain any other columns with arbitrary information for each read.
Columns that might be useful include participant ID and family designations
so that each read can be associated with its participant and family.
Important: A key feature of this function is that it uses the direction
column from the metadata to identify which rows are reverse reads. These reverse
reads will then be reversed-complemented and have quality scores reversed
such that all reads are in the forward direction, ideal for consistent analysis or
visualisation. The output columns are "forward_sequence"
and "forward_quality"
.
Calls reverse_sequence_if_needed()
and reverse_quality_if_needed()
to implement the reversing - see documentation for these functions for more details.
Usage
merge_fastq_with_metadata(
fastq_data,
metadata,
reverse_complement_mode = "DNA"
)
Arguments
fastq_data |
|
metadata |
|
reverse_complement_mode |
|
Value
dataframe
. A merged dataframe containing all columns from the input dataframes, as well as forward versions of sequences and qualities.
Examples
## Locate files
fastq_file <- system.file("extdata",
"example_many_sequences_raw.fastq",
package = "ggDNAvis")
metadata_file <- system.file("extdata",
"example_many_sequences_metadata.csv",
package = "ggDNAvis")
## Read files
fastq_data <- read_fastq(fastq_file)
metadata <- read.csv(metadata_file)
## Merge data (including reversing if needed)
merge_fastq_with_metadata(fastq_data, metadata)
Merge methylation with metadata
Description
Merge a dataframe of methylation/modification data (as produced by
read_modified_fastq()
) with a dataframe of metadata, reversing
sequence and modification information if required such that all information
is now in the forward direction.
merge_fastq_with_metadata()
is the equivalent function for working with
unmodified FASTQs (sequence and quality only).
Methylation/modification dataframe must contain columns of "read"
(unique read ID),
"sequence"
(DNA sequence), "quality"
(FASTQ quality score), "sequence_length"
(read length), "modification_types"
(a comma-separated string of SAMtools modification
headers produced via vector_to_string()
e.g. "C+h?,C+m?"
), and,
for each modification type, a column of comma-separated strings of modification
locations (e.g. "3,6,9,12"
) and a column of comma-separated strings of
modification probabilities (e.g. "255,0,64,128"
). See read_modified_fastq()
for more information on how this dataframe is formatted and produced.
Other columns are allowed but not required, and will be preserved unaltered
in the merged data.
Metadata dataframe must contain "read"
(unique read ID) and "direction"
(read direction, either "forward"
or "reverse"
for each read) columns,
and can contain any other columns with arbitrary information for each read.
Columns that might be useful include participant ID and family designations
so that each read can be associated with its participant and family.
Important: A key feature of this function is that it uses the direction
column from the metadata to identify which rows are reverse reads. These reverse
reads will then be reversed-complemented and have modification information reversed
such that all reads are in the forward direction, ideal for consistent analysis or
visualisation. The output columns are "forward_sequence"
, "forward_quality"
,
"forward_<modification_type>_locations"
, and "forward_<modification_type>_probabilities"
.
Calls reverse_sequence_if_needed()
, reverse_quality_if_needed()
,
reverse_locations_if_needed()
, and reverse_probabilities_if_needed()
to implement the reversing - see documentation for these functions for more details.
If wanting to write reversed sequences to FASTQ via write_modified_fastq()
, locations
must be symmetric (e.g. CpG) and offset must be set to 1. Asymmetric locations are impossible
to write to modified FASTQ once reversed because then e.g. cytosine methylation will be assessed
at guanines, which SAMtools can't account for. Symmetrically reversing CpGs via
reversed_location_offset = 1
is the only way to fix this.
Usage
merge_methylation_with_metadata(
methylation_data,
metadata,
reversed_location_offset = 0,
reverse_complement_mode = "DNA"
)
Arguments
methylation_data |
|
metadata |
|
reversed_location_offset |
|
reverse_complement_mode |
|
Value
dataframe
. A merged dataframe containing all columns from the input dataframes, as well as forward versions of sequences, qualities, modification locations, and modification probabilities (with separate locations and probabilities columns created for each modification type in the modification data).
Examples
## Locate files
modified_fastq_file <- system.file("extdata",
"example_many_sequences_raw_modified.fastq",
package = "ggDNAvis")
metadata_file <- system.file("extdata",
"example_many_sequences_metadata.csv",
package = "ggDNAvis")
## Read files
methylation_data <- read_modified_fastq(modified_fastq_file)
metadata <- read.csv(metadata_file)
## Merge data (including reversing if needed)
merge_methylation_with_metadata(methylation_data, metadata, reversed_location_offset = 0)
## Merge data with offset = 1
merge_methylation_with_metadata(methylation_data, metadata, reversed_location_offset = 1)
Read sequence and quality information from FASTQ
Description
This function simply reads a FASTQ file into a dataframe containing
columns for read ID, sequence, and quality scores.
Optionally also contains a column of sequence lengths.
See fastq_quality_scores
for an explanation of quality.
Resulting dataframe can be written back to FASTQ via write_fastq()
.
To read/write a modified FASTQ containing modification information
(SAM/BAM MM and ML tags) in the header lines, use
read_modified_fastq()
and write_modified_fastq()
.
Usage
read_fastq(filename = file.choose(), calculate_length = TRUE)
Arguments
filename |
|
calculate_length |
|
Value
dataframe
. A dataframe with read
, sequence
, quality
, and optionally sequence_length
columns.
Examples
## Locate file
fastq_file <- system.file("extdata",
"example_many_sequences_raw.fastq",
package = "ggDNAvis")
## View file
for (i in 1:16) {
cat(readLines(fastq_file)[i], "\n")
}
## Read file to dataframe
read_fastq(fastq_file, calculate_length = FALSE)
read_fastq(fastq_file, calculate_length = TRUE)
Read modification information from modified FASTQ
Description
This function reads a modified FASTQ file (e.g. created by samtools fastq -T MM,ML
from a BAM basecalled with a modification-capable model in Dorado or Guppy) to a dataframe.
By default, the dataframe contains columns for unique read id (read
), sequence (sequence
),
sequence length (sequence_length
), quality (quality
), comma-separated (via vector_to_string()
)
modification types present in each read (modification_types
), and for each modification type,
a column of comma-separated modification locations (<type>_locations
) and
a column of comma-separated modification probabilities (<type>_probabilities
).
Modification locations are the indices along the read at which modification was assessed
e.g. a 3 indicates that the third base in the read was assessed for modifications of the given type.
Modification probabilities are the probability that the given modification is present, given as
an integer from 0-255 where integer N
represents the probability space from \frac{N}{256}
to \frac{N+1}{256}
.
To extract the numbers from these columns as numeric vectors to analyse, use string_to_vector()
e.g.
list_of_locations <- lapply(test_01$`C+h?_locations`, string_to_vector)
. Be aware that the SAM
modification types often contain special characters, meaning the colname may need to be enclosed in
backticks as in this example. Alternatively, use extract_methylation_from_dataframe()
to
create a list of locations, probabilities, and lengths ready for visualisation in
visualise_methylation()
. This works with any modification type extracted in this function,
just provide the relevant colname when calling extract_methylation_from_dataframe()
.
Optionally (by specifying debug = TRUE
), the dataframe will also contain columns of
the raw MM and ML tags (<MM/ML>_raw
) and of the same tags with the initial label
trimmed out (<MM/ML>_tags
). This is not recommended in most situations but may help
with debugging unexpected issues as it contains the raw data exactly from the FASTQ.
Dataframes produced by this function can be written back to modified FASTQ via write_modified_fastq()
.
Usage
read_modified_fastq(filename = file.choose(), debug = FALSE)
Arguments
filename |
|
debug |
|
Value
dataframe
. Dataframe of modification information, as described above.
Sequences can be visualised with visualise_many_sequences()
and modification information can be visualised with visualise_methylation()
(despite the name, any type of information can be visualised as long as it has locations and probabilities columns).
Can be written back to FASTQ via write_modified_fastq()
.
Examples
## Locate file
modified_fastq_file <- system.file("extdata",
"example_many_sequences_raw_modified.fastq",
package = "ggDNAvis")
## View file
for (i in 1:16) {
cat(readLines(modified_fastq_file)[i], "\n")
}
## Read file to dataframe
read_modified_fastq(modified_fastq_file, debug = FALSE)
read_modified_fastq(modified_fastq_file, debug = TRUE)
Reverse complement a DNA/RNA sequence (generic ggDNAvis
helper)
Description
This function takes a string/character representing a DNA/RNA sequence and returns
the reverse complement. Either DNA (A/C/G/T
) or RNA (A/C/G/U
) input is accepted.
By default, output is DNA (so A
is reverse-complemented to T
), but it can be set
to output RNA (so A
is reverse-complemented to U
).
Usage
reverse_complement(sequence, output_mode = "DNA")
Arguments
sequence |
|
output_mode |
|
Value
character
. The reverse-complement of the input sequence.
Examples
reverse_complement("ATGCTAG")
reverse_complement("UUAUUAGC", output_mode = "RNA")
reverse_complement("AcGtU", output_mode = "DNA")
reverse_complement("aCgTU", output_mode = "RNA")
Reverse modification locations if needed (merge_methylation_with_metadata()
helper)
Description
This function takes a vector of condensed modification locations/indices (e.g.
c("3,6,9,12", "1,4,7,10")
), a vector of directions (which must all be either
"forward"
or "reverse"
, not case-sensitive), and a vector of sequence lengths
(integers).
Returns a vector of condensed locations where reads that were originally forward
are unchanged, and reads that were originally reverse are flipped to now be forward.
Optionally, a numerical offset can be set. If this is left at 0
(the default value),
then a CpG assessed for methylation would be reverse-complemented to a CG with the
modification information ascribed to the G (as the G is at the location where the actual
modified C was on the other strand). However, setting the offset to 1
would shift all
of the modification indices by 1 such that the modification is now ascribed to the C of the
reverse-strand CG. This is beneficial for visualising the modifications as it ensures consistency
between originally-forward and originally-reverse strands by making the modification score associated
with each CpG site always be located at the C, but may be misleading for quantitative analysis.
Setting the offset to anything other than 0
or 1
should work but may be biologically misleading,
so produces a warning.
Called by merge_methylation_with_metadata()
to create a forward dataset, alongside
reverse_sequence_if_needed()
, reverse_quality_if_needed()
, and reverse_probabilities_if_needed()
.
Example:
Forward sequence, with indices of Cs in CpGs numbered:
C C C A G G C G G C G G C G A C C G A 7 10 13 17
(length = 19, locations = "7,10,13,17"
, CpGs = 7-8, 10-11, 13-14, 17-18)
Reverse sequence, with indices of C in CpGs numbered:
T C G G T C G C C G C C G C C T G G G 2 6 9 12
(length = 19, locations = "2,6,9,12"
, CpGs = 2-3, 6-7, 9-10, 12-13)
As CG reverse-complements to itself, each CpG site has a 1:1 correspondence with
a CpG site in the reverse strand. Many methylation calling models assess C-methylation
at the C of each CpG. To map the locations from C to C, we take 19 - <index>
such that
"7,10,13,17"
becomes "12,9,6,2"
and "2,6,9,12"
becomes "17,13,10,7"
.
The symmetry of CpGs means mapping from C to C is also symmetric.
This is achieved by setting offset = 1
, as mapping C to C involves shifting position by 1.
Conversely, to map the locations from C to G (i.e. preserving the actual location of each
modification, which is required if assessed locations are non-symmetric/don't reverse-complement
to themselves like CpGs do), we take 20 - <index>
such that
"7,10,13,17"
becomes "13,10,7,3"
i.e. the indices of the Gs in CpGs in the reverse
sequence. Likewise "2,6,9,12"
becomes "18,14,11,8"
i.e. the indices of the Gs in CpGs in
the forward sequence.
This is achieved by setting offset = 0
, as mapping C to G preserves the actual original position
at which each modification was assessed, but changes the base to its complement.
In general, new locations are calculated as (<length> + 1 - <offset>) - <index>
.
Of course, output locations are reversed before returning so that they all
return in ascending order, as is standard for all location vectors/strings.
If wanting to write reversed sequences to FASTQ via write_modified_fastq()
, locations
must be symmetric (e.g. CpG) and offset must be set to 1. Asymmetric locations are impossible
to write to modified FASTQ once reversed because then e.g. cytosine methylation will be assessed
at guanines, which SAMtools can't account for. Symmetrically reversing CpGs via offset = 1
is
the only way to fix this.
Usage
reverse_locations_if_needed(
locations_vector,
direction_vector,
length_vector,
offset = 0
)
Arguments
locations_vector |
|
direction_vector |
|
length_vector |
|
offset |
|
Value
character vector
. A vector of all forward versions of the input locations vector.
Examples
reverse_locations_if_needed(
locations_vector = c("7,10,13,17", "2,6,9,12"),
direction_vector = c("forward", "reverse"),
length_vector = c(19, 19),
offset = 0
)
reverse_locations_if_needed(
locations_vector = c("7,10,13,17", "2,6,9,12"),
direction_vector = c("forward", "reverse"),
length_vector = c(19, 19),
offset = 1
)
Reverse modification probabilities if needed (merge_methylation_with_metadata()
helper)
Description
This function takes a vector of condensed modification probabilities
(e.g. c("128,0,63,255", "3,78,1"
) and a vector of directions (which
must all be either "forward"
or "reverse"
, not case-sensitive),
and returns a vector of condensed modification probabilities where those
that were originally forward are unchanged, and those that were originally
reverse are flipped to now be forward.
Called by merge_methylation_with_metadata()
to create a forward dataset, alongside
reverse_sequence_if_needed()
, reverse_quality_if_needed()
, and reverse_locations_if_needed()
.
Usage
reverse_probabilities_if_needed(probabilities_vector, direction_vector)
Arguments
probabilities_vector |
|
direction_vector |
|
Value
character vector
. A vector of all forward versions of the input probabilities vector.
Examples
reverse_probabilities_if_needed(
probabilities_vector = c("100,200,50", "100,200,50"),
direction_vector = c("forward", "reverse")
)
Reverse qualities if needed (merge_methylation_with_metadata()
helper)
Description
This function takes a vector of FASTQ qualities and a vector of directions
(which must all be either "forward"
or "reverse"
, not case-sensitive)
and returns a vector of forward qualities.
Qualities of reads that were forward to begin with are unchanged,
while qualities of reads that were reverse are now flipped
to give the corresponding forward quality scores.
Called by merge_methylation_with_metadata()
to create a forward dataset,
alongside reverse_sequence_if_needed()
, reverse_locations_if_needed()
,
and reverse_probabilities_if_needed()
.
Usage
reverse_quality_if_needed(quality_vector, direction_vector)
Arguments
quality_vector |
|
direction_vector |
|
Value
character vector
. A vector of all forward versions of the input quality vector.
Examples
reverse_quality_if_needed(
quality_vector = c("#^$&$*", "#^$&$*"),
direction_vector = c("reverse", "forward")
)
Reverse sequences if needed (merge_methylation_with_metadata()
helper)
Description
This function takes a vector of DNA/RNA sequences and a vector of directions
(which must all be either "forward"
or "reverse"
, not case-sensitive)
and returns a vector of forward DNA/RNA sequences.
Sequences in the vector that were forward to begin with are unchanged,
while sequences that were reverse are reverse-complemented via reverse_complement()
to produce the forward sequence.
Called by merge_methylation_with_metadata()
to create a forward dataset, alongside
reverse_quality_if_needed()
, reverse_locations_if_needed()
and reverse_probabilities_if_needed()
.
Usage
reverse_sequence_if_needed(
sequence_vector,
direction_vector,
output_mode = "DNA"
)
Arguments
sequence_vector |
|
direction_vector |
|
output_mode |
|
Value
character vector
. A vector of all forward versions of the input sequence vector.
Examples
reverse_sequence_if_needed(
sequence_vector = c("TAAGGC", "TAAGGC"),
direction_vector = c("reverse", "forward")
)
reverse_sequence_if_needed(
sequence_vector = c("UAAGGC", "UAAGGC"),
direction_vector = c("reverse", "forward"),
output_mode = "RNA"
)
Colour palettes for sequence visualisations
Description
A collection of colour palettes for use with visualise_single_sequence()
and visualise_many_sequences()
.
Each is a character vector of 4 colours,
corresponding to A, C, G, and T/U in that order.
To use inside the visualisation
functions, set sequence_colours = sequence_colour_palettes$<palette_name>
Generation code is available at data-raw/sequence_colour_palettes.R
Usage
sequence_colour_palettes
Format
sequence_colour_palettes
A list of 5 length-4 character vectors
- ggplot_style
The shades of red, green, blue, and purple that
ggplot2::ggplot()
uses by default for a 4-way discrete colour scheme.
Values:c("#F8766D", "#7CAE00", "#00BFC4", "#C77CFF")
- bright_pale
Bright yellow, green, blue, and red in lighter pastel-like tones.
Values:c("#FFDD00", "#40C000", "#00A0FF", "#FF4E4E")
- bright_pale2
Bright yellow, green, blue, and red in lighter pastel-like tones. The green (for C) is slightly ligther than bright_pale.
Values:c("#FFDD00", "#30EC00", "#00A0FF", "#FF4E4E")
- bright_deep
Bright orange, green, blue, and red in darker, richer tones.
Values:c("#FFAA00", "#00BC00", "#0000DC", "#FF1E1E")
- sanger
Green, blue, black, and red similar to a traditional Sanger sequencing readout.
Values:c("#00B200", "#0000FF", "#000000", "#FF0000")
Split a ","
-joined string back to a vector (generic ggDNAvis
helper)
Description
Takes a string (character) produced by vector_to_string()
and recreates the vector.
Note that if a vector of multiple strings is input (e.g. c("1,2,3", "9,8,7"
)) the output
will be a single concatenated vector (e.g. c(1, 2, 3, 9, 8, 7)
).
If the desired output is a list of vectors, try lapply()
e.g.
lapply(c("1,2,3", "9,8,7"), string_to_vector)
returns list(c(1, 2, 3), c(9, 8, 7))
.
Usage
string_to_vector(string, type = "numeric", sep = ",")
Arguments
string |
|
type |
|
sep |
|
Value
<type> vector
. The resulting vector (e.g. c(1, 2, 3)
).
Examples
## String to numeric vector (default)
string_to_vector("1,2,3,4")
string_to_vector("1,2,3,4", type = "numeric")
string_to_vector("1;2;3;4", sep = ";")
## String to character vector
string_to_vector("A,B,C,D", type = "character")
## String to logical vector
string_to_vector("TRUE FALSE TRUE", type = "logical", sep = " ")
## By default, vector inputs are concatenated
string_to_vector(c("1,2,3", "4,5,6"))
## To create a list of vector outputs, use lapply()
lapply(c("1,2,3", "4,5,6"), string_to_vector)
Join a vector into a comma-separated string (generic ggDNAvis
helper)
Description
Takes a vector and condenses it into a single string by joining items with ","
.
Reversed by string_to_vector()
.
Usage
vector_to_string(vector, sep = ",")
Arguments
vector |
|
sep |
|
Value
character
. The same vector but as a comma-separated string (e.g. "1,2,3"
).
Examples
vector_to_string(c(1, 2, 3, 4))
vector_to_string(c("These", "are", "some", "words"))
vector_to_string(3:5, sep = ";")
Visualise many DNA/RNA sequences
Description
This function takes a vector of DNA/RNA sequences (each sequence can be
any length and they can be different lengths), and plots each sequence
as base-coloured squares along a single line. Setting filename
allows direct
export of a png image with the correct dimensions to make every base a perfect
square. Empty strings (""
) within the vector can be utilised as blank spacing
lines. Colours and pixels per square when exported are configurable.
Usage
visualise_many_sequences(
sequences_vector,
sequence_colours = sequence_colour_palettes$ggplot_style,
background_colour = "white",
margin = 0.5,
sequence_text_colour = "black",
sequence_text_size = 16,
outline_colour = "black",
outline_linewidth = 3,
outline_join = "mitre",
return = TRUE,
filename = NA,
render_device = ragg::agg_png,
pixels_per_base = 100
)
Arguments
sequences_vector |
|
sequence_colours |
|
background_colour |
|
margin |
|
sequence_text_colour |
|
sequence_text_size |
|
outline_colour |
|
outline_linewidth |
|
outline_join |
|
return |
|
filename |
|
render_device |
|
pixels_per_base |
|
Value
A ggplot object containing the full visualisation, or invisible(NULL)
if return = FALSE
. It is often more useful to use filename = "myfilename.png"
, because then the visualisation is exported at the correct aspect ratio.
Examples
## Create sequences vector
sequences <- extract_and_sort_sequences(example_many_sequences)
## Visualise example_many_sequences with all defaults
## This looks ugly because it isn't at the right scale/aspect ratio
visualise_many_sequences(sequences)
## Export with all defaults rather than returning
visualise_many_sequences(
sequences,
filename = "example_vms_01.png",
return = FALSE
)
## View exported image
image <- png::readPNG("example_vms_01.png")
unlink("example_vms_01.png")
grid::grid.newpage()
grid::grid.raster(image)
## Export while customising appearance
visualise_many_sequences(
sequences,
filename = "example_vms_02.png",
return = FALSE,
sequence_colours = sequence_colour_palettes$bright_pale,
sequence_text_colour = "white",
background_colour = "lightgrey",
outline_linewidth = 0,
margin = 0
)
## View exported image
image <- png::readPNG("example_vms_02.png")
unlink("example_vms_02.png")
grid::grid.newpage()
grid::grid.raster(image)
Visualise methylation probabilities for many DNA sequences
Description
This function takes vectors of modifications locations, modification probabilities,
and sequence lengths (e.g. created by extract_methylation_from_dataframe()
) and
visualises the probability of methylation (or other modification) across each read.
Assumes that the three main input vectors are of equal length n and represent n sequences
(e.g. Nanopore reads), where locations
are the indices along each read at which modification
was assessed, probabilities
are the probability of modification at each assessed site, and
lengths
are the lengths of each sequence.
For each sequence, renders non-assessed (e.g. non-CpG) bases as other_bases_colour
, renders
background (including after the end of the sequence) as background_colour
, and renders assessed
bases on a linear scale from low_colour
to high_colour
.
Clamping means that the endpoints of the colour gradient can be set some distance into the probability
space e.g. with Nanopore > SAM probability values from 0-255, the default is to render 0 as fully blue
(#0000FF
), 255 as fully red (#FF0000
), and values in between linearly interpolated. However, clamping with
low_clamp = 100
and high_clamp = 200
would set all probabilities up to 100 as fully blue,
all probabilities 200 and above as fully red, and linearly interpolate only over the 100-200
range.
A separate scalebar plot showing the colours corresponding to each probability, with any/no clamping values,
can be produced via visualise_methylation_colour_scale()
.
Usage
visualise_methylation(
modification_locations,
modification_probabilities,
sequence_lengths,
low_colour = "blue",
high_colour = "red",
low_clamp = 0,
high_clamp = 255,
background_colour = "white",
other_bases_colour = "grey",
outline_colour = "black",
outline_linewidth = 3,
outline_join = "mitre",
modified_bases_outline_colour = NA,
modified_bases_outline_linewidth = NA,
modified_bases_outline_join = NA,
other_bases_outline_colour = NA,
other_bases_outline_linewidth = NA,
other_bases_outline_join = NA,
margin = 0.5,
return = TRUE,
filename = NA,
render_device = ragg::agg_png,
pixels_per_base = 20
)
Arguments
modification_locations |
|
modification_probabilities |
|
sequence_lengths |
|
low_colour |
|
high_colour |
|
low_clamp |
|
high_clamp |
|
background_colour |
|
other_bases_colour |
|
outline_colour |
|
outline_linewidth |
|
outline_join |
|
modified_bases_outline_colour |
|
modified_bases_outline_linewidth |
|
modified_bases_outline_join |
|
other_bases_outline_colour |
|
other_bases_outline_linewidth |
|
other_bases_outline_join |
|
margin |
|
return |
|
filename |
|
render_device |
|
pixels_per_base |
|
Value
A ggplot object containing the full visualisation, or invisible(NULL)
if return = FALSE
. It is often more useful to use filename = "myfilename.png"
, because then the visualisation is exported at the correct aspect ratio.
Examples
## Extract info from dataframe
methylation_info <- extract_methylation_from_dataframe(example_many_sequences)
## Visualise example_many_sequences with all defaults
## This looks ugly because it isn't at the right scale/aspect ratio
visualise_methylation(
methylation_info$locations,
methylation_info$probabilities,
methylation_info$lengths
)
## Export with all defaults rather than returning
visualise_methylation(
methylation_info$locations,
methylation_info$probabilities,
methylation_info$lengths,
filename = "example_vm_01.png",
return = FALSE
)
## View exported image
image <- png::readPNG("example_vm_01.png")
unlink("example_vm_01.png")
grid::grid.newpage()
grid::grid.raster(image)
## Export with customisation
visualise_methylation(
methylation_info$locations,
methylation_info$probabilities,
methylation_info$lengths,
filename = "example_vm_02.png",
return = FALSE,
low_colour = "white",
high_colour = "black",
low_clamp = 0.3*255,
high_clamp = 0.7*255,
other_bases_colour = "lightblue1",
other_bases_outline_linewidth = 1,
other_bases_outline_colour = "grey",
modified_bases_outline_linewidth = 3,
modified_bases_outline_colour = "black",
margin = 0.3
)
## View exported image
image <- png::readPNG("example_vm_02.png")
unlink("example_vm_02.png")
grid::grid.newpage()
grid::grid.raster(image)
Visualise methylation colour scalebar
Description
This function creates a scalebar showing the colouring scheme based on methylation
probability that is used in visualise_methylation()
. Showing this is particularly
important when the colour range is clamped via low_clamp
and high_clamp
(e.g.
setting that all values below 100 are fully blue (#0000FF
), all values above 200 are
fully red (#FF0000
), and colour interpolation occurs only in the range 100-200, rather
than across the whole range 0-255). If clamping is off (default), then 0 is fully blue,
255 is fully read, and all values are linearly interpolated. NB: colours are configurable
but default to blue = low modification probability and red = high modification probability.
Usage
visualise_methylation_colour_scale(
low_colour = "blue",
high_colour = "red",
low_clamp = 0,
high_clamp = 255,
full_range = c(0, 255),
precision = 10^3,
background_colour = "white",
x_axis_title = NULL,
do_x_ticks = TRUE,
do_side_scale = FALSE,
side_scale_title = NULL,
outline_colour = "black",
outline_linewidth = 1
)
Arguments
low_colour |
|
high_colour |
|
low_clamp |
|
high_clamp |
|
full_range |
|
precision |
|
background_colour |
|
x_axis_title |
|
do_x_ticks |
|
do_side_scale |
|
side_scale_title |
|
outline_colour |
|
outline_linewidth |
|
Value
ggplot of the scalebar.
Unlike the other visualise_<>
functions in this package, does not directly export a png. This is because there are no squares that need to be rendered at a precise aspect ratio in this function. It can just be saved normally with ggplot2::ggsave()
with any sensible combination of height and width.
Examples
## Defaults match defaults of visualise_methylation()
visualise_methylation_colour_scale()
## Use clamping and change colours
visualise_methylation_colour_scale(
low_colour = "white",
high_colour = "black",
low_clamp = 0.3*255,
high_clamp = 0.7*255,
full_range = c(0, 255),
background_colour = "lightblue1",
x_axis_title = "Methylation probability"
)
## Lower precision = colour banding
visualise_methylation_colour_scale(
precision = 10,
do_x_ticks = FALSE
)
Visualise a single DNA/RNA sequence
Description
This function takes a DNA/RNA sequence and returns a ggplot visualising it, with the option to directly export a png image with appropriate dimensions. Colours, line wrapping, index annotation interval, and pixels per square when exported are configurable.
Usage
visualise_single_sequence(
sequence,
sequence_colours = sequence_colour_palettes$ggplot_style,
background_colour = "white",
line_wrapping = 75,
spacing = 1,
margin = 0.5,
sequence_text_colour = "black",
sequence_text_size = 16,
index_annotation_colour = "darkred",
index_annotation_size = 12.5,
index_annotation_interval = 15,
index_annotations_above = TRUE,
index_annotation_vertical_position = 1/3,
outline_colour = "black",
outline_linewidth = 3,
outline_join = "mitre",
return = TRUE,
filename = NA,
render_device = ragg::agg_png,
pixels_per_base = 100
)
Arguments
sequence |
|
sequence_colours |
|
background_colour |
|
line_wrapping |
|
spacing |
|
margin |
|
sequence_text_colour |
|
sequence_text_size |
|
index_annotation_colour |
|
index_annotation_size |
|
index_annotation_interval |
|
index_annotations_above |
|
index_annotation_vertical_position |
|
outline_colour |
|
outline_linewidth |
|
outline_join |
|
return |
|
filename |
|
render_device |
|
pixels_per_base |
|
Value
A ggplot object containing the full visualisation, or invisible(NULL)
if return = FALSE
. It is often more useful to use filename = "myfilename.png"
, because then the visualisation is exported at the correct aspect ratio.
Examples
## Create sequence to visualise
sequence <- paste(c(rep("GGC", 72), rep("GGAGGAGGCGGC", 15)), collapse = "")
## Visualise with all defaults
## This looks ugly because it isn't at the right scale/aspect ratio
visualise_single_sequence(sequence)
## Export with all defaults rather than returning
visualise_single_sequence(
sequence,
filename = "example_vss_01.png",
return = FALSE
)
## View exported image
image <- png::readPNG("example_vss_01.png")
unlink("example_vss_01.png")
grid::grid.newpage()
grid::grid.raster(image)
## Export while customising appearance
visualise_single_sequence(
sequence,
filename = "example_vss_02.png",
return = FALSE,
sequence_colours = sequence_colour_palettes$bright_pale,
sequence_text_colour = "white",
background_colour = "lightgrey",
line_wrapping = 60,
spacing = 2,
outline_linewidth = 0,
index_annotations_above = FALSE,
margin = 0
)
## View exported image
image <- png::readPNG("example_vss_02.png")
unlink("example_vss_02.png")
grid::grid.newpage()
grid::grid.raster(image)
Write sequence and quality information to FASTQ
Description
This function simply writes a FASTQ file from a dataframe containing
columns for read ID, sequence, and quality scores.
See fastq_quality_scores
for an explanation of quality.
Said dataframe can be produced from FASTQ via read_fastq()
.
To read/write a modified FASTQ containing modification information
(SAM/BAM MM and ML tags) in the header lines, use
read_modified_fastq()
and write_modified_fastq()
.
Usage
write_fastq(
dataframe,
filename = NA,
read_id_colname = "read",
sequence_colname = "sequence",
quality_colname = "quality",
return = FALSE
)
Arguments
dataframe |
|
filename |
|
read_id_colname |
|
sequence_colname |
|
quality_colname |
|
return |
|
Value
character vector
. The resulting FASTQ file as a character vector of its constituent lines (or invisible(NULL)
if return
is FALSE
). This is probably mostly useful for debugging, as setting filename
within this function directly writes to FASTQ via writeLines()
. Therefore, defaults to returning invisible(NULL)
.
Examples
## Write to FASTQ (using filename = NA, return = FALSE
## to view as char vector rather than writing to file)
write_fastq(
example_many_sequences,
filename = NA,
read_id_colname = "read",
sequence_colname = "sequence",
quality_colname = "quality",
return = TRUE
)
## quality_colname = NA fills in quality with "B"
write_fastq(
example_many_sequences,
filename = NA,
read_id_colname = "read",
sequence_colname = "sequence",
quality_colname = NA,
return = TRUE
)
Write modification information stored in dataframe back to modified FASTQ
Description
This function takes a dataframe containing DNA modification information
(e.g. produced by read_modified_fastq()
) and writes it back to modified
FASTQ, equivalent to what would be produced via samtools fastq -T MM,ML
.
Arguments give the names of columns within the dataframe from which to read.
If multiple types of modification have been assessed (e.g. both methylation
and hydroxymethylation), then multiple colnames must be provided for locations
and probabilites, and multiple prefixes (e.g. "C+h?"
) must be provided.
IMPORTANT: These three vectors must all be the same length, and the modification
types must be in a consistent order (e.g. if writing hydroxymethylation and methylation
in that order, must do H then M in all three vectors and never vice versa).
If quality isn't known (e.g. there was a FASTA step at some point in the pipeline),
the quality
argument can be set to NA
to fill in quality scores with "B"
. This
is the same behaviour as SAMtools v1.21 when converting FASTA to SAM/BAM then FASTQ.
I don't really know why SAMtools decided the default quality should be "B" but there
was probably a reason so I have stuck with that.
Default arguments are set up to work with the included example_many_sequences
data.
Usage
write_modified_fastq(
dataframe,
filename = NA,
read_id_colname = "read",
sequence_colname = "sequence",
quality_colname = "quality",
locations_colnames = c("hydroxymethylation_locations", "methylation_locations"),
probabilities_colnames = c("hydroxymethylation_probabilities",
"methylation_probabilities"),
modification_prefixes = c("C+h?", "C+m?"),
include_blank_tags = TRUE,
return = FALSE
)
Arguments
dataframe |
|
filename |
|
read_id_colname |
|
sequence_colname |
|
quality_colname |
|
locations_colnames |
|
probabilities_colnames |
|
modification_prefixes |
|
include_blank_tags |
|
return |
|
Value
character vector
. The resulting modified FASTQ file as a character vector of its constituent lines (or invisible(NULL)
if return
is FALSE
). This is probably mostly useful for debugging, as setting filename
within this function directly writes to FASTQ via writeLines()
. Therefore, defaults to returning invisible(NULL)
.
Examples
## Write to FASTQ (using filename = NA, return = FALSE
## to view as char vector rather than writing to file)
write_modified_fastq(
example_many_sequences,
filename = NA,
read_id_colname = "read",
sequence_colname = "sequence",
quality_colname = "quality",
locations_colnames = c("hydroxymethylation_locations",
"methylation_locations"),
probabilities_colnames = c("hydroxymethylation_probabilities",
"methylation_probabilities"),
modification_prefixes = c("C+h?", "C+m?"),
return = TRUE
)
## Write methylation only, and fill in qualities with "B"
write_modified_fastq(
example_many_sequences,
filename = NA,
read_id_colname = "read",
sequence_colname = "sequence",
quality_colname = NA,
locations_colnames = c("methylation_locations"),
probabilities_colnames = c("methylation_probabilities"),
modification_prefixes = c("C+m?"),
return = TRUE
)