% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/vs_cluster_unoise.R
\name{vs_cluster_unoise}
\alias{vs_cluster_unoise}
\alias{cluster_unoise}
\alias{unoise}
\alias{denoise}
\title{Denoising FASTA sequences}
\usage{
vs_cluster_unoise(
  fasta_input,
  otutabout = NULL,
  minsize = 8,
  unoise_alpha = 2,
  relabel = NULL,
  relabel_sha1 = FALSE,
  log_file = NULL,
  threads = 1,
  vsearch_options = NULL,
  tmpdir = NULL
)
}
\arguments{
\item{fasta_input}{(Required). A FASTA file path or a FASTA object containing
reads to denoise. See \emph{Details}.}

\item{otutabout}{(Optional). A character string specifying the name of the
output file in an OTU table format. If \code{NULL} (default), the output is
returned as a tibble in R. See \emph{Details}.}

\item{minsize}{(Optional). Minimum abundance of cluster centroids.
Defaults to \code{8}.}

\item{unoise_alpha}{(Optional). Alpha value for the UNOISE algorithm.
Defaults to \code{2}.}

\item{relabel}{(Optional). Relabel sequences using the given prefix and a
ticker to construct new headers. Defaults to \code{NULL}.}

\item{relabel_sha1}{(Optional). If \code{TRUE} (default), relabel sequences
using the SHA1 message digest algorithm. Defaults to \code{FALSE}.}

\item{log_file}{(Optional). Name of the log file to capture messages from
\code{VSEARCH}. If \code{NULL} (default), no log file is created.}

\item{threads}{(Optional). Number of computational threads to be used by
\code{VSEARCH}. Defaults to \code{1}.}

\item{vsearch_options}{(Optional). Additional arguments to pass to
\code{VSEARCH}. Defaults to \code{NULL}. See \emph{Details}.}

\item{tmpdir}{(Optional). Path to the directory where temporary files should
be written when tables are used as input or output. Defaults to
\code{NULL}, which resolves to the session-specific temporary directory
(\code{tempdir()}).}
}
\value{
A read count table with one row for each cluster and one column for
each sample. If \code{otutabout} is a text it is assumed to be a file name,
and the results are written to this file. If no such text is supplied
(default), it is returned as a tibble.

The first two columns of this tibble lists the \code{Header} and
\code{Sequence} of the centroid sequences for each cluster.

The clustering statistics are included as an attribute named
\code{"statistics"} with the following columns:
\itemize{
  \item \code{num_nucleotides}: Total number of nucleotides used as input for
  clustering.
  \item \code{min_length_input_seq}: Length of the shortest sequence used as
  input for clustering.
  \item \code{max_length_input_seq}: Length of the longest sequence used as
  input for clustering.
  \item \code{avg_length_input_seq}: Average length of the sequences used as
  input for clustering.
  \item \code{num_clusters}: Number of clusters generated.
  \item \code{min_size_cluster}: Size of the smallest cluster.
  \item \code{max_size_cluster}: Size of the largest cluster.
  \item \code{avg_size_cluster}: Average size of the clusters.
  \item \code{num_singletons}: Number of singletons after clustering.
  \item \code{input}: Name of the input file/object for the clustering.
}
}
\description{
\code{vs_cluster_unoise} performs denoising of FASTA sequences
from a given file or object using \code{VSEARCH}´s \code{cluster_unoise}
method.
}
\details{
Sequences are denoised according to the UNOISE version 3 algorithm by Robert
Edgar, but without the de novo chimera removal step. In this algorithm,
clustering of sequences depends both on their similarity and their
abundances. The abundance ratio (skew) is the abundance of a new
sequence divided by the abundance of the centroid sequence. This skew must
not be larger than beta if the sequences should be clustered together. Beta
is calculated as 2 raised to the power of minus 1 minus alpha times the
sequence distance. The sequence distance used is the number of mismatches in
the alignment, ignoring gaps. This means that the abundance must be
exponentially lower as the distance increases from the centroid for a new
sequence to be included in the cluster.

The argument \code{minsize} will affect the total number of clusters,
specifying the minimum copy number required for any centroid. A larger value
means (in general) fewer clusters.

\code{fasta_input} can either be a file path to a FASTA file or a FASTA
object. FASTA objects are tibbles that contain the columns \code{Header} and
\code{Sequence}, see \code{\link[microseq]{readFasta}}.

The \code{Header} column \strong{must} contain the size (copy number) for
each read. The size information must have the format ";size=X",
where X is the count for the given sequence. This is obtained by running all
reads through \code{\link{vs_fastx_uniques}} with \code{sizeout = TRUE}.

You may use reads for a single sample or all reads from all samples as input.
In the latter case the \code{Header} must also contain sample information
on the format ";sample=xxx" where "xxx" is a unique sample identifier text.
Again, this is obtained by using \code{\link{vs_fastx_uniques}} on the reads
for each sample prior to this step. Use the \code{sample = "xxx"} argument,
where "xxx" is replaced with some unique text for each sample.

If \code{log_file} is \code{NULL} and \code{centroids} is specified,
clustering statistics from \code{VSEARCH} will not be captured.

\code{vsearch_options} allows users to pass additional command-line arguments
to \code{VSEARCH} that are not directly supported by this function. Refer to
the \code{VSEARCH} manual for more details.
}
\examples{
\dontrun{
# A small fasta file
fasta_input <- file.path(file.path(path.package("Rsearch"), "extdata"), "small.fasta")

# Denoise sequences and read counts
denoised.tbl <- vs_cluster_unoise(fasta_input = fasta_input)
head(denoised.tbl)

# Extract clustering statistics
statistics <- attr(denoised.tbl, "statistics")

# Cluster sequences and write results to a file
vs_cluster_unoise(fasta_input = fasta_input,
                  otutabout = "otutable.tsv")
}

}
\references{
\url{https://github.com/torognes/vsearch}
}
