NAME
bup-split - save individual files to bup backup sets
SYNOPSIS
bup split [-t] [-c] [-n name] COMMON_OPTIONS
bup split -b COMMON_OPTIONS
bup split –copy COMMON_OPTIONS
bup split –noop [-t|-b] COMMON_OPTIONS
- COMMON_OPTIONS
-
[-r host:path] [-v] [-q] [-d
seconds-since-epoch] [--bench] [--max-pack-size=bytes]
[-#] [--bwlimit=bytes] [--max-pack-objects=n]
[--fanout=count] [--keep-boundaries] [--git-ids | filenames…]
DESCRIPTION
bup split
concatenates the contents of the given files
(or if no filenames are given, reads from stdin), splits the content
into chunks of around 8k using a rolling checksum algorithm, and saves
the chunks into a bup repository. Chunks which have previously been
stored are not stored again (ie. they are ‘deduplicated’).
Because of the way the rolling checksum works, chunks tend to be very
stable across changes to a given file, including adding, deleting, and
changing bytes.
For example, if you use bup split
to back up an XML dump
of a database, and the XML file changes slightly from one run to the
next, nearly all the data will still be deduplicated and the size of
each backup after the first will typically be quite small.
Another technique is to pipe the output of the tar
(1) or
cpio
(1) programs to bup split
. When individual
files in the tarball change slightly or are added or removed, bup still
processes the remainder of the tarball efficiently. (Note that
bup save
is usually a more efficient way to accomplish
this, however.)
To get the data back, use bup-join
(1).
MODES
These options select the primary behavior of the command, with -n
being the most likely choice.
- -n, --name=name
-
after creating the dataset, create a git branch named name so
that it can be accessed using that name. If name already
exists, the new dataset will be considered a descendant of the old
name. (Thus, you can continually create new datasets with the
same name, and later view the history of that dataset to see how it has
changed over time.) The original data will also be available as a
top-level file named “data” in the VFS, accessible via
bup fuse
, bup ftp
, etc.
- -t, --tree
-
output the git tree id of the resulting dataset.
- -c, --commit
-
output the git commit id of the resulting dataset.
- -b, --blobs
-
output a series of git blob ids that correspond to the chunks in the
dataset. Incompatible with -n, -t, and -c.
- --noop
-
read the data and split it into blocks based on the “bupsplit” rolling
checksum algorithm, but don’t store anything in the repo. Can be
combined with -b or -t to compute (but not store) the git blobs or tree
ids for the dataset. This is mostly useful for benchmarking and
validating the bupsplit algorithm. Incompatible with -n and -c.
- --copy
-
like
--noop
, but also write the data to stdout. This can be
useful for benchmarking the speed of read+bupsplit+write for large
amounts of data. Incompatible with -n, -t, -c, and -b.
OPTIONS
- -r, --remote=host:path
-
save the backup set to the given remote server. If path is
omitted, uses the default path on the remote server (you still need to
include the ‘:’). The connection to the remote server is made with SSH.
If you’d like to specify which port, user or private key to use for the
SSH connection, we recommend you use the
~/.ssh/config
file. Even though the destination is remote, a local bup repository is
still required.
- -d, --date=seconds-since-epoch
-
specify the date inscribed in the commit (seconds since 1970-01-01).
- -q, --quiet
-
disable progress messages.
- -v, --verbose
-
increase verbosity (can be used more than once).
- --git-ids
-
stdin is a list of git object ids instead of raw data.
bup split
will read the contents of each named git object
(if it exists in the bup repository) and split it. This might be useful
for converting a git repository with large binary files to use bup-style
hashsplitting instead. This option is probably most useful when combined
with --keep-boundaries
.
- --keep-boundaries
-
if multiple filenames are given on the command line, they are normally
concatenated together as if the content all came from a single file.
That is, the set of blobs/trees produced is identical to what it would
have been if there had been a single input file. However, if you use
--keep-boundaries
, each file is split separately. You still
only get a single tree or commit or series of blobs, but each blob comes
from only one of the files; the end of one of the input files always
ends a blob.
- --bench
-
print benchmark timings to stderr.
- --max-pack-size=bytes
-
never create git packfiles larger than the given number of bytes.
Default is 1 billion bytes. Usually there is no reason to change this.
- --max-pack-objects=numobjs
-
never create git packfiles with more than the given number of objects.
Default is 200 thousand objects. Usually there is no reason to change
this.
- --fanout=numobjs
-
when splitting very large files, try and keep the number of elements in
trees to an average of numobjs.
- --bwlimit=bytes/sec
-
don’t transmit more than bytes/sec bytes per second to the
server. This is good for making your backups not suck up all your
network bandwidth. Use a suffix like k, M, or G to specify multiples of
1024, 1024*1024, 1024*1024*1024 respectively.
- -#, --compress=#
-
set the compression level to # (a value from 0-9, where 9 is the highest
and 0 is no compression). The default is 1 (fast, loose compression)
EXAMPLES
$ tar -cf - /etc | bup split -r myserver: -n mybackup-tar
tar: Removing leading /' from member names
Indexing objects: 100% (196/196), done.
$ bup join -r myserver: mybackup-tar | tar -tf - | wc -l
1961
SEE ALSO
bup-join
(1), bup-index
(1),
bup-save
(1), bup-on
(1),
ssh_config
(5)
BUP
Part of the bup
(1) suite.