Sometimes we need to remove an email from our archives. It might be GDPR or CCPA related. Or
it might meet our archival removal requirements. Whatever the need, the instructions below will
help you remove an email.
Currently we need to edit on Minotaur, mbox-vm and either mailarchive-vm or mailprivate-vm,
so root access is needed on all those boxes.
A request should also be sent to ponee.io to remove the mail from their end.
Before we get into the act of what to edit and where, an example of how mod_mbox stores emails
into a file or files will help with the actual editing. One mbox file is stored per month and can
contain 1, 10 or 500 emails, so we need to determine the start and end of an individual email.
mod_mbox emails always begin with a line starting with the word 'From ' - important to note at this point
that this is a From , (always capital F) followed by a space. This form of 'From ' contains important details
about the email itself, including at least the archived email number, the archive address and the date it was
'sent' by the sender. This marks the very beginning of an email entry; and this is where you start when
you want to remove an email. See example 1 below.
Later in the body of the email will be another 'From: ' - this one has a colon ':' immediately after it, and then a space.
This form of 'From: ' contains the name and email address of the sender. This is NOT the beginning of the email entry in
a mod_mbox file and for purposes of email removal, should be ignored. It is only useful in identifying the
email you want to remove. See example 2 below.
Example 1: From dev-return-12345-archiver=mbox-vm.apache.org@ Sat Aug 1 14:19:30 2020
This example shows that this email is in an mbox file belonging to 'firstname.lastname@example.org' and from the date we can
determine that the mbox file name would be called '202008'
Example 2: From: Gavin McDonald <email@example.com>
This example shows the sender's name and email address.
Thankfully, this is easier than it first seems. Emails can end with multipart identifiers if the email was sent as multipart / HTML.
Emails can also end with just the end of the body, or with the quoted text of a previous email. But none of this matters; all you
need to do is find the Start of the next email and look up a couple of lines.
Okay, down to business.
If editing the file for the current month, please take care not to lose any emails that arrive whilst you are editing.
The safest way to do this is to temporarily disable mail delivery, but that is not always possible.
An alternative is to reduce the file update window as much as possible.
For a simpler way to create an updated file, see 'Using the mboxsplit.pl script' below
Archives are located at /x1/archives, /x1/private, /x1/restricted. These hold the emails archived by the host, which started operating in Aug 2017.
Earlier mails were archived by minotaur; copies of these files are held under:
/x1/minotaur/apmail/private-arch/, /x1/minotaur/apmail/public-arch/, /x1/minotaur/apmail/restricted-arch
Note that most lists were archived by both minotaur and mbox-vm until minotaur was decommissioned.
Month end was handled differently by minotaur and mbox-vm, so emails received near the end of a month may appear in different months in the mbox-vm and minotaur archives.
Locate the archive file(s) you want to edit, for example '/x1/archives/httpd.apache.org/dev/202008.mbox'
Back it up so you have a copy to retreat to in case something goes wrong, then open it in Vim or your favourite editor, find the email via a quick search, then carefully delete it,
starting with 'From ' and ending just before the next 'From '. Save the file.
Warning: in earlier mails archived by minotaur, a 'From ' line which appeared in the body of an email was not correctly escaped, so can be matched by mistake.
Archives are located at /x1/mail-archives.apache.org/mod_mbox/
This is where things differ from mbox-vm. We still find and edit the file, but this time we
have a bit of extra processing to do.
On mailarchive-vm, due to the integration with mod_mbox on Apache2, we have extra files. Example:
And the process is:-
We are done. The email has disappeared from https://mail-archives.apache.org.
Archives are located at /x1/mail-private.apache.org/mod_mbox/
The process for mailprivate-vm is more or less the same as for mailarchive-vm. The main difference is that there
are different scripts to run at '/x1/mail-private' for updating the site index.
'/etc/apache2/bin/mod-mbox-util' is in the same place.
All the above assumes you are editing the 'current' month's file - i.e., the file has not been zipped up.
Older months will need to be extracted for editing and then rezipped.
mailarchive-vm has a script '/home/modmbox/scripts/xtract-gz.sh' which may help.
mailprivate-vm also has '/home/modmbox/scripts/create-gzip-from-mbox.sh' which will help also.
There is a script that can be used to split a mailbox. It can handle gzipped files as input. The script is at:
There several ways to use it:
$ mboxsplit.pl 202008 # split into one file per email, named Xaaa, Xaab etc
$ mboxsplit.pl 202008 WORK/X # split into one file per email, named WORK/Xaaa, WORKXaab etc
Find the X-files that contain the email(s) to be removed and delete them.
Create the updated mbox file by concatenation:
$ cat WORK/X??? 202008.new
If you know the details of the 'From ' header line, you can split the file into 3 as follows:
$ mboxsplit.pl -X dev-return-12345-archiver 200208 # split into 3 files
Xaaa - initial emails
Xaab - the matching email
Xaac - the rest of the emails
N.B. as a check that the split worked OK, one can compare the files:
$ cmp <(cat X???) 202008
Double-check that the file Xaab contains the unwanted mail, and concatenate the others?
$ cat Xaa[ac] 202008.new
After you have double-checked that you actually deleted the correct email and that you did not delete anything else, and that everything is as it should be, you can delete the backup files you created at the start of the deletion process for each archive.
There is a script on mailarchive-vm - details on this wiki page - but it did not work for me, and looks like it does not do the follow up work needed with re-indexing (and the needed workaround to the .msgsum), so I stopped pursuit of that script.