Introduction

Sometimes we need to remove an email from our archives. It might be GDPR or CCPA related. Or 
it might meet our archival removal requirements. Whatever the need, the instructions below will 
help you remove an email.

Access needed

Currently we need to edit on Minotaur, mbox-vm and either mailarchive-vm or mailprivate-vm,
so root access is needed on all those boxes.

A request should also be sent to ponee.io to remove the mail from their end.

Locating an Email in an archive file

Before we get into the act of what to edit and where, an example of how mod_mbox stores emails 
into a file or files will help with the actual editing. One mbox file is stored per month and  can
contain 1, 10 or 500 emails, so we need to determine the start and end of an individual email.

Find the very beginning of an email

mod_mbox emails always begin with a line starting with the word 'From ' - important to note at this point 
that this is a From , (always capital F) followed by a space. This form of 'From ' contains important details 
about the email itself, including at least the archived email number, the archive address and the date it was
'sent' by the sender. This marks the very beginning of an email entry; and this is where you start when
you want to remove an email. See example 1 below.

Later in the body of the email will be another 'From: ' - this one has a colon ':' immediately after it, and then a space.
This form of 'From: ' contains the name and email address of the sender. This is NOT the beginning of the email entry in
a mod_mbox file and for purposes of email removal, should be ignored. It is only useful in identifying the 
email you want to remove. See example 2 below.

Example 1: From dev-return-12345-archiver=mbox-vm.apache.org@http://$project.apache.org  Sat Aug  1 14:19:30 2020

This example shows that this email is in an mbox file belonging to 'dev@project.apache.org' and from the date we can
determine that the mbox file name would be called '202008'

Example 2: From: Gavin McDonald <gmcdonald@apache.org>

This example shows the sender's name and email address.

Find the very end of an email

Thankfully, this is easier than it first seems. Emails can end with multipart identifiers if the email was sent as multipart / HTML.
Emails can also end with just the end of the body, or with the quoted text of a previous email. But none of this matters; all you 
need to do is find the Start of the next email and look up a couple of lines.

Editing an archive file - aka removing an email from the archives

Okay, down to business.

If editing the file for the current month, please take care not to lose any emails that arrive whilst you are editing.

The safest way to do this is to temporarily disable mail delivery, but that is not always possible.

An alternative is to reduce the file update window as much as possible.

For example:

  • copy the current file, e.g. cp 202008 202008.save
  • copy the saved file, e.g. cp 202008.save 202008.work
  • edit the work file to remove the mail(s)
  • compare the current file with the saved copy. If they agree, it's OK to replace the file, i.e:
  • cmp 202008 202008.save && mv 202008.work 202008
  • If the compare fails, start again.

For a simpler way to create an updated file, see 'Using the mboxsplit.pl script' below

mbox-vm

Archives are located at /x1/archives, /x1/private, /x1/restricted. These hold the emails archived by the host, which started operating in Aug 2017.

Earlier mails were archived by minotaur; copies of these files are held under:

/x1/minotaur/apmail/private-arch/, /x1/minotaur/apmail/public-arch/, /x1/minotaur/apmail/restricted-arch

Note that most lists were archived by both minotaur and mbox-vm until minotaur was decommissioned.

Month end was handled differently by minotaur and mbox-vm, so emails received near the end of a month may appear in different months in the mbox-vm and minotaur archives.

Locate the archive file(s) you want to edit, for example '/x1/archives/httpd.apache.org/dev/202008.mbox'
Back it up so you have a copy to retreat to in case something goes wrong, then open it in Vim or your favourite editor, find the email via a quick search, then carefully delete it,
starting with 
'From ' and ending just before the next 'From '. Save the file.

Warning: in earlier mails archived by minotaur, a 'From ' line which appeared in the body of an email was not correctly escaped, so can be matched by mistake.

mailarchive-vm

Archives are located at /x1/mail-archives.apache.org/mod_mbox/

This is where things differ from mbox-vm. We still find and edit the file, but this time we 
have a bit of extra processing to do.

On mailarchive-vm, due to the integration with mod_mbox on Apache2, we have extra files. Example:

  • 200208 is the mbox file, exactly the same as on mino and mbox-vm.
  • 200208.mbox is a symlink to the above file. The extension is required for Apache2 and for indexing.
  • 200208.mbox.msgsum is the index file, generated by mod-mbox-util script.

And the process is:-

  • Edit the 200208 file, in the same manner as for mino and mbox-vm.
  • Delete the 200208.mbox.msgsum file (due to a bug, see note 2 below).
  • Run '/etc/apache2/bin/mod-mbox-util -vc .' to update the cache and reindex. This re-creates the 200208.mbox.msgsum file.
  • Run '/x1/mod_mbox/scripts/update-index' - this corrects the message count display, it takes around10 minutes to run, so technically you could leave this for the next cron to run it.

We are done. The email has disappeared from https://mail-archives.apache.org.

mailprivate-vm

Archives are located at /x1/mail-private.apache.org/mod_mbox/

The process for mailprivate-vm is more or less the same as for mailarchive-vm. The main difference is that there
are different scripts to run at '
/x1/mail-private' for updating the site index.
'/etc/apache2/bin/mod-mbox-util' is in the same place.

Editing mbox files older than current month

All the above assumes you are editing the 'current' month's file - i.e., the file has not been zipped up.

Older months will need to be extracted for editing and then rezipped.

mailarchive-vm has a script '/home/modmbox/scripts/xtract-gz.sh' which may help.

mailprivate-vm also has '/home/modmbox/scripts/create-gzip-from-mbox.sh' which will help also.

Using the mboxsplit.pl script

There is a script that can be used to split a mailbox. It can handle gzipped files as input. The script is at:

https://svn.apache.org/repos/infra/infrastructure/trunk/tools/mbox/mboxsplit.pl

There several ways to use it:

$ mboxsplit.pl 202008 # split into one file per email, named Xaaa, Xaab etc

$ mboxsplit.pl 202008 WORK/X # split into one file per email, named WORK/Xaaa, WORKXaab etc

Find the X-files that contain the email(s) to be removed and delete them.

Create the updated mbox file by concatenation:

$ cat WORK/X??? 202008.new

If you know the details of the 'From ' header line, you can split the file into 3 as follows:

$ mboxsplit.pl -X dev-return-12345-archiver 200208 # split into 3 files

Xaaa - initial emails

Xaab - the matching email

Xaac - the rest of the emails

N.B. as a check that the split worked OK, one can compare the files:

$ cmp <(cat X???) 202008

Double-check that the file Xaab contains the unwanted mail, and concatenate the others?

$ cat Xaa[ac] 202008.new

Once all is as you want it

After you have double-checked that you actually deleted the correct email and that you did not delete anything else, and that everything is as it should be, you can delete the backup files you created at the start of the deletion process for each archive.

Notes

  1. Edit mailarchive-vm/mailprivate-vm manually as soon as you can after editing mbox-vm.
  2. A bug in mod_mbox exists where the .msgsum file needs to be deleted before running mod-mbox-util - not doing so 
    results in the email entry remaining in the display of the archives even though the email no longer exists. See also: https://bz.apache.org/bugzilla/show_bug.cgi?id=62430 
  3. Other archive services mirror our own archives. The deletion should be noted to the requester for follow-up by them.
    Markmail has a form we can fill out ( https://markmail.org/docs/removal-policy.xqy ) - we can choose the reason for 'Original Content Removed'.

For the future

There is a script on mailarchive-vm - details on this wiki page - but it did not work for me, and looks like it does not do the follow up work needed with re-indexing (and the needed workaround to the .msgsum), so I stopped pursuit of that script.