TUTORIAL FOR EMIL VERSION 2.0.4

Written by Martin Wendel, UDAC. Martin.Wendel@udac.uu.se

Note: Emil converts internet messages that conforms with rfc822. Therefore the use of the term "message" below refers to an rfc822 conforming message. Emil is developed for swedish conditions, but will most likely be useful outside of Sweden as well.

WHY USE MESSAGE CONVERSION?

The major motive for using message conversion is the need to convert character set of a message text. The reason for this is that character sets in internet messages (at least outside of the English speaking countries) is a mess. When the common characters in US-ASCII does not fully support the national alphabet, character set really matter when you receive a message text.

The second motive is encoding. The problem is similar to that of character sets. You normally do not support more than one or a few encoding standards on your local mail agent. Therefore it is a big gain in usability if you can convert messages to use the encodings your mail agent supports.

The third motive is message format. The most commonly used formats for Internet messages in Sweden is plain old style rfc822, Sun Mailtool, and MIME. Neither of those formats is suited to interprate any of the other ones.

THE EMIL WAY OF MESSAGE CONVERSION.

To be able to understand message conversion it is important to understand the components of a message and the types of these components. Basically a message consists of a header and a body. The format of the header is thoroughly defined in rfc822 and rfc1522. Basically it consists of lines of text divided into two parts; the field and the value, separated by a colon ':'. Apart from a few details the header is very straight forward. The body, on the other hand, is somewhat more tricky to understand.

The body of a message consists of text or/and encoded data, where data may refer to text as well. Emil recognizes the encodings BinHex, Base64, Quoted-Printable and UUencode. Text is the parts of the message that is not recognized as being an encoded enclosure. Text is a raw format whereas encoded data follows the syntax of the encoding. This means that incorrect or incomplete encodings are not recognized and are thus treated as raw text (as a safety measure).

Emil works in a two pass manner. After loading the message, it first interprets the header and body of the message before applying any conversions. First the format is recognized. This is done by interprating the header of the message. Then the data of the message is structured into a hierarchical message structure.

UUencoded and BinHexed enclosures are recognized by their preamble and a syntax check is made on them to make pretty sure they are valid. When emil receives a MIME or Mailtool message, encoded enclosures are defined in the header. Emil recognizes these definitions and trusts them to be valid, no further processing is done initially. The rest of the body parts and the erroneous BinHex and UUencode parts are treated as text. The text is checked for 7bit or 8bit encoding.

Emil has a hierarchical representation of the message and each part is typed by it's encoding. It may be that the trusted encodings of a MIME or Mailtool message are erroneous as well. However, the definitions in the header is a strong evidence that they must be correct. In the case of error the message part is left untouched, but not treated as text.

Conversion is applied by the specification of the target format. Emil has a clear view of the incoming message and can work pretty straight forward parsing the hierarchical structure. Each object in the structure contains headers, data, and pointers to other objects. When applying conversion, first the data of each object is converted to what's specified in the target format, then the headers of each object is taken care of based on the resulting type and encoding of the data of the objects.

For example: The target format specifies Sun Mailtool, UUencoded attachments and ISO-646-SE text. A MIME message containing a Quoted-Printable encoded text/plain using ISO-8859-1 and a Base64 encoded Image/GIF will be converted as follows (this is a fairly long and detailed description):

The entire message is loaded into memory.
The header of the message is examined and loaded into a structure.
This yields format MIME and type Multipart/Mixed. The boundary is saved.
The start of the body of the message is marked up at the end of the header.
Find the end boundary, the end of the message is marked just before the end boundary. The root object in the message structure is completed.
Start off a child object.
We've got boundaries, go find the first boundary.
Examine the header of the first body part.
This yields type=Text/plain, charset=ISO-8859-1 and encoding=Quoted-Printable.
Find the second boundary, terminating the first body part and initiating the second. The first child object is completed.
Start off a sister object to the previous child object.
Examine the header of the second body part.
This yields type=Image/GIF, encoding=Base64.
There are no more boundaries (the end boundary was detected in [3] and end of data as seen by the second bodypart is marked just before the end boundary), terminating the second bodypart.
End of data is reached, message parsing is completed. We've got a hierarchical message structure describing the incoming message.
Apply conversion.
Parse the message structure, converting the data.
First object (root object) is multipart, no data to be converted.
Second object (first child) is a text with charset=ISO-8859-1 and encoding quoted-printable.
Target format does not want quoted-printable and does not want charset=ISO-8859-1. Thus, decode quoted-printable.
This yields an 8bit text with charset=ISO-8859-1.
Target format does not want charset=ISO-8859-1. Convert charset to ISO-646-SE.
This yields a 7bit text with charset=ISO-646-SE.
Target format wants charset=ISO-646-SE, conversion on this data is completed.
Third object (second child) is a GIF encoded in Base64.
Target format does not want Base64, decode Base64.
This yields a GIF with encoding=binary.
Target format does not want binary, encode UUencode.
This yields a GIF with encoding=UUencode.
Target format wants UUencode, conversion on this data is completed.
All data is converted according to the target format. Start converting headers.
Remove any MIME-specific headers in the root header. This includes MIME-Version and Content-Type.
The root object is a multipart, add header Content-Type X-Sun-Encoding. Also add the Mailtool boundary to the boundary string.
The second bodypart is a text. Add the Mailtool headers. Also add header Content-Lines: number of lines.
The second bodypart is a GIF. Add the Mailtool headers. Also add header Content-Lines: number of lines.
The message conversion is completed. Output the message.
Print the root header.
Print the Mailtool boundary.
Print the header of the first child.
Print the body of the first child.
Print the Mailtool boundary.
Print the header of the second child.
Print the body of the second child.

CONTROLLING CONVERSION.

There are two ways of controlling conversion using Emil. Either you use command line arguments or you use the configuration file. When using the configuration file you need to specify the mail address of the sender and recipient and the name of the recipient host. These are supplied by sendmail if you are using emil with sendmail.

Command line arguments.

Address specification arguments

-r <address> The address of the recipient.

-s <address> The address of the sender.

-x <hostname> The name of the recipient host.

The address specifications are used when you want to redirect output to a program such as /bin/mail or sendmail. The usage of these addresses is for arguments to the execv of the program. They are also used to lookup sender and recipient information such as format and encoding etc. Emil does not rewrite addresses, so beware the address format you use.

Format, Encodings and Charsets

-B <encoding> The binary encoding style.

-C <charset> The target charset.

-F <format> The target format.

-H <encoding> The header encoding style.

-S <charset> The sender's charset.

-T <encoding> The text encoding style.

It is possible to explicitly specify the format, encoding ot charset of the target message on the command line. If done, the lookup for this information in the configuration file is disabled. Therefore, make sure to apply a complete specification of the format and encoding features using these arguments, if one of them is omitted the value of that feature will be set to the default value.

Logging and Debug

-d Debug output on stderr.

-f <facility> The log facility of syslog logging.

-h <level> The log level of header logging.

-l <level> The log level of syslog logging.

Logging and debug is controlled using the above command line arguments.

Configuration files

-c <path> Path to charsets file.

-e <path> Path to configuration file.

It is possible to explicitly specify the path of the configuration or charsets files. If not specified, the default paths will be used.

I/O Specifications

-i <path> Path to input file.

-m <mailer> The program to which to pipe output.

-n Use SMTP client as output.

-o <path> Path to output file.

-p Add pseudo route to recipient address.

-u Wants UNIX style from line.

Input/Output is controlled by command line arguments and this applies for the SMTP client and the redirect of output to arbitrary programs also.

Miscellaneous

-g Return group match.

-v Print out version.

It is possible the get information from Emil using the above flags. These arguments disables all other activities and should only be used manually. Emil will print the desired information and then exit.

Legal Values

Legal values for the above referenced arguments are:

<address> Any Internet mail address.
<hostname> Any Internet host name.
<encoding> Choose between:
- (7) 7bit - 7bit encoding.
- (8) 8bit - 8bit encoding.
- (bi) binhex - BinHex encoding.
- (u) uuencode - UUencode encoding.
- (ba) base64 - Base64 encoding.
- (q) quoted-printable - Quoted-Printable.
<charset> Any charset name as defined in rfc1345.
<facility> Choose between:
- m - LOG_MAIL
- d - LOG_DAEMON
- 0 - LOG_LOCAL0
- 1 - LOG_LOCAL1
- 2 - LOG_LOCAL2
- 3 - LOG_LOCAL3
- 4 - LOG_LOCAL4
- 5 - LOG_LOCAL5
- 6 - LOG_LOCAL6
- 7 - LOG_LOCAL7
<level> Choose between:
- 1 - LOG_ERR
- 2 - LOG_NOTICE
- 3 - LOG_INFO
- 4 - LOG_DEBUG
<path> The path of a file.
<mailer> The name of a mailer as specified in the configuration file.

The configuration file emil.cf

Emil uses a configuration file for multiple purposes. One purpose is to retreive the conversion profile for the specified message addresses. Another purpose is to retreive information about file type marking among other things.

The configuration file consists of four parts or functions:

group - Specifies a conversion profile.
mailer - Defines a mailer (program) with arguments.
match - Specifies file type information.
member - Binds addresses to a conversion profile.

group

The group command is used to define a conversion profile, a named method for conversion. The conversion profile contains a specification of what format, charset and encodings to be used and when Emil has selected a conversion profile, the information within that will be used when undertaking conversion.

The syntax of the group command is the group keyword followed by a name, a colon and a comma separeted list of options, as follows:

group Group_name : [charset=CHARSET] [format=FORMAT] [bin=ENCODING] [textenc=ENCODING] [henc=ENCODING] ;

Group_name - The name of the group.
CHARSET - The charset to use for text, according to RFC1345.
FORMAT - The message format, pick one of MIME, MAILTOOL, RFC822 or TRANSPARENT.
bin - The encoding to use for binary attachments. Choose one of Base64, BinHex or UUencode.
textenc - Specifies the encoding to be used for text. Pick one of BAse64, BInhex, Uuencode, Quoted-printable, 8bit or 7bit. 7bit is default. When 7bit is chosen the text is first converted to ISO-8859-1 and then the 8bit characters are replaced by the appropriate characters as specified by some national variants of ISO-646. This special conversion is suited to work in Sweden but works probably also for other european languages.
henc - Specifies the encoding to be used for header lines. Pick one of Base64, Quoted-printable, 8bit or 7bit. 7bit is default. If BAse64 or Quoted-Printable is selected headers are converted according to RFC1522 (MIME-II). If 7bit is selected headers are first converted to ISO-8859-1 and then 8bit characters are converted to the US-ASCII characters of closest resemblance.

mailer

The mailer command is used to define a mailer, or program, to be used as an execv vector that will be forked off and will receive the output of the conversion. The mailer name is referred when using the "-m" command line flag and will thus result in the execv specified by the mailer definition will be executed in a child and the output of the conversion will be sent to the stdin of the child.
The syntax of the mailer command is as follows:

mailer Mailer_name : PATH, PROGRAM, ARGUMENTS, ... ;

Mailer_name - Name of the mailer.
Path - Path of the program.
Program - Name of the program.
Arguments - A list of each command line argument to use when invoking the program. To be suitable for mail programs you can specify variables representing sender and recipient. These are specified by $s and $r respectively.

match

The match command is used to define file type information such as Content-Type of MIME messages, file name extensions for UUencoded files and Type&Creator for Applefiles. What kind of type information is specified by the CONTEXT
The syntax of the match command is as follows:

match CONTEXT Match_String OUT ;

CONTEXT - One of MIME, BINHEX or UUENCODE.
Match_String - A quoted string that corresponds to the Context of the field. For MIME Match_String - defines the content-type to be used. For UUENCODE Match_String defines the file name extension to be used. For BINHEX Match_String defines the merged two 4byte strings Type and Auth info.
OUT - The generic type of the field

member

The member command is used to bind the address information, given as command line arguments to Emil, to a conversion profile. Address information is given as the triple of Recipient, Sender, and Recipient_host. Several triples may be bound to one conversion profile, in wich case the triples are specified as a comma separated list.
The syntax of the member command is as follows:

member Group_name : Recipient Sender Recipient_host, ... ;

Emil Details

Emil has no mechanisms for handling spooling and message delivery errors. Instead it relies on what is supplied by other programs, for example sendmail, when it comes to handling these matters. Normally Emil is setup as such that an originating process calls Emil and sends a message to the stdin of Emil. After processing the message, Emil sends the output either to another process, called by Emil, or to a file or an SMTP stream. If sent to another process, and provided that the conversion is otherwise successful, the return code of that process will be returned by Emil to the originating process. If sent to a file or an SMTP stream, the return code sent to the originating process will be that of the conversion and output process made by Emil. In either case, the originiating process will not get a return code from Emil until all further processing and delivery, under control by Emil, is completed. This is to make sure that the originating process will never receive a "delivery successful" return code unless indicated successful by the subsequent process or by Emil.
Emil is designed to be transparent when it comes to delivery failures. A side effect of this is that a spooled copy of a message (for example by sendmail) will not be unlinked until either control is successfully handed over to another process (for example sendmail) or the message is safely delivered to a file or SMTP stream. This is how Emil can keep the working copy of a message in volatile memory without risking loss of messages. There is a negative bieffect of this as well; if you manage to construct a loop of sendmail-emil-sendmail-emil-sendmail your spool area will fill up almost instantly nomatter how small message you sent.
Using Emil

There are several ways of using Emil. You may choose to use Emil as an independant program or you may choose to integrate Emil into your message delivery system. Using the capabilities specified below there is no restriction on how you can use Emil.
Emil can take its input in two ways: You can pipe a message to the stdin of Emil or you can place the message in a file and tell Emil to use that file for input.
Emil can produce its output in three ways: To a file, to the stdout or to an SMTP stream (Emil includes an SMTP client and can send the message to a remote host using SMTP).
Emil can also fork off another program and send the output to the stdin of that process.
If you intend to use Emil as an indepentant program feel free to experiment on how you can use Emil. If, on the other hand you want to integrate Emil into your message delivery systems there are a few things that need considering. There are three main approaches: Input filtering, output filtering and intermediate filtering.
Input filtering

This is applying Emil before the MTA receives the message and making Emil call the MTA after conversion is completed. The one big disadvantage of this method is that there is no way to apply Emil on a network input. This is because Emil has no network daemon facilities. It is however possible to use Emil as an input filter to sendmail on locally sent messages. This can be done by installing Emil on the path of the sendmail program and move sendmail to another place (sendmail must of course still be up and running as a daemon). Emil can then be made to call sendmail and send the converted messages to the stdin of sendmail. This is the general, and ugly, method of using Emil as an input filter. Another way is restrict usage to specific programs that need to have their messages converted. This, of course, means changing the programs to call Emil instead of sendmail. This can be fairly useful, for instance in a news-to-mail program that gateways a news group to a mailling list. (The opposite is also possible, using Emil in a mail-to-news program.)
Output filtering

This is letting the MTA apply Emil on it's output and is perhaps the most useful method of using Emil. The most powerful way to implement output filtering in sendmail is to change the mailer definitions in sendmail.cf. Emil includes an SMTP client and can be made to fork off other programs and redirect its output to them. Therefore it is possible to replace sendmail's TCP mailer and the external mailers like the local mailer, uucp mailer etc with mailers calling Emil. This is a total solution excluding only those internal mailers that are not SMTP based. It is also possible to implement conversion in a more restricted way using output filtering
Intermediate filtering

The only way to implement intermediate filtering using sendmail is to combine output and input filtering in a loop back fasion letting Emil receive the message from sendmail and redirect it back to sendmail again. This is the method suggested in previous versions of Emil 2. The main advantage is that this method is more general than any of the others. The main disadvantages are that it is difficult to make the changes to sendmail.cf correctly and that all message actually will be processed by sendmail twice, spooled twice et cetera.