Discussion:
Poorly-wrapped text when viewed in external editor
(too old to reply)
Victoria Stuart (gmail)
2017-07-07 02:06:21 UTC
Permalink
A quick question: my messages flow nicely (the width of preview pane) in Claws, but some of them -- viewed in an external text editor (Geany; gedit) -- are wrapped, poorly, with the wrapped lines terminating with an equals sign. E.g.

OK - CLAWS:

The neurodegeneration that occurs in Parkinson's disease is a result of stress on the endoplasmic reticulum in the cell rather than failure of the mitochondria as previously thought, according to a study in fruit flies. It was found that the death of neurons associated with the disease was prevented when chemicals that block the effects of endoplasmic reticulum stress were used. Some inherited forms of early-onset Parkinson's disease have typically been blamed on poorly functioning mitochondria, the powerhouses of cells. Without reliable sources of energy, neurons wither and die. This may not be the complete picture of what is happening within cells affected by Parkinson's. Researchers from the MRC Toxicology Unit at the University of Leicester used a common fruit fly to investigate this further; fruit flies were used because they provide a good genetic model for humans.

NOT OK - GEANY [SAME IN GEDIT; cat (TERMINAL)]:

The neurodegeneration that occurs in Parkinson's disease is a result of str=
ess on the endoplasmic reticulum in the cell rather than failure of the mit=
ochondria as previously thought, according to a study in fruit flies. It wa=
s found that the death of neurons associated with the disease was prevented=
when chemicals that block the effects of endoplasmic reticulum stress were=
used. Some inherited forms of early-onset Parkinson's disease have typical=
ly been blamed on poorly functioning mitochondria, the powerhouses of cells=
. Without reliable sources of energy, neurons wither and die. This may not =
be the complete picture of what is happening within cells affected by Parki=
nson's. Researchers from the MRC Toxicology Unit at the University of Leice=
ster used a common fruit fly to investigate this further; fruit flies were =
used because they provide a good genetic model for humans.

Other messages look OK, both in Claws and in the terminal and text editors.

Any ideas on what's causing this, and how I can obtain the properly-formatted text outside of Claws?

Specifically, is there some Claws setting that I can tweak? I don't want to have to reformat these messages (1000's) with command-line tools (awk; sed; ...), if I don't have too.

Thanks!

==============================================================================
Andrej Kacian
2017-07-07 07:50:48 UTC
Permalink
On Thu, 6 Jul 2017 19:06:21 -0700
Post by Victoria Stuart (gmail)
A quick question: my messages flow nicely (the width of preview pane) in Claws, but some of them -- viewed in an external text editor (Geany; gedit) -- are wrapped, poorly, with the wrapped lines terminating with an equals sign. E.g.
That is probably the "flowed" format the messages are using. This
article explains how it works nicely:
https://cpbotha.net/2016/09/27/thunderbird-support-of-rfc-3676-formatflowed-is-half-broken/

Regards,
--
Andrej
Jim Pachowski
2017-07-07 15:39:09 UTC
Permalink
Post by Andrej Kacian
Post by Victoria Stuart (gmail)
some of them -- viewed in an external text editor
(Geany; gedit) -- are wrapped, poorly, with the wrapped
lines terminating with an equals sign. E.g.
That is probably the "flowed" format the messages are using. This
https://cpbotha.net/2016/09/27/thunderbird-support-of-rfc-3676-formatflowed-is-half-broken/
Actually it sounds the external editor is picking up quoted printable
encoding while the email display is taking that encoding into account.
Victoria Stuart (gmail)
2017-07-07 15:46:28 UTC
Permalink
@Jim: I think that is correct: when I "cat" those messages in a terminal, I get the same thing (affected messages only; wrapped with an equals sign at the end).

Any idea on how I can turn this off (if possible)? Everything in Claws looks great; but I want to use those (archived) messages for some NLP work.

I can remove them with a small bash script (sed; ...) -- will post that solution later; want to test it more, first -- but looking for an easy fix. ;-)

@Andrej: Thanks also for your reply; appreciated! :-)
Post by Jim Pachowski
Actually it sounds the external editor is picking up quoted printable
encoding while the email display is taking that encoding into account.
Victoria Stuart (gmail)
2017-07-07 18:19:20 UTC
Permalink
OK: here is my solution:

find ./1 -type f -iname '[0-9]' -exec sed -i -e ':a;N;$!ba;s/=\n//g' {} \;

executed in a bash script (or simply on the command line).

Details here (GitHub gist):

https://gist.github.com/victoriastuart/2e1094ecacaf6e25b3347c2dcd597c66

Claws mail messages are saved in numeric-only filenames, so my plan is to copy the Claws mail directories I want to work with to another location, and run that script over the contents (copied folders), thus preserving the integrity of my Claws Mail directories but allowing me to preprocess ("clean) those affected messages, prior to further work with them.

Best, V. :-)
Victoria Stuart (VictoriasJourney.com)
2017-07-08 03:17:21 UTC
Permalink
Yes, in my case there is ...

------------------------------------------------------------------------------
HEADER (part) FROM AFFECTED MESSAGE (printed-quotable; '='-terminated lines):

...
Mime-Version: 1.0
Content-Type: multipart/alternative;
boundary="--==_mimepart_595ed001a35d6_bad43fbaf62cb98881253";
charset=utf-8
Content-Transfer-Encoding: 7bit
X-Auto-Response-Suppress: All
Auto-Submitted: auto-generated
X-Mailer: Zendesk Mailer
...
Content-Type: text/plain;
charset=utf-8
Content-Transfer-Encoding: quoted-printable

------------------------------------------------------------------------------
HEADER (part) FROM UNAFFECTED MESSAGE ('normally'-wrapped lines):

MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Victoria Stuart (VictoriasJourney.com)
2017-12-01 04:04:35 UTC
Permalink
Hello everyone! I returned to this issue yesterday as I now want to use my CM messages for NLP-related work. Looking at the raw (saved) messages, I noticed (linux: file -i <file>) that they were "message/rfc822; charset=us-ascii" even though I work in UTF-8 and CM displays them correctly (e.g. Greek letters in scientific text). Noting that the headers contain

MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

I did a quick search and found two Linux utilities that I explored, that decode MIME-encoded text:

munpack (part of the mpack package)
uudeview

that enable you to process (unpack / decode) those messages.

I found the latter to be the most useful, as it allowed me to easily get the text of each of the CM files with a .txt extension, that I used in the processing. I wrote a bash script, below, that will recursively grab CM files from a specified starting folder, and return the textual portions with an .out extension (optionally into a single output folder).

Since CM stores messages numerically (non-uniquely across directories), you can either output to the source directory, or rename the files. I wanted to tag the file names with the parent directory name, anyway, so I just dump everything into a common output folder, with the files named with the parent folder, base name, and and .out extension.

Cheers, Victoria :-)

# ============================================================================

#!/bin/bash

export LANG=C.UTF-8

# /mnt/Vancouver/Programming/scripts/claws/claws-decode.sh

# ----------------------------------------------------------------------------
# ENABLE SPACES IN FILENAMES:

## https://stackoverflow.com/questions/4638874/how-to-loop-through-a-directory-recursively-to-delete-files-with-certain-extensi

## To allow spaces in filenames,
## at the top of the script include: IFS=$'\n'; set -f
## at the end of the script include: unset IFS; set +f

IFS=$'\n'; set -f

# ----------------------------------------------------------------------------
# SET PATHS:

PWD=$(pwd)

OUT=$PWD/out
# https://stackoverflow.com/questions/793858/how-to-mkdir-only-if-a-dir-does-not-already-exist
mkdir -p $OUT

IN="/mnt/Vancouver/Programming/scripts/claws/claws_mail/"

# https://superuser.com/questions/716001/how-can-i-get-files-with-numeric-names-using-ls-command
# FILES=$(find $IN -type f -regex ".*/[0-9 ]*") ## recursive; numeric filenames only

FILES=$(find $IN -type f -regex ".*/[0-9 ]*") ## recursive; numeric filenames only (may include spaces)

# echo '$FILES:' ## single-quoted, prints: $FILES:
# echo "$FILES" ## double-quoted, prints path/, filename (one per line)

# ----------------------------------------------------------------------------
# MAIN LOOP:

for f in $FILES
do
bp=$(basename $(dirname "$f"))
bn=$(basename "$f")

# Output to $OUT dir:
# https://askubuntu.com/questions/538913/how-can-i-copy-files-with-duplicate-filenames-into-one-directory-and-retain-both
# uudeview -q -i -t $f; /usr/bin/cp -f --backup=existing -S .orig 0001.txt $OUT/$bp.$bn.out; rm -f 0001.txt
uudeview -q -i -t $f; /usr/bin/cp -f --backup=simple -S .orig 0001.txt $OUT/$bp.$bn.out; rm -f 0001.txt

# Output to source dir:
# uudeview -q -i -t $f; /usr/bin/cp -f 0001.txt $f.out; rm -f 0001.txt
done

# ----------------------------------------------------------------------------

unset IFS; set +f

# ----------------------------------------------------------------------------
# CLEAN UP:

## This must appear after the "unset IFS; set +f" line, above).

# ----------------------------------------
# Remove multiple extensions via 'brace expansion':

# https://stackoverflow.com/questions/10516384/how-to-delete-multiple-files-at-once-in-bash-on-linux
# rm -f *.{jpg, pdf, png, gif}
# Either of these work, but do NOT include spaces anywhere inside the braces:

#rm -fR $PWD/*.{jpg,pdf,png,gif}
rm -fR $PWD/*{.jpg,.pdf,.png,.gif}

# ----------------------------------------
# Rename extensions of renamed, duplicate files:

rename out.orig orig.out $OUT/*.orig

# ============================================================================
Victoria Stuart (VictoriasJourney.com)
2017-12-06 18:42:49 UTC
Permalink
I see these messages are archived by month (http://lists.claws-mail.org/pipermail/users/); my update above is in response to my thread in July 2017: http://lists.claws-mail.org/pipermail/users/2017-July/019762.html
Jeremy Nicoll
2017-12-06 19:07:35 UTC
Permalink
On Wed, 6 Dec 2017, at 18:42, Victoria Stuart (VictoriasJourney.com)
Post by Victoria Stuart (VictoriasJourney.com)
I see these messages are archived by month
(http://lists.claws-mail.org/pipermail/users/); my update above is in
http://lists.claws-mail.org/pipermail/users/2017-July/019762.html
I think you're seeing a 'problem' that isn't actually there. The "badly
formatted" data isn't meant to be read directly by a human being;
instead it's packaged up (in this case as Quoted-Printable) so that it
can be sent properly across an email network. It is not meant to be
formatted.

Any mail client should show you the unpacked data, formatted
within its capabilities.

Possibly the question that should really be asked is why plain text
has been packed as Quoted-Printable. It's likely to happen if that
text contains any of the following: accented letters, hard spaces,
sexed opening & closing quotes, ligatures etc - all characters NOT
supported by basic email systems.

It's a bit like someone saving a webpage, then opening the html in
a text editor and wondering why it doesn't look like it does in a
browser.
--
Jeremy Nicoll - my opinions are my own.
Olaf Hering
2017-12-07 08:17:06 UTC
Permalink
Post by Jeremy Nicoll
Possibly the question that should really be asked is why plain text
has been packed as Quoted-Printable. It's likely to happen if that
text contains any of the following: accented letters, hard spaces,
sexed opening & closing quotes, ligatures etc - all characters NOT
supported by basic email systems.
It is 2017. Sending 8bit should be the standard since a few years.
If you find a system that corrupts 8bit data, please let us know.

Olaf

rikona
2017-07-07 21:40:18 UTC
Permalink
On Fri, 7 Jul 2017 08:46:28 -0700
Post by Victoria Stuart (gmail)
@Jim: I think that is correct: when I "cat" those messages in a
terminal, I get the same thing (affected messages only; wrapped with
an equals sign at the end).
I also 'archive' direct copies of CM files and thought I'd better take a
look. I see the same thing, but only for SOME, but not all, non-text
parts of emails [HTML, etc]. The text part does not seem to have =
signs at the end of lines. However, this seriously breaks my archives.

It looks as though CM inserts the = signs in the file - is this true?
Post by Victoria Stuart (gmail)
Any idea on how I can turn this off (if possible)?
That would be necessary for my archiving - I have several hundred dirs,
each with many files. Likely too hard to fix with your script. I can
accept long lines if that is the alternative.
Post by Victoria Stuart (gmail)
Everything in > Claws looks great; but I want to use those (archived) messages for
some NLP work.
For me also - the = sign breaks the archives.
Post by Victoria Stuart (gmail)
I can remove them with a small bash script (sed; ...) -- will post
that solution later; want to test it more, first -- but looking for
an easy fix. ;-)
I saw that, nice, but with many hundreds of dirs, this is not
practical for me.

Any way to keep CM from doing this, if CM is putting in the = signs?

CM can also archive in a few formats, but these are not readable by my
comp-wide search pgm, which I use many times a day, so those formats
can't be used. And, even if one of those format is used, would the =
sign still be there if the format is unpacked?
Post by Victoria Stuart (gmail)
@Andrej: Thanks also for your reply; appreciated! :-)
Post by Jim Pachowski
Actually it sounds the external editor is picking up quoted
printable
encoding while the email display is taking that encoding into
account.
_______________________________________________
Users mailing list
http://lists.claws-mail.org/cgi-bin/mailman/listinfo/users
George Avrunin
2017-07-07 23:55:20 UTC
Permalink
Post by rikona
It looks as though CM inserts the = signs in the file - is this true?
I think Jim Pachowski's comment earlier in the thread is the most likely
explanation: the = characters are from quoted-printable encoding. This
uses the = as an escape to encode 8-bit data using printable ASCII
characters, and also makes all the lines no more than 76 characters long
(before encoding), as in the example in Victoria Stuart's message. (See
RFC 2045, which replaced RFC 1521 on this.)

Can you look at the actual message source for one of these messages and see
if there's a header saying "Content-Transfer-Encoding: quoted-printable"?

George
rikona
2017-07-08 03:13:52 UTC
Permalink
On Fri, 7 Jul 2017 19:55:20 -0400
Post by George Avrunin
Post by rikona
It looks as though CM inserts the = signs in the file - is this true?
I think Jim Pachowski's comment earlier in the thread is the most
likely explanation: the = characters are from quoted-printable
encoding. This uses the = as an escape to encode 8-bit data using
printable ASCII characters, and also makes all the lines no more than
76 characters long
Yes - all the ones I looked at are like that.
Post by George Avrunin
Can you look at the actual message source for one of these messages
quoted-printable"?
Yes - all the = signs do have that header. The HTML that does not
have = has a different header.

SO, how can I view the non-CM file archive with the HTML as what one
would expect for HTML? What tool displays that section as it 'should'
look? Several I tried don't work, and some crash!
Continue reading on narkive:
Loading...