Text normalizer, anyone?

When I write text documents (using LaTeX or Docbook), I like to wrap lines, as it makes them easier to edit (less things moving on the screen), and allow to have easy-to-read diffs.

However, I always hesitate before rewrapping paragraphs (using vim’s gqap): this mean that I will add noise to my git history. So I only do that from time to time, making “rewrapping-only commits”. But that sucks, since in the meantime, I sometimes make a lot of changes, and my lines grow long again. Of course, I could rewrap my paragraphs before each commit, but if I simply add a word to a paragraph, it might cause all the lines to be rewrapped.

So I what I would need is some kind of “text normalizer” that will:

  • split lines at The Right Place. After ‘.’, ‘,’, ‘:’, ‘;’, etc. So rewrapping won’t propagate changes too far away.
  • understand the basics of LaTeX, so it won’t rewrap
    \begin{tabular}{|l|l|}\hline
    x & y \\
    1 & 2 \\\hline
    \end{tabular}

    or

    \begin{figure}
    \centerline{\includegraphics{fig}}
    \caption{Cool stuff}
    \label{coolstuff}
    \end{figure}

    (vim does rewrap those examples.)

  • be editor-agnostic. So other committers could use it as well.
  • support for other document formats (docbook XML) would be nice too.

I’ve looked at plasTeX: I could use it to parse a LaTeX document, and export it as LaTeX. But then it would be a LaTeX-only solution. Does anyone have a better solution?

9 thoughts on “Text normalizer, anyone?

  1. I have roughly the same problem, colleagues mostly use TeXshop and just use the default soft wrapping, ending up with long lines or random line breaks.

    I personally try to start new lines for each sentence, but then the text is not as nicely wrapped in the editor. Maybe just replace period-space-space by period-linebreak before commits, and the reverse at updates ? Is it possible to tell svn to use period-space-space as an additional end-of-line marker ?

  2. Why not just turn on vim’s wrap mode? This only affects the display of the file.

  3. Matthew: the preblem with softwrap is that it makes really long lines so it makes conflicts more probable and reviewing differences difficult (eg fixing a typo in a paragraph when the whole paragraph is one hard line)

  4. > However, I always hesitate before rewrapping paragraphs (using vim’s
    > gqap): this mean that I will add noise to my git history.

    I don’t think that wrapping lines adds noise to git’s history if you
    re-wrap lines while changing a file.

    In other words, suppose you edit file a.tex and in the process also
    make some whitespace changes. Then if you do a commit using
    git commit a.tex -m “Some changes to a.tex”
    Now you want to see the changes you made but ignore whitespace
    changes. So you run
    git diff -b
    The -b switch asks that white space changes are ignored.

    This approach is only problematic if you want to generate patches
    which are to be applied outside git.

  5. Kapil: git diff -b doesn’t solve my problem.

    If I have a line:
    a c d e f g h i j k l m
    I edit the line to add a “bbbbbbbbbb”, but that causes the line to go past the 80-char limit. so, when rewrapping:
    a bbbbbbbbbbbbbb c d e
    f g h i j k l m
    I haven’t checked, but git diff -b won’t help here (if git diff -b behaves like diff -b).

  6. That was my mistake. I thought this option made “diff” behave like “wdiff”.

    I have used “wdiff” in the past when my collaborator sent me a
    para-reformatted TeX file.

    Maybe one should create something like “git-wdiff”.

  7. While I think you are slightly overoptimizing, using git you _can_ add a filter to convert between checkin and checkout.. However perhaps git diff –color-words can help your reviewing eyes! (That’s one out of few git features made only for actual plain text processing!)

Comments are closed.