Text normalizer, anyone?

June 9th, 2008 by lucas

When I write text documents (using LaTeX or Docbook), I like to wrap lines, as it makes them easier to edit (less things moving on the screen), and allow to have easy-to-read diffs.

However, I always hesitate before rewrapping paragraphs (using vim’s gqap): this mean that I will add noise to my git history. So I only do that from time to time, making “rewrapping-only commits”. But that sucks, since in the meantime, I sometimes make a lot of changes, and my lines grow long again. Of course, I could rewrap my paragraphs before each commit, but if I simply add a word to a paragraph, it might cause all the lines to be rewrapped.

So I what I would need is some kind of “text normalizer” that will:

  • split lines at The Right Place. After ‘.’, ‘,’, ‘:’, ‘;’, etc. So rewrapping won’t propagate changes too far away.
  • understand the basics of LaTeX, so it won’t rewrap
    \begin{tabular}{|l|l|}\hline
    x & y \\
    1 & 2 \\\hline
    \end{tabular}

    or

    \begin{figure}
    \centerline{\includegraphics{fig}}
    \caption{Cool stuff}
    \label{coolstuff}
    \end{figure}

    (vim does rewrap those examples.)

  • be editor-agnostic. So other committers could use it as well.
  • support for other document formats (docbook XML) would be nice too.

I’ve looked at plasTeX: I could use it to parse a LaTeX document, and export it as LaTeX. But then it would be a LaTeX-only solution. Does anyone have a better solution?

9 Responses to “Text normalizer, anyone?”

  1. Damien Pollet wrote on 06/9/08 at 6:21 pm :

    I have roughly the same problem, colleagues mostly use TeXshop and just use the default soft wrapping, ending up with long lines or random line breaks.

    I personally try to start new lines for each sentence, but then the text is not as nicely wrapped in the editor. Maybe just replace period-space-space by period-linebreak before commits, and the reverse at updates ? Is it possible to tell svn to use period-space-space as an additional end-of-line marker ?

  2. Chris Conway wrote on 06/9/08 at 6:44 pm :

    I start every sentence on a new line and use a fill-sentence macro from Luca de Alfaro. (For AucTeX, replace fill-region-as-paragraph with LaTeX-fill-region-as-paragraph.)

  3. Matthew W. S. Bell wrote on 06/10/08 at 2:47 am :

    Why not just turn on vim’s wrap mode? This only affects the display of the file.

  4. Damien Pollet wrote on 06/10/08 at 3:08 am :

    Matthew: the preblem with softwrap is that it makes really long lines so it makes conflicts more probable and reviewing differences difficult (eg fixing a typo in a paragraph when the whole paragraph is one hard line)

  5. Kapil Hari Paranjape wrote on 06/10/08 at 12:46 pm :

    > However, I always hesitate before rewrapping paragraphs (using vim’s
    > gqap): this mean that I will add noise to my git history.

    I don’t think that wrapping lines adds noise to git’s history if you
    re-wrap lines while changing a file.

    In other words, suppose you edit file a.tex and in the process also
    make some whitespace changes. Then if you do a commit using
    git commit a.tex -m “Some changes to a.tex”
    Now you want to see the changes you made but ignore whitespace
    changes. So you run
    git diff -b
    The -b switch asks that white space changes are ignored.

    This approach is only problematic if you want to generate patches
    which are to be applied outside git.

  6. Lucas wrote on 06/10/08 at 3:45 pm :

    Kapil: git diff -b doesn’t solve my problem.

    If I have a line:
    a c d e f g h i j k l m
    I edit the line to add a “bbbbbbbbbb”, but that causes the line to go past the 80-char limit. so, when rewrapping:
    a bbbbbbbbbbbbbb c d e
    f g h i j k l m
    I haven’t checked, but git diff -b won’t help here (if git diff -b behaves like diff -b).

  7. Kapil Hari Paranjape wrote on 06/10/08 at 5:22 pm :

    That was my mistake. I thought this option made “diff” behave like “wdiff”.

    I have used “wdiff” in the past when my collaborator sent me a
    para-reformatted TeX file.

    Maybe one should create something like “git-wdiff”.

  8. ulrik wrote on 06/11/08 at 11:37 am :

    While I think you are slightly overoptimizing, using git you _can_ add a filter to convert between checkin and checkout.. However perhaps git diff –color-words can help your reviewing eyes! (That’s one out of few git features made only for actual plain text processing!)

  9. Damien Pollet wrote on 06/11/08 at 11:45 am :

    ulrik: heh, now I really wish I could push people to use git at work :)