Random: regex

sed shouldn't need an introduction. I mean it must be so commonly used that it almost always follow a shell pipe. It's as if shell pipes were invented for sed. For those who live outside Geekdom, sed is a command line utility to modify data streams using regular expressions. It therefore deserves nothing less than to be called stream editor, or sed. Data streams, you ask?! Please close this window!

I'm a seasoned command line and regex user. Hardly a day pass without me using regex utilities. I tend to be in autopilot mode around them. The closest phenomenon I can liken it to is Somnambulism. Yes, I'm aware that it's a disorder; so can be my obsession.

One fine day (though not so in hind sight), I needed to join all lines in a file to a single one, separated by spaces. The choice was obvious: Use sed. And so I did. I proudly keyed the following command line, relaxed and eyes closed, expecting a neat single line joining all the ones from the input file:

sed 's/\n/ /' input.txt

To my utter surprise, sed spat the input file out without any modification at all! I thought it was some kind of joke. I froze for a moment befuddled, tried the same command again and again, bordering Einstienian insanity.

Clearly this must be a joke. Come on. This is so basic. This is sed for kindergarten. I'm better than that! All I'm asking sed is to replace newlines (agreeably represented by \n) with a single space character. And it doesn't seem to do so. What's happening? What universe am I in?!

What went wrong

I felt stupid to even Google about this problem; something I thought I could pull off even before I knew it. To my shame, the reason wasn't far away.

The reason why the above expression didn't work is not the regular expression. The real reason instead is how sed actually works: sed slurps in one line at a time from the input stream (from my file in this case) to pattern buffer, then executes the given command (in this case, to substitute newline with a space character) and print it. What sed uses to separate lines from each other is in fact the newline character (well, not a surprise, is it?). This means that by the time sed begins to execute its commands, the newline would have already been removed from the buffer. In other words, what's presented to the commands to process is a line without the newline! So there's no newline in the buffer to replace in the first place. sed then, quite rightly, prints the buffer contents unmodified. As a result, I get the same output as my input!

Apparently this is so common that it found its way into the sed FAQ. Quite a depressing realization!

Right ways of doing it

It's now rather clear what's been happening. Since sed lets the commands operate on the pattern buffer alone, there appears no way the commands can find or process a newline character therein. Time to give up? No. Here's where sed's scripting capabilities come, however crude they appear to be. The FAQ page itself suggests alternatives, and I'm listing the inflated versions below.

Make use of sed buffers

One is this: Keep appending all lines (each without the newline, of course) read by sed into the pattern buffer in a loop, and to do one grand substitution in the end. The N command, along with some branching commands help us here. That'll look like:

sed '
                  # Pattern buffer already has the first line from input

  :A              # Define a label named A to jump to

  N               # Append the next line from the input to the pattern buffer
                  # Note that this will add a newline in between the existing
                  # content and the next line read

  $!bA            # If not end of stream, jump to A. I.e. keep reading lines

  s/\n/ /g        # End of stream. Perform one grand substitution.
                  # Note the g flag so that all newlines in the buffer
                  # are replaced
' input.txt

The one-liner will appear a bit cryptic, like:

sed ':A;N;$!bA;s/\n/ /g' < input.txt

Another solution makes use of sed's hold buffer instead to hold the lines temporarily. Hold buffer contents can be exchanged with that of pattern buffer for commands to operate on. This is functionally exactly that of the above, but obviates the need for writing a loop (although you don't write a loop, sed will do it implicitly).

sed '
                  # Pattern buffer already has the line from input

  H               # Append a newline and then the pattern buffer to hold buffer

  ${x;s/\n/ /g;q} # If end of stream. Exchange the contents of both buffers
                  # Perform one grand substitution and quit
                  # Note the g flag so that all new lines in the buffer
                  # are replaced; and q flag to print and quit immediately.
                  # Otherwise the following 'd' will empty the pattern buffer

  d               # Delete the pattern buffer. This will force the script to run
                  # again from beginning (implicit loop)

' input.txt

In one-liner form, it'll look like:

sed 'H;${x;s/\n/ /g;q};d' input.txt

One minor itch with this is that there'll be a stray white space character at the beginning of the output line. That's because the first H command prepends the line with a newline, when substituted leaves a white space character. No big deal - the stray white space character can be killed like this, if strictly required.

sed 'H;${x;s/\n/ /g;s/^ //;q};d' input.txt

The one-liner does look longer than previous one however. This nevertheless is an alternative.

Use alternate tools

There's a school of thought that one shouldn't use (or misuse rather) the power of sed for puny tasks like this, especially when there are utilities built exactly for purposes of simple substitution. tr is one such utility, which gets its name from translation. tr is designed to substitute a set of characters from input stream to those in another set, which is essentially a substitution. To cut to the chase, the following does in less characters what all the above does:

tr '\n' ' ' < input.txt

Minor nit: A true follower would have noticed something unusual from the output of tr. It doesn't give you an extra new line at the end of it's own output (sed and most other utilities do). So your shell prompt will end up on the same line as the output. That's no big deal if you were to use this in scripts or capture output to a variable. Let's move on.

Apart from the simplicity, another incentive of using tr is that some sed implementations might have limit on the pattern buffer length, which means very long input lines might get truncated. tr is immune to that trouble as it can work with smaller chunks of input stream, where as sed commands expects one entire line to be present in the pattern buffer.

Takeaway

Preaching time: Personally, it was a reminder for me to refresh things once in a while. Another is that one should be wary of looking for newlines in line-oriented regex matching. Perl, for example, has an m flag for multi-line regexes. Doing so would be like looking for the wrapper inside the candy.

And finally: sed one-liners that achieve anything useful tend to appear quite cryptic. It's difficult to resist the temptation to show off one's sed prowess, but it's always helpful write multi-line sed scripts with comments, like above. If nothing else, the future you will thank you.

Happy sedding.

Random

Dec 29, 2013

A sed surprise

What went wrong

Right ways of doing it

Make use of sed buffers

Use alternate tools

Takeaway

Profile

Rant Record

Brain Dumps