How To: Perl. One-Liners to Process Text Files

Perl. One-Liners to Process Text Files

Perl. One-Liners to Process Text Files

Greetings NullByte Community.

You may had heard that Perl is great for text processing. Some people says that the name is an acronym for "Practical Extraction and Reporting Language". Others suggest that the name stands for "Pathologically Eclectic Rubbish Lister". The true is that Perl is not an acronym.

Anyway, in this post, we will see how to eclectically extract and report using Perl. In other words, we will see different ways to process text files so you can choose the one that better suits you.

Likely, you had already read the great tutorials from occupytheweb. In those tutorial you wrote your Perl script in a file. That's the way to go for any script bigger than one line. However, you can do a lot in just one line of Perl code.

The -E FLAG

The perl interpreter accepts a bunch of interesting flags. The -e flag, instructs the interpreter to execute the string that follows. Otherwise it considers that string a file name and tries to execute content of that file.

So the infamous Hello World can be written like this:

perl -e 'print "Hello World!\n";'

This is equivalent to open a text editor, typing the shebang and then the print command.

PROCESSING TEXT FILES

The simplest text processing tool we can think about is the cat command. As you know, the cat command just dumps the content of a file to the standard output (man cat). No?. OK, Sure. It actually conCATenates files... that's what it really does.

So, a possible implementation of the cat command in Perl may look like this (you can write this same thing in many different ways... that's the Perl motto).

perl -e 'while (<>) {print;}' file.txt

OK, that's probably way too cryptic. What about this:

perl -e 'while ($line = <>) { print $line;}' file.txt

This looks a lot simpler, but it still needs some explanation. The first thing you may had noticed is the so-called Diamond Operator (the <> thing inside the while condition). This operator is used in Perl to access files. However, if you do not provide a handler to the file (something to put in-between the brackets) it behaves as follows:

  • If you had provided multiple files in the command line the diamond operator will read those files one after the other, line by line. Pretty handy eh?.
  • If no parameter is provided in the command line, it will fallback to the standard input and then you can type your text in the console, or get the output from other process using a pipe.

Note that if you are using the standard input (second option) and typing (or maybe copying&pasting) your data, you will have to send the EOF (EndOfFile) mark manually. Just press CTRL+D when you are done (CTRL+C would also work... but that really means something else).

DEFAULT VARIABLES

Let's get back to our former version. The cryptic one:

perl -e 'while (<>) {print;}' file.txt

What is happening here is that Perl is making use of the default variable $_. Whenever a Perl instruction returns something and you do not specified a variable to store that value, it gets stored in the $_ variable.

Analogously, when a Perl instruction expects some parameter and it is not provided, it will, generally, use the content of $_. With this in mind, we can re-write our script like this:

perl -e 'while ($_ = <>) {print $_;}' file.txt

GETTING SERIOUS. GREP

Now, we had learn enough to go to the next stage and write something more useful. Let's go for the grep command. Again, let's first take a look to the one-liner and then go into the details.

perl -e 'while (<>) { print if /regular-expression/;}' textfile

Let's re-write once again in a more readable form:

perl -e 'while ($_ = <>) { if ($_ =~ /regular-expression/) {print $_;}}' textfile

Yep, Perl gives you quite some syntactic sugar when you write your scripts. A discussion on regular expressions is out of scope and it has actually been covered in another tutorial. So, let's go on.

Note: The =~ operator, in this case, is used to match the regular expression. Actually it does more than that, but we will not talk about that now

The -N FLAG and READING

This is a good point to make a break and talk again about the command-line flags. Let me introduce you the -n flag. This one is pretty cool. Basically it implements the while part of our previous one-liner. Using this flag. Our grep version becomes:

perl -ne 'print if /regularexpression/;' file.txt

There are quite some other interesting flags to discover. Just type:

man perlrun

And start reading.

If the man page does not appear you probably have to install the perl documentation: sudo apt-get install perl-doc

The man pages are great (type man perl to get the table of contents). It contains useful examples and you will learn a lot of things, not just Perl. If you try, and you feel it is still too much for you, then you better go for the Llama Book. This book is great for learning the basics. I learned myself with that book. Do not go for the Camel Book directly. It is another must-read if you want to play with Perl, but it is not the best choice for the beginner.

Enough about literature. Let's finish our introduction to Perl one-liners

The SPLIT FUNCTION

You can do quite a lot of stuff using just regular expressions (man perlretut). However, usually you are processing data with some kind of uniform format. Many fields delimited by some separator. A space, a colon,... In those cases, the split function is your friend. Regular expressions can easily become a bit messy in those situations.

The split function splits up a string and places the result in an array. Let's suppose that we want to check if there is more than one user with id 0 (root) in the system. We can parse the /etc/passwd file and check the user id for each entry. Something like this:

perl -ne '@f = split (/:/); print "$f[0]\n" if $f[2] == 0;' /etc/passwd

Split takes as parameter a regular expression to specify the field delimiter to use. It will use white spaces in case we do not provide a separator. Actually, the function accepts two more parameters but... You can figure that out yourself and have some fun.

Just two comments to better understand this last one-liner. In Perl, any variable starting with the symbol @ is an array. In this case, each entry on the array is a scalar variable (a string or a number, Perl will figure out how to deal with the variable), so you have to use the $ character to access its contents. In other words, $f[2] is the third element (we start counting in zero unless $[ is set) of array @f.

FINAL WORDS

Even when this tutorial was pretty basic, hope it could spark your curiosity to look deeper in this amazing language. It is really powerful and if you decide to spend some time and get used to it you will love it forever.

If Perl is not for you, then you can go for AWK (actually you should go for it anyway). AWK is actually an acronym, composed of the initials of its creator's last names. The A is from Mr. Aho author of the Dragon Books (another must read), the K is from Mr Kernighan (AKA God...) and the W is from Mr. Weinberger.

If you are curious, our last script written in AWK looks like this:

awk -F ":" -e '$3==0 {print $1}' /etc/passwd

2 Comments

Nice! I'll need this in the second half of the year.

Share Your Thoughts

  • Hot
  • Latest