Perl split function - process every word in a file

Problem: You’re developing a Perl program, and you need to process every “word” in a text file within your program.

Solution: How you achieve this in the end depends on your meaning of “every word,” but I’m going to go with a very simple definition, where I can use the Perl split function to break up each “word” that is surrounded by whitespace characters.

Here’s the source code for a Perl program that reads its input from STDIN (perl stdin); uses the Perl split function to split each input line into a group of words; loops through those words using a Perl for loop; and finally prints each word from within the for loop:

#!/usr/bin/perl
#
# purpose: this is a perl program that demonstrates
#          how to read file contents from STDIN (perl stdin),
#          use the perl split function to split each line in 
#          the file into a list of words, and then print each word.
#
# usage:   perl this-program.pl < input-file

# read from perl stdin
while (<>)
{
  # split each input line; words are separated by whitespace
  for $word (split)
  {
    # do whatever you need to here. in my case
    # i'm just printing each "word" on a new line.
    print $word . "\n";
  }
}

As mentioned above, this Perl program reads from STDIN (standard input), so the script should be run like this:

perl this-program.pl < input-file

Or, if you make the file executable on a Linux or Unix system using chmod, you can run this Perl script like this:

this-program.pl < input-file

Sample output (from our Perl split and stdin example)

When I run this Perl script against a text file that contains the contents of the Gettysburg Address, and then use the Unix head command to show the first 30 lines of output, I get these results:

prompt> perl process-every-word-file.pl < gettysburg-address | head -30
Four
score
and
seven
years
ago
our
fathers
brought
forth
on
this
continent
a
new
nation,
conceived
in
Liberty,
and
dedicated
to
the
proposition
that
all
men
are
created
equal.

As you can see, when you use the Perl split function and split a line using whitespace characters, some “words” can end up containing other characters, like commas or periods. You can strip those characters out with regular expression patterns, but for now I’m out of time, and I’m going to have to leave that as an exercise for the reader.