Speed Thrills : CGI Please ... and Fast!
CGI Programming Made (Relatively) Easy Using Libraries
Server-Side Includes and its Extensions
Random and Recursive Crypting using Salt on Unix and Win32
Creating a mailing list using Perl
Reading and Writing to Files on the Server
You are here: irt.org | Articles | CGI & Perl | CGI - Server Side Processing of Form Data [ previous next ]
Published on: Sunday 7th June 1998 By: Jason Nugent
After a brief hiatus into the world of style sheets, I have returned to the planet of CGI. In the last CGI article, we looked at, among other things, how the browser encodes data so that it can be sent across the internet to your server. We also looked briefly at the simplest of Perl scripts in order to get familiar with the language. In this article, I will detail the steps required to extract the information that is sent to your CGI program. The article is fairly Perl intensive, but hopefully the examples will make it easy enough to follow.
Just to recap, GET and POST are the two methods in which a browser can send data to the server. GET encodes the data and then appends it to the end of the URL. Typically, these URLs look like this:
http://www.server.com/cgi-bin/script.pl?item1=Jason&item2=Nugent
In which each name/value pair of the form is joined together with an equal sign, and then each pair of information is separated with an ampersand. Also, remember that any special characters present in this url which might not normally be allowed (like spaces, slashes, and tildes) are hexadecimally encoded.
With POST, any form name/value pairs are sent to the browser AFTER the rest of the response headers have been sent. The content of your form does not get appended to the URL like it does with the GET method, but rather it is sent to the script via standard input (STDIN). Knowing which method used is the first key step in acquiring the data sent from your form to your cgi program.
Perl has a number (quite a few, actually) of environmental variables that are set when a CGI program is run. The values that these variables get set to depend greatly on the environment in which the CGI program is called, and you can use these values to greatly enhance your Perl script. First, however, lets take a look at the data structures used in Perl.
The three that are most commonly used are scalars, arrays of scalars, and associative arrays of scalars, or hashes.
Scalars can store a single piece of information and take on the form
$variable_name
where variable_name is the name of the variable. Variable names are case sensitive, so $my_variable and $MY_variable are two different variables. It is usually good practice to keep all your variable names lowercase, which prevents the possibility of future name collisions when new special keywords are added to Perl. The developers of Perl have decided that all new keywords will be uppercase, so if you consistently use lowercase you should have no problem.
There are no restrictions as to what you can store in a variable.
$var = "This is my variable"; $var = "This string has numbers in it! 2452"; $var = 5; # note that you don't need to put numbers in quotes
The second type of variable is an array. Arrays are very similar to arrays in JavaScript, in the sense that you can store many different pieces of information in them and reference a specific one using a numeric index. Unlike scalars, array variable names begin with an "@" symbol. The following line creates an array which holds two pieces of information:
@array = ('This is some text', 'This is some more text');
Arrays begin counting from zero, so to get access to the first cell of this array, the following notation would be used:
print $array[0]; # prints 'This is some text'
Notice that a cell in an array is referenced using a $, not a @. This is because each cell of an array is in fact a regular scalar variable. To say @array[0] is incorrect.
The last type of data structure is an associative array, or hash. I like to use the word 'hash' because it is easier to type. Seriously. Hashes are a lot like arrays of scalars in the sense that they also contain scalar variables, but they are not indexed with numbers. Hashes are referenced using key strings. A hash is referenced not with a $ or a @, but with a %. So, the following is a valid hash name:
%hash_name;
Note that scalars, arrays of scalars, and hashes all exist in separate namespaces, which means that you can use the same variable name for all three. The following is valid:
$variable; @variable; %variable;
Although, for obvious reasons, it is not recommended. It is better to keep things separate and avoid the confusion.
Back to hashes. To reference a specific cell in a hash, you need to know the key which indexes it. For example, You might have a hash named %Jason, which has two keys, 'firstname' and 'lastname':
%Jason; # the name of the hash %Jason = (firsname => 'Jason', lastname => 'Nugent'); # or also %Jason = ('firstname', 'Jason', 'lastname', 'Nugent'); print $Jason{'firstname'}; # prints 'Jason' print $Jason{'lastname'}; # prints 'Nugent'
Note that, like scalar arrays, individual cells of the hash are referenced using a $, since each cell is, in fact, a scalar variable.
The whole point of that introduction to variables was so everyone can understand the environmental variable hash. This hash is called %ENV and contains elements that are set (usually) when a CGI program is invoked from a browser. One of the most important ones when dealing with information submitted from a form is the QUERY_STRING environmental variable. This variable contains the information submitted when a GET request method is used, and is the part of the URL which appears after the question mark. So, a request method of:
http://www.server.com/cgi-bin/script.pl?item1=Jason&item2=Nugent
will set QUERY_STRING to
item1=Jason&item2=Nugent
To reference the QUERY_STRING environmental variable in Perl, use:
$ENV{'QUERY_STRING'};
Generally, the first thing that you would do is to copy the contents of QUERY_STRING to another variable which has an easier name to work with.
$form_info = $ENV{'QUERY_STRING'};
and then you can now work with $form_info instead.
Since GET and POST both submit their information to the CGI program in two different ways, you have to figure out which one it was before you can attempt to do anything with it. If it is GET then you can get your information from QUERY_STRING. If it was POST, QUERY_STRING will not be set to anything because a different mechanism is used to submit information. To determine whether or not it was GET or POST, we need to look at another environmental variable - REQUEST_METHOD. This variable contains either GET or POST, depending on what was used to submit the information.
$method = $ENV{'REQUEST_METHOD'};
If the request method is set to post, QUERY_STRING will not be set. Instead, your form input will come in via STDIN and must be read off character by character. Fortunately, Perl makes this easy.
To read the required amount of text from STDIN we make use of Perl's read() function. In order to know how much information to read, we have to use one more environmental variable, called CONTENT_LENGTH. This variable contains the number of bytes of information returned from the browser, after the request headers have been sent. The following code will examine the environmental variables and then read in information from STDIN until it has all of what was sent.
$length = $ENV{'CONTENT_LENGTH'}; # set it to something easier read(STDIN, $data, $length);
The read() function will read the required amount of information from STDIN, and store it in the variable $data. To be more particular, read() reads information from a "filehandle", which, in this case, is STDIN. Typically, filehandles are used to manipulate files that have been opened, and are also used to redirect ouput around your Perl program. In this example, STDIN is a filehandle which is used as a pipe to information coming in to your script from the server.
How do I Write a Script that does Both?
This is a great question, and most certainly a valid one. If your script can handle both request methods, you are free to change the type used in your form without having to modify your script. To do this requires one more Perl technique - a control structure. In particular, we are going to look at the if statement.
The if statement in Perl is very similar to the one used in JavaScript. Generally, it is made up of three sections:
if ( some expression is true ) { then perform code here; }
so, for example, we might have something like this:
$x = 4; if (x < 5 ) { # this will evaluate to true, since x is 4 print "x is less than 5"; }
In our case, we must use an if statement to determine whether or not to use GET or POST as our request method. So, we have
$method = $ENV{'REQUEST_METHOD'}; if ($method eq "GET") { perform GET method code here; } else { perform POST method code here; }
This section of code will check to see what method is used and then decide how to go about getting our information from the server. If we wrap everything up, we now have this:
$method = $ENV{'REQUEST_METHOD'}; if ($method eq "GET") { $string = $ENV{'QUERY_STRING'}; } else { read (STDIN, $string, $ENV{'CONTENT_LENGTH'}); }
Alright, now we have our information submitted from the server. As you may recall, we now have to decode it. First, lets take a look at a sample QUERY_STRING.
item1=Jason&item2=Nugent&item3=Jason+Nugent
This string has three items, and each one has a value associated with it. The first logical step would be to separate them into separate name/value pairs. Perl can accomplish this step quite easily using the split command. Consider the following code:
@name_value_pairs = split (/&/, ENV{'QUERY_STRING'});
This line will take our query string variable and split it on every occurance of an ampersand, and store each fragment in the array @name_value_pairs. What we will have after performing this on the above string is:
item1=Jason item2=Nugent item3=Jason+Nugent
each stored separately in @name_value_pairs. Life becomes a bit more complicated at this point. How do we separate each name/value pair, and still keep track of them all? The answer is, of course, by using a hash. If you create a hash and using the keys as the index strings, you can easily reference each item submitted to the form. So, we have to loop through the array and work on each pair, one at a time. For this, we use the foreach loop.
In the foreach loop, $pair represents each consecutive value in the @name_value_pairs array and changes to reflect the new value each time the loop is cycled through.
The operator =~ is a match/assignment operator that basically says "If you find it in the string, perform the operation and reassign the value to the variable". In this case, if a + sign is found in the string, it is replaced with a space and the new value is put back into the $value variable.
Consider this code:
%form_results; # this is the hash which will store the form results foreach $pair (@name_value_pairs) { ($key, $value) = split (/=/, $pair); # split each pair on the equal # sign , and store each # fragment in $key and $value $value =~ tr/+/ /; # a quick transformation regex to convert pluses # back to spaces in your $value. Jason+Nugent # becomes Jason Nugent, which is good. # the next line will take a bit of explaining. # See outside the code for it $value =~ s/%([\dA-Fa-f][\dA-Fa-f])/pack ("C", hex ($1))/eg; # The next line adds a new "cell" to the %form_results hash, indexed # by $key, with a value of $value. As an example, consider a # name/value pair called colour=blue. The line below would create a # new entry in the hash indexed by the string "colour", with a value # of "blue". $form_results{$key} = $value; # store the value in the # %form_results hash, indexed # according to the key }
That foreach loop cycles through each name/value pair, splits it apart on the equal sign, converts pluses to spaces, and then that last line DECODES any hexadecimally encoded characters remaining in the string. It's a fun little regular expression, so let's take a look:
s/%([\dA-Fa-f][\dA-Fa-f])/pack ("C", hex ($1))/eg
first off, the s/ / / indicates that you are performing a substitution on a string. The regular expression between the first two slashes contains the pattern that you want to match, and the section between the last two slashes contains the code that you want to replace the matched text with.
The /eg modifiers at the end of the substitution operator influence how the operation is performed. In this case, the "e" represents the fact that we are replacing the matched text with Perl code that needs to be "evaluated" before the subsititution can take place. The /g makes the operation "global". More on these two below.
Let's look at the match regex first:
%([\dA-Fa-f][\dA-Fa-f])
This regular expression begins its match with a percent sign, which is the key component in a hexadecimal number. It is then followed by:
([\dA-Fa-f][\dA-Fa-f])
so lets look at that. The interior of this section contains [\dA-Fa-f][\dA-Fa-f], or two identical [\dA-Fa-f] classes. A class will match a SINGLE instance of one of the characters which appears inside it. In this case, [\dA-Fa-f] will match a single digit (referenced by the \d, and eqivalent to 0-9), OR a single capital letter from a to f (A-F) OR a single lower case letter from a - f (a-f). Each one of the [\dA-Fa-f] code sections can match one of these, so this regular expression will match things like
%7E or %AA or %Af or %b6 etc.
Note that this section of the regular expression is stored in parentheses (). This is important later on, since if a match is found, the section of the regex in parentheses is stored in a variable called $1, if it was the first parenthesized section. If a second set of parentheses was used, its regex match would be stored in $2, and so on.
So, let's look at the replacement part of the substitution.
pack ("C", hex ($1))
Note that this is NOT a regular expression. The replacement text of the substitution operator can not contain a regular expression. It can however, contain executable Perl code, which is why there is an "e" after the operator. Note the nested hex() function. It takes the matched text (something like 7E, for example, and stored in $1) and converts it to hexadecimal number. This number is then passed as an argument to the pack() function, which also takes a "C" argument in this case. The C tells pack to return an "unsigned char value", which will be the original unencoded text character entered by the user in the form. The final "g" subscript on the substitution operator means perform the operation globally, so all the encoded characters get changed back. If this was left off, the substitution would stop after the first match. Neat, eh?
So now, after all this, we have a hash called %form_results which contains all the information submitted to the CGI program. You can now do whatever you want with these values - print them back to the user, store them in a file on the server, or put them in a cookie. For now, we are just going to print them back out to the browser.
Remember, the first line printed out from our CGI script must tell the browser what type of file is to follow. By setting the MIME type to "text/html", the browser knows what to do with the information it it receiving from the server.
So, all we would have to do is:
print "Content-type: text/htm\n\n"; # need this! Must be the first # printed line foreach $key (sort keys(%form_results)) { print "$key has value $form_results{$key}\n"; }
A final word of explanation here - since the foreach loop cycles through a regular array, it can't handle a hash. It can, however, handle a list, which is why the keys() function is used. keys() returns a list of all the keys of a hash, which then gets passed to the sort() function. sort() puts them in some semblance of alphabetical order, and then the foreach loop easily handles them.
#!/usr/local/bin/perl -w ############################ ## Form Parser script ## ## Jason Nugent, 1998 ## ############################ my $method = $ENV{'REQUEST_METHOD'}; if ($method eq "GET") { $text = $ENV{'QUERY_STRING'}; } else { # default to POST read(STDIN, $text, $ENV{'CONTENT_LENGTH'}); } my @value_pairs = split (/&/,$text); my %form_results = (); foreach $pair (@value_pairs) { ($key, $value) = split (/=/,$pair); $value =~ tr/+/ /; $value =~ s/%([\dA-Fa-f][\dA-Fa-f])/pack ("C", hex ($1))/eg; $form_results{$key} = $value; # store the key in the results hash } # next line sets the MIME type for the output print "Content-type: text/html\n\n"; # loop through the results and print each key/value foreach $key (sort keys(%form_results)) { print "$key has value $form_results{$key}<BR>\n"; }
You can try out the working example for yourself.
I think the next article will be a combination of two things - Server Side Includes (since everyone seems to want to use them), and also a much needed discussion on CGI security and how you can do your part. I figure that a bit of security talk will help people think clearer about the dangers of writing a poor CGI script. Till then!
Speed Thrills : CGI Please ... and Fast!
CGI Programming Made (Relatively) Easy Using Libraries
Server-Side Includes and its Extensions
Random and Recursive Crypting using Salt on Unix and Win32
Creating a mailing list using Perl
Reading and Writing to Files on the Server
Server Side Includes and CGI Security