Related items

Addressing Form Field Validation with Regular Expressions and JavaScript 1.2

You are here: irt.org | Articles | JavaScript | Form | Addressing Form Field Validation with Regular Expressions and JavaScript 1.2 [ previous next ]

Published on: Sunday 16th November 1997 By: Jason Nugent

Introduction
The dangers of CGI
Using JavaScript 1.0 Validation
Using JavaScript 1.2 and Regular Expressions
Working Example
Source Code
Further Information

Introduction

As most of us are aware, using a form on a website is an effective way to gather information from a visitor. Information can be requested, mailing lists can be subscribed to, and comments and feedback can be submitted. For those of you who have not yet implemented a form on your website and are wondering about the HTML syntax in doing so, I will first delve into such details. If you are already familiar with forms, you may skip this section and move onto the next.

A form on a website is embodied inside the <FORM> tag, so we will look at this one first and in some detail. Like most HTML tags, the <FORM> tag takes a number of attributes related to the form itself. It possesses the following syntax:

<FORM attribute1=".." attribute2=".."> ... </FORM>

where the attributes can be one (or more)of the following:

ACTION=".." - this attribute specifies the location of the CGI script or email address that the information will be sent to once the form is submitted. If you want your form to do anything useful, this is a required field.
METHOD=".." - this attribute specifies how the form information is sent. Possible values are GET and POST. This attribute is not required, and defaults to GET if left out. With GET, the form information is appended to the URL of the CGI script, while with POST, the form information is sent as an encoded string via http.
TARGET=".." - this attribute specifies the frame in which the returned results will be loaded into - a useful feature if you have a thank-you message that your CGI script displays and you don't want to load it in the same window your form is located.
NAME=".." - for JavaScript, this is probably the most important attribute for a form. Since JavaScript stores items that appear on your website in arrays, giving your form a name makes it easier to work with.

Inside the <form> tag, you place all the elements that you want to use in the form itself. The first element is an <input> tag which takes attributes the define how it appears on the page. It is an open-ended tag, which means that it does not have a corresponding </input> tag to close it. These <input> tags can be of type text, checkbox, image, password, radio, submit, and reset. There also exists a <textarea> tag which creates a large area for entering multi-lined information. The <select> tag is used to create a drop-down list of items.

Defining each of these items is not the purpose of this article, but if you would like to know more, please visit the World Wide Web Consortium at http://www.w3.org for more information. For now, suffice it to say that qualifying each component of your form with a name=".." attribute is required if you wish to work with either a CGI script or a JavaScript function.

The simple form used in this example contains two text fields, one for a name, and the other for an email address. One button has been added, by which the form is submitted to the server.

<form name="form_name" onSubmit="return isReady(this)" action="">
<table cellpadding=0 cellspacing=5 border=0><tr>
    <td align="left">Your Name:</td><td align="left"><input type="text" name="Name"></td>
</tr><tr>
    <td align="left">Your Email Address:</td><td align="left"><input type="text" name="address"></td>
</tr></table>
</form>

Note that the form has been given a name, which is passed to the isReady() function upon submission through the this keyword. In JavaScript, this implies the current object.

The dangers of CGI

As we have seen, coupling interactivity via forms and programs or scripts on a server through the Common Gateway Interface (CGI) is an effective way to obtain information from individuals visiting your website. However, there are risks associated with running a CGI script from the web. Poorly written scripts that accept malformed information from an unknowing or malicious user could be made to do things that could bring your server to its knees.

For example, imagine operating a website that contains a field that allows a user to enter the name of a directory on the server. Certainly not the smartest idea, but they are out there. If someone were to put the following in as the directory they wanted listed, bad things could happen:

      web_directory ; /bin/rm *

Quite possibly, the command to list the directory would be carried out normally, and then the second command (/bin/rm *) could be carried out and erase a directory.

There are several ways to prevent this sort of thing from happening, and some are better than others, depending on the situation. First and foremost, the script itself could be written to verify that the form submitted to it does not contain any malicious code. Upon detecting such an attempt, the script could refuse to process the entry and store the submitter's IP address in a file for future reference. Or, more simply, the script could simply display an alternate page telling the visitor that their input was not accepted.

While this is a very good method to use when validating form field input, it does have its disadvantages. One of the biggest is the overhead involved with parsing input on the server. A busy server that parses all of its requests could be slowed considerably, resulting in a website that appears sluggish. Here is where JavaScript comes to the rescue!

By passing the contents of the form to a JavaScript function before submission, the contents can be validated before being sent to the server, which reduces server overhead.

Beware, however, that a poorly written script can still accept requests that do not come from the form. It is possible that a malicious user from a completely different domain could run your script directly and feed it bad information. Fortunately, there are several ways around this. One of the easiest is to make your CGI script examine the HTTP_REFERER and REMOTE_HOST environmental variables that are submitted with every request. These variables contain the URL of the requesting document and the domain name of the foreign server respectively, and could be checked to ensure that the request was submitted from a user on an allowed domain (in particular, your own). If the request is not allowed, the foreign domain name could be logged in a file and refused access to the script.

It is also important to ensure that some form of error checking still takes place, even if the request is a legitimate one. A visitor using a browser that does not support JavaScript could still conceivably submit malformed code.

Using JavaScript 1.0 Validation

JavaScript 1.0 offered a way to check and see if a field contained certain characters using the indexOf() method. If a character was found, the position of the character was returned as a number. For example:

var a = "This is my field's contents";  
var b = a.indexOf("my");  // b now contains 9.

As you can see, b now contains the position (starting from 0) that the pattern "my" was located at. If the pattern searched for was not found, the indexOf() method returns -1.

But what if you wanted to check for several characters all at once? What if you wanted to make sure that an email address only contained numbers, letters, an "at sign", and a period? By using indexOf(), you would be required to write several lines of code, each using indexOf() to look for ALL the characters you didn't want to find. If an illegal character is found, an alert box could be flashed asking the user to re-enter their information. The following functions use JavaScript 1.0 functionality to examine either a text field containing regular text or a text field containing an email address. By passing the contents of the form to the isReady() function using the onSubmit event handler, the information is validated before being sent to the server. If the function returns true (i.e. everything checks out), the ACTION attribute of the form is run.

Note that these functions can be used independently of a form. These methods can be used anywhere, as long as an appropriate string value is passed as an argument.

<script language="JavaScript"><!--
function isEmail(string) {

   if (!string) return false;
   var iChars = "*|,\":<>[]{}`\';()&$#%";

   for (var i = 0; i < string.length; i++) {
      if (iChars.indexOf(string.charAt(i)) != -1)
         return false;
   }
   return true;
}                      
function isProper(string) {

   if (!string) return false;
   var iChars = "*|,\":<>[]{}`\';()@&$#%";

   for (var i = 0; i < string.length; i++) {
      if (iChars.indexOf(string.charAt(i)) != -1)
         return false;
   }
   return true;
} 
                     
function isReady(form) {
    if (isEmail(form.address.value) == false) {
        alert("Please enter a valid email address.");
        form.address.focus();
        return false;
    }
    if (isProper(form.username.value) == false) {
        alert("Please enter a valid username.");
        form.username.focus();
        return false;
    }
    return true;
}
//--></script>

Although this method works fine if you want to ensure that certain characters are not present in the field, it falls short when trying to ensure that certain patterns ARE present. What if you only wanted to allow email addresses from a certain domain, while not allowing others? What if only word-word@word-word.word email addresses were allowed? These things would be incredibly difficult, if not impossible, to do with indexOf() and JavaScript 1.0.

Using JavaScript 1.2 and Regular Expressions

JavaScript 1.2 shows the way through the power of regular expressions. These expressions, which offer the same functionality as regular expressions taken from Perl, a very popular scripting language, add the ability to parse form field input in ways that were simply not possible before. The examples below, which only work in Netscape Navigator 4.0x and Internet Explorer 4, illuminate the power associated with these new additions.

First off, what is a regular expression? Put simply, a regular expression is a string of special values that programmers can use to explicitly match a specific string of text.

Before we get into using regular expressions to parse text, it is important that you understand a bit about how regular expressions work and what special characters do what. There is just too much to get into here, but here are a few that come up often:

       . matches any singular character.
       ? matches one or none of the preceding character.
       + matches at least one of the preceding character.
       * matches none or all of the preceding character.
       ^ matches the absolute beginning of the string.
       $ matches the absolute end of the string.
     \w+ matches a whole word.
      \w matches a "word" character (alphanumerics and the "_" character).
     \W+ matches whitespace.
     x|y matches one or the other of x or y.
  [0..9] matches ONE number, ranging from 0 to 9.
[A-Za-z] matches any letter, uppercase or lowercase.

Parentheses can be used to group characters together.

 (this)+ matches at least one occurrence of "this".

If you wish to search for one of the special characters, you must first delimit it with a backslash(\).

     \. matches a period.
     \? matches a question mark.
     \[ matches a left square bracket.
     \| matches a "pipe" character.

In addition to these, modifiers can be added after the regular expression to control how it searches through the string. Some of more useful ones include these:

/somematch/g - global (matches all instances).
/somematch/i - ignore case.
/somematch/gi - you can combine them, too.

JavaScript 1.2 contains a number of new constructors and methods that allow a programmer to parse a string of text using regular expressions. The first thing you must do before you can begin parsing a string is to determine exactly what your regular expression will be. There are two ways to do this. The first is to specify it by hand using normal syntax, and the second is to use the new RegExp() constructor. The following two statements are equivalent:

pattern = /:+/;             // matches one or more colons
pattern = new RegExp(":+"); // same thing.

There is one very important thing to notice here. With the first method, it is important to remember to delimit your expression using slashes. A slash specifies the beginning or the end of a regular expression. You may also place the regular expression directly into the function without first defining it using the RegExp() method, which is what I do in the examples below.

The replace() method allows a programmer to replace a found match with another string. It takes two arguments, one being the regular expression you want searched for, and the other being the replacement text you want substituted. For example:

var t = "javascript is great";
var s = t.replace(/javascript/, "JavaScript"); // fixes the capitalization.

The variable s now contains "JavaScript is great". The next method is the search() method. This method searches the source string and returns the location of the first match if the pattern is found, otherwise -1. It effectively duplicates the functionality of JavaScript 1.0's indexOf() method. Example:

var s = "Let's use Regular Expressions";
var found = s.search(/use/); // found now contains 6.

If the search string is not located, the function returns -1. This method is the one that will enable us to parse a field's contents to make sure that people aren't submitting information that could damage our server. Before we do that, however, let's take a look at the next method provided for regular expressions, the split() method. The split() method is actually present in older versions of JavaScript but has been updated for JavaScript 1.2 to accommodate regular expressions. It searches through a string and "breaks apart" the string and stores each part in an array. The example below uses a pattern that looks for a colon and stores each part in the array a.

var s = "Jason:Nugent:this:is:great:don't:you:think";
var a = s.split(/:/);

In this case, a becomes the array containing ["Jason", "Nugent", "this", "is", "great", "don't", "you", "think"]. In common CGI applications, this same technique is used to separate a comma delimited text file that perhaps serves as a database containing user information.

The match() method searches a string in a different way. It returns an array consisting of all the matches found in the string that match the regular expression. If no matches are found, it returns null.

var s = "Thank you, there, for thinking about me.";
var a = s.match(/th\w+/gi); // matches a word beginning with th, globally, and ignore case.

a is an array that now contains ["Thank", "there", "thinking"].

Now, finally, we get to do some useful things with regular expressions. The following function will parse a form consisting of a username and an email address, and alert the user if the username is not entirely made up of characters, numbers or spaces. The function will also alert the user if the email address contains more than just alphanumerics, an "at" sign, periods, or hyphens.

Since regular expressions are only a part of JavaScript 1.2, we must determine the browser being used and plan accordingly. Since all other browsers ignore JavaScript 1.2, we can simply use the language="JavaScript1.2" qualifier to refine our parsing functions. Older browsers will simply skip over this code.

<SCRIPT language="JavaScript1.2">
function isEmail(string) {
    if (string.search(/^\w+((-\w+)|(\.\w+))*\@[A-Za-z0-9]+((\.|-)[A-Za-z0-9]+)*\.[A-Za-z0-9]+$/) != -1)
        return true;
    else
        return false;
}

function isProper(string) {
    if (string.search(/^\w+( \w+)?$/) != -1)
        return true;
    else
        return false;
}
//--></SCRIPT>

Ok. Let's stop and examine the regular expressions used in the functions above. First, let's look at the isProper() function since it is simpler. The Regular Expression used is /^\w+( \w+)?$/.

The first / is the leftmost delimiter for the regular expression. No surprises there.
The ^ (caret) symbol represents the absolute beginning of the function. This is important, since if it were left out the match would return true if the search() method found the pattern ANYWHERE in the string. Malicious users could then include illegal characters before a valid name and get away with it.
The \w+ which is next indicates that we want to match at LEAST one or more alphanumeric characters, including the underscore. the \w represents the character, and the + symbol means at least one. No magic there.
The following part of the regular expression is special, since it has to be treated together. Let's break it down a bit, however. First, notice the whole picture. What we are doing is using a ? symbol, which means match one or none of the preceding character. So, what happens is the regular expression looks for one or none of a space, followed by at least one legal word character. This represents the optional last name of the user. Please take a moment to understand that.
The last $ symbol represents the end of the string. This makes sure that no characters can appear after our matched string in the regular expression, thus eliminating the possibility of someone sending bad data after a valid username.
The final / is the rightmost delimiter for the regular expression. Again, no surprises there.

Ok. Shall we move on to the isMail() function? The Regular Expression is /^\w+((-\w+)|(\.\w+))*\@[A-Za-z0-9]+((\.|-)[A-Za-z0-9]+)*\.[A-Za-z0-9]+$/.

The email regular expression begins with a /, again representing the leftmost delimiter.
Once again, we have a ^ symbol, representing the absolute beginning of the string for the same reason as before.
The following \w+ matches one or more alphanumeric characters.
The next chunk is where this gets interesting. I will try to break it down into manageable pieces, so please bear with me. The part we will look at is ((-\w+)|(\.\w+))*

First, note that the whole thing is surrounded by ()* which means that we want to match zero or more of them. Inside the parentheses, we have (-\w+)|(\.\w+) which means to match EITHER -\w+ OR \.\w+ so lets take a look at each of them in turn. The first one indicates that we should have a match if we find a hyphen followed immediately by a set of alphanumeric characters. The second part matches if we find a period followed immediately by a set of alphanumeric characters. Remember that a period by itself is a special character so we must delimit it by placing a backslash in front of it. In essence, what this inside bit does is allow someone to submit an email address that has a hyphenated or dot-separated email address before an "at" sign.
After this match, comes an @ sign. This is delimited to ensure that it isn't taken for special meaning.
Immediately following the "at" sign is [A-Za-z0-9]+ which matches a set of alphanumeric characters (excluding any _ characters, which we would have got if we had just used \w).
The final / is the rightmost delimiter for the regular expression.

After this, we have another interesting bit ((\.|-)[A-Za-z0-9]+)*. Let's go through it.
Again, note that we are matching one or none of a match using the * sign. Since parentheses are used, the entire match is taken into consideration. Let's look inside at the (\.|-)[A-Za-z0-9]+ pattern. Inside the parentheses, we have \.|- which implies that we will match either a period or a hyphen. Since this pattern is followed by a [A-Za-z0-9]+, the match only works if the period or hyphen is followed by a set of alphanumeric characters. This effectively represents an email address that contains a (possible) set of .word or -word sections. Because the * is used, the pattern works if they are present and also if they aren't.
The last \.[A-Za-z0-9]+ pattern matches a period followed by a set of alphanumerics. Because it is the last part of the regular expression, it represents the final part of the email address, which is the top level domain. Because [A-Za-z0-9]+ does not match non-alphanumerics, this pattern will not match email addresses that do not contain some sort of "real-looking" domain.
The final $ symbol ensures that the pattern is against the end of string for the same reasons as the previous example.

This pattern allows for email addresses like the following. With this particular regular expression, the bare minimum that a person could enter as an email address is x@x.x, where x is any alphanumeric character:

    someone@somewhere.com
    someone.somebody@somewhere.com
    someone.sombody@somewhere.where.com
    some-one@somewhere.com
    some-one.somewhere@wherever.com
    some-one.somewhere@where-ever.com

Working Example

Why not try the example out, which works in Netscape Navigator 2, 3 and 4, as well as Internet Explorer 3 and 4.

Source Code

You can view the source code of the working example.

Further Information

If you are interested in learning more about JavaScript 1.2, feel free to examine these sources of information on the web:

What's new in JavaScript 1.2: http://developer.netscape.com/library/documentation/communicator/jsguide/js1_2.htm

JavaScript 1.2 Reference: http://developer.netscape.com/library/documentation/communicator/jsref/index.htm

For a good introduction to regular expressions, please check out: ftp://ftp.ou.edu/mirrors/CPAN/doc/manual/html/pod/perlre.html

In addition, you might want to check out Tom Christiansen's page on Regular Expressions in Perl 5, which can be found at: http://www.perl.com/CPAN-local/doc/FMTEYEWTK/regexps.html The FMTEYEWTK stands for "Far More Than Everything You Ever Wanted To Know".

Related items

Chapter 6: Beginning JavaScript

Controlling Data Entry Using Form Fields

Form Image Button Fields

Creating 'Encoded' Name & Value Pairs

Disabling form elements

Passing data from one form to another

Dynamic Dropdown Menus

Form Tricks

Dropdown Menus #3

Check Boxes and Radio Buttons

Feedback on 'Addressing Form Field Validation with Regular Expressions and JavaScript 1.2'

Tuesday July 6th, 1999 at 07:50:48 - Juergen Barthel
Tuesday November 2nd, 1999 at 10:26:31 - David Swain
Thursday November 25th, 1999 at 07:53:26 - Ed Bradburn
Thursday January 13th, 2000 at 12:53:18 - Jimmy Devenport
Tuesday April 25th, 2000 at 13:19:28 - Vlad Gertsen
Thursday July 20th, 2000 at 21:08:57 - Dave Habros
Thursday August 10th, 2000 at 10:42:13 - Suma Thomas
Thursday August 10th, 2000 at 10:44:31 - Suma Thomas
Wednesday February 21st, 2001 at 00:41:52 - Joel Finkel
Saturday April 21st, 2001 at 00:17:04 - Liam Morley
Saturday August 11th, 2001 at 15:49:45 - Ian Anderson
Sunday September 2nd, 2001 at 13:43:37 - peter
Thursday September 6th, 2001 at 15:11:22 - frank koenen
Monday December 3rd, 2001 at 18:05:59 - Stefan Berglund
Friday May 17th, 2002 at 18:33:19 - Dick Curtis
Friday May 17th, 2002 at 18:34:03 - Dick Curtis
Wednesday September 19th, 2007 at 10:11:41 -
Wednesday January 16th, 2008 at 02:30:57 - Ashish