|
|
  |
|
 |
Note: This is an historic document. We are no longer maintaining the
content, but it may have value for research purposes. Pages linked to
from the document may no longer be available.
Understanding Malicious Content Mitigation for Web Developers
CERT Advisory CA-2000-02
describes a problem with malicious tags embedded in client HTTP
requests, discusses the impact of malicious scripts, and offers ways
to prevent the insertion of malicious tags.
This tech tip, written for web developers, describes more
specifically the steps you can take to prevent attackers from
from using untrusted content to exploit your web site.
This document has the following sections:
Problem Summary
Web pages contain both text and HTML markup that is generated by
the server and interpreted by the client browser. Servers that
generate static pages have full control over how the client will
interpret the pages sent by the server. However, servers that generate
dynamic pages do not have complete control over how their output is
interpreted by the client. The heart of the issue is that if untrusted
content can be introduced into a dynamic page, neither the server nor
the client has enough information to recognize that this has happened
and take protective actions.
In HTML, to distinguish text from markup, some characters are
treated specially. The grammar of HTML determines the significance of
"special" characters -- different characters are special at different
points in the document. For example, the less-than sign "<"
typically indicates the beginning of an HTML tag. Tags can either
affect the formatting of the page or introduce a program that the
browser executes (e.g., the <SCRIPT> tag introduces code from a
variety of scripting languages).
Many web servers generate web pages dynamically. For example, a
search engine may perform a database search and then construct a web
page that contains the result of the search. Any server that creates
web pages by inserting dynamic data into a template should check to
make sure that the data to be inserted does not contain any special
characters (e.g., "<"). If the inserted data contains special
characters, the user's web browser will mistake them for HTML
markup. Because HTML markup can introduce programs, the browser could
interpret some data values as HTML tags or script rather than
displaying them as text.
The risk of a web server not doing a check for special characters
in dynamically generated web pages is that in some cases an attacker
can choose the data that the web server inserts into the generated
page. Then the attacker can trick the user's browser into running a
program of the attacker's choice. This program will execute in the
browser's security context for communicating with the legitimate
web server, not the browser's security context for communicating
with the attacker. Thus, the program will execute in an inappropriate
security context with inappropriate privileges.
Mitigation Summary
Any data inserted into an output stream originating from a server
is presented as originating from that server, even if it does not
include malicious tags. Web developers must evaluate whether their
sites will send untrusted data as part of an output stream.
Untrusted input can come from, but is not limited to,
- URL parameters
- Form elements
- Cookies
- Databases queries
A combination of steps must be taken to mitigate this
vulnerability. These steps include
- Explicitly setting the character set encoding for each page
generated by the web server
- Identifying special characters
- Encoding dynamic output elements
- Filtering specific characters in dynamic elements
- Examine cookies
The following sections discuss details of each of these
steps.
Explicitly Setting the Character Encoding
Many web pages leave the character encoding ("charset" parameter in
HTTP) undefined. In earlier versions of HTML and HTTP, the character
encoding was supposed to default to ISO-8859-1 if it wasn't defined.
In fact, many browsers had a different default, so it was not possible
to rely on the default being ISO-8859-1. HTML version 4 legitimizes
this - if the character encoding isn't specified, any character
encoding can be used.
If the web server doesn't specify which character encoding is in use,
it can't tell which characters are special. Web pages with unspecified
character encoding work most of the time because most character sets
assign the same characters to byte values below 128. But which of the
values above 128 are special? Some 16-bit character-encoding schemes
have additional multi-byte representations for special characters such
as "<". Some browsers recognize this alternative encoding and act
on it. This is "correct" behavior, but it makes attacks using
malicious scripts much harder to prevent. The server simply doesn't
know which byte sequences represent the special characters.
For example, UTF-7 provides alternative encoding for "<" and
">", and several popular browsers recognize these as the start and
end of a tag. This is not a bug in those browsers. If the character
encoding really is UTF-7, then this is correct behavior. The problem
is that it is possible to get into a situation in which the browser
and the server disagree on the encoding. Web servers should set the
character set, then make sure that the data they insert is free from
byte sequences that are special in the specified encoding. For
example:
<HTML>
<HEAD>
<META http-equiv="Content-Type"
content="text/html; charset=ISO-8859-1">
<TITLE>HTML SAMPLE</TITLE>
</HEAD>
<BODY>
<P>This is a sample HTML page
</BODY>
</HTML>
The META tag in the HEAD section of this sample HTML forces the
page to use the ISO-8859-1 character set encoding.
Identifying the Special Characters
The next two steps, encoding and filtering, first require an
understanding of "special characters". The HTML specification
determines which characters are "special", because they have an effect
on how the page is displayed. However, many web browsers try to
correct common errors in HTML. As a result, they sometimes treat
characters as special when, according to the specification, they
aren't. In addition, the set of special characters depends on the
context:
- In the content of a block-level element (in the middle of a
paragraph of text)
- "<" is special because it introduces a tag.
- "&" is special because it introduces a character
entity.
- ">" is special because some browsers treat it as
special, on the assumption that the author of the page really
meant to put in an opening "<", but omitted it in error.
- Attribute values
- In attribute values enclosed with double quotes, the
double quotes are special because they mark the end of the
attribute value.
- In attribute values enclosed with single quote, the single
quotes are special because they mark the end of the attribute
value.
- Attribute values without any quotes make the white-space
characters such as space and tab special.
- "&" is special when used in conjunction with
some attributes because it introduces a character entity.
- In URLs, for example, a search engine might provide a link within
the results page that the user can click to re-run the search. This
can be implemented by encoding the search query inside the URL. When
this is done, it introduces additional special characters:
- Space, tab, and new line are special because they mark the
end of the URL.
- "&" is special because it introduces a character
entity or separates CGI parameters.
- Non-ASCII characters (that is, everything above 128 in the
ISO-8859-1 encoding) aren't allowed in URLs, so they are all
special here.
- The "%" must be filtered from input anywhere parameters
encoded with HTTP escape sequences are decoded by server-side
code. The percent must be filtered if input such as
"%68%65%6C%6C%6F" becomes "hello" when it appears on the web
page in question.
- Within the body of a <SCRIPT> </SCRIPT>
- The semicolon, parenthesis, curly braces, and new line
should be filtered in situations where text could be inserted
directly into a preexisting script tag.
- Server-side scripts
- Server-side scripts that convert any exclamation
characters (!) in input to double-quote characters (") on
output might require additional filtering.
- Other possibilities
- No current exploits rely on the ampersand. This character
may be useful in future exploits. Conservative web page
authors should filter this character out if possible.
It is important to note that individual situations may warrant
including additional characters in the list of special characters. Web
developers must examine their applications and determine which
characters can affect their web applications.
Encoding Dynamic Output Elements
Each character in the ISO-8859-1 specification can be encoded using
its numeric entry value. A complete description of the
ISO-8859-1 specification can be found in the appendix of
this document.
The following example uses the copyright mark in an HTML
document:
<p>© 2000 Some Co., Inc.
The copyright character is 169 and using the &# syntax allows
the author to insert encoded characters that will be interpreted by
the browser.
In addition, many of the ISO-8859-1 characters include an entity
name encoding. The copyright can also be done using this method:
<p>© 2000 Some Co., Inc.
Encoding untrusted data has benefits over filtering untrusted data,
including the preservation of visual appearance in the browser. This
is important when special characters are considered acceptable.
Unfortunately, encoding all untrusted data can be resource
intensive. Web developers must select a balance between encoding and
the other option of data filtering.
Filtering Dynamic Content
Unfortunately, it is unclear whether there are any other characters
or character combinations that can be used to expose other
vulnerabilities. The recommended method is to select the set of
characters that is known to be safe rather than excluding the set of
characters that might be bad. For example, a form element that is
expecting a person's age can be limited to the set of digits 0 through
9. There is no reason for this age element to accept any letters or
other special characters. Using this positive approach of selecting
the characters that are acceptable will help to reduce the ability to
exploit other yet unknown vulnerabilities.
The filtering process can be done as part of the data input
process, the data output process, or both. Filtering the data during
the output process, just before it is rendered as part of the dynamic
page, is recommended. Done correctly, this approach ensures that all
dynamic content is filtered. Filtering on the input side is less
effective because dynamic content can be entered into a web sites
database(s) via methods other than HTTP. In this case, the web server
may never see the data as part of the input process. Unless the
filtering is implemented in all places where dynamic data is entered,
the data elements may still be remain tainted.
Examine Cookies
One method to exploit this vulnerability involves inserting
malicious content into a cookie. Web developers should carefully
examine cookies that they accept and use the filtering techniques
describe above to verify that they are not storing malicious
content.
Sample Filtering Code
C++ Example
BYTE IsBadChar[] = {
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
0x00,0xFF,0xFF,0x00,0x00,0xFF,0xFF,0xFF,0xFF,0x00,0x00,
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
0x00,0x00,0x00,0x00,0xFF,0xFF,0x00,0xFF,0x00,0x00,0x00,
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,
0x00,0x00,0x00
};
DWORD FilterBuffer(BYTE * pString,DWORD cChLen){
BYTE * pBad = pString;
BYTE * pGood = pString;
DWORD i=0;
if (!pString) return 0;
for (i=0;pBad[i];i++){
if (!IsBadChar[pBad[i]]) *pGood++ = pBad[i];
};
return pGood-pString;
}
JavaScript Example
function RemoveBad(InStr){
InStr = InStr.replace(/\</g,"");
InStr = InStr.replace(/\>/g,"");
InStr = InStr.replace(/\"/g,"");
InStr = InStr.replace(/\'/g,"");
InStr = InStr.replace(/\%/g,"");
InStr = InStr.replace(/\;/g,"");
InStr = InStr.replace(/\(/g,"");
InStr = InStr.replace(/\)/g,"");
InStr = InStr.replace(/\&/g,"");
InStr = InStr.replace(/\+/g,"");
return InStr;
}
Perl Example
#! The first function takes the negative approach.
#! Use a list of bad characters to filter the data
sub FilterNeg {
local( $fd ) = @_;
$fd =~ s/[\<\>\"\'\%\;\)\(\&\+]//g;
return( $fd ) ;
}
#! The second function takes the positive approach.
#! Use a list of good characters to filter the data
sub FilterPos {
local( $fd ) = @_;
$fd =~ tr/A-Za-z0-9\ //dc;
return( $fd ) ;
}
$Data = "This is a test string<script>";
$Data = &FilterNeg( $Data );
print "$Data\n";
$Data = "This is a test string<script>";
$Data = &FilterPos( $Data );
print "$Data\n";
ISO 8859-1 (Latin-1) Character Set
Number | Name | Description | Appearance |
�- | - | Unused | - |
	 | - | HorizontalTab | space |
| - | Linefeed | space |
- | - | Unused | - |
  | - | Space | space |
! | - | Exclamationmark | ! |
" | " | Quotationmark | " |
# | - | Numbersign | # |
$ | - | Dollarsign | $ |
% | - | Percentsign | % |
& | & | Ampersand | & |
' | - | Apostrophe | ' |
( | - | Leftparenthesis | ( |
) | - | Rightparenthesis | ) |
* | - | Asterisk | * |
+ | - | Plussign | + |
, | - | Comma | , |
- | - | Hyphen | - |
. | - | Period(fullstop) | . |
/ | - | Solidus(slash) | / |
0-9 | - | Digits(0-9) | 0-9 |
: | - | Colon | : |
; | - | Semi-colon | ; |
< | < | Lessthan | < |
= | - | Equalssign | = |
> | > | Greaterthan | > |
? | - | Questionmark | ? |
@ | - | Commercialat | @ |
A-Z | - | UppercaseA-Z | A-Z |
[ | - | Leftsquarebracket | [ |
\ | - | Reversesolidus(backslash) | \ |
] | - | Rightsquarebracket | ] |
^ | - | Caret | ^ |
_ | - | Horizontalbar | _ |
` | - | Acuteaccent | ` |
a-z | - | Lowercasea-z | a-z |
{ | - | Leftcurlybrace | { |
| | - | Verticalbar | | |
} | - | Rightcurlybrace | } |
~ | - | Tilde | ~ |
-Ÿ | - | Unused | - |
  | | Non-breakingspace | |
¡ | ¡ | Invertedexclamation | ¡ |
¢ | ¢ | Centsign | ¢ |
£ | £ | Poundsterlingsign | £ |
¤ | ¤ | Generalcurrencysign | ¤ |
¥ | ¥ | Yensign | ¥ |
¦ | ¦ | Brokenverticalbar | ¦ |
§ | § | Sectionsign | § |
¨ | ¨ | Umlaut(dierisis) | ¨ |
© | © | Copyright | © |
ª | ª | Feminineordinal | ª |
« | « | Leftanglequote,guillemotleft | « |
¬ | ¬ | Notsign | ¬ |
­ | ­ | Softhyphen | |
® | ® | Registeredtrademark | ® |
¯ | ¯ | Macronaccent | ¯ |
° | ° | Degreesign | ° |
± | ± | Plusorminus | ± |
² | ² | Superscripttwo | ² |
³ | ³ | Superscriptthree | ³ |
´ | ´ | Acuteaccent | ´ |
µ | µ | Microsign | µ |
¶ | ¶ | Paragraphsign | ¶ |
· | · | Middledot | · |
¸ | ¸ | Cedilla | ¸ |
¹ | ¹ | Superscriptone | ¹ |
º | º | Masculineordinal | º |
» | » | Rightanglequote,guillemotright | » |
¼ | ¼ | Fraction(onequarter) | ¼ |
½ | ½ | Fraction(onehalf) | ½ |
¾ | ¾ | Fraction(threequarters) | ¾ |
¿ | ¿ | Invertedquestionmark | ¿ |
À | À | CapitalA,graveaccent | À |
Á | Á | CapitalA,acuteaccent | Á |
 |  | CapitalA,circumflexaccent |  |
à | à | CapitalA,tilde | à |
Ä | Ä | CapitalA,umlaut(dierisis) | Ä |
Å | Å | CapitalA,ring | Å |
Æ | Æ | CapitalAEdipthong(ligature) | Æ |
Ç | Ç | CapitalC,cedilla | Ç |
È | È | CapitalE,graveaccent | È |
É | É | CapitaE,acuteaccent | É |
Ê | Ê | CapitalE,circumflexaccent | Ê |
Ë | Ë | CapitalE,umlaut(dierisis) | Ë |
Ì | Ì | CapitalI,graveaccent | Ì |
Í | Í | CapitalI,acuteaccent | Í |
Î | Î | CapitalI,circumflexaccent | Î |
Ï | Ï | CapitalI,umlaut(dierisis) | Ï |
Ð | Ð | CapitalEth,Icelandic | Ð |
Ñ | Ñ | CapitalN,tilde | Ñ |
Ò | Ò | CapitalO,graveaccent | Ò |
Ó | Ó | CapitalO,acuteaccent | Ó |
Ô | Ô | CapitalO,circumflexaccent | Ô |
Õ | Õ | CapitalO,tilde | Õ |
Ö | Ö | CapitalO,umlaut(dierisis) | Ö |
× | × | Multiplysign | × |
Ø | Ø | CapitalO,slash | Ø |
Ù | Ù | CapitalU,graveaccent | Ù |
Ú | Ú | CapitalU,acuteaccent | Ú |
Û | Û | CapitalU,circumflexaccent | Û |
Ü | Ü | CapitalU,umlaut(dierisis) | Ü |
Ý | Ý | CapitalY,acuteaccent | Ý |
Þ | Þ | CapitalThorn,Icelandic | Þ |
ß | ß | Smallsharps,German(szligature) | ß |
à | à | Smalla,graveaccent | à |
á | á | Smalla,acuteaccent | á |
â | â | Smalla,circumflexaccent | â |
ã | ã | Smalla,tilde | ã |
ä | ä | Smalla,umlaut(dierisis) | ä |
å | å | Smalla,ring | å |
æ | æ | Smallaedipthong(ligature) | æ |
ç | ç | Smallc,cedilla | ç |
è | è | Smalle,graveaccent | è |
é | é | Smalle,acuteaccent | é |
ê | ê | Smalle,circumflexaccent | ê |
ë | ë | Smalle,umlaut(dierisis) | ë |
ì | ì | Smalli,graveaccent | ì |
í | í | Smalli,acuteaccent | í |
î | î | Smalli,circumflexaccent | î |
ï | ï | Smalli,umlaut(dierisis) | ï |
ð | ð | Smalleth,Icelandic | ð |
ñ | ñ | Smalln,tilde | ñ |
ò | ò | Smallo,graveaccent | òò |
ó | ó | Smallo,acuteaccent | ó |
ô | ô | Smallo,circumflexaccent | ô |
õ | õ | Smallo,tilde | õ |
ö | ö | Smallo,umlaut(dierisis) | ö |
÷ | ÷ | Divisionsign | ÷ |
ø | ø | Smallo,slash | ø |
ù | ù | Smallu,graveaccent | ù |
ú | ú | Smallu,acuteaccent | ú |
û | û | Smallu,circumflexaccent | û |
ü | ü | Smallu,umlaut(dierisis) | ü |
ý | ý | Smally,acuteaccent | ý |
þ | þ | Smallthorn,Icelandic | þ |
ÿ | ÿ | Smally,umlaut(dierisis) | ÿ |
This document is available from:
http://www.cert.org/tech_tips/malicious_code_mitigation.html
CERT/CC Contact Information
Email: cert@cert.org
Phone: +1 412-268-7090 (24-hour hotline)
Fax: +1 412-268-6989
Postal address:
-
CERT Coordination Center
Software Engineering Institute
Carnegie Mellon University
Pittsburgh PA 15213-3890
U.S.A.
CERT/CC personnel answer the hotline 08:00-17:00 EST(GMT-5) / EDT(GMT-4)
Monday through Friday; they are on call for emergencies during other
hours, on U.S. holidays, and on weekends.
Using encryption
We strongly urge you to encrypt sensitive information sent by
email. Our public PGP key is available from
If you prefer to use DES, please call the CERT hotline for more
information.
Getting security information
CERT publications and other security information are available from
our web site
* "CERT" and "CERT Coordination Center" are registered in the U.S. Patent and Trademark Office.
NO WARRANTY
Any material furnished by Carnegie Mellon University and the
Software Engineering Institute is furnished on an "as is"
basis. Carnegie Mellon University makes no warranties of any kind,
either expressed or implied as to any matter including, but not
limited to, warranty of fitness for a particular purpose or
merchantability, exclusivity or results obtained from use of the
material. Carnegie Mellon University does not make any warranty of any
kind with respect to freedom from patent, trademark, or copyright
infringement.
Conditions for use, disclaimers, and sponsorship information
|