This class is a document filter capable of removing specified elements from the processing stream. There are two options for processing document elements:
- specifying those elements which should be accepted and, optionally, which attributes of that element should be kept; and
- specifying those elements whose tags and content should be completely removed from the event stream.
The first option allows the application to specify which elements appearing in the event stream should be accepted and, therefore, passed on to the next stage in the pipeline. All elements not in the list of acceptable elements have their start and end tags stripped from the event stream unless those elements appear in the list of elements to be removed.
The second option allows the application to specify which elements should be completely removed from the event stream. When an element appears that is to be removed, the element's start and end tag as well as all of that element's content is removed from the event stream.
A common use of this filter would be to only allow rich-text and linking elements as well as the character content to pass through the filter — all other elements would be stripped. The following code shows how to configure this filter to perform this task:
ElementRemover remover = new ElementRemover(); remover.acceptElement("b", null); remover.acceptElement("i", null); remover.acceptElement("u", null); remover.acceptElement("a", new String[] { "href" });
However, this would still allow the text content of other elements to pass through, which may not be desirable. In order to further "clean" the input, the removeElement
option can be used. The following piece of code adds the ability to completely remove any <SCRIPT> tags and content from the stream.
remover.removeElement("script");
Note: All text and accepted element children of a stripped element is retained. To completely remove an element's content, use the removeElement
method.
Note: Care should be taken when using this filter because the output may not be a well-balanced tree. Specifically, if the application removes the <HTML> element (with or without retaining its children), the resulting document event stream will no longer be well-formed.
@author Andy Clark
@version $Id: ElementRemover.java,v 1.5 2005/02/14 03:56:54 andyc Exp $