In this release, we include the filtering engine of the YFilter system developed at the University of California at Berkeley. This filtering engine provides fast, on-the-fly matching of XML messages to simultaneous queries that represent different data interests. These queries consist of path expressions written in a subset of XPath 1.0. A detailed description of the language subset that this release supports is provided in Section 5 of this manual. For each XML message, the engine returns a result for every matched query. A result can be either a boolean value (i.e., True), or a sequence of elements that matched the query. This engine is implemented in the Java programming language, using J2SE1.4.x.
As systems built on YFilter are likely to be deployed in distributed wide-area environments, the number of users of such systems can easily grow into hundreds of thousands. YFilter was designed to handle such large numbers of queries. In particular, it provides high efficiency and scalability for filtering XML messages by exploiting commonality among all the queries.
YFilter was also designed to handle streaming XML data. Continuously arriving XML messages are processed on-the-fly in order of arrival. Lightweight event-based XML parsing is used to drive the execution of YFilter so that queries can be evaluated as a message is being parsed. YFilter does not use indexing or disk-oriented solutions for handling the XML data.
The current release only contains the filtering engine of the YFilter system. More advanced features such as further transforming the matching XML messages based on query-specific requirements are not included in this release.
Since YFilter is a Java-based system, having a Java Runtime Environment is a pre-requisite. The filtering engine of YFilter also uses a number of libraries that are included in this package. This engine has been developed and tested on Linux (RedHat Fedora) and Windows (Cygwin).
The directory structure of the package is as described below.
COPYRIGHT lists the license under which this package has been released.
INSTALL is a quick and dirty installation guide.
bin/ contains the executable scripts in the package.
build/ contains the JAR file and the classes.
tests/ contain the correctness and performance test suites.
sample/ contains sample DTD and query files for the GUI.
include/ contains the required libraries.
src/ is the source tree containing the Java source files.
html/ contains this user manual and API documentation.
doc/ contains additional technical reference material.
YFILTER_HOMEenvironment variable to point to the YFilter installation directory.
PATHto access all the executables.
At this point, you are ready to use YFilter.
Makefiles are also provided for re-compiling the classes. Note that in this case, you need to modify your
CLASSPATHin order to use the newly compiled classes instead of the original
% yfilter (DOC_FILE | DOC_DIR) QUERY_FILE
.xmlare collected from the directory) and a query file. YFilter reads and parses queries from the query file, runs these queries against the document(s) and prints out the results. For example, the following command will process all the queries in the input query file with the specified XML document.
In the next example, all the queries in the input query file will be processed with each XML document in the specified XML directory.
% yfilter sample/xmldocs/outFile1.xml sample/queries3.txt
% yfilter sample/xmldocs/ sample/queries3.txt
This can be used by the casual user as a gentle introduction to YFilter. Section 4 walks the user through this mode.
EXfilterclass provides a test driver for bulkloading a set of queries and benchmarking the performance of the filtering engine. In particular, it offers a number of parameters that can be tweaked for benchmarking purposes. These parameters and their values are described in their man page.
This class outputs the performance measurements to the
file, containing the execution time (msec) for each document as well as
the average execution time over the entire set of documents (excluding
those used for warming up the JVM). By default, the query results are
output to the
result.txt file. The cost of writing
results to this file is not included in the measured execution time.
output option of
EXfilter is set to
query results are written to different files, one per document, in
order of document execution. When the
profile option of
turned on, the profiling result is output to the
Examples of uses of the parameters and some description of their impact on the measured performance of the engine can be found in Section 7 of this manual.
The main filter class is
Example implementations using this class can be found in the
function of the
class for command-line execution, and that of the
class for command-line benchmarking.
FilterInstanceAPI to add queries, bind documents, get results and delete queries.
FilterInstanceprovides a Java Thread wrapper over the
EXfilterBasicclass. A sample client class called
SimpleAppis included in the release. Its member functions demonstrate the use of the API.
% java edu.berkeley.cs.db.yfilter.filterinstance.SimpleApp
which brings up the YFilter User Interface.
System->Select DTDto select a DTD file used for generating queries. One example DTD, NITF (News Industry Text Format), is available in the
Optionally, select the queries to be bulk-loaded. Sample queries are also included in the
samplesub-directory. This step is optional, as the system will dynamically generate queries based on the specified DTD to simulate online query submissions.
Once YFilter processes the DTD file and bulkloads queries from the query file, the demo is ready to be run.
One Cycle. This will run all the queries in the filter through one document, processing it completely. Once the cycle is complete, the processed document and the results of the processing can be viewed. Clicking on
View XMLwill bring up the processed document in a new window.
View all queriesto get a list of all the queries in the filter, against which the above document was processed. This opens up in a new window.
View matched queriesto view the matched queries, which will also bring up a new window. In the
Matched Querieswindow, click on
Nextto view a list of matched queries. The elements that matched a particular query can be viewed by entering the identifier of the query in the
View matched elementstext box at the bottom of the window.
Runbutton in the control panel and open the monitoring panels using the
Monitormenu. Clicking on each menu item will bring up a window that displays the relevant parameters.
The syntax of the path expressions that YFilter supports is the following.
::= (( "/" | "//" ) StepExpr
)+ ( "/" AdditionalSelect
every path query is absolute */
 StepExpr ::= ( QName | "*") ( "[" Predicate "]" )* /* no Boolean "and"/"or" */
 Predicate ::= SimplePredicate | PathPredicate
 SimplePredicate ::= "@" QName "=" ( Literal | PositiveNumber ) /* attribute of an element */
| PositiveNumber /* position of an element */
| "position" "(" ")" Comparator PositiveNumber /* position of an element */
| "text" "(" ")" "=" ( Literal | PositiveNumber ) /* text data of an element */
 PathPredicate ::= ( ".//" | "./" )? SimpleStepExpr (( "/" | "//" ) SimpleStepExpr)*
/* nested paths are relative */
::= ( QName
| "*") ( "[" SimplePredicate
/* only one level of path nesting */
 AdditionalSelect ::= "@" QName | "text" "(" ")" /* select an attribute or the text data of the addressed element */
 Comparator ::= "=" | "!=" | ">" | "<" | ">=" | "<="
 QName ::= NCNAME /* no namespace */
 NCNAME ::= ( Letter | '_' ) ( NCNameChar )*
 NCNameChar ::= Letter | Digit | '.' | '-' | '_' /* no combiningChar or extender */
 Literal ::= "" [^"]* ""
 PositiveNumber ::= [1-9][0-9]*
 Digit ::= [0-9]
 Letter ::= [a-zA-Z]
This syntax allows YFilter to handle path expressions that
are commonly used for filtering XML documents. Examples of such
/nitf/head/docdata[@doc-idref = "iptc.32a"]/series
/nitf/head/tobject[@tobject.type = "news"]/*
/nitf//person[text( ) = "John"]
/nitf[head/tobject[@tobject.type = "news"]]//person[text( ) = "John"]
/nitf[.//person[text( ) = "John"]]/body//headline
The architecture of the YFilter filtering engine is shown in the following figure. The primary inputs to the engine are the queries that represent interest specifications and the XML messages. The output consists of a boolean value or a sequence of elements for each query that has been satisfied by the incoming message.
Architecture of the YFilter filtering engine
The major components of this architecture include:
The main filter package is
This package orchestrates the various components listed above to
provide the filtering service.
In this release, we include tests suites for both correctness tests and performance tests of the filtering engine.
Correctness tests are provided to program developers who modify the YFilter source code for their targeted applications. To ensure that the filtering engine performs correctly, those developers are strongly encouraged to re-run the correctness tests following the steps listed below.
correct.outdirectory from either
correct_unix.out.tgz(if you are a UNIX user) or
correct_win.out.zip(if you are a Windows user). Place
correct.outunder the current directory, i.e.,
diff.outsub-directory created under
$YFILTER_HOME/tests. For each of the five sub-directories of
diff.out, check if all the
".diff"files have a size of 0. If all the files in the five sub-directories are empty, the filtering engine has passed the correctness tests.
Test suites for benchmarking are also included in
The scripts for running the benchmarking tests are available in the
subdirectory. The names of these scripts start with "
These scripts show examples of using different parameters of
(see the man page for a description of them),
and the measured results demonstrate the impact of these parameters on the
performance of the engine.
The test provided by
focuses on distinct path expressions that do not contain any
$YFILTER_HOME/tests/benchmark_simple_paths/subdirectory (please unzip the query file first).
The test provided by
focuses on random path expressions without any predicates. This test is run similarly as
the previous one, except that it reads queries from
queries-500000.txt in the
$YFILTER_HOME/tests/benchmark_simple_paths/ subdirectory (please also unzip the query file first), and
varies the number of the processed queries from 1,000 to 500,000.
The test in
considers the same set of queries as
But it sets the engine in such a mode that the engine only performs operations for matching
path expressions, but not for collecting the results on which queries
are matched. It does so by first setting the
NONE, and then
nfa_opt parameter to 0 (note that the order
of setting these parameters matters here). The results of this test,
written to the same sub-directory as above, contain only the cost of
the core operations for matching path expressions, which can be used to
truly compare different path matching engines.
The user can create his own test suites using the sub-packages included in this release. Creating a new test takes the following three steps.
class scans an input DTD and generates useful statistics for query
generation. The man page of this
class describes its parameters. The following command shows an example
% java edu.berkeley.cs.db.yfilterplus.dtdscanner.DTDStat nitf-2-5.dtd nitf-2-5.stat
PathGenerator class reads the statistics
DTDStat, and generates a set of XPath
queries according to workload parameters. Refer to the man page for a description of
these parameters. For example, the following command generates 10
queries, each of which contains two value-based predicates and one
nested path expression.
% java edu.berkeley.cs.db.yfilterplus.querygenerator.PathGenerator
nitf-2-5.stat queries1.txt 10 6 0.2 0.2 --num_predicates=2 --num_nestedpaths=1
In the next sample, the command generates 10 distinct queries, each of which contains value-based predicates and nested paths according to a probabilistic model. This model is good for generating mixed query workloads.
% java edu.berkeley.cs.db.yfilterplus.querygenerator.PathGenerator
nitf-2-5.stat queries2.txt 10 6 0.2 0.2 --prob_predicate=0.03 --prob_nestedpath=0.2 --distinct=TRUE
class to run the test (again, see the man
page for a description of the parameters). Alternatively, compose a
script to run the test, following the examples shown in the previous
[position( ) < 5]) that occurs after some other predicates in the same location step. An example is
/nitf//person[text() = "John"]. YFilter currently does not handle such positional predicates properly. It is important to note, however, that YFilter does handle positional predicates that appear as the first or the only predicate in a location step, e.g.,
/nitf//person[text( ) = "John"].
YFilter also has the following known limitations:
We expect to fix these problems in our future releases. The users concerned with these issues can fix them by improving the code, or contact us for the details of the solutions.