% sample article for KAIS
% last modified by alison woollatt, 18 jan 00
% minor changes by xindong wu, 2 march 01
\documentclass{kais}
%
\usepackage{wrapfig,psfig,epsf}
\usepackage[dcucite]{harvard}
\newtheorem{defi}{Definition}[section]
\newtheorem{property}{Property}[section]
\newtheorem{algorithm}{Algorithm}[section]
\newcommand\changenum{%
\renewcommand\labelenumi{\theenumi}%
\renewcommand{\theenumi}{(\arabic{enumi})}%
}
\newcommand\changepnum{%
\renewcommand\labelenumi{\theenumi}%
\renewcommand{\theenumi}{($P_\arabic{enumi}$)}%
}
%\bibliographystyle{agsm}
\received{xxx}
\revised{xxx}
\accepted{xxx}
\pubyear{2000}
\pagerange{\pageref{firstpage}--\pageref{lastpage}}
\volume{xxx}
\begin{document}
\label{firstpage}
\title{Fast Association Discovery in Derivative Transaction
Collections}
\author[L. Shen et al]{Li Shen$^1$, Hong Shen$^2$, Ling Cheng$^1$ and
Paul Pritchard$^2$\\ $^1$Department of Computer Science, Dartmouth
College, Hanover NH, USA; $^2$School of Computing\\ and Information
Technology, Griffith University, Nathan, Queensland, Australia}
\maketitle
\begin{abstract}
Association discovery from a transaction collection is an important
data-mining task. We study a new problem in this area whose solution
can provide users with valuable association rules in some relevant
collections: {\em association discovery in derivative transaction
collections\/}. In this problem, we are given association rules in
two transaction collections $D_{1}$ and $D_{2}$, and aim to find new
association rules in derivative transaction collections $D_{1}
\backslash D_{2}$, $D_{1} \cap D_{2}$, $D_{2} \backslash D_{1}$ and
$D_{1} \cup D_{2}$. Direct application of existing algorithms can
solve this problem, but in an expensive way. We propose an efficient
solution through making full use of already discovered information,
taking advantage of the relationships existing among relevant
collections, and avoiding unnecessary but expensive support-counting
operations by scanning databases. Experiments on well-known synthetic
data show that our solution consistently outperforms the naive
solution by factors from 2 to 3 in most cases. We also propose an
efficient parallelization of our approach, as parallel algorithms are
often interesting and necessary in the area of data mining.
\end{abstract}
\begin{keywords}
Association rule; Data mining; Itemset; Transaction collection
\end{keywords}
\section{Introduction}
\label{sec:intro}
Discovery of association rules is an important problem in the area of
data mining. This problem is introduced in \citeasnoun{agr:mabso}, and
can be formalized as follows. Let ${\cal I}=\{i_{1}, i_{2}, \ldots,
i_{m}\}$ be a set of literals, called {\em items\/}. Let ${\cal D}$
be a collection of transactions, where each transaction $t$ has a
unique identifier and contains a set of items such that $t \subseteq
{\cal I}$. A set of items is called an {\em itemset\/}, and an
itemset with $k$ items is called a {\em $k$-itemset\/}. The {\em
support\/} of an itemset $x$ in ${\cal D}$, denoted as $\sigma(x/{\cal
D})$, is the ratio of the number of transactions (in ${\cal D}$)
containing $x$ to the total number of transactions in ${\cal D}$. An
{\em association rule\/} is an expression $x \Rightarrow y$, where $x,
y \subseteq {\cal I}$ and $x \cap y = \emptyset$. The {\em
confidence\/} of $x \Rightarrow y$ is the ratio of $\sigma(x \cup
y/{\cal D})$ to $\sigma(x/{\cal D})$. We use {\sf minsup\/} and {\sf
minconf\/} to denote the user-specified minimum support and confidence
respectively. An itemset $x$ is {\em frequent\/} if $\sigma(x/{\cal
D}) \geq$ {\sf minsup\/}. An association rule $x \Rightarrow y$ is
{\em strong\/} if $x \cup y$ is frequent and $\frac{\sigma(x \cup
y/{\cal D})}{\sigma(x/{\cal D})} \geq$ {\sf minconf\/}. The problem
of {\em mining association rules\/} is to find all strong association
rules. For simplicity, all {\em mining tasks\/} mentioned later refer
to {\em mining tasks for association rules\/}.
According to the above formalization, any interesting association rule
is always related to a unique transaction collection and hence makes
sense only in this collection. Thus different collections are
associated with different mining tasks. In practice based on a large
transaction database, besides finding association rules related to the
whole database, users are often interested in rules related to some
part of the database. For example, given a shopping transaction
database, users may be interested in finding purchase patterns only
from all transactions made in the past two months. Hence, there may
exist many valuable mining tasks related to different sub-collections
of all transactions in a large database.
Let $D_{1}$ and $D_{2}$ be two transaction collections over the same
database. We say that $D_{1} \backslash D_{2}$, $D_{1} \cap D_{2}$,
$D_{2} \backslash D_{1}$ and $D_{1} \cup D_{2}$ are {\em derivative
transaction collections (DTCs in short) of $D_{1}$ and $D_{2}$\/}.
For simplicity, we denote $D_{1} \backslash D_{2}$ by $D_{1-2}$,
$D_{1} \cap D_{2}$ by $D_{1 \cap 2}$, $D_{2} \backslash D_{1}$ by
$D_{2-1}$ and $D_{1} \cup D_{2}$ by $D_{1 \cup 2}$. We observe that,
if both mining tasks related to $D_{1}$ and $D_{2}$ are interesting,
the mining tasks related to $D_{1-2}$, $D_{1 \cap 2}$, $D_{2-1}$ and
$D_{1 \cup 2}$ are often interesting. For example, after obtaining
all association patterns related to all transactions made in
January--February and February--March respectively, users may be
interested in finding association rules related to all transactions
made in January, February, March, or January--March. With this
observation, we introduce a new problem called finding association
rules in DTCs: {\em based on the results of two completed mining tasks
in two transaction collections (over the same database) $D_{1}$ and
$D_{2}$, find association rules related to DTCs of $D_{1}$ and
$D_{2}$\/}.
In addition to providing users with more valuable association rules in
relevant collections, our problem can also be regarded as a more
general case of maintenance of association rules. Actually, a
traditional maintenance problem involving insertion, deletion, and
modification can be efficiently reduced to our problem. Let $D$ be
the original database, $D_{in}$ be the collection of new transactions,
and $D_{de}$ be the collection of deleted transactions. Here, if a
transaction $d$ is modified to $d'$, $d$ is regarded as a deleted
transaction ($\in D_{de}$) and $d'$ as a new transaction ($\in
D_{in}$). The maintenance problem is to generate associations in $(D
\backslash D_{de})\cup D_{in}$. To efficiently achieve this goal, we
can first do mining on $D_{in}$ and $D_{de}$ respectively, which are
time-affordable because we usually have $|D| \gg |D_{in}|$ and $|D|
\gg |D_{de}|$. After that, we let $D_{1}=D$ and $D_{2}=D_{de}$, and
so the solution of our problem can provide associations in $D
\backslash D_{de}$. Finally, we let $D_{1}=D \backslash D_{de}$ and
$D_{2}=D_{in}$, and so the solution of our problem can provide
associations in $(D \backslash D_{de})\cup D_{in}$.
The study of our problem can also be applied in the following
situation introduced by incremental maintenance. Assume that a sales
database is not a static one. From time to time, new records are
appended to record new purchase activities. Association rule
maintenance is done monthly. We use $D_{i,j}$ to denote the collection
of all sales transactions made by the end of the $j$th month of the
$i$th year. Not surprisingly, sometimes we can obtain all
associations in $D_{1998,12}$ and $D_{1999,12}$ respectively. Now if
we let $D_{1}=D_{1998,12}$ and $D_{2}=D_{1999,12}$, the solution of
our problem can provide all associations in all transactions made in
1999 (i.e., $D_{1999,12} \backslash D_{1998,12}$), which is usually a
pretty interesting result.
With the above observations, we are interested in developing efficient
algorithms for our problem. There have been many studies on fast
association discovery in a fixed transaction collection
\cite{par:aehba,sav:aeafm,agr:fdoar,agr:pmoar,che:dmaof,zak:npaff,lin:psana}.
Thus, a naive approach for our problem is to directly execute any
existing algorithm in each of the DTCs, respectively. However, this
approach contains lots of redundant computation and doesn't work
efficiently, because it ignores the relationships existing among DTCs
as well as much already discovered information. We develop efficient
algorithms for our problem in this paper. We present some preliminary
concepts and properties in Section~\ref{sec:precon}; describe our
approach in Section~\ref{sec:algorithm}; study its performance
experimentally in Section~\ref{sec:perstu}; propose a parallelization
of the approach in Section~\ref{sec:parallel}; and conclude the paper
in Section~\ref{sec:con}.
\section{Preliminary Concepts and Properties}
\label{sec:precon}
In practice, a task-relevant transaction collection is often retrieved
by a database query. Usually users are able to provide the
corresponding database queries for DTCs. Hence we can assume that DTCs
are retrieved by user-described queries instead of applying set
operations ($\cup$, $\cap$ and $\backslash$) on original collections
so that there is no extra cost on obtaining DTCs in our study.
Since it is easy to generate all strong association rules from all
frequent itemsets (see \citenb{agr:fdoar} for a fast algorithm), we
focus our study on {\em how to generate all frequent itemsets in DTCs
efficiently\/}. Note that both ${\cal I}$ and {\sf minsup\/} are {\em
fixed\/} in our study. Now we introduce some useful concepts and
relevant properties to help our discussion. Given an itemset $x
\subseteq {\cal I}$, we use ${\cal P}(x)$ to denote the set of all
subsets of $x$. A set of itemsets $L$ is called {\em subset-closed\/}
if any subset of any itemset in $L$ must also be in $L$.
\begin{defi} \label{def:negaborder}
Assume that $L$ is a subset-closed set of itemsets. The {\sf negative
border\/} of $L$, denoted by ${\cal NB}(L)$, is defined as: ${\cal
NB}(L)=\{x \mid x \subseteq {\cal I}$, $x \not \in L$, ${\cal
P}(x)\backslash\{x\} \subseteq L\}$, i.e., the set of all the minimal
itemsets $x \subseteq {\cal I}$ not in $L$ (also refer to
\cite{man:lsabt,tho:eaiua}).
\end{defi}
For example, let ${\cal I}=\{1,2,3,4,5\}$ and $L$=$\{\emptyset,
\{1\},\{2\},\{3\},\{5\},$ $\{1,2\},\{2,3\},$ $\{2,5\},\{3,5\},$
$\{2,3,5\}\}$. Note that $\{1,5\}$ should be in ${\cal NB}(L)$ because
it is not in $L$, but all its subsets are. The whole negative border
is ${\cal NB}(L)$=$\{\{4\},$$\{1,3\},$$\{1,5\}\}$. The intuition
behind the concept is that given a subset-closed set of itemsets that
are frequent, the negative border contains the {\em closest\/}
itemsets that could be frequent, too.
\begin{property} \label{pro:negaborder}
Let $L_{1}$ and $L_{2}$ be two subset-closed sets of itemsets.
\begin{enumerate}[(2)]
\changenum
\item If $L_{1} \subseteq L_{2}$,
then $L_{1} \cup {\cal NB}(L_{1}) \subseteq L_{2} \cup {\cal NB}(L_{2})$.
%
\item $(L_{1} \cup L_{2}) \cup {\cal NB}(L_{1} \cup L_{2})
= (L_{1} \cup {\cal NB}(L_{1})) \cup (L_{2} \cup {\cal NB}(L_{2}))$.
\end{enumerate}
\end{property}
\begin{proof}
(1--2) Straightforward by Definition~\ref{def:negaborder}.
\end{proof}
We use ${\cal L}(D)$ to denote {\em the set of all frequent itemsets
in transaction collection~$D$}. To find ${\cal L}(D)$, we need to
first obtain supports for a set of candidates (potentially frequent
itemsets), say $C$, and then select those candidates with support
exceeding {\sf minsup\/} from $C$ to form ${\cal L}(D)$. To guarantee
the completeness of ${\cal L}(D)$, all elements of ${\cal NB}({\cal
L}(D))$ are usually needed to be contained by $C$ \cite{man:lsabt}.
Thus, we use ${\cal C}(D)$ to denote ${\cal L}(D) \cup {\cal NB}({\cal
L}(D))$, and say that $x$ is a {\em necessary candidate\/} for the
mining task in $D$ if $x \in {\cal C}(D)$.
\begin{property} \label{pro:nececan}
Let $D_{1}$ and $D_{2}$ be two transaction collections.
\begin{enumerate}[(2)]
\changenum
\item If $D_{1} \cap D_{2} = \emptyset$, then ${\cal C}(D_{1} \cup
D_{2}) \subseteq {\cal C}(D_{1}) \cup {\cal C}(D_{2})$.
%
\item ${\cal C}(D_{1}) \subseteq {\cal C}(D_{1-2}) \cup {\cal C}(D_{1
\cap 2})$.
%
\item ${\cal C}(D_{2}) \subseteq {\cal C}(D_{1 \cap 2})
\cup {\cal C}(D_{2-1})$.
\end{enumerate}
\end{property}
\begin{proof}
(1) Let $D=D_{1} \cup D_{2}$ and $a \in {\cal L}(D)$, which implies
$\sigma(a/D) \geq $ {\sf minsup\/}. By relevant definitions, we have
$\sigma(a/D_{1})*|D_{1}|+\sigma(a/D_{2})*|D_{2}| =\sigma(a/D)*|D|
=\sigma(a/D)*(|D_{1}|+|D_{2}|) \geq $ {\sf
minsup\/}$*(|D_{1}|+|D_{2}|)$. Thus, we have either $\sigma(a/D_{1})
\geq $ {\sf minsup\/} or $\sigma(a/D_{2}) \geq $ {\sf minsup\/}, and
so $a \in {\cal L}(D_{1}) \cup {\cal L}(D_{2})$. Hence, ${\cal L}(D)
\subseteq {\cal L}(D_{1}) \cup {\cal L}(D_{2})$. Therefore, by
Property~\ref{pro:negaborder}(1)(2), we have
\begin{eqnarray*}
{\cal C}(D) &=& {\cal L}(D) \cup {\cal NB}({\cal L}(D))
\subseteq ({\cal L}(D_{1}) \cup {\cal L}(D_{2}))
\cup {\cal NB}({\cal L}(D_{1}) \cup {\cal L}(D_{2}))\\
%
&=& ({\cal L}(D_{1}) \cup {\cal NB}({\cal L}(D_{1})))
\cup ({\cal L}(D_{2}) \cup {\cal NB}({\cal L}(D_{2})))
= {\cal C}(D_{1}) \cup {\cal C}(D_{2}).
\end{eqnarray*}
(2--3) Straightforward by (1); also refer to Fig.~\ref{fig:dtcs}.
\end{proof}
\begin{figure}%fig 1
\footnotesize
\begin{picture}(120,55)(-75,0)
\put(0,80){
\put(0,-50){\framebox(60,20){$D_{1-2}$}}
\put(60,-50){\framebox(60,20){$D_{1 \cap 2}$}}
\put(120,-50){\framebox(60,20){$D_{2-1}$}}
\put(-3,-60){
\begin{picture}(120,10)
\setlength{\unitlength}{1pt}
\put(20,10){\oval(40,10)[bl]}
\put(20,0){\oval(40,10)[tr]}
\put(80,0){\oval(80,10)[tl]}
\put(80,10){\oval(80,10)[br]}
%\put(0,0){\framebox(120,10)}
\end{picture}
}
\put(38,-68){$D_{1}$}
\put(57,-60){
\begin{picture}(120,10)
\setlength{\unitlength}{1pt}
\put(40,10){\oval(80,10)[bl]}
\put(40,0){\oval(80,10)[tr]}
\put(100,0){\oval(40,10)[tl]}
\put(100,10){\oval(40,10)[br]}
\end{picture}
}
\put(135,-68){$D_{2}$}
\put(-3,-78){
\begin{picture}(120,10)
\setlength{\unitlength}{1.5pt}
\put(30,10){\oval(60,10)[bl]}
\put(30,0){\oval(60,10)[tr]}
\put(90,0){\oval(60,10)[tl]}
\put(90,10){\oval(60,10)[br]}
%\put(0,0){\framebox(120,10)}
\end{picture}
}
\put(78,-85){$D_{1 \cup 2}$}
}
\end{picture}
\caption{The relationships among $D_{1}$, $D_{2}$ and their DTCs}
\label{fig:dtcs}
\end{figure}
\begin{property} \label{pro:calsupcan}
Let $D_{1}$ and $D_{2}$ be two transaction collections.
\begin{enumerate}[(2)]
\changenum
\item $\sigma(a/D_{1})*|D_{1}|=
\sigma(a/D_{1-2}) * |D_{1-2}|
+\sigma(a/D_{1 \cap 2})*|D_{1 \cap 2}|$.
\item $\sigma(a/D_{2})*|D_{2}|=
\sigma(a/D_{1 \cap 2})*|D_{1 \cap 2}|
+\sigma(a/D_{2-1}) * |D_{2-1}|$.
\item $\sigma(a/D_{1 \cup 2})*|D_{1 \cup 2}|=
\sigma(a/D_{1})*|D_{1}|
+\sigma(a/D_{2-1}) * |D_{2-1}|$.
\end{enumerate}
\end{property}
\begin{proof}
(1--3) Straightforward by the definition of {\em support\/};
also refer to Fig.~\ref{fig:dtcs}.
\end{proof}
Based on the definition of ${\cal C}(D)$, it is easy to see that the
task of finding ${\cal L}(D)$ can be reduced to the task of obtaining
supports for a set of candidates $C$, where $C$ may be any superset of
${\cal C}(D)$. In our problem, since two mining tasks in $D_{1}$ and
$D_{2}$ have been completed, it is reasonable to assume that all
$\sigma(x/D_{1})$ and $\sigma(y/D_{2})$ are available for $x \in {\cal
C}(D_{1})$ and $y \in {\cal C}(D_{2})$. Our main task is to generate
all $\sigma(a/D_{1-2})$, $\sigma(b/D_{1 \cap 2})$, $\sigma(c/D_{2-1})$
and $\sigma(d/D_{1 \cup 2})$ for $a \in C_{1-2}$, $b \in C_{1 \cap
2}$, $c \in C_{2-1}$ and $d \in C_{1 \cup 2}$, where $C_{1-2}, C_{1
\cap 2}, C_{2-1}$ and $C_{1 \cup 2}$ can be any supersets of ${\cal
C}(D_{1-2}), {\cal C}(D_{1 \cap 2}), {\cal C}(D_{2-1})$ and ${\cal
C}(D_{1 \cup 2})$ respectively.
To complete the main task efficiently, we should make full use of the
already available $\sigma(x/D_{1})$ and $\sigma(y/D_{2})$ for $x \in
{\cal C}(D_{1})$ and $y \in {\cal C}(D_{2})$. From
Property~\ref{pro:nececan}(2), we know that any element $x$ of ${\cal
C}(D_{1})$ must also be a necessary candidate in either $D_{1-2}$ or
$D_{1 \cap 2}$, so its support has to be obtained in at least one of
them. Actually, by Property~\ref{pro:calsupcan}(1), if we have got
the support of $a$ in either of $D_{1-2}$ and $D_{1 \cap 2}$, we can
get the support of $a$ in the other collection immediately by
calculating an expression, e.g., $\sigma(x/D_{1 \cap
2})=\frac{\sigma(x/D_{1})*|D_{1}|- \sigma(x/D_{1-2}) *
|D_{1-2}|}{|D_{1 \cap 2}|}$. Compared with support-counting by a scan
of database, the cost for calculating an expression is negligible.
So, at the main cost of support-counting by scanning only one of
$D_{1-2}$ and $D_{1 \cap 2}$, we can obtain both $\sigma(x/D_{1-2})$
and $\sigma(x/D_{1 \cap 2})$ for all $x \in {\cal C}(D_{1})$. Thus,
it is not only reasonable but also cost-effective to regard all the
elements of ${\cal C}(D_{1})$ as candidates in both $D_{1-2}$ and
$D_{1 \cap 2}$. Likewise, we will take all the elements of ${\cal
C}(D_{2})$ as candidates in both $D_{1 \cap 2}$ and $D_{2-1}$.
Furthermore, after we have obtained $\sigma(c/D_{2-1})$ for all $c \in
{\cal C}(D_{2})$, it is easy to calculate $\sigma(d/D_{1 \cup 2})$ for
all $d \in {\cal C}(D_{1}) \cap {\cal C}(D_{2})$ by
Property~\ref{pro:calsupcan}(3). These observations motivate us to
complete the main task by using the following two phases.
\begin{quote}
{\bf Phase 1:\/} Calculating $\sigma(a/D_{1-2})$, $\sigma(b/D_{1 \cap
2})$, $\sigma(c/D_{2-1})$ and $\sigma(d/D_{1 \cup 2})$ for all $a \in
{\cal C}(D_{1})$, $b \in {\cal C}(D_{1}) \cup {\cal C}(D_{2})$, $c \in
{\cal C}(D_{2})$ and $d \in {\cal C}(D_{1}) \cap {\cal C}(D_{2})$.
{\bf Phase 2:\/} Calculating $\sigma(a/D_{1-2})$, $\sigma(b/D_{1 \cap
2})$, $\sigma(c/D_{2-1})$ and $\sigma(d/D_{1 \cup 2})$ for all $a \in
{\cal C}(D_{1-2}) \backslash {\cal C}(D_{1})$, $b \in {\cal C}(D_{1
\cap 2}) \backslash ({\cal C}(D_{1}) \cup {\cal C}(D_{2}))$, $c \in
{\cal C}(D_{2-1}) \backslash {\cal C}(D_{2})$, and $d \in {\cal
C}(D_{1 \cup 2}) \backslash ({\cal C}(D_{1}) \cap {\cal C}(D_{2}))$.
\end{quote}
In Phase 1, we can use Property~\ref{pro:calsupcan} and the available
$\sigma(x/D_{1})$ and $\sigma(y/D_{2})$ for $x \in {\cal C}(D_{1})$
and $y \in {\cal C}(D_{2})$ to save unnecessary computation that the
naive approach contains. In Phase 2, we calculate supports only for
necessary candidates whose supports are unknown so that the cost of
support-counting can be minimized.
\section{Algorithm}
\label{sec:algorithm}
We describe the approaches for solving Phase 1 and Phase 2 mentioned
above in Sections~\ref{subsec:phase1} and~\ref{subsec:phase2}
respectively; and then present our method for finding frequent
itemsets in DTCs efficiently in Section~\ref{subsec:algorithm}.
\subsection{Support Calculation for some Useful Candidates}
\label{subsec:phase1}
Here we present an efficient algorithm for Phase 1 mentioned before,
which aims to complete the following tasks of support calculation:
\begin{quote}
{\bf Task 1:\/} Calculating $\sigma(a/D_{1-2})$
for all $a \in {\cal C}(D_{1})$.
{\bf Task 2:\/} Calculating $\sigma(b/D_{1 \cap 2})$
for all $b \in {\cal C}(D_{1}) \cup {\cal C}(D_{2})$.
{\bf Task 3:\/} Calculating $\sigma(c/D_{2-1})$
for all $c \in {\cal C}(D_{2})$.
{\bf Task 4:\/} Calculating $\sigma(d/D_{1 \cup 2})$
for all $d \in {\cal C}(D_{1}) \cap {\cal C}(D_{2})$.
\end{quote}
We first consider how to complete Tasks 1--3 because the result of
Task 4 can be obtained immediately after Tasks 1--3 are completed.
Suppose that we need to calculate $\sigma(x/D)$ for all $x \in C$. We
can use the following support-counting method: each candidate $x \in
C$ is compared with each transaction $t \in D$ to determine if $t$
contains $x$. To get all desired supports, the total number of
operations of comparing two itemsets is $|C|*|D|$. A more efficient
method of support-counting can be used, through storing the candidate
set $C$ in a {\em hash tree\/} (see \citenb{agr:fdoar} for
details). However, its expected time cost is also in the order of
$O(|C|*|D|)$. Hence, for simplicity, we assume that obtaining
$\sigma(x/D)$ by counting each transaction in $D$ for all $x \in C$
requires $|C|*|D|$ operations of comparing two itemsets.
As mentioned before, $\sigma(x/D_{1})$ and $\sigma(y/D_{2})$ are
available for all $x \in {\cal C}(D_{1})$ and $y \in {\cal C}(D_{2})$.
Thus, if we have got $\sigma(x/D_{1 \cap 2})$ for all $x \in {\cal
C}(D_{1})$, we can use Property~\ref{pro:calsupcan}(1) to get
$\sigma(x/D_{1-2})$ for all $x \in {\cal C}(D_{1})$ (Task 1). This
approach only needs $|{\cal C}(D_{1})|$ operations of calculating an
expression, so is much faster than the method of support-counting by a
scan of $D_{1-2}$, whose cost is in the order of $O(|{\cal
C}(D_{1})|*|D_{1-2}|)$. Likewise, we can use
Property~\ref{pro:calsupcan} to accelerate the calculations of
$\sigma(x/D_{1 \cap 2})$ for all $x \in {\cal C}(D_{1}) \cup {\cal
C}(D_{2})$ (Task 2) and $\sigma(x/D_{2-1})$ for all $x \in {\cal
C}(D_{2})$ (Task 3) when some preconditions are met.
Thus, we can propose several different methods to complete Tasks~1--3.
First, we introduce the following three procedures, where $C$ is a
candidate set and $D$ is a transaction collection.
\begin{quote}
{\bf CntSup($C, D$)} \mbox{}\quad {// scan $D$ to count $\sigma(x/D)$
for all $x \in C$\/} \\ \mbox{}\quad {\bf for} each transaction $t \in
D$ {\bf do} \\ \mbox{}\qquad{\bf for} each candidate $x \in C$ such
that $x \subseteq t$ {\bf do} $x.sup$++ \\ \mbox{}\quad {\bf for} each
candidate $x \in C$ {\bf do} $\sigma(x/D):=\frac{x.sup}{|D|}$
{\bf CalSupA($C, D, D_{a}, D_{b}$)\/}:
$\sigma(x/D)=\frac{\sigma(x/D_{a})*|D_{a}|-\sigma(x/D_{b}) *
|D_{b}|}{|D|}$ for all $x \in C$.
{\bf CalSupB($C, D, D_{a}, D_{b}$)\/}:
$\sigma(x/D)=\frac{\sigma(x/D_{a})*|D_{a}|+\sigma(x/D_{b}) *
|D_{b}|}{|D|}$ for all $x \in C$.
\end{quote}
Now we present five different methods for completing Tasks~1--3 as
follows:
\begin{quote}
{\bf Method 1:\/}
CntSup(${\cal C}(D_{1}) \cup {\cal C}(D_{2})$, $D_{1 \cap 2}$); \\
CalSupA(${\cal C}(D_{1})$, $D_{1-2}$, $D_{1}$, $D_{1 \cap 2}$);
CalSupA(${\cal C}(D_{2})$, $D_{2-1}$, $D_{2}$, $D_{1 \cap 2}$).
{\bf Method 2:\/}
CntSup(${\cal C}(D_{1})$, $D_{1-2}$);
CntSup(${\cal C}(D_{2}) \backslash {\cal C}(D_{1})$, $D_{1 \cap 2}$); \\
CalSupA(${\cal C}(D_{1})$, $D_{1 \cap 2}$, $D_{1}$, $D_{1-2}$);
CalSupA(${\cal C}(D_{2})$, $D_{2-1}$, $D_{2}$, $D_{1 \cap 2}$).
{\bf Method 3:\/}
CntSup(${\cal C}(D_{2})$, $D_{2-1}$);
CntSup(${\cal C}(D_{1}) \backslash {\cal C}(D_{2})$, $D_{1 \cap 2}$); \\
CalSupA(${\cal C}(D_{2})$, $D_{1 \cap 2}$, $D_{2}$, $D_{2-1}$);
CalSupA(${\cal C}(D_{1})$, $D_{1-2}$, $D_{1}$, $D_{1 \cap 2}$).
{\bf Method 4:\/}
CntSup(${\cal C}(D_{1})$, $D_{1-2}$);
CntSup(${\cal C}(D_{2}) \backslash {\cal C}(D_{1})$, $D_{2-1}$); \\
CalSupA(${\cal C}(D_{1})$, $D_{1 \cap 2}$, $D_{1}$, $D_{1-2}$);
CalSupA(${\cal C}(D_{2}) \backslash {\cal C}(D_{1})$, $D_{1 \cap 2}$,
$D_{2}$, $D_{2-1}$); \\
CalSupA(${\cal C}(D_{1}) \cap {\cal C}(D_{2})$, $D_{2-1}$, $D_{2}$,
$D_{1 \cap 2}$).
{\bf Method 5:\/}
CntSup(${\cal C}(D_{2})$, $D_{2-1}$);
CntSup(${\cal C}(D_{1}) \backslash {\cal C}(D_{2})$, $D_{1-2}$); \\
CalSupA(${\cal C}(D_{2})$, $D_{1 \cap 2}$, $D_{2}$, $D_{2-1}$);
CalSupA(${\cal C}(D_{1}) \backslash {\cal C}(D_{2})$, $D_{1 \cap 2}$,
$D_{1}$, $D_{1-2}$); \\
CalSupA(${\cal C}(D_{1}) \cap {\cal C}(D_{2})$, $D_{1-2}$, $D_{1}$,
$D_{1 \cap 2}$).
\end{quote}
Clearly, Methods 1--5 all contain $|{\cal C}(D_{1})|+|{\cal
C}(D_{2})|$ operations of calculating an expression. Moreover, we
assume that Method $i$ contains $O_{i}$ operations of comparing two
itemsets. It is also easy to conclude these costs as follows:
\begin{eqnarray*}
O_{1}&=&|{\cal C}(D_{1}) \cup {\cal C}(D_{2})| * |D_{1 \cap 2}|\\
O_{2}&=&|{\cal C}(D_{1})|*|D_{1-2}| +
|{\cal C}(D_{2}) \backslash {\cal C}(D_{1})|*|D_{1 \cap 2}|\\
O_{3}&=&|{\cal C}(D_{2})|*|D_{2-1}| +
|{\cal C}(D_{1}) \backslash {\cal C}(D_{2})|*|D_{1 \cap 2}|\\
O_{4}&=&|{\cal C}(D_{1})|*|D_{1-2}| +
|{\cal C}(D_{2}) \backslash {\cal C}(D_{1})|*|D_{2-1}|\\
O_{5}&=&|{\cal C}(D_{2})|*|D_{2-1}| +
|{\cal C}(D_{1}) \backslash {\cal C}(D_{2})|*|D_{1-2}|
\end{eqnarray*}
Since ${\cal C}(D_{1}), {\cal C}(D_{2}), D_{1-2}, D_{1 \cap 2}$ and
$D_{2-1}$ are all available, the computation costs of Methods 1--5 can
be obtained before we start Tasks 1--3. So, we can find out the most
efficient method in advance and use it to complete Tasks 1--3. Thus,
we present our algorithm for efficiently completing Tasks 1--4 as
follows.
\begin{algorithm} \label{alg:phase1}
Support calculation for some useful candidates.
\begin{quote}
{\bf Input:}
(1) $\sigma(x/D_{1})$ for all $x \in {\cal C}(D_{1})$,
(2) $\sigma(y/D_{2})$ for all $y \in {\cal C}(D_{2})$.
{\bf Output:}
The results of Tasks 1--4 mentioned before.
\end{quote}
\end{algorithm}
\begin{quote}
{\bf Step 1:} Find $i$, $1 \leq i \leq 5$, such that
$O_{i}$=min\{$O_{1} \sim O_{5}$\}.
{\bf Step 2:} Use Method $i$ defined above to complete Tasks 1--3.
{\bf Step 3:} Call CalSupB(${\cal C}(D_{1}) \cap {\cal C}(D_{2})$,
$D_{1 \cup 2}$, $D_{1}$, $D_{2-1}$) to complete Task 4.
\end{quote}
It is easy to see the correctness of Algorithm~\ref{alg:phase1}. The
cost of Algorithm~\ref{alg:phase1} is analyzed as follows: (1)~main
cost: min\{$O_{1} \sim O_{5}$\} operations of comparing two itemsets;
(2)~minor cost: $|{\cal C}(D_{1})|+|{\cal C}(D_{2})|+|{\cal C}(D_{1})
\cap {\cal C}(D_{2})|$ operations of calculating an expression; and
(3)~I/O cost: one scan of relevant transactions.
\subsection{Support Calculation for other Necessary Candidates}
\label{subsec:phase2}
In this section, we study the solution of Phase 2 of our main
task. The key problem is how to implement the following procedure
called {\em OtherSup($D$, $C$)\/}:
\begin{quote}
{\em OtherSup($D$, $C$)\/} aims to generate $\sigma(x/D)$ for all $x
\in {\cal C}(D) \backslash C$, where $D$ is a transaction collection
and $C$ is a set of itemsets with known supports.
\end{quote}
Phase 2 can be completed by calling
{\em OtherSup($D_{1-2}$, ${\cal C}(D_{1})$)\/},
{\em OtherSup($D_{1 \cap 2}$, ${\cal C}(D_{1}) \cup {\cal C}(D_{2}))$\/},
{\em OtherSup($D_{2-1}$, ${\cal C}(D_{2})$)\/} and
{\em OtherSup($D_{1 \cup 2}$, ${\cal C}(D_{1}) \cap {\cal C}(D_{2}))$\/}.
Since {\em Apriori\/} \citeasnoun{agr:fdoar} is one of the best
algorithms for mining tasks in a fixed collection, we use a modified
{\em Apriori\/} approach to implement {\em OtherSup($D$, $C$)\/} as
follows. Our algorithm needs multiple scans over the data. In scan
$k-1$, it starts with $L_{k-1}$, the set of all frequent
$(k-1)$-itemsets, and use $L_{k-1}$ to generate $C_{k}$, which is a
superset of all large $k$-itemsets. The algorithm finds the support
count for all itemsets in $C_{k} \backslash C$ during the scan over
the data, because all elements in $C$ have got their supports. At the
end of the scan, it selects all itemsets with support exceeding {\sf
minsup\/} from $C_{k}$ to generate $L_{k}$ for the next pass. This
process continues until no new frequent itemsets can be found.
\begin{algorithm} \label{alg:phase2}
The implementation of OtherSup($D$, $C$).
\end{algorithm}
\begin{quote}
{\bf Procedure\/} OtherSup($D$, $C$) \\
\mbox{}\quad $C_{1}$:=the set of all $1$-itemsets \\
\mbox{}\quad {\bf for} $(k:=1; C_{k} \not= \emptyset$; k++) {\bf do} \\
\mbox{}\qquad $C_{known}:=C_{k} \cap C$ and $C_{k}:=C_{k}
\backslash C_{known}$ \\
\mbox{}\qquad CntSup($C_{k}$, $D$)
\quad {\em// scan $D$ to count $\sigma(x/D)$ for all $x \in C_{k}$\/} \\
\mbox{}\qquad $L_{k}:= \{x \in C_{k} \cup C_{known} \mid
\sigma(x/D) \geq$ {\sf minsup}\} \\
\mbox{}\qquad $C_{k+1}:=$getcanditemset$(L_{k})$
{\bf Function} getcanditemset($L_{k}$) \\
\mbox{}\quad $C_{t}:=\{x \mid x=l_{1} \cup l_{2}, x$ is a $(k+1)$-itemset,
$l_{1}, l_{2} \in L_{k}\}$ \\
\mbox{}\quad {\bf for} $x \in C_{t}$ {\bf do}
{\bf if} some $k$-subset of $x$ isn't in $L_{k}$ {\bf then}
delete $x$ from $C_{t}$ \\
\mbox{}\quad {\bf return} $C_{t}$
\end{quote}
It is easy to see that Algorithm~\ref{alg:phase2} employs the idea
described above. The correctness of Algorithm~\ref{alg:phase2} is
based on the fact that {\em Apriori\/} precisely takes ${\cal C}(D)$
as its candidate set and counts supports only for all necessary
candidates (see \citenb {agr:fdoar}, for details). In our modified
version, we counts supports exactly for all necessary candidates not
in $C$, i.e., all itemsets in ${\cal C}(D) \backslash C$. Similar to
{\em Apriori\/}, Algorithm~\ref{alg:phase2} requires multiple passes
over the databases, which tends to incur high I/O cost. To reduce this
I/O cost, we can also use {\em scan reduction\/} technique (refer to
\citenb{che:dmaof}, for details).
\subsection{Algorithm Description}
\label{subsec:algorithm}
Now we can present our algorithm for finding all frequent itemsets in
DTCs.
\begin{algorithm} \label{alg:main}
Finding all frequent itemsets in DTCs.
\begin{quote}
{\bf Input:}
(1) $\sigma(x/D_{1})$ for all $x \in {\cal C}(D_{1})$,
(2) $\sigma(y/D_{2})$ for all $y \in {\cal C}(D_{2})$.
{\bf Output:} ${\cal L}(D_{1-2})$, ${\cal L}(D_{1 \cap 2})$,
${\cal L}(D_{2-1})$ and ${\cal L}(D_{1 \cup 2})$.
\end{quote}
\end{algorithm}
\begin{quote}
{\bf Phase 1:} Use Algorithm~\ref{alg:phase1} to generate
\end{quote}
\begin{enumerate}[(2)]
\changenum
\item $\sigma(a/D_{1-2})$ for all $a \in {\cal C}(D_{1})$,
\item $\sigma(b/D_{1 \cap 2})$ for all $b \in {\cal C}(D_{1})
\cup {\cal C}(D_{2})$,
\item $\sigma(c/D_{2-1})$ for all $c \in {\cal C}(D_{2})$, and
\item $\sigma(d/D_{1 \cup 2})$ for all $d \in {\cal C}(D_{1}) \cap
{\cal C}(D_{2})$.
\end{enumerate}
\begin{quote}
{\bf Phase 2:} Call {\em OtherSup($D_{1-2}$, ${\cal C}(D_{1})$)\/},
{\em OtherSup($D_{1 \cap 2}$, ${\cal C}(D_{1}) \cup {\cal
C}(D_{2}))$\/},\hfill\break {\em OtherSup($D_{2-1}$, ${\cal
C}(D_{2})$)\/} and {\em OtherSup($D_{1 \cup 2}$, ${\cal C}(D_{1}) \cap
{\cal C}(D_{2}))$\/} to generate
\end{quote}
\begin{enumerate}[(2)]
\changenum
\item $\sigma(a/D_{1-2})$ for all $a \in {\cal C}(D_{1-2})
\backslash {\cal C}(D_{1})$,
\item $\sigma(b/D_{1 \cap 2})$ for all $b \in
{\cal C}(D_{1 \cap 2}) \backslash ({\cal C}(D_{1}) \cup {\cal C}(D_{2})$),
\item $\sigma(c/D_{2-1})$ for all $c \in {\cal C}(D_{2-1})
\backslash {\cal C}(D_{2})$, and
\item $\sigma(d/D_{1 \cup 2})$ for all $d \in
{\cal C}(D_{1 \cup 2}) \backslash ({\cal C}(D_{1}) \cap {\cal C}(D_{2}))$.
\end{enumerate}
\begin{quote}
{\bf Phase 3:} Respectively select itemsets with support exceeding
{\sf minsup\/}
\end{quote}
\begin{enumerate}[(2)]
\changenum
\item from ${\cal C}(D_{1-2}) \cup {\cal C}(D_{1})$
to form ${\cal L}(D_{1-2})$,
\item from ${\cal C}(D_{1 \cap 2}) \cup ({\cal C}(D_{1}) \cup
{\cal C}(D_{2}))$ to form ${\cal L}(D_{1 \cap 2})$,
\item from ${\cal C}(D_{2-1}) \cup {\cal C}(D_{2})$
to form ${\cal L}(D_{2-1})$, and
\item from ${\cal C}(D_{1 \cup 2}) \cup ({\cal C}(D_{1}) \cap
{\cal C}(D_{2}))$ to form ${\cal L}(D_{1 \cup 2})$.
\end{enumerate}
It is easy to see the correctness of Algorithm~\ref{alg:main}. In
this algorithm, by Property~\ref{pro:calsupcan} we minimize the number
of candidates whose supports are counted by scanning databases. On the
other hand, we also avoid lots of redundant computation for
support-counting contained in the naive solution.
\section{Performance Study}
\label{sec:perstu}
For comparison, we use {\em Apriori\/} to implement the naive solution
({\em Naive\/} for short) mentioned in Section~\ref{sec:intro}, as
{\em Apriori\/} is one of the best algorithms for finding frequent
itemsets in a fixed collection. To assess the performances of
Algorithm~\ref{alg:main} and {\em Naive\/}, we performed several
experiments on a SUN ULTRA-1 workstation with 64 MB main memory
running Sun OS 5.5. To keep the comparison fair, we implemented all
the algorithms using the same basic data structures.
To get a group of synthetic datasets $D_{1}$, $D_{2}$, $D_{1-2}$,
$D_{1 \cap 2}$, $D_{2-1}$ and $D_{1 \cup 2}$, we first use a tool
described in \citeasnoun{agr:fdoar} to generate $D_{1 \cup 2}$. Then,
by using given parameters $a$, $b$ and $c$, we randomly partition
$D_{1 \cup 2}$ into 3 parts $D_{1-2}$, $D_{1 \cap 2}$ and $D_{2-1}$
such that $|D_{1-2}|:|D_{1 \cap 2}|:|D_{2-1}|=a:b:c$. Finally, we
generate $D_{1}=D_{1-2} \cup D_{1 \cap 2}$ and $D_{2}=D_{1 \cap 2}
\cup D_{2-1}$. We use the notation $Rabc.Tx.Iy.Dz$ to denote such a
group of datasets in which $x$ is the average transaction size, $y$ is
the average size of a maximal potentially frequent itemset, $z$ is the
number of transactions in $D_{1 \cup 2}$, and $a:b:c=|D_{1-2}|:|D_{1
\cap 2}|:|D_{2-1}|$.
Figure~\ref{fig:time} shows the experimental results for nine groups
of synthetic datasets to compare the performances of
Algorithm~\ref{alg:main} and {\em Naive\/}. We observe that
Algorithm~\ref{alg:main} outperforms {\em Naive\/}, by factors ranging
from 2 to 3 in most cases.
\begin{figure}%fig 2
\centerline{\underline{Figure 2 to be inserted here.}}
% Figure files are not available here.
% \centerline{\psfig{figure=figs/R111.T10.I4.D200K.eps,height=2.8cm,width=4cm}
% \psfig{figure=figs/R181.T10.I4.D200K.eps,height=2.8cm,width=4cm}
% \psfig{figure=figs/R316.T10.I4.D200K.eps,height=2.8cm,width=4cm}}
% \centerline{\psfig{figure=figs/R111.T10.I6.D300K.eps,height=2.8cm,width=4cm}
% \psfig{figure=figs/R512.T10.I6.D300K.eps,height=2.8cm,width=4cm}
% \psfig{figure=figs/R316.T10.I6.D300K.eps,height=2.8cm,width=4cm}}
% \centerline{\psfig{figure=figs/R111.T10.I8.D500K.eps,height=2.8cm,width=4cm}
% \psfig{figure=figs/R226.T10.I8.D500K.eps,height=2.8cm,width=4cm}
% \psfig{figure=figs/R316.T10.I8.D500K.eps,height=2.8cm,width=4cm}}
\caption{Performance comparison of Algorithm~\ref{alg:main} and {\em Naive}}
\label{fig:time}
\end{figure}
\begin{figure}%fig 3
\centerline{\underline{Figure 3 to be inserted here.}}
% \centerline{\psfig{figure=figs/R111.T10.I4.D200K.can.eps,
% height=2.8cm,width=4cm}
% \psfig{figure=figs/R181.T10.I4.D200K.can.eps,height=2.8cm,width=4cm}
% \psfig{figure=figs/R316.T10.I4.D200K.can.eps,height=2.8cm,width=4cm}}
% \centerline{\psfig{figure=figs/R111.T10.I6.D300K.can.eps,
% height=2.8cm,width=4cm}
% \psfig{figure=figs/R512.T10.I6.D300K.can.eps,height=2.8cm,width=4cm}
% \psfig{figure=figs/R316.T10.I6.D300K.can.eps,height=2.8cm,width=4cm}}
% \centerline{\psfig{figure=figs/R111.T10.I8.D500K.can.eps,
% height=2.8cm,width=4cm}
% \psfig{figure=figs/R226.T10.I8.D500K.can.eps,height=2.8cm,width=4cm}
% \psfig{figure=figs/R316.T10.I8.D500K.can.eps,height=2.8cm,width=4cm}}
\caption{The number of candidates (excluding all $1$-itemsets and $2$-itemsets)
whose supports are obtained by counting all relevant transactions in
Algorithm~\ref{alg:main} and {\em Naive}}
\label{fig:candi}
\end{figure}
In theory, it is easy to see that the main costs of both algorithms
are the costs of support-counting by scanning databases. In practice,
sets ${\cal C}(D_{1-2})$, ${\cal C}(D_{1 \cap 2})$, ${\cal
C}(D_{2-1})$ and ${\cal C}(D_{1 \cup 2})$ usually share many common
elements, because a frequent itemset in a collection tends to be
frequent in other collections over the same database. Thus, in
Algorithm~\ref{alg:main}, lots of necessary candidates obtain their
supports by a much faster method based on Property~\ref{pro:calsupcan}
instead of a very expensive method of counting each relevant
transaction. So the number of candidates whose supports are counted by
scanning databases in Algorithm~\ref{alg:main} is often much smaller
than that in {\em Naive\/}. As a result, Algorithm~\ref{alg:main}
outperforms {\em Naive\/}.
To confirm this observation, we also tested the numbers of candidates
whose supports are obtained by counting each relevant transaction.
Figure~\ref{fig:candi} shows the test results. Note that we used a
one-dimensional array and a two-dimensional array to speed up the
process of obtaining frequent $1$-itemsets and $2$-itemsets (refer to
\citenb{lin:psana}) so that this process runs very fast and its cost
is negligible. Thus, to keep the comparison fair, the candidates shown
in Fig.~\ref{fig:candi} do not include $1$-itemsets and $2$-itemsets.
Clearly, Fig.~\ref{fig:candi} shows that Algorithm~\ref{alg:main} can
reduce the number of necessary candidates whose supports are counted
by scanning databases, and hence often runs much faster than {\em
Naive\/}.
\section{Parallelization}
\label{sec:parallel}
In the area of data mining, since most problems, including ours,
require lots of computation power, developing parallel algorithms for
them is interesting. Here, we study how to extend
Algorithm~\ref{alg:main} for parallelization in a shared-memory
architecture with $N$ processors $P_{1} \ldots P_{N}$. We assume $N
\geq 4$. First, we introduce the following parallel procedures, in
which $C_{i}$ and $D_{i}$ are determined by the processor id such that
(1) $C_{i}$ contains $1/N$th of itemsets in $C$, (2)
$C=\bigcup_{i=1}^{N} C_{i}$, (3) $D_{i}$ contains $1/N$th of
transactions in $D$, and (4) $D=\bigcup_{i=1}^{N} D_{i}$.
\begin{quote}
{\bf ParCntSup($C, D$): the first implementation\/} \\
\mbox{}\quad {\bf for\/} $i:=1$ to $N$ at processor $P_{i}$ {\bf do\/}
{\bf in parallel\/} \\
\mbox{}\qquad CntSup($C_{i}$, $D$)
\quad {\em // scan $D$ to count $\sigma(x/D)$ for all $x \in C_{i}$\/}
{\bf ParCntSup($C, D$): the second implementation\/} \\
\mbox{}\quad {\bf for\/} $i:=1$ to $N$ at processor $P_{i}$ {\bf do\/}
{\bf in parallel\/} \\
\mbox{}\qquad {\bf for} each transaction $t \in D_{i}$ {\bf do} \\
\mbox{}\qquad\quad {\bf for} each candidate $x \in C$ such that $x \subseteq t$
{\bf do} $x.sup$++ \\
\mbox{}\quad {\bf for\/} $i:=1$ to $N$ at processor $P_{i}$ {\bf do\/}
{\bf in parallel\/} \\
\mbox{}\qquad {\bf for} each candidate $x \in C_{i}$ {\bf do}
$\sigma(x/D):=\frac{x.sup}{|D|}$
{\bf ParCalSupA($C, D, D_{a}, D_{b}$)\/}\\
\mbox{}\quad {\bf for\/} $i:=1$ to $N$ at processor $P_{i}$ {\bf do\/}
{\bf in parallel\/} \\
\mbox{}\qquad $\sigma(x/D)=\frac{\sigma(x/D_{a})*|D_{a}|-\sigma(x/D_{b})
* |D_{b}|}{|D|}$ for all $x \in C_{i}$
{\bf ParCalSupB($C, D, D_{a}, D_{b}$)\/}\\
\mbox{}\quad {\bf for\/} $i:=1$ to $N$ at processor $P_{i}$ {\bf do\/}
{\bf in parallel\/} \\
\mbox{}\qquad $\sigma(x/D)=\frac{\sigma(x/D_{a})*|D_{a}|+\sigma(x/D_{b})
* |D_{b}|}{|D|}$ for all $x \in C_{i}$
\end{quote}
Now we can use the following six steps to solve our problem in parallel.
\begin{quote}
{\bf Step 1:\/} This step completes the same task as Step 1 in
Algorithm~\ref{alg:phase1}. Since the cost of this step is negligible,
we simply use $P_{1}$ to complete it.
{\bf Step 2:\/} This step is a natural parallelization of Step 2 in
Algorithm~\ref{alg:phase1}: we use parallel procedures {\em
ParCntSup\/}\footnote{ Either of the implementations of {\em
ParCntSup\/} defined above can be chosen. } and {\em ParCalSupA\/} to
replace the corresponding sequential procedures {\em CntSup\/} and
{\em CalSupA\/} contained in Step 2 of Algorithm~\ref{alg:phase1}
respectively. For example, if Method 1 is chosen in this step, we
complete it by calling ParCntSup(${\cal C}(D_{1}) \cup {\cal
C}(D_{2})$, $D_{1 \cap 2}$), ParCalSupA(${\cal C}(D_{1})$, $D_{1-2}$,
$D_{1}$, $D_{1 \cap 2}$), ParCalSupA(${\cal C}(D_{2})$, $D_{2-1}$,
$D_{2}$, $D_{1 \cap 2}$).
{\bf Step 3:\/} This step undertakes the same task as Step 3 in
Algorithm~\ref{alg:phase1}. To complete this step, we call
ParCalSupB(${\cal C}(D_{1})$ $\cap {\cal C}(D_{1})$, $D_{1 \cup 2}$,
$D_{1}$, $D_{2-1}$).
\end{quote}
It is easy to see that the above Steps 1--3 are a natural
parallelization of Algorithm~\ref{alg:phase1}, which completes Phase 1
of Algorithm~\ref{alg:main}. However, OtherSup($D$, $C$) called by
Phase 2 in Algorithm~\ref{alg:main} has a sequential bottleneck,
because it employs several iterations and any iteration can not start
until the previous one finishes. So, a straightforward parallelization
of Phase 2 can not be efficient. The key problem is that we need to
generate all candidates at one time instead of multiple times by the
previous levelwise method. Thus, we introduce the following function,
in which {\em getcanditemset\/} has been defined in
Algorithm~\ref{alg:phase2}.
\begin{quote}
{\bf Function\/} AllCan($D$, $C$)\\
\mbox{}\quad $C_{infrq}:=\{c \in C \mid \sigma(c/D)<$ {\sf minsup}\}\\
\mbox{}\quad $C_{1}$:=\{all frequent $1$-itemsets\}
and $C_{2}$:=\{all frequent $2$-itemsets\}\footnote{
As mentioned in Section~\ref{sec:perstu}, array technique can make
this process run very fast.} \\
\mbox{}\quad {\bf for} $(k:=2; C_{k} \not= \emptyset$; k++) {\bf do}\\
\mbox{}\qquad $C_{k}:=C_{k} \backslash C_{infrq}$
and then $C_{k+1}:=$getcanditemset$(C_{k})$; \\
\mbox{}\quad {\bf return\/} $(\bigcup_{k} C_{k}) \backslash C$;
\end{quote}
We assume that AllCan($D$, $C$) returns $C'$. It is easy to see that
$C'$ is a superset of all frequent itemsets in $D$, excluding those
contained in $C$. Furthermore, the size of $C'$ tends to be not very
large in practice, because (1) the input $C$ often contains quite a
few elements, (2) $C'$ does not include any element in $C$, and (3)
all infrequent itemsets in $C$ are employed in AllCan($D$, $C$) to
prune unnecessary candidates and make $C'$ as small as possible. Thus,
we can use the following Steps 4 and 5 to complete the work similar to
Phase 2 of Algorithm~\ref{alg:main}.
\begin{quote}
{\bf Step 4:\/} Processors $P_{1} \sim P_{4}$ execute the following tasks in
parallel.
\end{quote}
\begin{enumerate}[($P_2$)]
\changepnum
\item $C_{1-2}$:=AllCan($D_{1-2}$, ${\cal C}(D_{1})$).
\item $C_{1 \cap 2}$:=AllCan($D_{1 \cap 2}$,
${\cal C}(D_{1}) \cup {\cal C}(D_{1})$).
\item $C_{2-1}$:=AllCan($D_{2-1}$, ${\cal C}(D_{2})$).
\item $C_{1 \cup 2}$:=AllCan($D_{1 \cup 2}$,
${\cal C}(D_{1}) \cap {\cal C}(D_{2})$).
\end{enumerate}
\begin{quote}%ajw
{\bf Step 5:\/} Call ParCntSup($C_{1-2}$, $D_{1-2}$),
ParCntSup($C_{1 \cap 2}$, $D_{1 \cap 2}$),
ParCntSup($C_{2-1}$, $D_{2-1}$) and
ParCntSup($C_{1 \cup 2}$, $D_{1 \cup 2}$).
\end{quote}
Finally, we use the following Step 6 to obtain the desired result.
\begin{quote}
{\bf Step 6:\/} Processors $P_{1} \sim P_{4}$ execute the following tasks in
parallel.
\end{quote}
\begin{enumerate}[($P_2$)]
\changepnum
\item select frequent itemsets from
${\cal C}_{1-2} \cup {\cal C}(D_{1})$ to form ${\cal L}(D_{1-2})$.
\item select frequent itemsets from
${\cal C}_{1 \cap 2} \cup ({\cal C}(D_{1}) \cup {\cal C}(D_{2}))$
to form ${\cal L}(D_{1 \cap 2})$.
\item select frequent itemsets from
${\cal C}_{2-1} \cup {\cal C}(D_{2})$ to form ${\cal L}(D_{2-1})$.
\item select frequent itemsets from
${\cal C}_{1 \cup 2} \cup ({\cal C}(D_{1}) \cap {\cal C}(D_{2}))$
to form ${\cal L}(D_{1 \cup 2})$.
\end{enumerate}
It is easy to see the correctness of the above solution. In general,
this solution is a natural parallel extension of
Algorithm~\ref{alg:main}, because the major part of
Algorithm~\ref{alg:main} is suitable for parallelization. As for the
only sequential bottle neck contained in Algorithm~\ref{alg:main}, our
parallel solution removes it by generating some extra candidates in
Step 4.
\section{Conclusion}
\label{sec:con}
We have addressed a new problem whose solution can provide users with
valuable association rules in relevant transaction collections: {\em
finding association rules in DTCs (derivative transaction
collections)\/}. In this problem, we are given association rules in
two transaction collections $D_{1}$ and $D_{2}$, and required to find
new association rules in DTCs $D_{1-2}$, $D_{1 \cap 2}$, $D_{2-1}$ and
$D_{1 \cup 2}$. Direct application of existing algorithms can solve
this problem, but in a cost-ineffective way. We have presented an
efficient solution to solve this problem. The main ideas used in our
approach are (1) making full use of already discovered information,
(2) taking advantage of the relationships existing among relevant
collections, and (3) avoiding unnecessary and expensive
support-counting by scanning databases. We have studied the
performance of our solution experimentally and found that it
consistently outperforms the naive solution, by factors from 2 to 3 in
most cases. Finally, we also present a simple parallelization of our
approach, because parallel algorithms are often interesting and
necessary in the area of data mining.
We would like to point out that the savings of our results are based
on the precomputation in $D_{1}$ and $D_{2}$, i.e., the mining results
for some other necessary tasks. What happens if $D_{1}$ and $D_{2}$
have not been computed in the first place? Certainly, we can first do
the precomputation and then employ our algorithms. However, this is
not an efficient solution. With the observation of ${\cal L}(D_{1
\cup 2}) \subseteq {\cal L}(D_{1 - 2}) \cup {\cal L}(D_{1 \cap 2})
\cup {\cal L}(D_{2-1})$ (refer to \citenb{cheng:emoar}), an
alternative and faster approach is to first do mining in $D_{1-2}$,
$D_{1 \cap 2}$ and $D_{2-1}$ by using {\em Apriori\/} respectively,
then count supports in $D_{1 \cup 2}$ for each itemset in ${\cal
L}(D_{1 - 2}) \cup {\cal L}(D_{1 \cap 2}) \cup {\cal L}(D_{2-1})$, and
finally generate ${\cal L}(D_{1 \cup 2})$. It remains an interesting
future topic to study the above problem in detail and see if there are
other efficient solutions. Another interesting topic for future
research is to propose and study new mining tasks related to multiple
data collections.
\begin{acknowledgements}
We thank anonymous reviewers for their very useful comments and
suggestions. Part of this work was done while Li Shen and Ling Cheng
were doing research in Griffith University. The work was supported by
Australian Research Council (ARC) Large Grant A849602031.
\end{acknowledgements}
\begin{thebibliography}{99}
\harvarditem{Agrawal et al}{1993}{agr:mabso}
Agrawal R, Imielinski T, Swami A (1993)
Mining associations between sets of items in massive databases.
In Buneman P, Jajodia S (eds).
Proceedings of the ACM SIGMOD international conference on management of data,
Washington DC, May 1993, pp~207--216
\harvarditem{Agrawal et al}{1996}{agr:fdoar}
Agrawal R, Mannila H, Srikant R, Toivonen H
Verkamo AI (1996)
Fast discovery of association rules.
In Fayyad U, Piatetsky-Shapiro G, Smyth P, Uthurusamy R (eds).
Advances in knowledge discovery and data mining,
Ch~12. AAAI/MIT Press, Cambridge, MA, pp~307--328
\harvarditem{Agrawal and Shafer}{1996}{agr:pmoar}
Agrawal R, Shafer JC (1996)
Parallel mining of association rules.
IEEE Transactions on Knowledge and Data Engineering
8(6):962--969
\harvarditem{Chen et al}{1996}{che:dmaof} Chen MS, Han J, Yu PS (1996)
Data mining: an overview from database perspective,
IEEE Transactions on Knowledge and Data Engineering
8(6):866--883
\harvarditem{Cheung et al}{1996}{cheng:emoar}
Cheung DW, Ng V, Fu AW, Fu Y (1996)
Efficient mining of association rules in distributed databases.
IEEE Transactions on Knowledge and Data Engineering
8(6):911--922
\harvarditem{Lin and Kedem}{1998}{lin:psana} Lin DI, Kedem ZM (1998)
Pincer-search: a new algorithm for discovering the maximum frequent
set.
In Schek H, Saltor F, Ramos I, Alonso G (eds).
Proceedings of advances in database technology
(EDBT '98), 6th international conference on extending database
technology, Valencia, Spain, March 1998.
Lecture Notes in Computer Science 1377, Springer, Berlin, pp~105--119
\harvarditem{Mannila and Toivonen}{1997}{man:lsabt}
Mannila H, Toivonen H (1997)
Levelwise search and borders of theories in knowledge discovery.
Data mining and knowledge discovery,
1(3): 241--258
\harvarditem{Park et al}{1995}{par:aehba} Park JS, Chen MS, Yu PS (1995)
An effective hash-based algorithm for mining association rules.
In Carey MJ, Schneider DA (eds).
Proceedings of 1995 ACM-SIGMOD international conference on management of data,
San Jose, CA, May 1995, pp~175--186
\harvarditem{Savasere et al}{1995}{sav:aeafm}
Savasere A, Omiecinski E, Navathe S (1995)
An efficient algorithm for mining association rules in large databases.
In Dayal, U, Gray, PMD, Nishio, S (eds).
Proceedings of 21st international conference on very large data bases,
Zurich, Switzerland, September 1995, pp~432--444
\harvarditem{Thomas et al}{1997}{tho:eaiua}
Thomas S, Bodagala S, Alsabti K, Ranka S (1997)
An efficient algorithm for the incremental updation on
association rules in large databases.
In Heckerman D, Mannila H, Pregibon D, Uthurusamy R (eds).
Proceedings of the 3rd international conference on knowledge discovery and
data mining, AAAI Press, Newport, CA, August 1997, pp~263--266
\harvarditem{Zaki et al}{1997}{zak:npaff} Zaki MJ, Parthasarathy S,
Ogihara M, Li W (1997) New parallel algorithms for fast discovery of
association rules, Data Mining and Knowledge Discovery: Special Issue
on Scalable High-Performance Computing for KDD 1(4): 343--373
\end{thebibliography}
\section*{Author Biographies}
\leavevmode
\vbox{%
\begin{wrapfigure}{l}{80pt}
{\vspace*{15pt}\fbox{insert photo}\vspace*{100pt}}%
\end{wrapfigure}
\noindent\small
{\bf Li Shen} received a B.E. degree from Xi'an Jiao Tong University,
Xi'an, China, in 1993 and an M.E. degree from Shanghai Jiao Tong
University, Shanghai, China, in 1996. >From 1997 to 1998, he did
research work in the Data Mining Group at the School of Computing and
Information Technology, Griffith University, Brisbane, Australia. He
is currently a Ph.D. student at the Department of Computer Science,
Dartmouth College, USA. His research interests include data mining,
multimedia, database and software engineering.\vadjust{\vspace{40pt}}}
\vbox{%
\begin{wrapfigure}{l}{80pt}
{\vspace*{15pt}\fbox{insert photo}\vspace*{100pt}}%
\end{wrapfigure}
\noindent\small {\bf Hong Shen} is currently Associate Professor
(Reader) in the School of Computing and Information Technology and
Research Director of Parallel Computing Unit at Griffith
University. He held visiting posts in several leading universities in
USA, Europe and Asia. With main research interests in algorithms,
parallel and distributed computing, interconnection networks, parallel
databases and data mining, multimedia systems and networking, he has
authored over 120 technical papers. Dr Shen has served as an editor
of the {\it Journal of Parallel and Distributed Computing Practice},
associate editor of the {\it International Journal of Parallel and
Distributed Systems and Networks}, and on the editorial board of the
{\it Journal of Parallel Algorithms and Applications}, {\it
International Journal of Computer Mathematics} and the {\it Journal of
Supercomputing}. He has been involved in organization of various
international conferences.}
\vspace{15pt}
\vbox{%
\begin{wrapfigure}{l}{80pt}
{\vspace*{15pt}\fbox{insert photo}\vspace*{100pt}}%
\end{wrapfigure}
\noindent\small
{\bf Ling Cheng} received a B.E. degree from Xi'an Jiao Tong
University, Xi'an, China, in 1993 and an M.E. degree from Shanghai
Jiao Tong University, Shanghai, China, in 1996. From 1996 to 1997, she
worked in the IBM China Research Lab, Beijing, China. In 1998, she
was a visiting scholar at the School of Computing and Information
Technology, Griffith University, Brisbane, Australia. She is
currently a graduate student at the Department of Computer Science,
Dartmouth College, USA. Her research interests include data mining,
multimedia, database and e-commerce.}
\vspace{15pt}
\vbox{%
\begin{wrapfigure}{l}{80pt}
{\vspace*{15pt}\fbox{insert photo}\vspace*{100pt}}%
\end{wrapfigure}
\noindent\small
{\bf Paul Pritchard} is currently Professor in the School of Computing
and Information Technology and Executive Director of the Parallel
Computing Unit at Griffith University. He is also joint Manager of
Queensland Parallel Supercomputing Facility (QPSF) and served as Head
of the school (1991--1993), and Dean of Faculty of Information and
Communication Technology (1997--1999). With main research interest in
algorithms, programming languages, high-performance computing,
Internet computing and data mining, Professor Pritchard has published
numerous papers in several areas of computer science. He is best known
for his contributions to number-theoretic algorithms, and his wheel
sieve is the fastest known algorithm for enumerating the prime
numbers. He is foundation editor of the {\it Journal of Universal
Computer Science}, and has chaired several conferences.}
\correspond{Li Shen, Department of Computer Science, Dartmouth
College, Hanover, NH 03755, USA. Email: li@cs.dartmouth.edu}
\label{lastpage}
\end{document}