TY - GEN
T1 - PreJoin
T2 - 2012 8th International Conference on Informatics and Systems, INFOS 2012
AU - Gouda, Karam
AU - Rashad, Metwally
PY - 2012
Y1 - 2012
N2 - A string similarity join finds all similar pairs between two collections of strings. It is an essential operation in many applications, such as data integration and cleaning, and has attracted significant attention recently. In this paper, we study string similarity joins with edit-distance constraints. Recently, a Trie-based similarity Join framework is proposed [3]. Existing Trie-based Join algorithms have shown that Trie Indexing is more suitable for Similarity Join on short strings. The main problem with current approaches is that they generate and maintain lots of candidate prefixes called active nodes which need to be further removed. With large edit distance, the number of active nodes becomes quite large. In this paper, we propose a new Trie-based Join algorithm called PreJoin, which improves over current Trie-based Join methods. It efficiently finds all similar string pairs using a new active-node set generation method, and a dynamic preorder traversal of the Trie index. Experiments show that PreJoin is highly efficient for processing short as well as long strings, and outperforms the state-of-the-art Trie-based Join approaches by a factor five.
AB - A string similarity join finds all similar pairs between two collections of strings. It is an essential operation in many applications, such as data integration and cleaning, and has attracted significant attention recently. In this paper, we study string similarity joins with edit-distance constraints. Recently, a Trie-based similarity Join framework is proposed [3]. Existing Trie-based Join algorithms have shown that Trie Indexing is more suitable for Similarity Join on short strings. The main problem with current approaches is that they generate and maintain lots of candidate prefixes called active nodes which need to be further removed. With large edit distance, the number of active nodes becomes quite large. In this paper, we propose a new Trie-based Join algorithm called PreJoin, which improves over current Trie-based Join methods. It efficiently finds all similar string pairs using a new active-node set generation method, and a dynamic preorder traversal of the Trie index. Experiments show that PreJoin is highly efficient for processing short as well as long strings, and outperforms the state-of-the-art Trie-based Join approaches by a factor five.
UR - https://www.scopus.com/pages/publications/84864832745
M3 - Conference contribution
AN - SCOPUS:84864832745
SN - 9789774035067
T3 - 2012 8th International Conference on Informatics and Systems, INFOS 2012
SP - DE37-DE43
BT - 2012 8th International Conference on Informatics and Systems, INFOS 2012
Y2 - 14 May 2012 through 16 May 2012
ER -