PreJoin: An efficient Trie-based string similarity join algorithm

Karam Gouda; Metwally Rashad

PreJoin: An efficient Trie-based string similarity join algorithm

Karam Gouda
, Metwally Rashad

Benha University

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

3 Scopus citations

Abstract

A string similarity join finds all similar pairs between two collections of strings. It is an essential operation in many applications, such as data integration and cleaning, and has attracted significant attention recently. In this paper, we study string similarity joins with edit-distance constraints. Recently, a Trie-based similarity Join framework is proposed [3]. Existing Trie-based Join algorithms have shown that Trie Indexing is more suitable for Similarity Join on short strings. The main problem with current approaches is that they generate and maintain lots of candidate prefixes called active nodes which need to be further removed. With large edit distance, the number of active nodes becomes quite large. In this paper, we propose a new Trie-based Join algorithm called PreJoin, which improves over current Trie-based Join methods. It efficiently finds all similar string pairs using a new active-node set generation method, and a dynamic preorder traversal of the Trie index. Experiments show that PreJoin is highly efficient for processing short as well as long strings, and outperforms the state-of-the-art Trie-based Join approaches by a factor five.

Original language	English
Title of host publication	2012 8th International Conference on Informatics and Systems, INFOS 2012
Pages	DE37-DE43
State	Published - 2012
Externally published	Yes
Event	2012 8th International Conference on Informatics and Systems, INFOS 2012 - Cairo, Egypt Duration: 14 May 2012 → 16 May 2012

Publication series

Name	2012 8th International Conference on Informatics and Systems, INFOS 2012

Conference

Conference	2012 8th International Conference on Informatics and Systems, INFOS 2012
Country/Territory	Egypt
City	Cairo
Period	14/05/12 → 16/05/12

Cite this

@inproceedings{4545305296bb4ffdb0fda7ef99055c3c,

title = "PreJoin: An efficient Trie-based string similarity join algorithm",

abstract = "A string similarity join finds all similar pairs between two collections of strings. It is an essential operation in many applications, such as data integration and cleaning, and has attracted significant attention recently. In this paper, we study string similarity joins with edit-distance constraints. Recently, a Trie-based similarity Join framework is proposed [3]. Existing Trie-based Join algorithms have shown that Trie Indexing is more suitable for Similarity Join on short strings. The main problem with current approaches is that they generate and maintain lots of candidate prefixes called active nodes which need to be further removed. With large edit distance, the number of active nodes becomes quite large. In this paper, we propose a new Trie-based Join algorithm called PreJoin, which improves over current Trie-based Join methods. It efficiently finds all similar string pairs using a new active-node set generation method, and a dynamic preorder traversal of the Trie index. Experiments show that PreJoin is highly efficient for processing short as well as long strings, and outperforms the state-of-the-art Trie-based Join approaches by a factor five.",

author = "Karam Gouda and Metwally Rashad",

year = "2012",

language = "English",

isbn = "9789774035067",

series = "2012 8th International Conference on Informatics and Systems, INFOS 2012",

pages = "DE37--DE43",

booktitle = "2012 8th International Conference on Informatics and Systems, INFOS 2012",

note = "2012 8th International Conference on Informatics and Systems, INFOS 2012 ; Conference date: 14-05-2012 Through 16-05-2012",

}

PreJoin: An efficient Trie-based string similarity join algorithm. / Gouda, Karam; Rashad, Metwally.
2012 8th International Conference on Informatics and Systems, INFOS 2012. 2012. p. DE37-DE43 6236512 (2012 8th International Conference on Informatics and Systems, INFOS 2012).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - PreJoin

T2 - 2012 8th International Conference on Informatics and Systems, INFOS 2012

AU - Gouda, Karam

AU - Rashad, Metwally

PY - 2012

Y1 - 2012

N2 - A string similarity join finds all similar pairs between two collections of strings. It is an essential operation in many applications, such as data integration and cleaning, and has attracted significant attention recently. In this paper, we study string similarity joins with edit-distance constraints. Recently, a Trie-based similarity Join framework is proposed [3]. Existing Trie-based Join algorithms have shown that Trie Indexing is more suitable for Similarity Join on short strings. The main problem with current approaches is that they generate and maintain lots of candidate prefixes called active nodes which need to be further removed. With large edit distance, the number of active nodes becomes quite large. In this paper, we propose a new Trie-based Join algorithm called PreJoin, which improves over current Trie-based Join methods. It efficiently finds all similar string pairs using a new active-node set generation method, and a dynamic preorder traversal of the Trie index. Experiments show that PreJoin is highly efficient for processing short as well as long strings, and outperforms the state-of-the-art Trie-based Join approaches by a factor five.

AB - A string similarity join finds all similar pairs between two collections of strings. It is an essential operation in many applications, such as data integration and cleaning, and has attracted significant attention recently. In this paper, we study string similarity joins with edit-distance constraints. Recently, a Trie-based similarity Join framework is proposed [3]. Existing Trie-based Join algorithms have shown that Trie Indexing is more suitable for Similarity Join on short strings. The main problem with current approaches is that they generate and maintain lots of candidate prefixes called active nodes which need to be further removed. With large edit distance, the number of active nodes becomes quite large. In this paper, we propose a new Trie-based Join algorithm called PreJoin, which improves over current Trie-based Join methods. It efficiently finds all similar string pairs using a new active-node set generation method, and a dynamic preorder traversal of the Trie index. Experiments show that PreJoin is highly efficient for processing short as well as long strings, and outperforms the state-of-the-art Trie-based Join approaches by a factor five.

UR - https://www.scopus.com/pages/publications/84864832745

M3 - Conference contribution

AN - SCOPUS:84864832745

SN - 9789774035067

T3 - 2012 8th International Conference on Informatics and Systems, INFOS 2012

SP - DE37-DE43

BT - 2012 8th International Conference on Informatics and Systems, INFOS 2012

Y2 - 14 May 2012 through 16 May 2012

ER -

PreJoin: An efficient Trie-based string similarity join algorithm

Abstract

Publication series

Conference

Other files and links

Fingerprint

Cite this