DBA Hub

📋Steps in this guide1/4

FUZZY_MATCH and PHONIC_ENCODE Data Quality Operators in Oracle Database 23ai/26ai

Oracle database 23ai/26ai introduced the FUZZY_MATCH and PHONIC_ENCODE data quality operators to perform fuzzy string matching.

oracle 23configurationintermediate
by OracleDba
13 views
1

Setup

The examples in this article require the following test table. Notice the and values have varying degrees of similarity including exact matches, near matches and completely different values.

Code/Command (click line numbers to comment):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
drop table if exists match_tab;

create table match_tab (
  id    number,
  col1  varchar2(15),
  col2  varchar2(15),
  constraint match_tab_pk primary  key (id)
);

insert into match_tab values
  (1, 'Peter Parker', 'Pete Parker'),
  (2, 'Peter Parker', 'peter parker'),
  (3, 'Clark Kent', 'Claire Kent'),
  (4, 'Wonder Woman', 'Ponder Woman'),
  (5, 'Superman', 'Superman'),
  (6, 'The Hulk', 'Iron Man');
commit;
2

FUZZY_MATCH

The operator is language-neutral. It determines the similarity between two strings and supports several algorithms listed here . By default the output is a percentage similarity, but the keyword can be added to return the raw value. If you have used the package, the scaled verses unscaled values represent the following calls. By default the output is scaled by the length of the longer input string. The keyword alters this to scale to the shorter input string. The keyword can be used with the algorithm. The tolerance is a percentage of characters in a word that can be different, while still considering it to be the same word.

Code/Command (click line numbers to comment):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
set linesize 100
column col1 format a12
column col2 format a12

select col1,
       col2,
       fuzzy_match(levenshtein, col1, col2) as levenshtein,
       fuzzy_match(jaro_winkler, col1, col2) as jaro_winkler,
       fuzzy_match(bigram, col1, col2) as bigram,
       fuzzy_match(trigram, col1, col2) as trigram,
       fuzzy_match(whole_word_match, col1, col2) as wwm,
       fuzzy_match(longest_common_substring, col1, col2) as lcs
from   match_tab;

COL1         COL2         LEVENSHTEIN JARO_WINKLER     BIGRAM    TRIGRAM        WWM        LCS
------------ ------------ ----------- ------------ ---------- ---------- ---------- ----------
Peter Parker Pete Parker           92           92         90         70         50         58
Peter Parker peter parker          84           88         72         60          0         41
Clark Kent   Claire Kent           82           90         60         44         50         45
Wonder Woman Ponder Woman          92           94        100         90         50         91
Superman     Superman             100          100        100        100        100        100
The Hulk     Iron Man               0           41          0          0          0         12

6 rows selected.

SQL>

select col1,
       col2,
       fuzzy_match(levenshtein, col1, col2, unscaled) as levenshtein,
       fuzzy_match(jaro_winkler, col1, col2, unscaled) as jaro_winkler,
       fuzzy_match(bigram, col1, col2, unscaled) as bigram,
       fuzzy_match(trigram, col1, col2, unscaled) as trigram,
       fuzzy_match(whole_word_match, col1, col2, unscaled) as wwm,
       fuzzy_match(longest_common_substring, col1, col2, unscaled) as lcs
from   match_tab;

COL1         COL2         LEVENSHTEIN JARO_WINKLER     BIGRAM    TRIGRAM        WWM        LCS
------------ ------------ ----------- ------------ ---------- ---------- ---------- ----------
Peter Parker Pete Parker            1          .92         10          7          1          7
Peter Parker peter parker           2          .88          8          6          0          5
Clark Kent   Claire Kent            2           .9          6          4          1          5
Wonder Woman Ponder Woman           1          .94         11          9          1         11
Superman     Superman               0            1          7          6          1          8
The Hulk     Iron Man               8          .41          0          0          0          1

6 rows selected.

SQL>

select col1,
       col2,
       fuzzy_match(levenshtein, col1, col2, relate_to_shorter) as levenshtein,
       fuzzy_match(jaro_winkler, col1, col2, relate_to_shorter) as jaro_winkler,
       fuzzy_match(bigram, col1, col2, relate_to_shorter) as bigram,
       fuzzy_match(trigram, col1, col2, relate_to_shorter) as trigram,
       fuzzy_match(whole_word_match, col1, col2, relate_to_shorter) as wwm,
       fuzzy_match(longest_common_substring, col1, col2, relate_to_shorter) as lcs
from   match_tab;

COL1         COL2         LEVENSHTEIN JARO_WINKLER     BIGRAM    TRIGRAM        WWM        LCS
------------ ------------ ----------- ------------ ---------- ---------- ---------- ----------
Peter Parker Pete Parker           91           92        100         77         50         63
Peter Parker peter parker          84           88         72         60          0         41
Clark Kent   Claire Kent           80           90         66         50         50         50
Wonder Woman Ponder Woman          92           94        100         90         50         91
Superman     Superman             100          100        100        100        100        100
The Hulk     Iron Man               0           41          0          0          0         12

6 rows selected.

SQL>

select col1,
       col2,
       fuzzy_match(whole_word_match, col1, col2) as wwm,
       fuzzy_match(whole_word_match, col1, col2, edit_tolerance 20) as wwm20,
       fuzzy_match(whole_word_match, col1, col2, edit_tolerance 82) as wwm82
from   match_tab;

COL1         COL2                WWM      WWM20      WWM82
------------ ------------ ---------- ---------- ----------
Peter Parker Pete Parker          50        100         50
Peter Parker peter parker          0        100         50
Clark Kent   Claire Kent          50        100         50
Wonder Woman Ponder Woman         50        100        100
Superman     Superman            100        100        100
The Hulk     Iron Man              0          0          0

6 rows selected.

SQL>
3

PHONIC_ENCODE

The operator converts text into language-specific codes based on the pronunciation of the text. It implements the Double Metaphone algorithm and an alternative algorithm. When using , if there is no alternative code the primary code is returned. The maximum length of the code is controlled by an optional third parameter, which accepts integer values from 1 to 12.

Code/Command (click line numbers to comment):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
set linesize 100
column col1 format a12
column col2 format a12
column col1_dm format a8
column col2_dm format a8
column col1_dma format a8
column col2_dma format a8

select col1,
       col2,
       phonic_encode(double_metaphone, col1) as col1_dm,
       phonic_encode(double_metaphone, col2) as col2_dm,
       phonic_encode(double_metaphone_alt, col1) as col1_dma,
       phonic_encode(double_metaphone_alt, col2) as col2_dma
from   match_tab;

COL1         COL2         COL1_DM  COL2_DM  COL1_DMA COL2_DMA
------------ ------------ -------- -------- -------- --------
Peter Parker Pete Parker  PTRP     PTPR     PTRP     PTPR
Peter Parker peter parker PTRP     PTRP     PTRP     PTRP
Clark Kent   Claire Kent  KLRK     KLRK     KLRK     KLRK
Wonder Woman Ponder Woman ANTR     PNTR     FNTR     PNTR
Superman     Superman     SPRM     SPRM     SPRM     SPRM
The Hulk     Iron Man     0LK      ARNM     TLK      ARNM

6 rows selected.

SQL>

column col1_dm1 format a9
column col2_dm1 format a9
column col1_dm6 format a9
column col2_dm6 format a9
column col1_dm12 format a9
column col2_dm12 format a9

select col1,
       col2,
       phonic_encode(double_metaphone, col1, 1) as col1_dm1,
       phonic_encode(double_metaphone, col2, 1) as col2_dm1,
       phonic_encode(double_metaphone, col1, 6) as col1_dm6,
       phonic_encode(double_metaphone, col2, 6) as col2_dm6,
       phonic_encode(double_metaphone, col1, 12) as col1_dm12,
       phonic_encode(double_metaphone, col2, 12) as col2_dm12
from   match_tab;

COL1         COL2         COL1_DM1  COL2_DM1  COL1_DM6  COL2_DM6  COL1_DM12 COL2_DM12
------------ ------------ --------- --------- --------- --------- --------- ---------
Peter Parker Pete Parker  P         P         PTRPRK    PTPRKR    PTRPRKR   PTPRKR
Peter Parker peter parker P         P         PTRPRK    PTRPRK    PTRPRKR   PTRPRKR
Clark Kent   Claire Kent  K         K         KLRKKN    KLRKNT    KLRKKNT   KLRKNT
Wonder Woman Ponder Woman A         P         ANTRMN    PNTRMN    ANTRMN    PNTRMN
Superman     Superman     S         S         SPRMN     SPRMN     SPRMN     SPRMN
The Hulk     Iron Man     0         A         0LK       ARNMN     0LK       ARNMN

6 rows selected.

SQL>
4

PL/SQL Support

In this release there is no direct support for the or operators in PL/SQL, so a direct assignment is not possible. We can use a to make the assignment. For more information see: - Data Quality Operators - Fuzzy String Matching - UTL_MATCH : String Matching by Testing Levels of Similarity/Difference Hope this helps. Regards Tim...

Code/Command (click line numbers to comment):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
declare
  l_output  number;
begin
  l_output := fuzzy_match(levenshtein, 'Peter Parker', 'peter parker');
end;
/
              *
ERROR at line 4:
ORA-06550: line 4, column 15:
PLS-00201: identifier 'FUZZY_MATCH' must be declared
ORA-06550: line 4, column 3:
PL/SQL: Statement ignored

SQL>


declare
  l_output  varchar2(10);
begin
  l_output := phonic_encode(double_metaphone, 'Peter Parker');
end;
/
              *
ERROR at line 4:
ORA-06550: line 4, column 15:
PLS-00201: identifier 'PHONIC_ENCODE' must be declared
ORA-06550: line 4, column 3:
PL/SQL: Statement ignored

SQL>

declare
  l_output  number;
begin
  select fuzzy_match(levenshtein, 'Peter Parker', 'peter parker')
  into   l_output;
end;
/

PL/SQL procedure successfully completed.

SQL>


declare
  l_output  varchar2(10);
begin
  select phonic_encode(double_metaphone, 'Peter Parker')
  into   l_output;
end;
/
PL/SQL procedure successfully completed.

SQL>

Comments (0)

Please to add comments

No comments yet. Be the first to comment!