sql - Matching partial words in two different columns -


i working on trying weed out customer our database. i've noticed trend people fill out first name same name partial how fill out company name. example like:

business_name               first_name -------------               ---------- locksmith taylorsville      locksmith  locksmith roy               locksmi  locksmith clinton           locks  locksmith farmington        locksmith 

these people not want being pulled in query. bad eggs. i'm trying put query statement (presumably) isolates has first name contains @ least partial match business name, i'm stumped , use help.

you can employ similarity based approach
try code @ bottom of answer
produces result below

business_name           partial_business_name   first_name  similarity    locksmith taylorsville  locksmith               locksmith   1.0   locksmith farmington    locksmith               locksmith   1.0   locksmith roy           locksmith               locksmi     0.7777777777777778    locksmith clinton       locksmith               locks       0.5555555555555556    

so, able control filter out based on similarity value

** code **

select business_name, partial_business_name, first_name, similarity  js( // input table (   select business_name, regexp_extract(business_name, r'^(\w+)') partial_business_name, first_name first_name      (select 'locksmith taylorsville' business_name, 'locksmith' first_name),     (select 'locksmith roy' business_name, 'locksmi' first_name),     (select 'locksmith clinton' business_name, 'locks' first_name),     (select 'locksmith farmington' business_name, 'locksmith' first_name), ) , // input columns business_name, partial_business_name, first_name, // output schema "[{name: 'business_name', type:'string'},   {name: 'partial_business_name', type:'string'},   {name: 'first_name', type:'string'},   {name: 'similarity', type:'float'}] ", // function "function(r, emit) {    var _extend = function(dst) {     var sources = array.prototype.slice.call(arguments, 1);     (var i=0; i<sources.length; ++i) {       var src = sources[i];       (var p in src) {         if (src.hasownproperty(p)) dst[p] = src[p];       }     }     return dst;   };    var levenshtein = {     /**      * calculate levenshtein distance of 2 strings.      *      * @param str1 string first string.      * @param str2 string second string.      * @return integer levenshtein distance (0 , above).      */     get: function(str1, str2) {       // base cases       if (str1 === str2) return 0;       if (str1.length === 0) return str2.length;       if (str2.length === 0) return str1.length;        // 2 rows       var prevrow  = new array(str2.length + 1),           curcol, nextcol, i, j, tmp;        // initialise previous row       (i=0; i<prevrow.length; ++i) {         prevrow[i] = i;       }        // calculate current row distance previous row       (i=0; i<str1.length; ++i) {         nextcol = + 1;          (j=0; j<str2.length; ++j) {           curcol = nextcol;            // substution           nextcol = prevrow[j] + ( (str1.charat(i) === str2.charat(j)) ? 0 : 1 );           // insertion           tmp = curcol + 1;           if (nextcol > tmp) {             nextcol = tmp;           }           // deletion           tmp = prevrow[j + 1] + 1;           if (nextcol > tmp) {             nextcol = tmp;           }            // copy current col value previous (in preparation next iteration)           prevrow[j] = curcol;         }          // copy last col value previous (in preparation next iteration)         prevrow[j] = nextcol;       }        return nextcol;     }    };    var the_partial_business_name;    try {     the_partial_business_name = decodeuri(r.partial_business_name).tolowercase();   } catch (ex) {     the_partial_business_name = r.partial_business_name.tolowercase();   }    try {     the_first_name = decodeuri(r.first_name).tolowercase();   } catch (ex) {     the_first_name = r.first_name.tolowercase();   }    emit({business_name: r.business_name, partial_business_name: the_partial_business_name, first_name: the_first_name,         similarity: 1 - levenshtein.get(the_partial_business_name, the_first_name) / the_partial_business_name.length});    }" ) order similarity desc 

was used in how perform trigram operations in google bigquery? , based on https://storage.googleapis.com/thomaspark-sandbox/udf-examples/pataky.js @thomaspark levenshtein's distance used measure similarity


Comments