i working on trying weed out customer our database. i've noticed trend people fill out first name same name partial how fill out company name. example like:
business_name first_name ------------- ---------- locksmith taylorsville locksmith locksmith roy locksmi locksmith clinton locks locksmith farmington locksmith
these people not want being pulled in query. bad eggs. i'm trying put query statement (presumably) isolates has first name contains @ least partial match business name, i'm stumped , use help.
you can employ similarity based approach
try code @ bottom of answer
produces result below
business_name partial_business_name first_name similarity locksmith taylorsville locksmith locksmith 1.0 locksmith farmington locksmith locksmith 1.0 locksmith roy locksmith locksmi 0.7777777777777778 locksmith clinton locksmith locks 0.5555555555555556
so, able control filter out based on similarity value
** code **
select business_name, partial_business_name, first_name, similarity js( // input table ( select business_name, regexp_extract(business_name, r'^(\w+)') partial_business_name, first_name first_name (select 'locksmith taylorsville' business_name, 'locksmith' first_name), (select 'locksmith roy' business_name, 'locksmi' first_name), (select 'locksmith clinton' business_name, 'locks' first_name), (select 'locksmith farmington' business_name, 'locksmith' first_name), ) , // input columns business_name, partial_business_name, first_name, // output schema "[{name: 'business_name', type:'string'}, {name: 'partial_business_name', type:'string'}, {name: 'first_name', type:'string'}, {name: 'similarity', type:'float'}] ", // function "function(r, emit) { var _extend = function(dst) { var sources = array.prototype.slice.call(arguments, 1); (var i=0; i<sources.length; ++i) { var src = sources[i]; (var p in src) { if (src.hasownproperty(p)) dst[p] = src[p]; } } return dst; }; var levenshtein = { /** * calculate levenshtein distance of 2 strings. * * @param str1 string first string. * @param str2 string second string. * @return integer levenshtein distance (0 , above). */ get: function(str1, str2) { // base cases if (str1 === str2) return 0; if (str1.length === 0) return str2.length; if (str2.length === 0) return str1.length; // 2 rows var prevrow = new array(str2.length + 1), curcol, nextcol, i, j, tmp; // initialise previous row (i=0; i<prevrow.length; ++i) { prevrow[i] = i; } // calculate current row distance previous row (i=0; i<str1.length; ++i) { nextcol = + 1; (j=0; j<str2.length; ++j) { curcol = nextcol; // substution nextcol = prevrow[j] + ( (str1.charat(i) === str2.charat(j)) ? 0 : 1 ); // insertion tmp = curcol + 1; if (nextcol > tmp) { nextcol = tmp; } // deletion tmp = prevrow[j + 1] + 1; if (nextcol > tmp) { nextcol = tmp; } // copy current col value previous (in preparation next iteration) prevrow[j] = curcol; } // copy last col value previous (in preparation next iteration) prevrow[j] = nextcol; } return nextcol; } }; var the_partial_business_name; try { the_partial_business_name = decodeuri(r.partial_business_name).tolowercase(); } catch (ex) { the_partial_business_name = r.partial_business_name.tolowercase(); } try { the_first_name = decodeuri(r.first_name).tolowercase(); } catch (ex) { the_first_name = r.first_name.tolowercase(); } emit({business_name: r.business_name, partial_business_name: the_partial_business_name, first_name: the_first_name, similarity: 1 - levenshtein.get(the_partial_business_name, the_first_name) / the_partial_business_name.length}); }" ) order similarity desc
was used in how perform trigram operations in google bigquery? , based on https://storage.googleapis.com/thomaspark-sandbox/udf-examples/pataky.js @thomaspark levenshtein's distance used measure similarity
Comments
Post a Comment