i have 2 data files. 1 having 1600 rows , other 1 having 2 million rows(tab delimited files). need vlookup between these 2 files. please see below example expected output , kindly let me know if it's possible. i've tried using awk, couldn't expected result.
file 1(small file)
bc1 10 100
bc2 20 200
bc3 30 300
file 2(large file)
bc1 xyz
bc2 abc
bc3 def
expected output:
bc1 10 100 xyz
bc2 20 200 abc
bc3 30 300 def
i tried join command. taking forever complete. please me find solution. thanks
this awk script scan line line each file, , try match number in bc column. once matched, print columns. if 1 of files not contain 1 of numbers, skipped in both files , search next one. loop until 1 of files ends. script accepts number of columns per file , number of files, long first column bc , number. this awk script assumes files ordered minor major number in bc column (like in example). otherwise not work.
to execute script, run command:
awk -f vlookup.awk smallfile bigfile
the vlookup.awk file have content:
begin {files=1;lines=0;maxlines=0;filelines[1]=0; #number of columns sod, prn, reference file col_bc=1; #initialize variables bc_now=0; new_bc=0; end_of_process=0; aux=""; text_result=""; } { if(filename!=argv[1])exit; no_bc=0; new_bc=0; #save number of columns nfields[1]=nf; #copy reference file data for(j=0;j<=nf;j++) { file[1,j]=$j; } #read lines file for(i=2;i<argc;i++) { ret=getline < argv[i]; if(ret==0) exit; #end of file reached #copy columns file variable for(j=0;j<=nf;j++) { file[i,j]=$j; } #save number of columns nfields[i]=nf; } #check files in same number for(i=1;i<argc;i++) { bc[i]=file[i,col_bc]; bc[i]=sub("bc","",file[i,col_bc]); if(bc[i]>bc_now) {bc_now=bc[i];new_bc=1;} } #one or more files have new number if (new_bc==1) { for(i=1;i<argc;i++) { while(bc_now!=file[i,col_bc]) { #read next line file if(i==1) ret=getline; #file 1 reference file else ret=getline < argv[i]; if(ret==0) exit; #end of file reached #copy columns file variable for(j=0;j<=nf;j++) { file[i,j]=$j; } #save number of columns nfields[i]=nf; #check if in current file data has gone next number if(file[i,col_bc]>bc_now) { no_bc=1; break; } #no more data lines compare, end of comparison if(filename!=argv[1]) { exit; } } #if number not in file, process realign must restarted next number available (exit loop) if (no_bc==1) {break;} } #if number not in file, process realign must restarted next number available (continue while loop) if (no_bc==1) {next;} } #number aligned for(i=1;i<argc;i++) { for(j=2;j<=nfields[i];j++) { #join colums in text_result variable aux=sprintf("%s %s",text_result,file[i,j]); text_result=sprintf("%s",aux); } } printf("bc%d%s\n",bc_now,text_result) #reset text variables aux=""; text_result=""; }
Comments
Post a Comment