Searching words on Shift_JIS code

Top Japanese page




Overview

Search a Japanese word from Shift_JIS data line. This allows to search words without converting Japanese into UTF-8 or EUC-JP.

Flow

  1. Unpack searching word
  2. Escape special characters to be reserved for regular expression
  3. Pack the espaced searching word again
  4. Search the word using pattern match

A sample code

 # Searching word
 my $search_word = 'パターン';
 my $search_word_org = $search_word;
 
 # Searched word
 my $string = '検索される文字列とエスケープ処理された検索文字列をパターンマッチする処理する';
 
 $search_word =~ s/([\W])/sprintf("%%%02X", ord($1))/eg;
 
 $search_word =~ s/%5[BCDE]/%5c$&/gi;
 $search_word =~ s/%2[489B]/%5c$&/gi;
 $search_word =~ s/%3F/%5c$&/gi;
 $search_word =~ s/%7[BCD]/%5c$&/gi;
 $search_word =~ s/[\.\*]/%5c$&/g;
 
 $search_word =~ s/%([A-Fa-f0-9][A-Fa-f0-9])/pack("C", hex($1))/eg;
 
 my $hit;
 ("$string" =~ /$search_word/) && ($hit = 1);
 
 if ($hit) {
   print "$search_word_org was found.";
 } else {
   print "$search_word_org was not found.";
 }

Descriptoin of the code

 my $search_word = 'パターン';
 my $search_word_org = $search_word;

Put Search word into a variable. Keep original search word into $search_word_org to be used when displaying a result.

 my $string = '検索される文字列とエスケープ処理された検索文字列をパターンマッチする処理する';

This is a serched sentense.

 $search_word =~ s/([\W])/sprintf("%%%02X", ord($1))/eg;

Unpack the search word. At this point ``パターン'' will look like as follows.

 %83p%83%5E%81%5B%83%93
 $search_word =~ s/%5[BCDE]/%5c$&/gi;
 $search_word =~ s/%2[489B]/%5c$&/gi;
 $search_word =~ s/%3F/%5c$&/gi;
 $search_word =~ s/%7[BCD]/%5c$&/gi;
 $search_word =~ s/[\.\*]/%5c$&/g;

Espace special characters reserved by regular expression. This process actually add back-slash right before the resreved characters. Some examples are shown below.

 \=%5C, (=%28, )=%29, [=%5B, ]=%5D, |=%7C
 ?=%3F, +=%2B, ^=%5E, $=%24, {=%7B, }=%7D

After escaping, the ``パターン'' will look like as follows.

 %83p%83%5c%5E%81%5c%5B%83%93
 $search_word =~ s/%([A-Fa-f0-9][A-Fa-f0-9])/pack("C", hex($1))/eg;

The escaped words to pack again.

 my $hit;
 ("$string" =~ /$search_word/) && ($hit = 1);

When the seach word matchs to the sentense, set 1 to $hit.

 if ($hit) {
   print "$search_word_org was found.";
 } else {
   print "$search_word_org was not found.";
 }

Display result based on $hit value. $search_ord_org is used in here.