Regular Expressions for Japanese in Lua_问答_开发者

开发者 https://www.devze.com 2023-02-05 13:43 出处：网络

I want to process japanese vocabulary in Lua (LuaTeX to be specific). The vo开发者_StackOverflow中文版cabulary is stored in a text file which is to be read. While reading each line of the file the words should be matched by a regular expression (lines are written like: | がくせい | student |):

function readFile(fn)
   local file = assert(io.open(fn, "r"))
   local contents = file:read("*a")
   file:close()
   return contents
end

function processTest(contents)
   for line in contents:gmatch("%a+") do
      print(line)
   end
end

a = readFile("vocabulary.org")
processTest(a)

The problem now is that only the english words are printed:

student

I have to mention that I'm new to Lua and LuaTeX, so if there is a better approach to it I would be happy to know.

Anyway, is there any possibility to get the Japanese words?

You cannot use %a for this. It only matches a single octet (locale-dependent but usually only a byte that encodes a letter in ASCII or Latin-1.)

To match UTF-8 encoded letters you would need to break them down into ranges of bytes, as in the example here.

For example some patterns for UTF-8-encoded Hiragana might include:

(\227\129[\129-\191])
(\227\130[\128-\160])

A full list of patterns to match all unicode letters (which would need to include hundreds of subranges) would be unwieldy.

I'm not a Lua guru, but I think you are probably out of luck. Lua doesn't consume Unicode files "natively," as it were. It just treats what it reads as a series of bytes and doesn't do any interpretation on it. In particular, your gmatch() call isn't likely to do what you want.

There was a big discussion about i18n on the mailing list recently here. This discussion here may also help.