regular expression 2024 case study

書到用時方恨少

序

之前2024積存下來的文章

正則表達式 Regular Expression

間中有用，書到用時方恨少，要用才去看，用完便忘記了

海豚的老毛病發作，總會忘記 Regular Expression ，需要的時候又發現很有用，又去看

網上工具

Regular Expression 有人簡稱 Regexr / Regexp，這個東西好像SQL一樣，有大的框架規範，但在不同program language 上又有點不同。

雖然IEEE 有定立一套標準，但未沒有被所以program language 或機構跟隨。

ISO/IEC/IEEE 9945:2009

https://www.iso.org/standard/50516.html

如果每次測試都要改code、complie、run 實在太麻煩，特別是一些大型的系統

一來一回很費時間，所以今天介紹兩款工具

https://regex-generator.olafneumann.org/

這介面顏色很像LEGO，很易認

https://regexr.com/

最喜歡這網站會順序解釋你的regexp

速食字典/雞精/精華

It indicates that the subpattern is a non-capture subpattern.

explame: /^(?:\w+\s)*(\w+)$*/

That means whatever is matched in (?:\w+\s), even though it's enclosed by () it won't appear in the list of matches, only (\w+) will.

\s+

one-or-more of any whitespace character

\s?

zero-or-one of any whitespace character

\s*

zero-or-more spaces

[0-9]{1,6}

[] square brackets ( [ ] ) are characters that have a special meaning of their own. Since they have a special meaning, you can call them "metacharacter". Square brackets help you to define a character set, which is a set of characters you want to match.

- if you put in hyphen between two characters inside square brackets, it means range.

[a-z] any lowercase letter

[A-Z] any uppercase letter

[a-zA-Z] any alphabet

[a-zA-Z0-9] alphanumeric (alphabet + numeric)

{} The curly brackets are used to match exactly n instances of the proceeding character or pattern. For example, "/x{2}/" matches "xx".

{min,} >= min

{min, max} from min to max is allows

簡單複習一下

Occurrence Indicators (or Repetition Operators):

+: one or more (1+), e.g., [0-9]+ matches one or more digits such as '123', '000'.

*: zero or more (0+), e.g., [0-9]* matches zero or more digits. It accepts all those in [0-9]+ plus the empty string.

?: zero or one (optional), e.g., [+-]? matches an optional "+", "-", or an empty string.

{m,n}: m to n (both inclusive)

{m}: exactly m times

{m,}: m or more (m+)

開始和結束

^ 以什麼什麼開頭

$ 以什麼什麼結束

假如是 [0-9], e.g Abc4 會match

假如是 ^[0-9], e.g Abc4 不會match，4Abc才會match

所以一般來說，使用^$很常見，要文字符合特定長度和格式的都會^${}一起用

另一種方法，不用^$也是常見的做法，在文字欄先檢查文字長度，長度對才弄給regex 檢查

簡化語法

\w which matches all word characters (a-z, A-Z, 0-9, and _)

\D which matches all non-digit characters.

\d which matches all digit characters.

\W which matches all non-word characters

\s which matches all white spaces including the spacebar, tab, and return

所以 [a-zA-Z0-9] 可以簡化為 \w

這些叫 Syntactic sugar ，中文釋譯為語法糖

很多 program language 也會有sugar，可能你用了也不知道。

sugar對功能沒有影響，但是更方便程式設計師使用(節省打字；節省閱讀和理解的時間比較多)。

語法糖讓程式更加簡潔，有更高的可讀性。

\ 反斜線

硬生生要求任何一個字元

e.g

Regxr: \E[0-9]

或有些language直接接受 E[0-9]

會用到\的時候主要是簡化語法，或一些有其它有特別運算意義的字元例如以上提到的?*

Regex uses backslash (\) for two purposes:

- for metacharacters such as \d (digit), \D (non-digit), \s (space), \S (non-space), \w (word), \W (non-word).

- to escape special regex characters, e.g., \. for ., \+ for +, \* for *, \? for ?. You also need to write \\ for \ in regex to avoid ambiguity.

- Regex also recognizes \n for newline, \t for tab, etc.

如果想判定的字與簡化語法的相同

e.g

+D9

+D1

Regxr: \+D[0-9]

今天的功課

原則上 account payroll period posted，帳本便不能動，但現實總有例外

舉例人力資源系統只能輸入今或未來的金錢交易 (發放薪金)

但現實中總有在cut off後輸入資料的時候，可能是員工放無薪假，發電郵給老闆，老闆準了但沒有傳給HR

所以上一個月出糧多了，本月要扣回作調整

例如外展社工，外出工作要向上級回報工作日期和時間，因為它們是以時薪計算

而更表，輪更安排總有特發事情或臨時調動，負責更新系統的人可能不是及時可以更新

或很多私人公司都不是每月每底或下個月1日出糧，可能是每月25號出糧

25號已經出糧，但26~30日的行程有變，不論少了還是多了，薪水的調整都只能在下個月進行

而可否輸入過去的日程就按使用者帳號權限

例如

userid: GAAAAAA002 可以輸入過去14日的更動

userid: GAAAAAA003 可以輸入過去21日的更動

userid: GAAAAAA004 可以輸入過去30日的更動

說那麼多

[GAAAAAA002, 14], [GAAAAAA003, 21], [GAAAAAA004, 30]

就是 configuration 的一串字，那串字user 可變更，systeam administrator 可變更

就是怕入錯，human mistake / syntax error

簡化問題，先將輸入一個pattern到工具

它強大在於，這會提示每部份的 regexp 應該怎拆解和怎寫，如下

未選取前regexp 是 \[GAAAAAAAAA,14\]

如黃字文字的部份，是

選 multiple characters 變成 \[[A-Za-z]+,14\]

選 alphanumeric characters 變成 \[[A-Za-z0-9]+,14\]

再處理數字14的藍色部份

exact number (14) 變成 \[[A-Za-z]+,14\]

number 變成 \[[A-Za-z0-9]+,[0-9]+\]

Floating point number (with optional exponent) 變成 \[[A-Za-z0-9]+,([+-]?(?=\.\d|\d)(?:\d+)?(?:\.?\d*))(?:[Ee]([+-]?\d+))?\]

...

整串的答案在第4部份，最底還可以share給別人

第5部份，即可就不同program language生成source code

然後另一個工具也不錯，將上方的 regexp 貼到第二個具

便可以開始除錯

將regexp 貼在expression ，如

\[[A-Za-z0-9]{1,10},[0-9]{1,999}\]

在 Text Tab 輸入要測試的一段文字，看你的regexp有多match

按 Test Tab 按 Add Test

然後想做得花俏點，有些人可能會用空格不隔不同數值，觀感上也比較好看。

可以在 text tab輸

也可按add Test 直到以下test case都輸入

[GAAAAAA002,14]

[GAAAAAA002, 14]

[GAAAAAA002, 14],[GAAAAAA002, 14]

[GAAAAAA002, 14], [GAAAAAA002, 14]

海豚在不同的位置加入了 0~N個space，那都是user有機會輸入的結果

之後Trail and Error，不斷嘗試出可以match 0~N個space的regexp

\[[A-Za-z0-9]{1,10},\s*[0-9]{1,999}\]

好了，測試結果全合格了。

可能你又問，本來將問題簡化了，現在 [] 之間的逗號和空格未處理

海豚c#取易不取難的，splti 再 trim 再 regex 好了

都不知這位天才前輩要用 regexp 去檢查，先不論是否適合

接了任務，只好硬上了

經過不斷嘗試後，才突發奇想，去問chatGPT

原來根本不用自學，大家去問AI 便可，好了

請將今天學到的東西忘掉吧

看來掉進狗屋的海豚要丟飯腕了

其它應用

一般最常用到的地方就是證件號碼，電話 / 郵政區碼 / ID Card Number

香港的身份證號碼 HKID

A123456 (3)

很多UX比較好的網站都是兩個textbox

一個入A123456，另一個入3，省略()不用需入

但我們簡單一點，一個textbox處理，忽略()space，請寫一個合適的regexr

e.g

A123456(3)

[A-Z]{1}[0-9]{6,6}[1-9]?

之後豚告訴你英文字母可以1~2個

這個是很多人不知道的，香港不少網站和系統早期也有那個問題，不讓人輸入兩個英文字母

()內可以是 1~9，可以是A ，可以是 0

e.g

A123456(3)

WX123456(8)

[A-Z]{1,2}[0-9]{6,6}[0-9|A]?

最後答案

相信一定有認真魔人，這位就是regexp魔人

海豚有點怕

Reference

Regex Generator - Creating regex is easy again!
https://regex-generator.olafneumann.org/?sampleText=%5BGAAAAAAAAA%2C%2014%5D&flags=i

RegExr: Learn, Build, & Test RegEx
https://regexr.com/

香港財務專業協會 - 新聞資訊
https://www.hkpla.org/index.php?r=announcement/detail&id=15

Regular Expression (Regex) Tutorial
https://www3.ntu.edu.sg/home/ehchua/programming/howto/Regexe.html#zz-1.4

Nijyuu AmineEX

lazyload the images

Pages

最新精選 Featured post

漫畫 - 異世界清單

標籤雲 Labels

訂閱

07/12/2025

regular expression 2024 case study

序