[Python] 정규표현식(regular expression)

2024-08-16 8 분 소요

정규표현식이란

파이썬의 정규표현식은 문자열에서 패턴을 찾고, 패턴을 기준으로 문자열을 분리하거나, 특정 문자열을 대체하는 등 다양한 문자열 처리 작업을 수행하는 데 사용되는 표현식입니다. 파이썬에서는 regular expression을 줄인 re 모듈을 사용해서 이를 구현합니다.

기본 사용법

파이썬에서 정규표현식을 사용하기 위해서는 다음과 같이 모듈을 import해야 합니다.

import re   # python 기본 내장 함수이기 때문에 추가 설치 필요 X

메타 문자

메타 문자는 특별한 의미를 가지며, 특정 패턴을 표현하는 데 사용됩니다.

’ . ‘

개행 문자(‘\n’)를 제외한 모든 단일 문자와 매칭

import re

pattern = r'a.'

test_str1 = "ab"   # 매칭됨 ('a' 다음에 'b')
test_str2 = "a\n"  # 매칭되지 않음 ('a' 다음에 개행 문자)

match1 = re.match(pattern, test_str1)
match2 = re.match(pattern, test_str2)

print(match1.group() if match1 else "No match")  # output: ab
print(match2.group() if match2 else "No match")  # output: No match

’ ^ ‘

문자열의 시작과 매칭

import re

pattern = r'^Hello'
test_str = "Hello, world!"
match = re.match(pattern, test_str)

print(match.group() if match else "No match")  # output: Hello

’ $ ‘

문자열의 끝과 매칭

import re

pattern = r'world!$'
test_str = "Hello, world!"
match = re.search(pattern, test_str)

print(match.group() if match else "No match")  # output: world!

’ * ‘

0회 이상 반복되는 패턴과 매칭

import re

pattern = r'ab*'
test_str = "a"
match = re.match(pattern, test_str)

print(match.group() if match else "No match")  # output: a (b가 0회 나타난 경우에도 매칭)

’ + ‘

1회 이상 반복되는 패턴과 매칭

import re

pattern = r'ab+'
test_str = "abb"
match = re.match(pattern, test_str)

print(match.group() if match else "No match")  # output: abb (b가 1회 이상 나타나야 매칭)

’ ? ‘

0회 또는 1회 나타나는 패턴과 매칭

import re

pattern = r'ab?'
test_str = "a"
match = re.match(pattern, test_str)

print(match.group() if match else "No match")  # output: a (b가 0회 나타난 경우에도 매칭)

’ {n} ‘

정확히 n번 반복되는 패턴과 매칭

import re

pattern = r'ab{2}'
test_str = "abb"
match = re.match(pattern, test_str)

print(match.group() if match else "No match")  # output: abb (b가 정확히 2번 나타난 경우에 매칭)

’ {n,} ‘

n번 이상 반복되는 패턴과 매칭

import re

pattern = r'ab{2,}'
test_str = "abbb"
match = re.match(pattern, test_str)

print(match.group() if match else "No match")  # output: abbb (b가 2번 이상 나타난 경우에 매칭)

’ {n,m} ‘

n번 이상 m번 이하 반복되는 패턴과 매칭

import re

pattern = r'ab{2,3}'
test_str = "abbb"
match = re.match(pattern, test_str)

print(match.group() if match else "No match")  # output: abbb (b가 2번 이상 3번 이하 나타난 경우에 매칭)

’ [] ‘

대괄호 안에 있는 문자 중 하나와 매칭

import re

pattern = r'[aeiou]'
test_str = "python"
match = re.search(pattern, test_str)

print(match.group() if match else "No match")  # output: o (첫 번째 모음에 매칭)

’`|`’

OR 연산자로 사용되며, A 또는 B와 매칭

import re

pattern = r'cat|dog'
test_str = "I have a dog."
match = re.search(pattern, test_str)

print(match.group() if match else "No match")  # output: dog (cat 또는 dog 중 dog에 매칭)

’ () ‘

그룹화를 의미하며, 해당 부분을 하나의 그룹으로 캡처

import re

pattern = r'(Hello), (world)!'
test_str = "Hello, world!"
match = re.search(pattern, test_str)

if match:
    print(match.group(0))  # output: Hello, world! (전체 매칭)
    print(match.group(1))  # output: Hello (첫 번째 그룹)
    print(match.group(2))  # output: world (두 번째 그룹)

특수 시퀀스

특정 문자 클래스와 매칭하기 위해 특수 시퀀스를 사용할 수 있습니다.

‘\d’

모든 숫자와 매칭 (0-9)

import re

pattern = r'\d'
text = "There are 3 cats and 4 dogs."
matches = re.findall(pattern, text)

print(matches)  # output: ['3', '4']

‘\D’

숫자가 아닌 모든 문자와 매칭

import re

pattern = r'\D'
text = "123 Main St."
matches = re.findall(pattern, text)

print(matches)  # output: [' ', 'M', 'a', 'i', 'n', ' ', 'S', 't', '.']

‘\w’

알파벳 대소문자, 숫자, 밑줄과 매칭 (word character)

import re

pattern = r'\w'
text = "Hello, World!"
matches = re.findall(pattern, text)

print(matches)  # output: ['H', 'e', 'l', 'l', 'o', 'W', 'o', 'r', 'l', 'd']

‘\W’

단어 문자가 아닌 모든 문자와 매칭

import re

pattern = r'\W'
text = "Hello, World!"
matches = re.findall(pattern, text)

print(matches)  # output: [',', ' ', '!']

‘\s’

모든 공백 문자와 매칭 (스페이스, 탭, 개행(‘\n’) 등)

import re

pattern = r'\s'
text = "Hello, World!\nWelcome to Python."
matches = re.findall(pattern, text)

print(matches)  # output: [' ', '\n', ' ']

‘\S’

공백 문자가 아닌 모든 문자와 매칭

import re

pattern = r'\S'
text = "Hello, World!"
matches = re.findall(pattern, text)

print(matches)  # output: ['H', 'e', 'l', 'l', 'o', ',', 'W', 'o', 'r', 'l', 'd', '!']

‘\b’

단어의 경계와 매칭

import re

pattern = r'\bworld\b'
text = "Hello, world! Welcome to the new world."
matches = re.findall(pattern, text)

print(matches)  # output: ['world', 'world'] (정확히 'world'라는 단어에 매칭)

‘\B’

단어의 경계가 아닌 부분과 매칭

import re

pattern = r'\Bworld\B'
text = "Otherworldly things happen in the underworld."
matches = re.findall(pattern, text)

print(matches)  # output: ['world'] (단어 경계가 아닌 'world'에 매칭)

‘?P`<name>`’

정규표현식에서 이름 있는 그룹(named group)을 정의하는 데 사용되는 구문입니다. 이 구문은 파이썬을 비롯한 일부 정규표현식 엔진에서 지원하며, 특정한 부분 문자열을 캡처한 뒤 그 값을 나중에 이름을 통해 참조할 수 있도록 합니다.

pattern = r'(?P<word>\w+)'
text = "Hello World!"s
match = re.search(pattern, text)

if match:
    print(match.group('word'))  # output: Hello

자주 사용되는 함수들

re.match()

문자열의 시작 부분에서 특정 패턴과 매칭되는지 확인

import re

pattern = r'\d+'
s = '123abc'
match = re.match(pattern, s)

if match:
    print(match.group())  # output: 123

re.search()

문자열에서 패턴이 처음으로 매칭되는 부분을 찾습니다.

import re

pattern = r'\d+'
s = 'abc123def'
match = re.search(pattern, s)

if match:
    print(match.group())  # output: 123

re.findall()

문자열에서 매칭되는 모든 패턴을 리스트로 반환합니다.

import re

pattern = r'\d+'
s = 'abc123def456ghi789'
matches = re.findall(pattern, s)

print(matches)  # output: ['123', '456', '789']

re.finditer()

문자열에서 매칭되는 모든 패턴을 반복자(iterator)로 반환합니다. 각 매칭 결과는 ‘MatchObject’로 제공됩니다.

import re

pattern = r'\d+'
s = 'abc123def456ghi789'
matches = re.finditer(pattern, s)

for match in matches:
    print(match.group())  # output: 123, 456, 789

re.split()

패턴을 기준으로 문자열을 분리하여 리스트로 반환합니다.

import re

pattern = r'\d+'
s = 'abc123def456ghi789'
result = re.split(pattern, s)

print(result)  # output: ['abc', 'def', 'ghi', '']

re.sub()

매칭되는 패턴을 다른 문자열로 대체합니다.

import re

pattern = r'\d+'
s = 'abc123def456ghi789'
result = re.sub(pattern, '#', s)

print(result)  # output: abc#def#ghi#

정규표현식에서의 플래그

정규표현식을 사용할 때 특정 동작을 제어하기 위해 플래그를 사용할 수 있습니다. 플래그는 정규표현식의 행동을 변경하는 역활을 합니다.

re.IGNORECASE 또는 re.I

대소문자를 구분하지 않고 매칭합니다.

import re

pattern = r'abc'
s = 'ABC'
match = re.match(pattern, s, re.IGNORECASE)

if match:
    print(match.group())  # output: ABC

re.MULTILINE 또는 re.M

여러 줄에서 문자열의 시작(^)과 끝($)을 매칭합니다.

import re

pattern = r'^Hello'
text = """Hello, world!
Goodbye, world!
Hello again!"""

# re.MULTILINE 플래그를 사용하여 각 줄의 시작을 매칭
matches = re.findall(pattern, text, re.MULTILINE)

print(matches)  # output: ['Hello', 'Hello'] (첫 번째와 세 번째 줄에서 매칭)

re.DOTALL 또는 re.S

.이 개행 문자도 포함하여 모든 문자와 매칭되도록 합니다.

import re

pattern = r'Hello.*world'
text = """Hello
world"""

# re.DOTALL 플래그를 사용하여 개행 문자도 포함하여 매칭
match = re.search(pattern, text, re.DOTALL)

print(match.group() if match else "No match")  # output: Hello\nworld

re.VERBOSE 또는 re.X

정규표현식을 보기 좋게 작성할 수 있게 공백과 주석을 허용합니다.

import re

pattern = r"""
    Hello      # 'Hello'를 찾습니다
    \s+        # 하나 이상의 공백
    world      # 'world'를 찾습니다
"""

text = "Hello   world"

# re.VERBOSE 플래그를 사용하여 보기 좋은 정규표현식을 작성
match = re.search(pattern, text, re.VERBOSE)

print(match.group() if match else "No match")  # output: Hello   world

MatchObject의 메서드

정규표현식 매칭 결과로 반환된 ‘MatchObject’는 다음과 같은 메서드를 제공합니다.

group() : 매칭된 전체 문자열 또는 특징 그룹을 반환합니다.
start() : 매칭된 부분의 시작 인덱스를 반환합니다.
end() : 매칭된 부분의 끝 인덱스를 반환합니다.
span() : 매칭된 부분의 (시작 인덱스, 끝 인덱스) 튜플을 반환합니다.

import re

pattern = r'(\d+)-(\d+)'
s = '123-456'
match = re.search(pattern, s)

if match:
    print(match.group(0))  # output: 123-456 (전체 매칭)
    print(match.group(1))  # output: 123 (첫 번째 그룹)
    print(match.group(2))  # output: 456 (두 번째 그룹)
    print(match.start())   # output: 0 (매칭 시작 인덱스)
    print(match.end())     # output: 7 (매칭 끝 인덱스)
    print(match.span())    # output: (0, 7) (매칭 범위)

정규표현식에서의 그룹화와 캡처

정규표현식에서 소괄호 ‘()’를 사용하여 패턴을 그룹화할 수 있으며, 이렇게 하면 해당 부분을 캡쳐하여 나중에 참조할 수 있습니다.

import re

pattern = r'(\d+)-(\d+)'
s = '123-456'
match = re.search(pattern, s)

if match:
    print(match.group(0))  # output: 123-456 (전체 매칭)
    print(match.group(1))  # output: 123 (첫 번째 그룹)
    print(match.group(2))  # output: 456 (두 번째 그룹)

비캡처 그룹 (Non-capturing Group)

특정 그룹을 캡처하지 않으려면 ‘(?:…)’ 구문을 사용합니다.

import re

pattern = r'(?:\d+)-(\d+)'
s = '123-456'
match = re.search(pattern, s)

if match:
    print(match.group(1))  # output: 456 (첫 번째 캡처된 그룹)

Lookahead와 Lookbehind

정규표현식에서 특정 패턴 앞이나 뒤에 있는 것을 확인하지만, 실제 매칭에는 포함되지 않습니다.

Lookahead

긍정적 전방 탐색 (Positive Lookahead): 패턴이 뒤에 있는지 확인합니다. 패턴이 매칭되면 해당 위치는 포함되지만, 실제 매칭 결과에는 포함되지 않습니다.
- 구문: X(?=Y)
- 의미: X가 뒤에 Y가 있는 경우에만 매칭.
부정적 전방 탐색 (Negative Lookahead): 패턴이 뒤에 없는지 확인합니다. 패턴이 매칭되면 해당 위치는 포함되지만, 실제 매칭 결과에는 포함되지 않습니다.
- 구문: X(?!Y)
- 의미: X가 뒤에 Y가 없는 경우에만 매칭.

import re

# 긍정적 전방 탐색: 'cat' 뒤에 's'가 있는 경우
pattern = r'cat(?=s)'
text = "cats and dogs"
matches = re.findall(pattern, text)

print(matches)  # output: ['cat'] (뒤에 's'가 있는 'cat'에 매칭)

# 부정적 전방 탐색: 'cat' 뒤에 's'가 없는 경우
pattern = r'cat(?!s)'
text = "cat and cats"
matches = re.findall(pattern, text)

print(matches)  # output: ['cat'] (뒤에 's'가 없는 'cat'에 매칭)

Lookbehind

긍정적 후방 탐색 (Positive Lookbehind): 패턴이 앞에 있는지 확인합니다. 패턴이 매칭되면 해당 위치는 포함되지만, 실제 매칭 결과에는 포함되지 않습니다.
- 구문: (?<=X)Y
- 의미: Y 앞에 X가 있는 경우에만 매칭.
부정적 후방 탐색 (Negative Lookbehind): 패턴이 앞에 없는지 확인합니다. 패턴이 매칭되면 해당 위치는 포함되지만, 실제 매칭 결과에는 포함되지 않습니다.
- 구문: (?<!X)Y
- 의미: Y 앞에 X가 없는 경우에만 매칭.

import re

# 긍정적 후방 탐색: 'ing' 앞에 'r'가 있는 경우
pattern = r'(?<=r)ing'
text = "running and jumping"
matches = re.findall(pattern, text)

print(matches)  # 출력: ['ing', 'ing'] (각 'ing' 앞에 'r'가 있는 경우에 매칭)

# 부정적 후방 탐색: 'ing' 앞에 'r'가 없는 경우
pattern = r'(?<!r)ing'
text = "going and ringing"
matches = re.findall(pattern, text)

print(matches)  # 출력: ['ing'] (각 'ing' 앞에 'r'가 없는 경우에 매칭)

Twitter Facebook LinkedIn