You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _posts/2022-12-15-python-re.markdown
+62-24Lines changed: 62 additions & 24 deletions
Original file line number
Diff line number
Diff line change
@@ -5,46 +5,84 @@ date: 2022-12-15 20:06:11 +0530
5
5
categories: programming
6
6
---
7
7
8
-
In python, we import `re` module to work with regular expressions. Using it, we `search` for a pattern which is a `r'raw string'` in a text. It can return `None` if no match is found, otherwise it can return a `match` object.
8
+
In python, we import `re` module to work with regular expressions. Using it, we `search` for a pattern which is a `r'raw string'` in a text like `re.search(pattern, text)`. It can return `None` if no match is found, otherwise it can return a `match` object.
9
9
10
-
If a match is found, then `match.group()` will contain the matching portion of the original string. In case we group the regular expressions with a `(` and `)`, then `group(0)` will contain the whole match, while `group(1)`, `group(2)` etc. will contain specific matches. `group(1,2,3)` will return a tuple of them. Another way to get the tuple of the whole match is to simply call `groups()`. `group()` is same as `group(0)`
10
+
regexes can be compiled with `re.compile(pattern)` - compilation work shifts to application start time, rather than application use time. It returns a `RegexObject` - on which we can get the `regex.pattern` and we can `regex.search(text)` to see if the given text matches the pattern. `search` also takes a position - to start searching from that position.
11
+
12
+
Most of the methods can be called directly from `re` module or from the `re.compile()` object.
11
13
12
-
A `groupdict()`would give a dictionary instead of a tuple, however, one has to specify the key name in the group matching expression and it gets messy ther, though it has a handy name to refer to.
14
+
`findall()`finds all the matches - it returns a list of matches as strings. In case groupings are used, then each element in the list is a tuple and can be indexed. Instead of passing a string, one can even pass `f.read()`to pass file contents to match. BTW, `findall()` just returns the captured groups, not the whole expression. `finditer()` returns an iterator for match objects, not strings themselves. With `match` object returned by `finditer()` one can index the original string like `text[s:e]` where `s` is `match.start()` and `e` is `match.end()`
13
15
14
-
`findall()` finds all the matches - it returns a list of matches as strings. In case groupings are used, then each element in the list is a tuple and can be indexed. Instead of passing a string, one can even pass `f.read()` to pass file contents to match. BTW, `findall()` just returns the captured groups, not the whole expression. `finditer()` returns an iterator for match objects, not strings themselves.
15
16
16
17
Whatever is matched, can be substituted as well with `sub`. Whatever is matched can be replaced by default, however, only specific groups can be referred to/omitted to get a different result with `\1`, `\2` etc.
17
18
19
+
### Repetitions
20
+
21
+
|`*`| zero or more |
22
+
|`+`| one or more |
23
+
|`?`| zero or one |
24
+
|`{m}`| exactly `m` matches |
25
+
|`{m, n}`|`m` - `n` matches |
26
+
|`{m, }`|`m` or more matches |
27
+
28
+
* Adding a `?` at the end of the repetition pattern will make it non-greedy
29
+
*`ab??` - `a` followed by zero or one `b` - non-greedy version
30
+
* In non-greedy version if a match has to be made zero or more times, then it will be matched zero times
31
+
* In non-greedy version `{m,n}` will match only `m` even though more is available
32
+
33
+
### Character sets
34
+
35
+
Can be followed by repetition pattern.
36
+
37
+
|`[ab]`|`a` or `b`|
38
+
|`[^ab]`| neither `a` nor `b`|
39
+
|`[a-z]`| range of lower case, upper case, digits |
40
+
|`-`| within `[]` matches a range if specified in between two ordered characters, specifies itself when at the end |
41
+
42
+
### Escape codes
43
+
44
+
|`\d`| a digit |
45
+
|`\D`| a non-digit |
46
+
|`\s`| a white-space |
47
+
|`\S`| a non-white-space |
48
+
|`\w`| alpha numeric |
49
+
|`\W`| non alpha numeric |
50
+
51
+
### Anchoring
52
+
53
+
|`^`, `\A`| Beginning of line (if not used in character set) |
54
+
|`$`, `\Z`| End of line |
55
+
|`\b`| matches empty string at word boundary, that is, beginning or end |
56
+
|`\B`| matches empty string at non-word boundary |
57
+
58
+
### Constraining the search
59
+
60
+
|`re.match`| match the pattern at the beginning of input |
61
+
|`re.fullmatch`| match the whole input with the pattern |
62
+
|`re.search(text, pos)`| match from `pos` in the input |
63
+
64
+
### Grouping
65
+
66
+
If a match is found, then `match.group()` will contain the matching portion of the original string. In case we group the regular expressions with a `(` and `)`, then `group(0)` will contain the whole match, while `group(1)`, `group(2)` etc. will contain specific matches. `group(1,2,3)` will return a tuple of them. Another way to get the tuple of the whole match is to simply call `groups()`. `group()` is same as `group(0)`
67
+
68
+
A `groupdict()` would give a dictionary instead of a tuple, however, one has to specify the key name in the group matching expression and it gets messy there, though it has a handy name to refer to.
69
+
70
+
`(?: )` will not count as a group - in case if some match has to be ignored.
71
+
18
72
A note on how to match:
19
73
20
-
* ordinary characters (alpha-numeric) just match themselves only
74
+
### Misc
75
+
21
76
*`.` matches any single character except newline - however, when used in `[]`, it means a literal dot
22
-
*`\w` matches a single alpha-numeric, underscore only
23
-
*`\W` matches any non word character - whatever mentioned above
24
-
*`\b` matches boundary between word and a non-word (not the non-word character itself)
25
-
*`\s` matches a single space character
26
-
*`\S` matches a single non-space character
27
77
*`\t`, `\n`, `\r` - tab, newline, return
28
-
*`\d` matches a single digit
29
-
*`^` and `$` match beginning and end respectively
30
78
*`\` inhibits the specialness or the identity of the character. If it is given spuriously, the engine will throw an error
31
-
*`+` matches one or more occurences of a pattern before it
32
-
*`*` matches zero or more occurences of a pattern before it
33
-
*`?` matches zero or one occurences of a pattern before it
34
-
*`+` and `*` are greedy, they match as much as possible
35
-
*`[]` matches any specific set of characters mentioned inside it.
36
-
*`-` within `[]` matches a range if specified in between two ordered characters, specifies itself when at the end
37
-
*`^` within `[]` inverts the match
38
-
*`(?: )` will not count as a group - in case if some match has to be ignored.
39
-
* when `?` follows a `*` or `+`, it becomes non-greedy.
40
-
*`{n,}` at the end of a matching sequence will match `n` or more occurrences of it.
41
79
* any variable can be included in the regular expression by embedding `re.escape(var)` in between
42
-
*`(?=)` and `(?!)` are positive and negative look-ahead matches - from what I understood, it just peeps, but does not move past it for matching..
80
+
*`(?=)` and `(?!)` are positive and negative look-ahead matches - from what I understood, it just peeps, but does not move past it for matching.
43
81
44
82
Some flags that can be passed are:
45
83
46
84
*`re.IGNORECASE` - to ignore the case
47
85
*`re.DOTALL` - let the `.` match the newline also
48
86
*`re.MULTILINE` - `^` and `$` will match each line instead of the real beginning and end
49
87
50
-
https://www.debuggex.com/ - can be used to debug regular expressions in python
88
+
(debuggex)[https://www.debuggex.com/] - can be used to debug regular expressions in python
0 commit comments