Modify Post

nbellari · nbellari · commit bdbd61759d0c · 2023-10-10T12:30:18.000+01:00
diff --git a/_posts/2022-12-15-python-re.markdown b/_posts/2022-12-15-python-re.markdown
@@ -5,46 +5,84 @@ date: 2022-12-15 20:06:11 +0530
 categories: programming
 ---
 
-In python, we import `re` module to work with regular expressions. Using it, we `search` for a pattern which is a `r'raw string'` in a text. It can return `None` if no match is found, otherwise it can return a `match` object.
+In python, we import `re` module to work with regular expressions. Using it, we `search` for a pattern which is a `r'raw string'` in a text like `re.search(pattern, text)`. It can return `None` if no match is found, otherwise it can return a `match` object.
 
-If a match is found, then `match.group()` will contain the matching portion of the original string. In case we group the regular expressions with a `(` and `)`, then `group(0)` will contain the whole match, while `group(1)`, `group(2)` etc. will contain specific matches. `group(1,2,3)` will return a tuple of them. Another way to get the tuple of the whole match is to simply call `groups()`. `group()` is same as `group(0)`
+regexes can be compiled with `re.compile(pattern)` - compilation work shifts to application start time, rather than application use time. It returns a `RegexObject` - on which we can get the `regex.pattern` and we can `regex.search(text)` to see if the given text matches the pattern. `search` also takes a position - to start searching from that position.
+
+Most of the methods can be called directly from `re` module or from the `re.compile()` object.
 
-A `groupdict()` would give a dictionary instead of a tuple, however, one has to specify the key name in the group matching expression and it gets messy ther, though it has a handy name to refer to.
+`findall()` finds all the matches - it returns a list of matches as strings. In case groupings are used, then each element in the list is a tuple and can be indexed. Instead of passing a string, one can even pass `f.read()` to pass file contents to match. BTW, `findall()` just returns the captured groups, not the whole expression. `finditer()` returns an iterator for match objects, not strings themselves. With `match` object returned by `finditer()` one can index the original string like `text[s:e]` where `s` is `match.start()` and `e` is `match.end()`
 
-`findall()` finds all the matches - it returns a list of matches as strings. In case groupings are used, then each element in the list is a tuple and can be indexed. Instead of passing a string, one can even pass `f.read()` to pass file contents to match. BTW, `findall()` just returns the captured groups, not the whole expression. `finditer()` returns an iterator for match objects, not strings themselves.
 
 Whatever is matched, can be substituted as well with `sub`. Whatever is matched can be replaced by default, however, only specific groups can be referred to/omitted to get a different result with `\1`, `\2` etc.
 
+### Repetitions
+
+| `*` | zero or more |
+| `+` | one or more |
+| `?` | zero or one |
+| `{m}` | exactly `m` matches |
+| `{m, n}` | `m` - `n` matches |
+| `{m, }` | `m` or more matches |
+
+* Adding a `?` at the end of the repetition pattern will make it non-greedy
+* `ab??` - `a` followed by zero or one `b` - non-greedy version
+* In non-greedy version if a match has to be made zero or more times, then it will be matched zero times
+* In non-greedy version `{m,n}` will match only `m` even though more is available
+
+### Character sets
+
+Can be followed by repetition pattern.
+
+| `[ab]` | `a` or `b` |
+| `[^ab]` | neither `a` nor `b` |
+| `[a-z]` | range of lower case, upper case, digits |
+| `-` | within `[]` matches a range if specified in between two ordered characters, specifies itself when at the end |
+
+### Escape codes
+
+| `\d` | a digit |
+| `\D` | a non-digit |
+| `\s` | a white-space |
+| `\S` | a non-white-space |
+| `\w` | alpha numeric | 
+| `\W` | non alpha numeric |
+
+### Anchoring
+
+| `^`, `\A` | Beginning of line (if not used in character set) |
+| `$`, `\Z` | End of line |
+| `\b` | matches empty string at word boundary, that is, beginning or end |
+| `\B` | matches empty string at non-word boundary |
+
+### Constraining the search
+
+| `re.match` | match the pattern at the beginning of input |
+| `re.fullmatch` | match the whole input with the pattern |
+| `re.search(text, pos)` | match from `pos` in the input | 
+
+### Grouping
+
+If a match is found, then `match.group()` will contain the matching portion of the original string. In case we group the regular expressions with a `(` and `)`, then `group(0)` will contain the whole match, while `group(1)`, `group(2)` etc. will contain specific matches. `group(1,2,3)` will return a tuple of them. Another way to get the tuple of the whole match is to simply call `groups()`. `group()` is same as `group(0)`
+
+A `groupdict()` would give a dictionary instead of a tuple, however, one has to specify the key name in the group matching expression and it gets messy there, though it has a handy name to refer to.
+
+`(?: )` will not count as a group - in case if some match has to be ignored.
+
 A note on how to match:
 
-* ordinary characters (alpha-numeric) just match themselves only
+### Misc
+
 * `.` matches any single character except newline - however, when used in `[]`, it means a literal dot
-* `\w` matches a single alpha-numeric, underscore only
-* `\W` matches any non word character - whatever mentioned above
-* `\b` matches boundary between word and a non-word (not the non-word character itself)
-* `\s` matches a single space character
-* `\S` matches a single non-space character
 * `\t`, `\n`, `\r` - tab, newline, return
-* `\d` matches a single digit
-* `^` and `$` match beginning and end respectively
 * `\` inhibits the specialness or the identity of the character. If it is given spuriously, the engine will throw an error
-* `+` matches one or more occurences of a pattern before it
-* `*` matches zero or more occurences of a pattern before it
-* `?` matches zero or one occurences of a pattern before it
-* `+` and `*` are greedy, they match as much as possible
-* `[]` matches any specific set of characters mentioned inside it.
-* `-` within `[]` matches a range if specified in between two ordered characters, specifies itself when at the end
-* `^` within `[]` inverts the match
-* `(?: )` will not count as a group - in case if some match has to be ignored.
-* when `?` follows a `*` or `+`, it becomes non-greedy.
-* `{n,}` at the end of a matching sequence will match `n` or more occurrences of it.
 * any variable can be included in the regular expression by embedding `re.escape(var)` in between
-* `(?=)` and `(?!)` are positive and negative look-ahead matches - from what I understood, it just peeps, but does not move past it for matching..
+* `(?=)` and `(?!)` are positive and negative look-ahead matches - from what I understood, it just peeps, but does not move past it for matching.
 
 Some flags that can be passed are:
 
 * `re.IGNORECASE` - to ignore the case
 * `re.DOTALL` - let the `.` match the newline also
 * `re.MULTILINE` - `^` and `$` will match each line instead of the real beginning and end
 
-https://www.debuggex.com/ - can be used to debug regular expressions in python
+(debuggex)[https://www.debuggex.com/] - can be used to debug regular expressions in python
diff --git a/_posts/2023-09-30-python-standard-library-reference.markdown b/_posts/2023-09-30-python-standard-library-reference.markdown
@@ -155,4 +155,5 @@ Different action arguments:
 | `:>` | right aligned |
 | `:^` | center aligned |
 | `:<num>{b|f|o|x|X}` | print the number in binary, float, octal, lower hex, upper hex respectively |
-
+|`{!r}` | user `repr` method to display |
+| `{!s}` | use `str` method to display |