|
| 1 | +Feature |
| 2 | +------- |
| 3 | + |
| 4 | +AFR CLI enhancements |
| 5 | + |
| 6 | +SUMMARY |
| 7 | +------- |
| 8 | + |
| 9 | +Presently the AFR reporting via CLI has lots of problems in the |
| 10 | +representation of logs because of which they may not be able to use the |
| 11 | +data effectively. This feature is to correct these problems and provide |
| 12 | +a coherent mechanism to present heal status,information and the logs |
| 13 | +associated. |
| 14 | + |
| 15 | +Owners |
| 16 | +------ |
| 17 | + |
| 18 | +Venkatesh Somayajulu |
| 19 | +Raghavan |
| 20 | + |
| 21 | +Current status |
| 22 | +-------------- |
| 23 | + |
| 24 | +There are many bugs related to this which indicates the current status |
| 25 | +and why these requirements are required. |
| 26 | + |
| 27 | +1) 924062 - gluster volume heal info shows only gfids in some cases and |
| 28 | +sometimes names. This is very confusing for the end user. |
| 29 | + |
| 30 | +2) 852294 - gluster volume heal info hangs/crashes when there is a |
| 31 | +large number of entries to be healed. |
| 32 | + |
| 33 | +3) 883698 - when self heal daemon is turned off, heal info does not |
| 34 | +show any output. But healing can happen because of lookups from IO path. |
| 35 | +Hence list of entries to be healed still needs to be shown. |
| 36 | + |
| 37 | +4) 921025 - directories are not reported when list of split brain |
| 38 | +entries needs to be displayed. |
| 39 | + |
| 40 | +5) 981185 - when self heal daemon process is offline, volume heal info |
| 41 | +gives error as "staging failure" |
| 42 | + |
| 43 | +6) 952084 - We need a command to resolve files in split brain state. |
| 44 | + |
| 45 | +7) 986309 - We need to report source information for files which got |
| 46 | +healed during a self heal session. |
| 47 | + |
| 48 | +8) 986317 - Sometimes list of files to get healed also includes files |
| 49 | +to which IO s being done since the entries for these files could be in |
| 50 | +the xattrop directory. This could be confusing for the user. |
| 51 | + |
| 52 | +There is a master bug 926044 that sums up most of the above problems. It |
| 53 | +does give the QA perspective of the current representation out of the |
| 54 | +present reporting infrastructure. |
| 55 | + |
| 56 | +Detailed Description |
| 57 | +-------------------- |
| 58 | + |
| 59 | +1) One common thread among all the above complaints is that the |
| 60 | +information presented to the user is <B>FUD</B> because of the following |
| 61 | +reasons: |
| 62 | + |
| 63 | +(a) Split brain itself is a scary scenario especially with VMs. |
| 64 | +(b) The data that we present to the users cannot be used in a stable |
| 65 | + manner for them to get to the list of these files. <I>For ex:</I> we |
| 66 | + need to give mechanisms by which he can automate the resolution out |
| 67 | + of split brain. |
| 68 | +(c) The logs that are generated are all the more scarier since we |
| 69 | + see repetition of some error lines running into hundreds of lines. |
| 70 | + Our mailing lists are filled with such emails from end users. |
| 71 | + |
| 72 | +Any data is useless unless it is associated with an event. For self |
| 73 | +heal, the event that leads to self heal is the loss of connectivity to a |
| 74 | +brick from a client. So all healing info and especially split brain |
| 75 | +should be associated with such events. |
| 76 | + |
| 77 | +The following is hence the proposed mechanism: |
| 78 | + |
| 79 | +(a) Every loss of a brick from client's perspective is logged and |
| 80 | + available via some ID. The information provides the time from when |
| 81 | + the brick went down to when it came up. Also it should also report |
| 82 | + the number of IO transactions(modifies) that hapenned during this |
| 83 | + event. |
| 84 | +(b) The list of these events are available via some CLI command. The |
| 85 | + actual command needs to be detailed as part of this feature. |
| 86 | +(c) All volume info commands regarding list of files to be healed, |
| 87 | + files healed and split brain files should be associated with this |
| 88 | + event(s). |
| 89 | + |
| 90 | +2) Provide a mechanism to show statistics at a volume and replica group |
| 91 | +level. It should show the number of files to be healed and number of |
| 92 | +split brain files at both the volume and replica group level. |
| 93 | + |
| 94 | +3) Provide a mechanism to show per volume list of files to be |
| 95 | +healed/files healed/split brain in the following info: |
| 96 | + |
| 97 | +This should have the following information: |
| 98 | + |
| 99 | +(a) File name |
| 100 | +(b) Bricks location |
| 101 | +(c) Event association (brick going down) |
| 102 | +(d) Source |
| 103 | +(v) Sink |
| 104 | + |
| 105 | +4) Self heal crawl statistics - Introduce new CLI commands for showing |
| 106 | +more information on self heal crawl per volume. |
| 107 | + |
| 108 | +(a) Display why a self heal crawl ran (timeouts, brick coming up) |
| 109 | +(b) Start time and end time |
| 110 | +(c) Number of files it attempted to heal |
| 111 | +(d) Location of the self heal daemon |
| 112 | + |
| 113 | +5) Scale the logging infrastructure to handle huge number of file list |
| 114 | +that needs to be displayed as part of the logging. |
| 115 | + |
| 116 | +(a) Right now the system crashes or hangs in case of a high number |
| 117 | + of files. |
| 118 | +(b) It causes CLI timeouts arbitrarily. The latencies involved in |
| 119 | + the logging have to be studied (profiled) and mechanisms to |
| 120 | + circumvent them have to be introduced. |
| 121 | +(c) All files are displayed on the output. Have a better way of |
| 122 | + representing them. |
| 123 | + |
| 124 | +Options are: |
| 125 | + |
| 126 | +(a) Maybe write to a glusterd log file or have a seperate directory |
| 127 | + for afr heal logs. |
| 128 | +(b) Have a status kind of command. This will display the current |
| 129 | + status of the log building and maybe have batched way of |
| 130 | + representing when there is a huge list. |
| 131 | + |
| 132 | +6) We should provide mechanism where the user can heal split brain by |
| 133 | +some pre-established policies: |
| 134 | + |
| 135 | +(a) Let the system figure out the latest files (assuming all nodes |
| 136 | + are in time sync) and choose the copies that have the latest time. |
| 137 | +(b) Choose one particular brick as the source for split brain and |
| 138 | + heal all split brains from this brick. |
| 139 | +(c) Just remove the split brain information from changelog. We leave |
| 140 | + the exercise to the user to repair split brain where in he would |
| 141 | + rewrite to the split brained files. (right now the user is forced to |
| 142 | + remove xattrs manually for this step). |
| 143 | + |
| 144 | +Benefits to GlusterFS |
| 145 | +-------------------- |
| 146 | + |
| 147 | +Makes the end user more aware of healing status and provides statistics. |
| 148 | + |
| 149 | +Scope |
| 150 | +----- |
| 151 | + |
| 152 | +6.1. Nature of proposed change |
| 153 | + |
| 154 | +Modification to AFR and CLI and glusterd code |
| 155 | + |
| 156 | +6.2. Implications on manageability |
| 157 | + |
| 158 | +New CLI commands to be added. Existing commands to be improved. |
| 159 | + |
| 160 | +6.3. Implications on presentation layer |
| 161 | + |
| 162 | +N/A |
| 163 | + |
| 164 | +6.4. Implications on persistence layer |
| 165 | + |
| 166 | +N/A |
| 167 | + |
| 168 | +6.5. Implications on 'GlusterFS' backend |
| 169 | + |
| 170 | +N/A |
| 171 | + |
| 172 | +6.6. Modification to GlusterFS metadata |
| 173 | + |
| 174 | +N/A |
| 175 | + |
| 176 | +6.7. Implications on 'glusterd' |
| 177 | + |
| 178 | +Changes for healing specific commands will be introduced. |
| 179 | + |
| 180 | +How To Test |
| 181 | +----------- |
| 182 | + |
| 183 | +See documentation session |
| 184 | + |
| 185 | +User Experience |
| 186 | +--------------- |
| 187 | + |
| 188 | +*Changes in CLI, effect on User experience...* |
| 189 | + |
| 190 | +Documentation |
| 191 | +------------- |
| 192 | + |
| 193 | +<http://review.gluster.org/#/c/7792/1/doc/features/afr-statistics.md> |
| 194 | + |
| 195 | +Status |
| 196 | +------ |
| 197 | + |
| 198 | +Patches : |
| 199 | + |
| 200 | +<http://review.gluster.org/6044> <http://review.gluster.org/4790> |
| 201 | + |
| 202 | +Status: |
| 203 | + |
| 204 | +Merged |
0 commit comments