Skip to content

Commit 72f6d5c

Browse files
guillep2klafriks
authored andcommitted
Restrict repository indexing by glob match (#7767)
* Restrict repository indexing by file extension * Use REPO_EXTENSIONS_LIST_INCLUDE instead of REPO_EXTENSIONS_LIST_EXCLUDE and have a more flexible extension pattern * Corrected to pass lint gosimple * Add wildcard support to REPO_INDEXER_EXTENSIONS * This reverts commit 72a650c. * Add wildcard support to REPO_INDEXER_EXTENSIONS (no make vendor) * Simplify isIndexable() for better clarity * Add gobwas/glob to vendors * manually set appengine new release * Implement better REPO_INDEXER_INCLUDE and REPO_INDEXER_EXCLUDE * Add unit and integration tests * Update app.ini.sample and reword config-cheat-sheet * Add doc page and correct app.ini.sample * Some polish on the doc * Simplify code as suggested by @lafriks
1 parent 3fd0eec commit 72f6d5c

38 files changed

+920
-17
lines changed

custom/conf/app.ini.sample

+5
Original file line numberDiff line numberDiff line change
@@ -302,6 +302,11 @@ REPO_INDEXER_ENABLED = false
302302
REPO_INDEXER_PATH = indexers/repos.bleve
303303
UPDATE_BUFFER_LEN = 20
304304
MAX_FILE_SIZE = 1048576
305+
; A comma separated list of glob patterns (see https://github.com/gobwas/glob) to include
306+
; in the index; default is empty
307+
REPO_INDEXER_INCLUDE =
308+
; A comma separated list of glob patterns to exclude from the index; ; default is empty
309+
REPO_INDEXER_EXCLUDE =
305310

306311
[admin]
307312
; Disallow regular (non-admin) users from creating organizations.

docs/content/doc/advanced/config-cheat-sheet.en-us.md

+2
Original file line numberDiff line numberDiff line change
@@ -181,6 +181,8 @@ Values containing `#` or `;` must be quoted using `` ` `` or `"""`.
181181

182182
- `REPO_INDEXER_ENABLED`: **false**: Enables code search (uses a lot of disk space, about 6 times more than the repository size).
183183
- `REPO_INDEXER_PATH`: **indexers/repos.bleve**: Index file used for code search.
184+
- `REPO_INDEXER_INCLUDE`: **empty**: A comma separated list of glob patterns (see https://github.com/gobwas/glob) to **include** in the index. Use `**.txt` to match any files with .txt extension. An empty list means include all files.
185+
- `REPO_INDEXER_EXCLUDE`: **empty**: A comma separated list of glob patterns (see https://github.com/gobwas/glob) to **exclude** from the index. Files that match this list will not be indexed, even if they match in `REPO_INDEXER_INCLUDE`.
184186
- `UPDATE_BUFFER_LEN`: **20**: Buffer length of index request.
185187
- `MAX_FILE_SIZE`: **1048576**: Maximum size in bytes of files to be indexed.
186188

Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
---
2+
date: "2019-09-06T01:35:00-03:00"
3+
title: "Repository indexer"
4+
slug: "repo-indexer"
5+
weight: 45
6+
toc: true
7+
draft: false
8+
menu:
9+
sidebar:
10+
parent: "advanced"
11+
name: "Repository indexer"
12+
weight: 45
13+
identifier: "repo-indexer"
14+
---
15+
16+
# Repository indexer
17+
18+
## Setting up the repository indexer
19+
20+
Gitea can search through the files of the repositories by enabling this function in your [`app.ini`](https://docs.gitea.io/en-us/config-cheat-sheet/):
21+
22+
```
23+
[indexer]
24+
; ...
25+
REPO_INDEXER_ENABLED = true
26+
REPO_INDEXER_PATH = indexers/repos.bleve
27+
UPDATE_BUFFER_LEN = 20
28+
MAX_FILE_SIZE = 1048576
29+
REPO_INDEXER_INCLUDE =
30+
REPO_INDEXER_EXCLUDE = resources/bin/**
31+
```
32+
33+
Please bear in mind that indexing the contents can consume a lot of system resources, especially when the index is created for the first time or globally updated (e.g. after upgrading Gitea).
34+
35+
### Choosing the files for indexing by size
36+
37+
The `MAX_FILE_SIZE` option will make the indexer skip all files larger than the specified value.
38+
39+
### Choosing the files for indexing by path
40+
41+
Gitea applies glob pattern matching from the [`gobwas/glob` library](https://github.com/gobwas/glob) to choose which files will be included in the index.
42+
43+
Limiting the list of files prevents the indexes from becoming polluted with derived or irrelevant files (e.g. lss, sym, map, etc.), so the search results are more relevant. It can also help reduce the index size.
44+
45+
`REPO_INDEXER_INCLUDE` (default: empty) is a comma separated list of glob patterns to **include** in the index. An empty list means "_include all files_".
46+
`REPO_INDEXER_EXCLUDE` (default: empty) is a comma separated list of glob patterns to **exclude** from the index. Files that match this list will not be indexed. `REPO_INDEXER_EXCLUDE` takes precedence over `REPO_INDEXER_INCLUDE`.
47+
48+
Pattern matching works as follows:
49+
50+
* To match all files with a `.txt` extension no matter what directory, use `**.txt`.
51+
* To match all files with a `.txt` extension _only at the root level of the repository_, use `*.txt`.
52+
* To match all files inside `resources/bin` and below, use `resources/bin/**`.
53+
* To match all files _immediately inside_ `resources/bin`, use `resources/bin/*`.
54+
* To match all files named `Makefile`, use `**Makefile`.
55+
* Matching a directory has no effect; the pattern `resources/bin` will not include/exclude files inside that directory; `resources/bin/**` will.
56+
* All files and patterns are normalized to lower case, so `**Makefile`, `**makefile` and `**MAKEFILE` are equivalent.
57+
58+

integrations/api_repo_test.go

+3-3
Original file line numberDiff line numberDiff line change
@@ -70,9 +70,9 @@ func TestAPISearchRepo(t *testing.T) {
7070
expectedResults
7171
}{
7272
{name: "RepositoriesMax50", requestURL: "/api/v1/repos/search?limit=50&private=false", expectedResults: expectedResults{
73-
nil: {count: 21},
74-
user: {count: 21},
75-
user2: {count: 21}},
73+
nil: {count: 22},
74+
user: {count: 22},
75+
user2: {count: 22}},
7676
},
7777
{name: "RepositoriesMax10", requestURL: "/api/v1/repos/search?limit=10&private=false", expectedResults: expectedResults{
7878
nil: {count: 10},
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
ref: refs/heads/master
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
[core]
2+
repositoryformatversion = 0
3+
filemode = true
4+
bare = true
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Unnamed repository; edit this file 'description' to name the repository.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
#!/bin/sh
2+
#
3+
# An example hook script to check the commit log message taken by
4+
# applypatch from an e-mail message.
5+
#
6+
# The hook should exit with non-zero status after issuing an
7+
# appropriate message if it wants to stop the commit. The hook is
8+
# allowed to edit the commit message file.
9+
#
10+
# To enable this hook, rename this file to "applypatch-msg".
11+
12+
. git-sh-setup
13+
commitmsg="$(git rev-parse --git-path hooks/commit-msg)"
14+
test -x "$commitmsg" && exec "$commitmsg" ${1+"$@"}
15+
:
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
#!/bin/sh
2+
#
3+
# An example hook script to check the commit log message.
4+
# Called by "git commit" with one argument, the name of the file
5+
# that has the commit message. The hook should exit with non-zero
6+
# status after issuing an appropriate message if it wants to stop the
7+
# commit. The hook is allowed to edit the commit message file.
8+
#
9+
# To enable this hook, rename this file to "commit-msg".
10+
11+
# Uncomment the below to add a Signed-off-by line to the message.
12+
# Doing this in a hook is a bad idea in general, but the prepare-commit-msg
13+
# hook is more suited to it.
14+
#
15+
# SOB=$(git var GIT_AUTHOR_IDENT | sed -n 's/^\(.*>\).*$/Signed-off-by: \1/p')
16+
# grep -qs "^$SOB" "$1" || echo "$SOB" >> "$1"
17+
18+
# This example catches duplicate Signed-off-by lines.
19+
20+
test "" = "$(grep '^Signed-off-by: ' "$1" |
21+
sort | uniq -c | sed -e '/^[ ]*1[ ]/d')" || {
22+
echo >&2 Duplicate Signed-off-by lines.
23+
exit 1
24+
}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
#!/usr/bin/perl
2+
3+
use strict;
4+
use warnings;
5+
use IPC::Open2;
6+
7+
# An example hook script to integrate Watchman
8+
# (https://facebook.github.io/watchman/) with git to speed up detecting
9+
# new and modified files.
10+
#
11+
# The hook is passed a version (currently 1) and a time in nanoseconds
12+
# formatted as a string and outputs to stdout all files that have been
13+
# modified since the given time. Paths must be relative to the root of
14+
# the working tree and separated by a single NUL.
15+
#
16+
# To enable this hook, rename this file to "query-watchman" and set
17+
# 'git config core.fsmonitor .git/hooks/query-watchman'
18+
#
19+
my ($version, $time) = @ARGV;
20+
21+
# Check the hook interface version
22+
23+
if ($version == 1) {
24+
# convert nanoseconds to seconds
25+
$time = int $time / 1000000000;
26+
} else {
27+
die "Unsupported query-fsmonitor hook version '$version'.\n" .
28+
"Falling back to scanning...\n";
29+
}
30+
31+
my $git_work_tree;
32+
if ($^O =~ 'msys' || $^O =~ 'cygwin') {
33+
$git_work_tree = Win32::GetCwd();
34+
$git_work_tree =~ tr/\\/\//;
35+
} else {
36+
require Cwd;
37+
$git_work_tree = Cwd::cwd();
38+
}
39+
40+
my $retry = 1;
41+
42+
launch_watchman();
43+
44+
sub launch_watchman {
45+
46+
my $pid = open2(\*CHLD_OUT, \*CHLD_IN, 'watchman -j --no-pretty')
47+
or die "open2() failed: $!\n" .
48+
"Falling back to scanning...\n";
49+
50+
# In the query expression below we're asking for names of files that
51+
# changed since $time but were not transient (ie created after
52+
# $time but no longer exist).
53+
#
54+
# To accomplish this, we're using the "since" generator to use the
55+
# recency index to select candidate nodes and "fields" to limit the
56+
# output to file names only. Then we're using the "expression" term to
57+
# further constrain the results.
58+
#
59+
# The category of transient files that we want to ignore will have a
60+
# creation clock (cclock) newer than $time_t value and will also not
61+
# currently exist.
62+
63+
my $query = <<" END";
64+
["query", "$git_work_tree", {
65+
"since": $time,
66+
"fields": ["name"],
67+
"expression": ["not", ["allof", ["since", $time, "cclock"], ["not", "exists"]]]
68+
}]
69+
END
70+
71+
print CHLD_IN $query;
72+
close CHLD_IN;
73+
my $response = do {local $/; <CHLD_OUT>};
74+
75+
die "Watchman: command returned no output.\n" .
76+
"Falling back to scanning...\n" if $response eq "";
77+
die "Watchman: command returned invalid output: $response\n" .
78+
"Falling back to scanning...\n" unless $response =~ /^\{/;
79+
80+
my $json_pkg;
81+
eval {
82+
require JSON::XS;
83+
$json_pkg = "JSON::XS";
84+
1;
85+
} or do {
86+
require JSON::PP;
87+
$json_pkg = "JSON::PP";
88+
};
89+
90+
my $o = $json_pkg->new->utf8->decode($response);
91+
92+
if ($retry > 0 and $o->{error} and $o->{error} =~ m/unable to resolve root .* directory (.*) is not watched/) {
93+
print STDERR "Adding '$git_work_tree' to watchman's watch list.\n";
94+
$retry--;
95+
qx/watchman watch "$git_work_tree"/;
96+
die "Failed to make watchman watch '$git_work_tree'.\n" .
97+
"Falling back to scanning...\n" if $? != 0;
98+
99+
# Watchman will always return all files on the first query so
100+
# return the fast "everything is dirty" flag to git and do the
101+
# Watchman query just to get it over with now so we won't pay
102+
# the cost in git to look up each individual file.
103+
print "/\0";
104+
eval { launch_watchman() };
105+
exit 0;
106+
}
107+
108+
die "Watchman: $o->{error}.\n" .
109+
"Falling back to scanning...\n" if $o->{error};
110+
111+
binmode STDOUT, ":utf8";
112+
local $, = "\0";
113+
print @{$o->{files}};
114+
}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
#!/bin/sh
2+
#
3+
# An example hook script to prepare a packed repository for use over
4+
# dumb transports.
5+
#
6+
# To enable this hook, rename this file to "post-update".
7+
8+
exec git update-server-info
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
#!/bin/sh
2+
#
3+
# An example hook script to verify what is about to be committed
4+
# by applypatch from an e-mail message.
5+
#
6+
# The hook should exit with non-zero status after issuing an
7+
# appropriate message if it wants to stop the commit.
8+
#
9+
# To enable this hook, rename this file to "pre-applypatch".
10+
11+
. git-sh-setup
12+
precommit="$(git rev-parse --git-path hooks/pre-commit)"
13+
test -x "$precommit" && exec "$precommit" ${1+"$@"}
14+
:
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
#!/bin/sh
2+
#
3+
# An example hook script to verify what is about to be committed.
4+
# Called by "git commit" with no arguments. The hook should
5+
# exit with non-zero status after issuing an appropriate message if
6+
# it wants to stop the commit.
7+
#
8+
# To enable this hook, rename this file to "pre-commit".
9+
10+
if git rev-parse --verify HEAD >/dev/null 2>&1
11+
then
12+
against=HEAD
13+
else
14+
# Initial commit: diff against an empty tree object
15+
against=$(git hash-object -t tree /dev/null)
16+
fi
17+
18+
# If you want to allow non-ASCII filenames set this variable to true.
19+
allownonascii=$(git config --bool hooks.allownonascii)
20+
21+
# Redirect output to stderr.
22+
exec 1>&2
23+
24+
# Cross platform projects tend to avoid non-ASCII filenames; prevent
25+
# them from being added to the repository. We exploit the fact that the
26+
# printable range starts at the space character and ends with tilde.
27+
if [ "$allownonascii" != "true" ] &&
28+
# Note that the use of brackets around a tr range is ok here, (it's
29+
# even required, for portability to Solaris 10's /usr/bin/tr), since
30+
# the square bracket bytes happen to fall in the designated range.
31+
test $(git diff --cached --name-only --diff-filter=A -z $against |
32+
LC_ALL=C tr -d '[ -~]\0' | wc -c) != 0
33+
then
34+
cat <<\EOF
35+
Error: Attempt to add a non-ASCII file name.
36+
37+
This can cause problems if you want to work with people on other platforms.
38+
39+
To be portable it is advisable to rename the file.
40+
41+
If you know what you are doing you can disable this check using:
42+
43+
git config hooks.allownonascii true
44+
EOF
45+
exit 1
46+
fi
47+
48+
# If there are whitespace errors, print the offending file names and fail.
49+
exec git diff-index --check --cached $against --
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
#!/bin/sh
2+
3+
# An example hook script to verify what is about to be pushed. Called by "git
4+
# push" after it has checked the remote status, but before anything has been
5+
# pushed. If this script exits with a non-zero status nothing will be pushed.
6+
#
7+
# This hook is called with the following parameters:
8+
#
9+
# $1 -- Name of the remote to which the push is being done
10+
# $2 -- URL to which the push is being done
11+
#
12+
# If pushing without using a named remote those arguments will be equal.
13+
#
14+
# Information about the commits which are being pushed is supplied as lines to
15+
# the standard input in the form:
16+
#
17+
# <local ref> <local sha1> <remote ref> <remote sha1>
18+
#
19+
# This sample shows how to prevent push of commits where the log message starts
20+
# with "WIP" (work in progress).
21+
22+
remote="$1"
23+
url="$2"
24+
25+
z40=0000000000000000000000000000000000000000
26+
27+
while read local_ref local_sha remote_ref remote_sha
28+
do
29+
if [ "$local_sha" = $z40 ]
30+
then
31+
# Handle delete
32+
:
33+
else
34+
if [ "$remote_sha" = $z40 ]
35+
then
36+
# New branch, examine all commits
37+
range="$local_sha"
38+
else
39+
# Update to existing branch, examine new commits
40+
range="$remote_sha..$local_sha"
41+
fi
42+
43+
# Check for WIP commit
44+
commit=`git rev-list -n 1 --grep '^WIP' "$range"`
45+
if [ -n "$commit" ]
46+
then
47+
echo >&2 "Found WIP commit in $local_ref, not pushing"
48+
exit 1
49+
fi
50+
fi
51+
done
52+
53+
exit 0

0 commit comments

Comments
 (0)