Skip to content

SP-2169: GitHub scraping at scale #36

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 182 commits into from
May 27, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
182 commits
Select commit Hold shift + click to select a range
741f217
rewrite of GitHub scraper; note dependency on one function in old Git…
ameisner May 12, 2025
395b889
linting
ameisner May 13, 2025
c4affcf
linting
ameisner May 13, 2025
6b1cc1b
linting
ameisner May 13, 2025
23d82d5
linting
ameisner May 13, 2025
e7d5ff2
linting
ameisner May 13, 2025
353be36
linting
ameisner May 13, 2025
eaf274e
linting
ameisner May 13, 2025
5e65da2
linting
ameisner May 13, 2025
d2fd001
linting
ameisner May 13, 2025
931350d
linting
ameisner May 13, 2025
0a6c38c
linting
ameisner May 13, 2025
6c40f63
linting
ameisner May 13, 2025
5518e98
linting
ameisner May 13, 2025
6ba7c41
linting
ameisner May 13, 2025
1826eea
linting
ameisner May 13, 2025
e231cde
linting
ameisner May 13, 2025
75acfd1
add minimal docstrings to each function
ameisner May 13, 2025
1d5aa1f
add file header including docstring
ameisner May 13, 2025
6b6d890
type annotation
ameisner May 13, 2025
b76f2ef
type annotations
ameisner May 13, 2025
3d6f59c
linting
ameisner May 13, 2025
fc6c1d8
linting
ameisner May 13, 2025
f45fe1f
fix import
ameisner May 13, 2025
9b6bef3
linting
ameisner May 13, 2025
4fe0196
linting
ameisner May 13, 2025
43e5d58
linting
ameisner May 13, 2025
f754c44
linting
ameisner May 13, 2025
1d73c85
linting
ameisner May 13, 2025
25123eb
linting
ameisner May 13, 2025
1dbbcf0
fix bug
ameisner May 13, 2025
77fc235
linting
ameisner May 13, 2025
2b06874
linting
ameisner May 13, 2025
eba5c26
type annotation
ameisner May 13, 2025
cbd3d55
Path instead of os
ameisner May 13, 2025
ace7e24
Path instead of os
ameisner May 13, 2025
9056890
linting
ameisner May 13, 2025
474563d
remove unused import
ameisner May 13, 2025
e0b5ecf
linting
ameisner May 13, 2025
6e544dd
replace a print statement
ameisner May 13, 2025
926aaac
replace a print statement
ameisner May 13, 2025
9418c83
replace a print statement
ameisner May 13, 2025
7a100b7
replace a print statement
ameisner May 13, 2025
7524419
fix bug
ameisner May 13, 2025
194154e
linting
ameisner May 13, 2025
7a0e156
linting
ameisner May 13, 2025
3ed2e59
linting
ameisner May 13, 2025
713d710
linting
ameisner May 13, 2025
d6b4a63
Path.open argument
ameisner May 13, 2025
8d6c16c
trying to open file with Path
ameisner May 13, 2025
7f44cd1
trying to open a file with Path
ameisner May 13, 2025
70d4de1
fix a logging info printout
ameisner May 14, 2025
d0005c7
fix a logging printout
ameisner May 14, 2025
afec59c
move utility function
ameisner May 14, 2025
efbb8e1
requests needed for utility function
ameisner May 14, 2025
860d976
move requests import for linter
ameisner May 14, 2025
a85d675
GitHub API token
ameisner May 15, 2025
f645806
linting
ameisner May 15, 2025
2fb19e4
rename functions
ameisner May 15, 2025
218ba80
consistent verbiage
ameisner May 15, 2025
13ebdf2
delete old/slow version of GitHub scraper
May 15, 2025
0f22bae
rename new GitHub scraper to match old GitHub scraper's name
May 15, 2025
5463049
fill in a docstring
ameisner May 15, 2025
3a679d3
linting
ameisner May 15, 2025
8439ead
fill in a docstring
ameisner May 15, 2025
6b64a3a
fill in a docstring
ameisner May 15, 2025
670fa34
fill in a docstring
ameisner May 15, 2025
3003cc8
fill in a docstring
ameisner May 15, 2025
51bb56a
remove duplicated org
ameisner May 17, 2025
39acf6e
load org list from YAML
May 17, 2025
253f3cd
linting
ameisner May 17, 2025
02f6899
linting
ameisner May 17, 2025
b91ec89
linting
ameisner May 17, 2025
88b3e62
fill in a docstring
ameisner May 17, 2025
786e0a8
start integrating Connor's code
ameisner May 18, 2025
e69a1e7
clean-up
ameisner May 18, 2025
8ec7c9d
port in Connor's version of scrape_repo
ameisner May 18, 2025
bd8a367
linting
ameisner May 18, 2025
0aeab2c
linting
ameisner May 18, 2025
d5b484a
further clean-up
ameisner May 18, 2025
bc8b65f
linting
ameisner May 18, 2025
ba8d39a
try opening a pickle file differently
ameisner May 18, 2025
676cf61
open pickle file with Path.open
ameisner May 18, 2025
038faee
type hint per CI failure
ameisner May 18, 2025
2837be5
update output folder name and scrape_repo docstring
ameisner May 18, 2025
6b182a4
remove duplicate definition of output_dir
ameisner May 18, 2025
5e2f2ba
change top-level output dir name
ameisner May 18, 2025
ecc0a2f
propagate max_mb from scrape_org to scrape_repo
ameisner May 18, 2025
74f7236
add load_yaml_spec utility function
ameisner May 18, 2025
9063b57
linting
ameisner May 18, 2025
4bb5bf5
linting
ameisner May 18, 2025
5efb23e
use new YAML spec loading utility function
ameisner May 18, 2025
8c774b5
add exploratory util function for retrieving multi-org repo list
ameisner May 18, 2025
3481222
linting
ameisner May 18, 2025
4e5dd48
linting
ameisner May 18, 2025
b01661e
linting
ameisner May 18, 2025
c0ce6ef
fill in a docstring
ameisner May 18, 2025
72d8ab8
check that directory to be deleted is subdir of cwd
ameisner May 18, 2025
c6dbf92
fix bug
ameisner May 18, 2025
39c4ea8
explicitly reject .fits files
ameisner May 18, 2025
b46e486
linting
ameisner May 18, 2025
e3ca961
reject repos with "data" in their names
ameisner May 18, 2025
de8506e
linting
ameisner May 18, 2025
1762587
ignore repos with "dustmaps" in their names
ameisner May 18, 2025
9821f07
linting
ameisner May 18, 2025
dc8427b
ignore repos with "gen2" in their names
ameisner May 18, 2025
7d88066
ignore files with "gen2" in their names
ameisner May 19, 2025
ee59f4c
ignore files in subdirectories with names containing "data"
ameisner May 19, 2025
16e423f
ignore repos not updated since the start of calendar 2021
ameisner May 19, 2025
6f3942d
skip .eps files
ameisner May 19, 2025
2b413b7
ignore images/ and figures/ directories
ameisner May 19, 2025
0c2c23c
skip .tar files
ameisner May 19, 2025
1aed66a
add more metadata to LangChain docs
ameisner May 19, 2025
7bb9b44
linting
ameisner May 19, 2025
52e0deb
linting
ameisner May 19, 2025
f58c394
linting
ameisner May 19, 2025
018aaf2
ignore .zip files
ameisner May 19, 2025
236c5b9
add utility function to get Git last modified timestamp of file
ameisner May 19, 2025
19d278f
remove a comment
ameisner May 19, 2025
162bb1f
linting
ameisner May 19, 2025
3cb897a
linting
ameisner May 19, 2025
46c0576
ignore .out files; LSSTDESC has some very large .out files
ameisner May 19, 2025
714c452
skip logs/ subdirs
ameisner May 19, 2025
7400c5c
add utility to check for possible data dump files
ameisner May 19, 2025
3e44d5c
linting
ameisner May 19, 2025
2a847a9
linting
ameisner May 19, 2025
1d50492
add return type annotation
ameisner May 19, 2025
29341ca
attempt to remove data dump e.g., json, txt, dat files
ameisner May 19, 2025
82d9b71
linting
ameisner May 19, 2025
c8d24ca
fill in a docstring
ameisner May 19, 2025
162cfee
special handling of .ipynb files to avoid scraping large outputs/plots
ameisner May 19, 2025
52ed6fa
linting
ameisner May 19, 2025
4de1bd8
linting
ameisner May 19, 2025
cda1c19
skip .SIMLIB files (see e.g., sn_lc2cosmo_tutorials in LSSTDESC)
ameisner May 19, 2025
10b064b
parcel out LangChain loader selection to its own function
ameisner May 19, 2025
3501955
linting
ameisner May 19, 2025
4445a29
use HTML loader that avoids reading in large text-encoded images
ameisner May 19, 2025
20d862f
fix syntax error
ameisner May 19, 2025
731a9d1
linting
ameisner May 19, 2025
1b6353a
linting
ameisner May 19, 2025
a7a50c1
fix return type annotation
ameisner May 19, 2025
f3f36c6
linting
ameisner May 19, 2025
d87de1d
fill in a docstring
ameisner May 19, 2025
fcd314c
ignore .pd files
ameisner May 19, 2025
1c32d4c
skip .pkl files
ameisner May 19, 2025
4dff57c
ignore .pickle files
ameisner May 19, 2025
d136667
skip .dax files
ameisner May 19, 2025
c0cbcdc
skip .svg files
ameisner May 19, 2025
d15c666
add .log to list of extensions checked for being data dumps
ameisner May 19, 2025
1ece998
add .sql to list of extensions checked for being data dumps
ameisner May 19, 2025
08767f8
skip .lvproj files
ameisner May 19, 2025
81d04d6
add .yaml to list of extensions checked for being data dumps
ameisner May 19, 2025
823efc3
linting
ameisner May 19, 2025
c73d227
add .cfg extension to list checked for being a data dump
ameisner May 19, 2025
8fcea22
skip .lvbitx files
ameisner May 19, 2025
66d442c
linting
ameisner May 19, 2025
4fe3dac
skip .trim files
ameisner May 19, 2025
5f38b24
remove archived repos
ameisner May 19, 2025
180f422
skip repos with "legacy" in their names
ameisner May 19, 2025
32c1000
limit git history in clone command (Connor's idea/code)
ameisner May 19, 2025
2b52f62
linting
ameisner May 19, 2025
27901d5
add main
ameisner May 19, 2025
05edd51
add .tbl to list of extensions to check for data dumps
ameisner May 20, 2025
0fef4d7
linting
ameisner May 20, 2025
3a3dd12
skip .tsbuildinfo files
ameisner May 20, 2025
60a10ab
add mechanism to ignore specified repos within org
ameisner May 20, 2025
d1edee6
fix bug
ameisner May 20, 2025
ee7707b
linting
ameisner May 20, 2025
54f985f
linting
ameisner May 20, 2025
b9336d4
linting
ameisner May 20, 2025
9e950cf
fix operand issue
ameisner May 20, 2025
ee8c71c
get list of repos to ignore from YAML spec
ameisner May 20, 2025
6b7caae
linting
ameisner May 20, 2025
5058117
add new parameter to docstring
ameisner May 20, 2025
030fe22
updated YAML spec listing particular repos to ignore
ameisner May 20, 2025
1a9e59f
update YAML dictionary key for repos to ignore
ameisner May 20, 2025
0cc8fb1
attempt to implement suggestion from Connor
ameisner May 23, 2025
faa0a82
linting
ameisner May 23, 2025
6bf4084
linting
ameisner May 23, 2025
8d74ed0
use different version of deleting a dir suggested by Connor
ameisner May 23, 2025
602be6d
delete a logging statement
ameisner May 23, 2025
2b07f66
ignore repos with "test" in their names
ameisner May 23, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 7 additions & 3 deletions data/github_sources.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,9 @@ url: "https://github.com"
# LSST organizations
organization:
- name: "lsst"
ignore_repos:
- "community-operators"
- "versiondb"
- name: "lsst-it"
- name: "lsst-dm"
- name: "lsst-dmsst"
Expand All @@ -17,13 +20,14 @@ organization:
- name: "rubin-dp0"
- name: "lsst-sims"
- name: "lsst-epo"
- name: "lsst-camera-dh"
repos:
- repo: "eo-pipe"

# Science collaborations
- name: "LSSTDESC"
ignore_repos:
- "PC5AtmosphericExtinction"
- name: "LSST-strong-lensing"
- name: "LSST-TVSSC"
ignore_repos:
- "LSST-TVSSC.github.io"
- name: "LSST-SSSC"
- name: "LSSTScienceCollaborations"
Loading
Loading