Overview
GitHub Collector is a command-line tool for collecting GitHub repository data — issues, pull requests, releases, and collaborators. It outputs structured JSON files suitable for analysis in Python, R, or any data processing tool. Designed for researchers who need to mine software repositories without dealing with GitHub API pagination, rate limits, or JSON parsing.Quick Start
No Java installation required. JBang handles everything.1. Install JBang
- SDKMAN (Linux/macOS)
- Homebrew (macOS)
- Windows
- Direct Install
2. Set your GitHub token
Create a personal access token withrepo scope.
3. Collect data
issues/raw/open/spring-projects/spring-ai/:
Common Use Cases
Collect all closed issues
Collect merged PRs
Collect issues from a date range
Collect releases
Collect repository collaborators
Verify and deduplicate batch files
If you collected data in multiple runs, duplicates may exist across batches:Output Format
Each batch file contains a metadata header and an array of items:Loading in Python
CLI Reference
Rate Limits
GitHub API allows 5,000 requests per hour for authenticated users. The collector handles rate limiting automatically — it will pause and resume when limits reset. For large repositories (10,000+ issues), consider:- Using date ranges to split collection across multiple runs
- Running overnight when you won’t need the token for other work