Documentation Index
Fetch the complete documentation index at: https://springaicommunity.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
GitHub
Overview
GitHub Collector is a command-line tool for collecting GitHub repository data — issues, pull requests, releases, and collaborators. It outputs structured JSON files suitable for analysis in Python, R, or any data processing tool.
Designed for researchers who need to mine software repositories without dealing with GitHub API pagination, rate limits, or JSON parsing.
Quick Start
No Java installation required. JBang handles everything.
1. Install JBang
SDKMAN (Linux/macOS)
Homebrew (macOS)
Windows
Direct Install
curl -s "https://get.sdkman.io" | bash
source "$HOME/.sdkman/bin/sdkman-init.sh"
sdk install jbang
iex "& { $(iwr -useb https://ps.jbang.dev) } app setup"
curl -Ls https://sh.jbang.dev | bash -s - app setup
2. Set your GitHub token
Create a personal access token with repo scope.
export GITHUB_TOKEN=ghp_your_token_here
3. Collect data
jbang collect@spring-ai-community/github-collector --repo spring-projects/spring-ai --type issues
Output appears in issues/raw/open/spring-projects/spring-ai/:
batch_001_issues.json
batch_002_issues.json
...
Common Use Cases
Collect all closed issues
jbang collect@spring-ai-community/github-collector \
--repo owner/repo \
--type issues \
--state closed
Collect merged PRs
jbang collect@spring-ai-community/github-collector \
--repo owner/repo \
--type prs \
--pr-state merged
Collect issues from a date range
jbang collect@spring-ai-community/github-collector \
--repo owner/repo \
--type issues \
--state all \
--created-after 2024-01-01 \
--created-before 2025-01-01
Collect releases
jbang collect@spring-ai-community/github-collector \
--repo owner/repo \
--type releases
Collect repository collaborators
jbang collect@spring-ai-community/github-collector \
--repo owner/repo \
--type collaborators
Verify and deduplicate batch files
If you collected data in multiple runs, duplicates may exist across batches:
# Check for problems
jbang collect@spring-ai-community/github-collector \
--type issues \
--verify \
--verify-dir path/to/batch/files
# Fix duplicates automatically
jbang collect@spring-ai-community/github-collector \
--type issues \
--deduplicate \
--verify-dir path/to/batch/files
Each batch file contains a metadata header and an array of items:
{
"metadata": {
"batch_index": 1,
"item_count": 100,
"collection_type": "issues",
"repository": "owner/repo",
"state": "open",
"timestamp": "2026-02-04T15:30:45"
},
"issues": [
{
"number": 123,
"title": "Bug in feature X",
"body": "Description...",
"state": "OPEN",
"created_at": "2025-06-15T10:30:00",
"updated_at": "2025-06-16T14:22:00",
"closed_at": null,
"url": "https://github.com/owner/repo/issues/123",
"author": { "login": "username" },
"labels": [
{ "name": "bug", "color": "d73a4a" }
],
"assignees": [],
"comments_count": 5,
"label_events": [
{
"event": "labeled",
"label": "bug",
"actor": "maintainer",
"created_at": "2025-06-15T11:00:00"
}
]
}
]
}
Loading in Python
import json
from pathlib import Path
def load_batches(directory):
"""Load all batch files from a directory."""
items = []
for batch_file in sorted(Path(directory).glob("batch_*.json")):
with open(batch_file) as f:
data = json.load(f)
# Get the collection type (issues, prs, releases, etc.)
for key in data:
if key != "metadata" and isinstance(data[key], list):
items.extend(data[key])
return items
issues = load_batches("issues/raw/closed/spring-projects/spring-ai")
print(f"Loaded {len(issues)} issues")
CLI Reference
Usage: jbang collect@spring-ai-community/github-collector [OPTIONS]
REQUIRED:
-r, --repo REPO Repository in format owner/repo
COLLECTION TYPE:
-t, --type TYPE issues, prs, releases, collaborators (default: issues)
-s, --state STATE Issue state: open, closed, all (default: open)
--pr-state STATE PR state: open, closed, merged, all (default: open)
DATE FILTERING:
--created-after DATE Items created on or after DATE (YYYY-MM-DD)
--created-before DATE Items created before DATE (YYYY-MM-DD)
OUTPUT:
-b, --batch-size SIZE Items per batch file (default: 100)
--single-file Output all items to one file instead of batches
-o, --output FILE Output file path (with --single-file)
VERIFICATION:
--verify Check batch files for duplicates and issues
--deduplicate Remove duplicates from batch files
--verify-dir DIR Directory containing batch files to verify
OTHER:
-d, --dry-run Show what would be collected without doing it
-v, --verbose Enable verbose logging
-h, --help Show help message
Rate Limits
GitHub API allows 5,000 requests per hour for authenticated users. The collector handles rate limiting automatically — it will pause and resume when limits reset.
For large repositories (10,000+ issues), consider:
- Using date ranges to split collection across multiple runs
- Running overnight when you won’t need the token for other work
Troubleshooting
”GITHUB_TOKEN environment variable is required”
Set your token:
export GITHUB_TOKEN=ghp_your_token_here
“API rate limit exceeded”
The collector waits automatically. For immediate needs, check your rate limit status (requires authentication).
JBang not found after install
Restart your terminal or run:
source ~/.bashrc # or ~/.zshrc
Duplicates in collected data
Run verification and deduplication:
jbang collect@spring-ai-community/github-collector \
--type issues \
--deduplicate \
--verify-dir path/to/your/data