1. Requirements Gathering (Interview)
Before analysis, ask the user 5-10 questions covering:
- Research objective - Exploratory vs. confirmatory analysis
- Grouping variables - Categorical variables to stratify by
- Continuous variables - Metrics to calculate descriptives for
- Text fields requiring extraction - Columns with embedded entities
- Temporal variable - Date/time column and desired granularity
- Classification schemes - Any custom tier/category definitions
- Output preferences - CSV tables, MD summary, visualizations
2. Data Preparation
Create derived variables as needed:
```python
# Tier classification (customize thresholds)
def classify_tier(value, tiers):
for tier_name, (min_val, max_val) in tiers.items():
if min_val <= value <= max_val:
return tier_name
return 'Other'
# Example tier structure
TIERS = {
'Small': (0, 1000),
'Medium': (1001, 10000),
'Large': (10001, float('inf'))
}
# Temporal grouping
df['month'] = df['date_col'].dt.to_period('M').astype(str)
```
3. Analysis Structure
Generate tables in this order:
| # | File Pattern | Contents |
|---|--------------|----------|
| 01 | sample_overview.csv | N, date range, unique counts |
| 02-07 | {groupvar}_distribution.csv | Frequency for each grouping variable |
| 08 | continuous_overall.csv | Mean, SD, Median, Min, Max |
| 08a-f | continuous_by_{groupvar}.csv | Descriptives stratified by group |
| 09-10 | categorical_distribution.csv | Key categorical variables |
| 11-15 | entity_{fieldname}.csv | Extracted entity frequencies |
| 16 | temporal_trends.csv | Metrics over time |
4. Descriptive Statistics Function
```python
def descriptive_stats(series, name='Variable'):
return {
'Variable': name,
'N': series.count(),
'Mean': series.mean(),
'SD': series.std(),
'Min': series.min(),
'Q1': series.quantile(0.25),
'Median': series.median(),
'Q3': series.quantile(0.75),
'Max': series.max()
}
def grouped_descriptives(df, var, group_var, group_col_name):
results = []
for group in df[group_var].dropna().unique():
group_data = df[df[group_var] == group][var].dropna()
if len(group_data) > 0:
stats = descriptive_stats(group_data, var)
stats[group_col_name] = group
results.append(stats)
return pd.DataFrame(results)
```
5. Entity Extraction
For text fields with embedded entities (timestamps, names, etc.):
```python
import re
def extract_entities(text):
"""Extract entities from bracketed text like '[00:01:23] entity_name'"""
if pd.isna(text) or text == '':
return []
entities = []
pattern = r'\[[\d:]+\]\s*([^;\[\]]+)'
matches = re.findall(pattern, str(text))
for match in matches:
entity = match.strip().lower()
if entity and len(entity) > 1:
entities.append(entity)
return entities
def entity_frequency(df, col):
all_entities = []
for text in df[col].dropna():
all_entities.extend(extract_entities(text))
return pd.Series(all_entities).value_counts()
```
6. Output Directory Structure
```
TABLE/
βββ 01_sample_overview.csv
βββ 02_groupvar1_distribution.csv
βββ ...
βββ 08_continuous_overall.csv
βββ 08a_continuous_by_groupvar1.csv
βββ ...
βββ DESCRIPTIVE_SUMMARY.md
```
7. MD Summary Generator
Create comprehensive markdown summary including:
- Sample Overview - Dataset dimensions and date range
- Distribution Tables - Top values for each grouping variable
- Continuous Descriptives - Overall + by each grouping variable
- Entity Summaries - Unique counts and top entities
- Temporal Trends - Key metrics over time
- Output Files Reference - Links to all CSV tables
Summary should use markdown tables with proper formatting:
```markdown
| Variable | N | Mean | SD | Median |
|----------|---|------|-------|--------|
| views | 380 | 27192.59 | 133894.14 | 657.00 |
```
8. Key Design Principles
- Descriptive only - No inferential statistics unless requested
- Flexible grouping - Support any number of grouping variables
- Top-N limits - Show top 5-10 for large category sets
- Clean entity extraction - Normalize case, deduplicate
- Dual output - CSV for validation, MD for interpretation
- Video/channel counts - When applicable, report both unit types
- Milestone annotations - Add context to temporal distributions
9. Verification Checklist
- [ ] All CSV files generated with > 0 rows
- [ ] No empty/null columns
- [ ] Sum of frequencies matches total N
- [ ] Grouped descriptives align with overall
- [ ] Entity extraction capturing expected patterns
- [ ] MD summary coherent and complete