# Python Package Statistics

I am writing this in order to give some descriptive statistics about Python Packages in PyPI. I crawled 34968 indiviual package pages on Python package index. I could extract data from 34923 of them. PyPI reports 35813 packages, but some of them are different versions of same packages. I preferred working with latest versions.

Why did I do this? Curiosity and spare time...

### Package Size

Most packages provide a download section in their PyPI page. I got data from first file listed in download section. Here are the biggest packages on PyPI:

- b2gpopulate (36MB)
- ajenti (35MB)
- FinPy (29MB)
- django-dojo (28MB)
- QSTK (27MB)

Total sizes on packages in PyPI amounted to 4.2 GB. Average package size is 161 KB and standard deviation is 1MB.

I normalized values using `1MB = 1024KB`

and `1KB = 1024 B`

.

### Most Required Packages

I designated a requiredness score for individual packages. This score works like this; if package a needs package b, and package b
needs package c, in that case, c gets score of 2, and b gets score of 1, because c has directly or indirectyl have 2 packages
requiring it. I have used *Requires* section of package pages to calculate this. Some packages doesn't provide this information and
some have mistakes in this section like spelling errors. However, results seems representative of real requiredness value.

- numpy (1500)
- django (860)
- scipy (555)
- python (425)
- matplotlib (370)
- simplejson (310)
- requests (310)
- pil (275)
- pyyaml (265)
- lxml (255)

Some packages appeares to have specified *python* as a required package.

### Downloads

Every package provides daily, weekly and monthly download data on their page. I collected these numbers in early morning of 18 october 2013. Here are most downloaded packages and some statistics.

#### Daily

- distribute (67 417)
- setuptools (52 184)
- requests (50 486)
- ssl (50 201)
- certifi (49 738)

total: 1 880 974, average: 54, standard deviation: 900.

#### Weekly

- distribute (445 006)
- setuptools (414 556)
- ssl (379 201)
- certifi (378 729)
- wincertstore (375 158)

total 14 097 027, average: 403.6, standard deviation: 6 425.

#### Monthly

- distribute (1 743 309)
- setuptools (1 607 116)
- certifi (1 546 749)
- ssl (1 537 270)
- wincertstore (1 532 424)

total: 53 281 293, average: 1525.7, standard deviation: 25058.7

### Number of Lines of Python Code

I selected random sample of 30 packages using `random.sample()`

. I downloaded source files and counted number of non-empty lines in python
files (excluding setup.py). Minimum and maximum lines were 2 and 47 453 respectively. Number of lines averaged to 2212.6 lines per package
and standard deviation was 8729.7. 95% confidence interval for population mean had upper limit of 5336.4 and lower limit was negative
number. Therefore, we can estimate that there are `2212 * 35813 = 79 239 844`

lines of Python code on package index. As an upper limit,
we can use `5336.4 * 35813 = 191 112 493`

lines of code with 95% confidence.

### Data

Packages meta data (5 MB rar file) can be accessed through google drive. You will get a pickled Python dictionary (around 23 MB) when you extract rar file. To load data:

from pickle import load with open("all_in_one") as d: package_metas = load(d)

Veys of the `package_metas`

will be lowercase package names. Values are dictionaries of package meta data. Most packages will at least
provide `packagename`

, `daily`

, `weekly`

, `monthly`

, `Author`

, `Home Page`

fields, however, some might not provide this info. It is
because when I encountered unknown packages during determination of requiredness_score, I added unknown packages to this dictionary.

Most packages will also provide `filename`

, `filetype`

, `pyversion`

, `uploaded`

and `size`

data.

Other keys of per package dictionaries depend on individual packages.