README 7.46 KB
Newer Older
Vincent Pelletier's avatar
Vincent Pelletier committed
1 2 3 4 5
Compute APDEX from Apache-style logs.

Overview
========

6 7
Parses Apache-style logs and generates several statistics intended for a
website developer audience:
Vincent Pelletier's avatar
Vincent Pelletier committed
8 9 10 11

- APDEX (Application Performance inDEX, see http://www.apdex.org) ratio
  (plotted)

12 13
  Because you want to know how satisfied your users are.

Vincent Pelletier's avatar
Vincent Pelletier committed
14 15
- hit count (plotted)

16 17
  Because achieving 100% APDEX is easy when there is nobody around.

Vincent Pelletier's avatar
Vincent Pelletier committed
18 19 20
- HTTP status codes, with optional detailed output of the most frequent URLs
  per error status code, along with their most frequent referers

21 22 23
  Because your forgot to update a link to that conditionally-used browser
  compatibility javascript you renamed.

Vincent Pelletier's avatar
Vincent Pelletier committed
24 25
- Hottest pages (pages which use rendering time the most)

26 27 28
  Because you want to know where to invest time to get highest user experience
  improvement.

Vincent Pelletier's avatar
Vincent Pelletier committed
29 30
- ERP5 sites: per-module statistics, with module and document views separated

31 32
  Because module and document types are not born equal in usage patterns.

Vincent Pelletier's avatar
Vincent Pelletier committed
33 34 35 36
Some parsing performance figures:

On a 2.3Ghz Corei5, apachedex achieves 97000 lines/s (
pypy-c-jit-62994-bd32583a3f11-linux64) and 43000 lines/s (CPython 2.7).
37
Those were measures on a 3000000-hits logfile, with 3 --skip-base, 1
Vincent Pelletier's avatar
Vincent Pelletier committed
38
--erp5-base, 3 --base and --default set. --\*base values were similar in
39
simplicity to the ones provided in examples below.
Vincent Pelletier's avatar
Vincent Pelletier committed
40

41 42 43 44 45 46 47 48 49
What APacheDEX is not
=====================

APacheDEX does not produce website audience statistics like AWStats, Google
Analytics (etc) could do.

APacheDEX does not monitor website availability & resource usage like Zabbix,
Cacti, Ganglia, Nagios (etc) could do.

Vincent Pelletier's avatar
Vincent Pelletier committed
50 51 52 53 54 55
Requirements
============

Dependencies
------------

56
As such, apachedex has no strict dependencies outside of standard python 2.7
Vincent Pelletier's avatar
Vincent Pelletier committed
57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
installation.
But generated output needs a few javascript files which come from other
projects:

- jquery.js

- jquery.flot.js

- jquery.flot.time.js (official flot plugin)

- jquery.flot.axislabels.js (third-party flot plugin)

If you installed apachedex (using an egg or with a distribution's package) you
should have them already.
If you are running from repository, you need to fetch them first::

  python setup.py deps

75 76 77 78
Also, apachedex can make use of backports.lzma
(http://pypi.python.org/pypi/backports.lzma/) if it's installed to support xz
file compression.

Vincent Pelletier's avatar
Vincent Pelletier committed
79 80 81 82 83 84 85 86 87 88
Input
-----

All default "combined" log format fields are supported (more can easily be
added), plus %D.

Mandatory fields are (in any order) `%t`, `%r` (for request's URL), `%>s`,
`%{Referer}i`, `%D`. Just tell apachedex the value from your apache log
configuration (see `--logformat` argument documentation).

89 90 91 92 93 94 95
Input files may be provided uncompressed or compressed in:

- bzip

- gzip2

- xz (if module backports.lzma is installed)
Vincent Pelletier's avatar
Vincent Pelletier committed
96

97 98
Input filename "-" is understood as stdin.

Vincent Pelletier's avatar
Vincent Pelletier committed
99 100 101 102 103
Output
------

The output is HTML + CSS + JS, so you need a web browser to read it.

104 105
Output filename "-" is understood as stdout.

Vincent Pelletier's avatar
Vincent Pelletier committed
106 107 108
Usage
=====

109
A few usage examples. See embedded help (`-h`/`--help`) for further options.
Vincent Pelletier's avatar
Vincent Pelletier committed
110 111 112 113 114 115 116 117 118 119 120 121

Most basic usage::

  apachedex --default website access.log

Generate stand-alone output (suitable for inclusion in a mail, for example)::

  apachedex --default website --js-embed access.log --out attachment.html

A log file with requests for 2 websites for which individual stats are
desired, and hits outside those base urls are ignored::

122
  apachedex --base "/site1(/|$|\?)" "/site2(/|$|\?)"
Vincent Pelletier's avatar
Vincent Pelletier committed
123 124 125

A log file with a site section to ignore. Order does not matter::

126
  apachedex --skip-base "/ignored(/|$|\?)" --default website
Vincent Pelletier's avatar
Vincent Pelletier committed
127 128 129

A mix of both above examples. Order matters !::

130
  apachedex --skip-base "/site1/ignored(/|$|\?)" \
131
  --base "/site1(/|$|\?)" "/site2(/|$|\?)"
Vincent Pelletier's avatar
Vincent Pelletier committed
132

133 134 135 136 137 138 139
Matching non-ASCII urls works by using urlencoded strings::

  apachedex --base "/%E6%96%87%E5%AD%97%E5%8C%96%E3%81%91(/|$|\\?)" access.log

Naming websites so that report looks less intimidating, by interleaving
"+"-prefixed titles with regexes (title must be just before regex)::

140
  apachedex --default "Public website" --base "+Back office" \
141 142
  "/backoffice(/|$|\\?)" "+User access" "/secure(/|$|\\?)" access.log

143 144
Saving the result of an analysis for faster reuse::

145
  apachedex --default foo --format json --out save_state.json --period day \
146 147 148 149 150 151
  access.log

Although not required, it is strongly advised to provide `--period` argument,
as mixing states saved with different periods (fixed or auto-detected from
data) give hard-to-read results and can cause problems if loaded data gets
converted to a larger period.
152 153 154

Continuing a saved analysis, updating collected data::

155
  apachedex --default foo --format json --state-file save_state.json \
156
  --out save_state.json --period day access.2.log
157 158 159 160

Generating HTML output from two state files, aggregating their content
without parsing more logs::

161
  apachedex --default foo --state-file save_state.json save_state.2.json \
162
  --period day --out index.html
163

164 165 166 167 168 169 170 171 172 173

Configuration files
===================

Providing a filename prefixed by "@" puts the content of that file in place of
that argument, recursively. Each file is loaded relative to the containing
directory of referencing file, or current working directory for command line.

- foo/dev.cfg::

Vincent Pelletier's avatar
Vincent Pelletier committed
174
    --error-detail
175
    @site.cfg
Vincent Pelletier's avatar
Vincent Pelletier committed
176
    --stats
177 178 179

- foo/site.cfg::

Vincent Pelletier's avatar
Vincent Pelletier committed
180 181 182 183
    --default Front-office
    # This is a comment
    --prefix "+Back office" "/back(/|$|\?)" # This is another comment
    --skip-prefix "/baz/ignored(/|$|\?)" --prefix +Something "/baz(/|$|\?)"
184 185 186

- command line::

187
    apachedex --skip-base "/ignored(/|$|\?)" @foo/dev.cfg --out index.html \
Vincent Pelletier's avatar
Vincent Pelletier committed
188
    access.log
189 190 191

This is equivalent to::

192 193 194
  apachedex --skip-base "/ignored(/|$|\?)" --error-detail \
  --default Front-office --prefix "+Back office" "/back(/|$|\?)" \
  --skip-prefix "/baz/ignored(/|$|\?)" --prefix +Something "/baz(/|$|\?)" \
195 196 197 198 199 200
  --stats --out index.html access.log

Portability note: the use of paths containing directory elements inside
configuration files is discouraged, as it's not portable. This may change
later (ex: deciding that import paths are URLs and applying their rules).

201 202 203 204 205 206 207 208 209 210
Performance
===========

For better performance...

- pipe decompressed files to apachedex instead of having apachedex decompress
  files itself::

    bzcat access.log.bz2 | apachedex [...] -

211 212 213 214 215
- when letting apachedex decide statistic granularity with multiple log files,
  provide earliest and latest log files first (whatever order) so apachedex can
  adapt its data structure to analysed time range before there is too much
  data::

216
    apachedex [...] access.log.1.gz access.log.99.gz access.log.2.gz \
217 218
    access.log.3.gz [...] access.98.gz

219 220 221 222 223 224 225 226 227
- parse log files in parallel processes, saving analysis output and aggregating
  them in the end::

    for LOG in access*.log; do
      apachedex "$@" --format json --out "$LOG.json" "$LOG" &
    done
    wait
    apachedex "$@" --out access.html --state-file access.*.json

228
  If you have bash and have an xargs implementation supporting `-P`, you may
229 230
  want to use `parallel_parse.sh` available in source distribution or from
  repository.
231

Vincent Pelletier's avatar
Vincent Pelletier committed
232 233 234 235 236 237 238
Notes
=====

When there are no hits for more than a graph period, placeholders are
generated for 0 hit (which is the reality) and 100% apdex (this is
arbitrary). Those placeholders only affect graphs, and do not affect
averages nor table content.
239 240 241 242 243 244

Loading saved states generated with different sets of parameters is not
prevented, but can produce nonsense/unreadable results. Or it can save the day
if you do want to mix different parameters (ex: you have some logs generated
with %T, others with %D).

245 246
It is unclear how saved state format will evolve. Be prepared to have
to regenerate saved states when you upgrade APacheDEX.