Add spec for website activity stats effort
A common request is better insight to how our hosted sites are accessed. This allows people to find 404s and fix them, invest in popular pages to ensure they are accurate and up to date, as well as measure if changes are effective over time. "Traditional" tools in this space often expose far more user information than we are comfortable with. Thankfully there exists a GPL tool, goaccess, that allows us to remove sensitive data from its reports. This will allow us to publish the data without concern over what is in the reports. Change-Id: I3e6673def7edcb2f31f9be88e1831f716f6e8c9d
This commit is contained in:
parent
cab1a48a1a
commit
f8c1dd508e
@ -43,6 +43,7 @@ permits.
|
||||
specs/translation_check_site
|
||||
specs/wiki_modernization
|
||||
specs/retire-static
|
||||
specs/website-stats
|
||||
|
||||
Help Wanted
|
||||
===========
|
||||
|
178
specs/website-stats.rst
Normal file
178
specs/website-stats.rst
Normal file
@ -0,0 +1,178 @@
|
||||
::
|
||||
|
||||
Copyright 2020 OpenStack Foundation
|
||||
|
||||
This work is licensed under a Creative Commons Attribution 3.0
|
||||
Unported License.
|
||||
http://creativecommons.org/licenses/by/3.0/legalcode
|
||||
|
||||
======================
|
||||
Website Activity Stats
|
||||
======================
|
||||
|
||||
https://storyboard.openstack.org/TODO
|
||||
|
||||
Basic website activity stats around which pages are hit most often, which
|
||||
pages are 404s, and total number of visitors aid in properly running a
|
||||
site. With this info you can correct broken links or redirect users to
|
||||
appropriate locations. Popular pages can be given more attention as they
|
||||
are read most often. Visitor numbers help you learn if changes that are
|
||||
being made are effective or not.
|
||||
|
||||
Unfortunately for a long period of time we've not really published any of
|
||||
this useful data.
|
||||
|
||||
Problem Description
|
||||
===================
|
||||
|
||||
One of the major reasons we have not published this data historically is
|
||||
that many tools that work with this data over share. We are particularly
|
||||
concerned about publishing information that might be attributed to specific
|
||||
users. The ideal here is that we could publish a bare minimum of information
|
||||
that allows web admins to properly manage sites without leaking personal
|
||||
information.
|
||||
|
||||
In particular we don't want to leak IP Addresses or subnets as IPs are
|
||||
considered PII and without significant traffic subnets typically identify
|
||||
specific users. We also want to avoid publishing referer information as
|
||||
this can be used to infer who users are as well. This can happen if users
|
||||
follow links from internal company wikis, bug trackers or code hosting
|
||||
systems.
|
||||
|
||||
Out of an abundance of caution we will avoid publishing Operating System,
|
||||
Web Browser, and google search terms as well. This data is likely safe to
|
||||
share, particularly if we avoid making it cross referenceable with other
|
||||
fields. For this reason we may add these stats in the future.
|
||||
|
||||
Proposed Change
|
||||
===============
|
||||
|
||||
We can use goaccess, a GPL tool, to produce conservative website stats
|
||||
reports from apache access logs. The key here is that newer goaccess (since
|
||||
Ubuntu Bionic) allow you to remove data from the end result report files.
|
||||
This allows us to tell goaccess to produce reports only with the data we
|
||||
feel is safe for public consumption.
|
||||
|
||||
We would run periodic Zuul jobs that connected to static.opendev.org,
|
||||
uncompressed Apache log files as necessary, then fed them through goaccess.
|
||||
The resulting report.html output file could then be written into AFS as well
|
||||
as hosted directly from the zuul logs system. This would give us reports
|
||||
that updated roughly daily covering the period of time for which logs are
|
||||
available.
|
||||
|
||||
To make this possible we will use Zuul's per project ssh keys. This will
|
||||
allow the jobs to add static.opendev.org to the running ansible inventory
|
||||
then run ansible to perform the above steps.
|
||||
|
||||
If publishing into AFS we would write them to a known location for each site::
|
||||
|
||||
https://example.website.org/goaccess.html
|
||||
|
||||
To do this we need a configuration file that excludes the panels we do not
|
||||
want::
|
||||
|
||||
log-format COMBINED
|
||||
|
||||
ignore-panel VISITORS
|
||||
ignore-panel REQUESTS
|
||||
ignore-panel REQUESTS_STATIC
|
||||
ignore-panel NOT_FOUND
|
||||
ignore-panel HOSTS
|
||||
ignore-panel OS
|
||||
ignore-panel BROWSERS
|
||||
ignore-panel VISIT_TIMES
|
||||
ignore-panel VIRTUAL_HOSTS
|
||||
ignore-panel REFERRERS
|
||||
ignore-panel REFERRING_SITES
|
||||
ignore-panel KEYPHRASES
|
||||
ignore-panel STATUS_CODES
|
||||
ignore-panel REMOTE_USER
|
||||
ignore-panel GEO_LOCATION
|
||||
|
||||
enable-panel VISITORS
|
||||
enable-panel REQUESTS
|
||||
enable-panel REQUESTS_STATIC
|
||||
enable-panel NOT_FOUND
|
||||
enable-panel STATUS_CODES
|
||||
|
||||
Then we can run (roughly) this command in the Zuul jobs::
|
||||
|
||||
goaccess /var/log/apache2/example.site.org_access.log* -o example-site-report.html -p ./goaccess.conf
|
||||
|
||||
Alternatives
|
||||
------------
|
||||
|
||||
We can use tracker that run in the browser like goatcounter. One downside
|
||||
to this approach is that we would need to run custom 404 pages in order
|
||||
to collect data on 404s. This is more complicated than the web server logs
|
||||
approach. One upside to this approach is that we could track referrers to
|
||||
404s enabling us to more easily fix our own broken links.
|
||||
|
||||
If we were collecting a rich set of data they would provide much more info,
|
||||
but because we've decided that we do not want to collect that information
|
||||
the server logs should be sufficient.
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
Assignee(s)
|
||||
-----------
|
||||
|
||||
Primary assignee:
|
||||
TBD
|
||||
|
||||
Gerrit Topic
|
||||
------------
|
||||
|
||||
Use Gerrit topic "website-stats" for all patches related to this spec.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
git-review -t website-stats
|
||||
|
||||
Work Items
|
||||
----------
|
||||
|
||||
* Write zuul jobs to produce and publish the goaccess reports.
|
||||
* Document goaccess tooling for web admins.
|
||||
|
||||
Repositories
|
||||
------------
|
||||
|
||||
None
|
||||
|
||||
Servers
|
||||
-------
|
||||
|
||||
static.opendev.org would be updated to implement this for the sites it
|
||||
hosts.
|
||||
|
||||
DNS Entries
|
||||
-----------
|
||||
|
||||
None
|
||||
|
||||
Documentation
|
||||
-------------
|
||||
|
||||
We will need to document where the stats can be retrieved once available.
|
||||
We should also document the choices we made around which data is collected.
|
||||
|
||||
Security
|
||||
--------
|
||||
|
||||
We could potentially leak sensitive client information unintentionally.
|
||||
The example config file used above is intended to do its best to avoid that
|
||||
by explicitly disabling all available goaccess panels then enabling the few
|
||||
we know are safe.
|
||||
|
||||
Testing
|
||||
-------
|
||||
|
||||
We can run the new job against test data to ensure it works as expected
|
||||
without disclosing unwanted info.
|
||||
|
||||
Dependencies
|
||||
============
|
||||
|
||||
None
|
Loading…
x
Reference in New Issue
Block a user