diff --git a/doc/source/index.rst b/doc/source/index.rst index 6ca43c2..16a651b 100644 --- a/doc/source/index.rst +++ b/doc/source/index.rst @@ -43,6 +43,7 @@ permits. specs/translation_check_site specs/wiki_modernization specs/retire-static + specs/website-stats Help Wanted =========== diff --git a/specs/website-stats.rst b/specs/website-stats.rst new file mode 100644 index 0000000..d239188 --- /dev/null +++ b/specs/website-stats.rst @@ -0,0 +1,178 @@ +:: + + Copyright 2020 OpenStack Foundation + + This work is licensed under a Creative Commons Attribution 3.0 + Unported License. + http://creativecommons.org/licenses/by/3.0/legalcode + +====================== +Website Activity Stats +====================== + +https://storyboard.openstack.org/TODO + +Basic website activity stats around which pages are hit most often, which +pages are 404s, and total number of visitors aid in properly running a +site. With this info you can correct broken links or redirect users to +appropriate locations. Popular pages can be given more attention as they +are read most often. Visitor numbers help you learn if changes that are +being made are effective or not. + +Unfortunately for a long period of time we've not really published any of +this useful data. + +Problem Description +=================== + +One of the major reasons we have not published this data historically is +that many tools that work with this data over share. We are particularly +concerned about publishing information that might be attributed to specific +users. The ideal here is that we could publish a bare minimum of information +that allows web admins to properly manage sites without leaking personal +information. + +In particular we don't want to leak IP Addresses or subnets as IPs are +considered PII and without significant traffic subnets typically identify +specific users. We also want to avoid publishing referer information as +this can be used to infer who users are as well. This can happen if users +follow links from internal company wikis, bug trackers or code hosting +systems. + +Out of an abundance of caution we will avoid publishing Operating System, +Web Browser, and google search terms as well. This data is likely safe to +share, particularly if we avoid making it cross referenceable with other +fields. For this reason we may add these stats in the future. + +Proposed Change +=============== + +We can use goaccess, a GPL tool, to produce conservative website stats +reports from apache access logs. The key here is that newer goaccess (since +Ubuntu Bionic) allow you to remove data from the end result report files. +This allows us to tell goaccess to produce reports only with the data we +feel is safe for public consumption. + +We would run periodic Zuul jobs that connected to static.opendev.org, +uncompressed Apache log files as necessary, then fed them through goaccess. +The resulting report.html output file could then be written into AFS as well +as hosted directly from the zuul logs system. This would give us reports +that updated roughly daily covering the period of time for which logs are +available. + +To make this possible we will use Zuul's per project ssh keys. This will +allow the jobs to add static.opendev.org to the running ansible inventory +then run ansible to perform the above steps. + +If publishing into AFS we would write them to a known location for each site:: + + https://example.website.org/goaccess.html + +To do this we need a configuration file that excludes the panels we do not +want:: + + log-format COMBINED + + ignore-panel VISITORS + ignore-panel REQUESTS + ignore-panel REQUESTS_STATIC + ignore-panel NOT_FOUND + ignore-panel HOSTS + ignore-panel OS + ignore-panel BROWSERS + ignore-panel VISIT_TIMES + ignore-panel VIRTUAL_HOSTS + ignore-panel REFERRERS + ignore-panel REFERRING_SITES + ignore-panel KEYPHRASES + ignore-panel STATUS_CODES + ignore-panel REMOTE_USER + ignore-panel GEO_LOCATION + + enable-panel VISITORS + enable-panel REQUESTS + enable-panel REQUESTS_STATIC + enable-panel NOT_FOUND + enable-panel STATUS_CODES + +Then we can run (roughly) this command in the Zuul jobs:: + + goaccess /var/log/apache2/example.site.org_access.log* -o example-site-report.html -p ./goaccess.conf + +Alternatives +------------ + +We can use tracker that run in the browser like goatcounter. One downside +to this approach is that we would need to run custom 404 pages in order +to collect data on 404s. This is more complicated than the web server logs +approach. One upside to this approach is that we could track referrers to +404s enabling us to more easily fix our own broken links. + +If we were collecting a rich set of data they would provide much more info, +but because we've decided that we do not want to collect that information +the server logs should be sufficient. + +Implementation +============== + +Assignee(s) +----------- + +Primary assignee: + TBD + +Gerrit Topic +------------ + +Use Gerrit topic "website-stats" for all patches related to this spec. + +.. code-block:: bash + + git-review -t website-stats + +Work Items +---------- + +* Write zuul jobs to produce and publish the goaccess reports. +* Document goaccess tooling for web admins. + +Repositories +------------ + +None + +Servers +------- + +static.opendev.org would be updated to implement this for the sites it +hosts. + +DNS Entries +----------- + +None + +Documentation +------------- + +We will need to document where the stats can be retrieved once available. +We should also document the choices we made around which data is collected. + +Security +-------- + +We could potentially leak sensitive client information unintentionally. +The example config file used above is intended to do its best to avoid that +by explicitly disabling all available goaccess panels then enabling the few +we know are safe. + +Testing +------- + +We can run the new job against test data to ensure it works as expected +without disclosing unwanted info. + +Dependencies +============ + +None