add in neutron, glance, and n-net logs as required files when
appropriate. This will help ensure that we don't miss a pattern
because we searched before the log was in the system.
Change-Id: Ia8f2cdedfc9964f1d9589fda253174e972fcc770
Instead of just listing which bugs were seen in an entire gerrit event
(multiple jenkins/zuul jobs), list which bugs were seen in which job.
If one of the jobs has an unrecognized error don't display the comment
about running recheck, just list which bugs were seen on which jobs (and
which has an unrecognized error)
Change-Id: I55b2eb8f0efe43ab22540294150d4bc9f5885510
We are starting to track a decent amount of data per zuul/jenkins job,
so track data in an object instead of assorted variables and
dictionaries. For example bugs are now tracked by job and not
gerrit event. Now, we can support reporting which bug caused which
specific job to fail. This also does some assorted object related
cleanups. This consists of internal changes only, a future patch will
make the gerrit and irc comments take advantage of this.
Change-Id: I2116cd0e10b45617a8d572b27f1672f695fa91d0
main in elasticRecheck was originally used for testing before the bot
was ready, but now that we have the bot working and it supports noirc
and no gerrit comment modes (tox -erun) there is no need to include a
main() here.
Change-Id: I6e1d790b78d2f2eafacd8efcaf132cf4479fe8ca
Always log the gerrit comment, and when running in nocomment just don't
send it to gerrit. This helps make testing changes to the gerrit comment
easier.
Change-Id: Ie26b86ed374d284154389b4bd5a86b9d2f365800
In preperation for providing a web page that will just show hits on the
gate queue, add a '-q queue' option to elastic-recheck-graph.
Change-Id: I9217a2ceedf86ffe04851084df78238384fccd51
Now that we are running this on all jobs (not just tempest) we are
getting significantly more IRC messages. Add failed job name to logs to
provide more context of what job is failing. For unclassified failures
also include the queue (as a unclassified unit test failure in check queue is
much less important then one in the gate).
Change-Id: I485bf06721fa5afd102b99b26e38f12449deec7b
When adding support for short build_uuid's in
AI6356a971ca250ddf5f01a9734f13d0b080a62c89 event.bugs was converted to a
set since we would know can run classify multiple times on a singe
event and don't want duplicate bugs. That patch didn't update the gerrit
leave comment capabilties to understand event.bugs as a set (instead of
a list).
Change-Id: I9032e23e0e53426a57bebf42f4c4d4167624280e
In addition to searching by change and patch search by the short build_uuid.
This prevents us accidentally classifying multiple builds when we classify
a failure on gerrit. This can happen in the gate queue if there is a
gate reset, or if there are multiple 'recheck bug x' on a single patch
revision in the check queue.
Change-Id: I6356a971ca250ddf5f01a9734f13d0b080a62c89
instead of passing around complex data structures, create an
event object for our purposes that means we can pass around the
payload relevant to us. Simplifies some things, and will make
adding build_uuid tighter.
Change-Id: I8172b25ae3c60e38d63cf7f4d8a0f6c854bae766
we have been timing out on logs a lot, and not noticing. Redo this
logic to be exception based so we can tell the IRC channel when we
timeout on logs, to get to the bottom of reliability issues with
indexing logstash data.
Change-Id: Ia63d801235c6959eb7b97c334291a6d2f06411b6
this makes the er bot work at a more sane set of default logs,
plus also tells us how often we end up timing out.
it also makes the logs actually include timestamps.
Change-Id: I29877c4158a84bd46b0a437a12c14450a049b49d
we only want to run on things we consider the "integrated" gate,
however, that's kind of a nebulus definition. Today a reasonable
heuristic is if we are running the tempest full job, so use that.
This check could be enhanced in the future.
Change-Id: Iad36d330f8f6db3bbaa0c54a0c8e70b0e01a17b6
this changes the interface to move the readiness check out of
the classifier and into the stream object. This massively
simplifies the logic connecting these pieces, as classifier is
now just a thin wrapper to elastic search.
This also adds unit testing for the stream processing through the
creation of a fake_gerrit mock class. That lets us run gerrit
event interactions in a sane way.
It also drops all the unit testing for the classifier which is now
largely useless, because all it tests is we can execute a for loop.
Change-Id: I1971c121276412e31f01eb5680b9c41fc7e442d3
one of the big issues today with er is the amount the there is
coupling between the bot and the classifier about knowing when
jobs are ready. The impact of this is that we are often
incorrectly determining when jobs are ready, because we have this
small set of files we test for, that aren't right for various
jobs.
This is the beginning of decoupling that. By parsing the job names
that have failed in the jenkins failure message we can move all
the readiness checking into the Stream.
This commit adds the parsing and the unit tests, though it doesn't
actually change behavior to use it yet (next patch).
Change-Id: I54ffa3495a36c2d61b1824794a672c8f5552df54
Add a resolved_at attribute in the query yaml files
that can be used to mark when a bug has been
fixed or does not occur any more. This can help us
re-enable bugs quickly when we see them again.
Change-Id: I7af7ce9417eec5ff9ecc2487a920ff9d1286a714
Job names are about to change in infra/config. Be a little more
robust (but still, this is fragile).
Change-Id: I882de80dbb02aad68ef7b41095f36db2c7ebec49
In the land of random cleanups, let more of the whitespace rules
back in. Also explicitly exclude E125 because of the overreach,
and leave E123 excluded because it creates some kind of odd
artifacts in the current code (possibly clean it up later).
Tox.ini adjusted with comments about the fact that what we are
ignoring is there for a reason.
Change-Id: I5636cb646d7898df71b715aa0e32a68ce279ee80
extract out methods for readability, so the code has logical flow
and the details about each conditional can be encapsulated in its
own method.
Change-Id: I5b62842346e0e3774d8e0586ff6b2c6969602a07
elastic_recheck started off life ignoring the 80 column boundary.
We should stop that, as it's bad form. Also, I do multi column
emacs and it blows my column widths.
So fix all the E501 issues and start enforcing the rules in tox
Change-Id: Ib0a1d48d085d9b21fbc1bab75e93e9cc40d36988
this handles the piece of work we've been talking about for a while
in moving the queries.yaml file into a directory with a bunch of
files. These remain yaml so that they can be tagged with additional
metadata. This would support the concept of soft deleting as well
as other useful meta data to gauge our evolution of the bugs we
track over time.
This should see some real review as it's extensive enough of a
change that the existing tests might not be sufficient. However it
should be enough to move this forward quite a bit.
This also makes future looking statements about doing soft deletes
with a resolved_at keyword in the future. That implementation will
come later.
Change-Id: I86317fcf6f1886ab5b6c0ee154b29e71865c52b7
I was confused by the code review message, as I thought a recheck
was automatically kicked off. Make it clearer that I need to do
this manually.
Change-Id: I21497c6ae54c44b746375e6473b8501c99776451
as part of trying to simplify the core elasticRecheck, refactor
the query creation into a separate set of query_builder routines.
This takes away some of the duplication between the queries, and
attempts to add documentation to the uses for each of them.
add elasticRecheck fake pyelasticsearch testing
build basic fixtures for unit testing that let us fake out the
interaction to pyelasticsearch. This uses the json samples added
for previous testing as the return results should an inbound
query match one of the queries we know about.
If the query is unknown to us, return an empty result set. Unit
testing for both cases included going all the way from the top
level Classifier class.
Change-Id: I0d23b649274b31e8f281aaac588c4c6113a11a47
in an attempt for long term simplification of the source tree, this
is the beginning of a ResultSet and Hit object type. The ResultSet
is contructed from the ElasticSearch returned json structure, and
it builds hits internally.
ResultSet is an iterator, and indexable, so that you can easily loop
through them. Both ResultSet and Hit objects have dynamic attributes
to make accessing the deep data structures easier (and without having
to make everything explicit), and also handling the multiline collapse
correctly.
A basic set of tests is included, as well as sample json dumps for all
the current bugs in the system for additional unit testing. Fortunately
this includes bugs which have hits, and those that don't.
In order to use ResultSet we need to pass everything through
our own SearchEngine object, so we get results back as expected.
We also need to teach ResultSet about facets, as those get used
when attempting to find specific files.
Lastly, we need __len__ implementation for ResultSet to support
the wait loop correctly.
ResultSet lets us simplify a bit of the code in elasticRecheck,
port it over.
There is a short term fix in the test_classifier test to get us
working here until real stub data can be applied.
Change-Id: I7b0d47a8802dcf6e6c052f137b5f9494b1b99501
* elastic_recheck/elasticRecheck.py: Update templated queries to use non
'@' prefixed fields and flatten the old '@fields' field. This is
possible because query for foo_field will find foo_field and
@fields.foo_field. Also, handle the case where @fields may not be
present in the query results.
* queries.yaml: Update queries using the same rules as in
elasticRecheck.py
Change-Id: I48672912d05c7ad557e948cfef0402c7c89582f6
* elastic_recheck/elasticRecheck.py: There was a comma missing in the
REQUIRED_FILES list that cased the cinder volume log file and syslog log
file names to be appended together. Add the comma to fix the list.
Change-Id: I6aaf745f996e725c529ccd9f8b7444d8b9a5648f
First syslog based query, using it get to the swift proxy-server logs.
Add log/syslog.txt to required files list as well.
Change-Id: I6f3090efe4945efcd67b53b89c1b64bc1db3afa7
previously when we had multiple bugs we did looped string appends,
but that meant we had a trailing "and", which was ugly. We can
do better by transforming bugs to bug_urls, then using join.
Change-Id: Iaf28dbe9909c60b1e2206a79faaf5190f792252d
* elastic_recheck/elasticRecheck.py: When a single bug is found be sure
to pass that single bug to the string formatter rather than an undefined
variable. This fixes a bug that caused elastic-recheck's Stream to die
previously.
Change-Id: Ie62abde1b571fa2b42b95519fc5c23e0199f732d
Move test code into tests.
Remove last_failures test, as its replaced by other tests now.
Remove dead code.
Change-Id: I3514f62e003b1140fbe597cc91aea3089c268ac7
this adds a tool that runs through the query list and looks for
whether the queries exist in success runs in logstash. This helps
us classify whether or not queries need to be looked at for
narrowing.
make elastic-recheck-success the entry point when installed
Change-Id: I3eaa822af35146935b22100ffb1e3a4f18dc8d0e
Now that ElasticSearch isn't backed way up, using a while True is
dangerous, because if something breaks for an individual tempest
failure the entire system will hang.
Even if something breaks in ElasticSearch we want elastic-recheck to
recover without needing to be restarted.
Update test_classifier, unfortunitly it uses logstash.o.o which removes
results every two weeks, so the test needs updating to work.
Change-Id: I119bb3d1ef814aabd393e65af97f851a54895985
This commit adds support for a test failure having more than one
bug match. Since there are normally more than one tempest run for
each commit there is the potential for multiple failures.
Change-Id: Ibd0a5e3c7ec64732b41186400da2af6cd4658fdd
And some other fixups around starting the daemon:
* read config file before forking
* add '-d' option to avoid forking
* default pidfile to /var/run/elastic-recheck/elastic-recheck.pid
* add pidfile option to config file
* switch to python-daemon library (which is the version of the
lib that the code was expecting anyway)
* use expanduser in the query file path (to match the rest of the
paths)
Change-Id: I674778ef189cd216a80f74bd449cdc3b12b57a7d
It is easier for a human to read, and by virtue of not requiring
escaped quotes, easier to copy/paste into a logstash field.
When copy/pasting, the newlines won't show up in the input field.
The '>' syntax in YAML indicates folding, which causes the newline
and indentation to be turned into a single space.
Change-Id: Ibd172fd4859c055096609f31ef09222147c34cf3