We just finished migrating all of our monitoring from InfluxDB to Prometheus and I thought I'd write up our reasons for the change. Please note that these are my own personal observations and relate to a specific project, these issue may not apply to you and you should evaluate each product for your own uses.
Update: To clarify, the versions of InfluxDB and Prometheus that I am talking about are InfluxDB 1.1.1 and Prometheus 1.5.2.
Push vs Pull
- InfluxDB is a push based system, i.e. your running application needs to actively push data into the monitoring system.
- Prometheus is a pull based system, the Prometheus server fetches the metrics values from the running application periodically.
With centralized control of how polling is done with Prometheus I can switch from polling every minute to every 10 seconds just by adjusting the configuration of the Prometheus server. With InfluxDB I would have to redeploy every application with a change to how often they should push metrics. In addition the Prometheus pull method allows Prometheus to create and offer a synthetic "UP" metric that monitors whether an application is up and running. For short lived applications Prometheus has a push gateway.
- InfluxDB has a monolithic database for both metric values and indices.
- Prometheus uses LevelDB for indices, but each metric is stored in its own file.
Both use key/value datastores, but how they use them is very different and it affects the performance of the products. InfluxDB was slower and took up substantially more disk space than Prometheus for the same exact set of metics. Just starting up InfluxDB and sending a small number of metrics to it caused the datastore to grow to 1GB, and then grow rapidly from there to 100's of GB for our full set of metrics, while Prometheus has yet to crack 10GB with all of our metrics. And let's not even go into the number of times InfluxDB lost all of our data, either from a crash or from a failed attempt to upgrade the version of InfluxDB that we were running.
Update: I was also reminded there's another datastore related issue with startup time, while Prometheus starts in a matter of seconds, InfluxDB would regularly take 5 minutes to restart while it either validated or rebuilt its indices and would not collect data during the entire process.
Probably closely related to the datastore efficiency, InfluxDB was coming close to maxing out the server it was running on, while Prometheus running on an identical instance peaks at maybe 0.2 load.
- InfluxDB uses a variant of SQL.
- Uses a substantially simpler and more direct querying model.
What would you rather type?
SELECT * FROM "cpu_load_short" WHERE "value" > 0.9
cpu_load_short > 0.9
- Configuration is done through a mix of config files and SQL commands sent to the server.
- Text files.
Prometheus config is simply YAML files, and the entire config is done via files. With InfluxDB you have to worry that some of the config, for example, creating the named database that metrics are to be stored in, actually gets done. Additionally Prometheus just picks more reasonable defaults, for example, it defaults to only storing data for 15 days, while InfluxDB defaults to storing all data forever, and if you don't want to store all data forever you need to construct an SQL command to send to the server to control how data is retained.