轻量级日志系统

背景

还在纠结 Elastic Stack 资源消耗？

还在纠结 Elastic Stack 配置繁杂？

还在纠结 Elastic Stack 维护困难？

今天我就推出一套我自己的轻量级解决方案：

Loki + Vector + Grafana

另外就是再简单说一下，我为什么要搞这么个轻量级日志系统，我的 NAS 和路由器的日志都可以发送到syslog上，NAS上有很多时候会报一些奇奇怪怪的错，如果是硬盘发生了问题就更恐怖了。

所以我想把日志都收集起来，做统一的分析和告警。但是呢日志收集并不能在我本地的NAS上做，如果NAS的虚拟机崩了，或重启了，有部分日志会丢失，所以，我选择在我的朋友的机器上做收集，然后我远程发送过去。

日志三大件

Loki

Loki 是列式数据库新起之秀，类似于 Prometheus，但更适合日志存储和分析。

对比 ElasticSearch，要轻量太多了，尤其是内存占用方面。

而且你可以通过配置把存储直接放到 S3 上，这样日志还有的备份。

（另外就是 clickhouse 是 Loki 的第二种选择，对于轻量级，怎么选择都没关系）

1. install

简单的启动命令

./loki -config.file=/opt/loki/loki-local-config.yaml

或配置成服务，由systemd来接管

[Unit]
Description=Like Prometheus, but for logs.
After=syslog.target network.target remote-fs.target nss-lookup.target
 
[Service]
Type=simple
WorkingDirectory=/opt/loki
User=root
ExecStart=/bin/sh -c "./loki -config.file=/opt/loki/loki-local-config.yaml"
PrivateTmp=true
Restart=on-failure

[Install]
WantedBy=multi-user.target

2. config

示例配置：

auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096

common:
  instance_addr: 127.0.0.1
  path_prefix: /tmp/loki
  storage:
    filesystem:
      chunks_directory: /tmp/loki/chunks
      rules_directory: /tmp/loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

query_range:
  results_cache:
    cache:
      embedded_cache:
        enabled: true
        max_size_mb: 100

schema_config:
  configs:
    - from: 2020-10-24
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

ruler:
  alertmanager_url: http://localhost:9093

frontend:
  encoding: protobuf

# By default, Loki will send anonymous, but uniquely-identifiable usage and configuration
# analytics to Grafana Labs. These statistics are sent to https://stats.grafana.org/
#
# Statistics help us better understand how Loki is used, and they show us performance
# levels for most users. This helps us prioritize features and documentation.
# For more information on what's sent, look at
# https://github.com/grafana/loki/blob/main/pkg/analytics/stats.go
# Refer to the buildReport method to see what goes into a report.
#
# If you would like to disable reporting, uncomment the following lines:
#analytics:
#  reporting_enabled: false

Vector

Vector 作为一个中性的日志收集器，不仅支持各种数据源，而且支持很多很多的Sinks。

对比logstash，轻量的不要太多，尤其是启动方面（logstash光启动都要1分半）。

1. install

可以直接尝试使用包管理软件安装，当然也可以手动安装

启动命令

./vector --config /opt/vector/config/syslog-vector.toml

或者配置成服务

[Unit]
Description=Vector
Documentation=https://vector.dev
After=network-online.target
Requires=network-online.target

[Service]
Type=simple
WorkingDirectory=/opt/vector/bin
User=root
ExecStart=/bin/sh -c "./vector --config /opt/vector/config/syslog-vector.toml"
ExecReload=/bin/kill -HUP $MAINPID
PrivateTmp=true
Restart=on-failure
# Since systemd 229, should be in [Unit] but in order to support systemd <229,
# it is also supported to have it here.
StartLimitInterval=10
StartLimitBurst=5

2. config

注意了配置文件这里是 toml 格式。

这个配置需要根据你的业务来，我这里的示例配置是：通过udp的514端口来接收syslog，然后转成json，发送到loki上

# udp:514 接收 syslog 日志
[sources.remote_udp_syslog_1]
type = "syslog"
address = "0.0.0.0:514"
mode = "udp"

# 解析系统日志
# See the Vector Remap Language reference for more info: https://vrl.dev
[transforms.parse_logs]
type = "remap"
inputs = ["remote_udp_syslog_1"]
metric_tag_values = "full"
source = '''
. |= parse_syslog!(.message)
'''

# 打印到控制台上（for test），也可以不要
[sinks.print]
type = "console"
inputs = ["parse_logs"]
encoding.codec = "json"

# 输出到 loki 上
[sinks.loki_sink_1]
type = "loki"
inputs = [ "parse_logs" ]
endpoint = "http://localhost:3100"
# 编码成json
encoding.codec = "json"
encoding.metric_tag_values = "full"
    [sinks.loki_sink_1.labels]
        source = 'syslog'
    [sinks.loki_sink_1.healthcheck]
        enabled = false

Grafana

Grafana 这个老牌的监控了，业界使用率最广泛的，而且不重。看上Grafana的另一个原因就是和 Loki 是一个厂出的。

1. install

怎么安装，安装到哪，这个对于Dashboard 其实影响不大了，只要能连接到数据源（Loki）最低要求就达到了。

这里我托管到Nas上了，所以使用 Docker（compose）来安装。

services:
  grafana:
    image: docker.m.daocloud.io/grafana/grafana-enterprise
    container_name: grafana
    restart: unless-stopped
    # if you are running as root then set it to 0
    # else find the right id with the id -u command
    user: '0'
    ports:
      - '3300:3000'
    # adding the mount volume point which we create earlier
    volumes:
      - '$PWD/data:/var/lib/grafana'
    # 注意这里连通局域网的另外一台server，可以采用hosts再映射一次，然后就可以使用 `host.log.server` 来去连接了
    extra_hosts:
      - 'host.log.server:100.100.1.2'

2. config

配置方面没有什么特别要注意的，但是可以讲几个有关 Loki 的查询语句

如果配置纬度比较广的Dashboard，需要在编辑表盘中的 Transform data 中添加一个 Extract fields, 选择 source 为 lables, format 成 JSON。
每隔一段时间，count一下时间区间的日志条数
```
count by(time) (rate({source="syslog"} [$__auto]))
```

根据日志级别，配置饼图

count by(detected_level) (rate({source="syslog"} | json | detected_level != `` | __error__=`` [$__auto]))

获取从router发送的日志（转换成json）
```
{source="syslog"} | json | host = `router`
```

Wayne's blog