Commit f95e657e authored by Lutra Conseil's avatar Lutra Conseil Committed by Rafael Monnerat

slapos.collect: Include collect basic documentation/information

parent 482189e0
Collecting Data
================
The "slapos node collect" command collects data from a computer taking a
few snapshot on different scopes and storing it (currently on sqllite3).
Scopes of Snapshots are:
- User Processes: Collects data from all user's process related to SlapOS (ie.: slapuser*)
- System Information: Collects data from the System Usage and Computer Hardware.
So on every slapos node collect calls (perfomed by cron on every minute), the
slapos stores the all snapshots for future analizes.
User's Processes Snapshot
==========================
Collect command search for all process launched by all users related to the
slapos [1]. After this, for each process it uses psutil (or similars tools) to
collect all available information for every process pid [2].
Once Collected, every Process information is stored on sqllite3 [3], in other
words, we have 1 line per pid for a giving time. It's used pid number and
process creation date for create a UID for the process, and it is omitted the
command name in order to annonymalize the data (so the risk of information
leak is reduced).
The measuring of process only consider CPU, memory and io operations (rw and
cycles), we are studying how to measure network (without be intrusive).
System Information Snapshot
============================
Those snapshots has 2 different goals, first is collect current load from existing
computer (cpu, memory, disk, network...) and the second goal is collect the
available resources the computer has installed [4].
We use 3 types of snapshots for determinate the load and the available resources
(all mostly use psutils to collect data):
- System Snapshot [5]: It collects general computer usage like CPU, Memory
and Network IO usage.
- Computer Snapshot [6]: It collects for now number of CPU cores and available
memory, however we wish to collect more details.
- Disk Snapshot [7]: It collects the informations related to the a disk
(1 snapshot per disk), which contains total, usage and
io informations.
"Real-time" Partial dump (Dygraph)
===================================
On every run, we dump data from the current day on csv [8] (2 axes), in order to
plot easily with dygraph, so there will be few files available like this:
- system_cpu_percent.csv
- system_disk_memory_free__dev_sda1.csv
- system_disk_memory_free__dev_sdb1.csv
- system_disk_memory_used__dev_sda1.csv
- system_disk_memory_used__dev_sdb1.csv
- system_loadavg.csv
- system_memory_free.csv
- system_memory_used.csv
- system_net_in_bytes.csv
- system_net_in_dropped.csv
- system_net_in_errors.csv
- system_net_out_bytes.csv
- system_net_out_dropped.csv
- system_net_out_errors.csv
All contains only information from computer usage, for global usage (for now). It
is perfectly acceptable keep a realtime copy in csv of the most recently data.
Logrotate
=========
Slapos collects contains its on log rotating policy [9] and gargabe collection [10].
- We dump in folders YYYY-MM-DD, all data which are not from the current day.
- Every table generates 1 csv with the date from the dumped day.
- All dumped data is marked as reported on sqllite (column reported)
- All data which are older them 3 days and it is already reported is removed.
- All folders which contains dumped data is compressed in a tar.gz file.
Data Structure
===============
The header of the CSVs are not included on the dumped file (it is probably a
mistake), but it corresponds to (same as columns on the sqllite) which can be
easily described like bellow [11]:
- user
partition (text)
pid (real)
process (text)
cpu_percent (real)
cpu_time (real)
cpu_num_threads (real)
memory_percent (real)
memory_rss (real)
io_rw_counter (real)
io_cycles_counter (real)
date (text)
time (text)
reported (integer)
- computer
cpu_num_core (real)
cpu_frequency (real
cpu_type (text)
memory_size (real)
memory_type (text)
partition_list (text)
date (text)
time (text)
reported (integer)
- system
loadavg (real)
cpu_percent (real)
memory_used (real)
memory_free (real)
net_in_bytes (real)
net_in_errors (real)
net_in_dropped (real)
net_out_bytes (real)
net_out_errors (real)
net_out_dropped (real)
date (text)
time (text)
reported (integer)
- disk
partition (text)
used (text)
free (text)
mountpoint (text)
date (text)
time (text)
reported (integer)
Probably a more formal way to collect data data can be introduced.
Download Collected Data
========================
Data is normally available on the server file system, we use a simple software
"slapmonitor" which can be deployed on any machine which allow us download via
HTTP the data.
Slapmonitor can be also used to determinate de availability of the machine (it
returns "OK" if accessed on his "/" address), and it servers the data on a url
like:
- https://<address>/ -> just return "OK"
- https://<address>/<secret hash>/server-log/ -> you can see all files
The slapmonitoring can be easily extented to include more sensors (like
temperature, benchmarks...) which normally requires more speficic software
configurations.
Planned Non core extensions and benchmarking
=============================================
It is planned to include 4 simple benchmarks measure machines performance
degradation overtime:
- CPU benchmark with Pystone
- SQL Benchmark on SQLlite (for now)
- Network Uplink Benchmark
- Network Download Benchmark
This part is not included or coded, but we intent to measure performance
degradation in future, to stop to allocate if the machine is working but
cannot mantain a minimal Service Quality (even if it is not looks like
overloaded).
Servers Availability
=====================
All servers contacts the slapos master on regular bases (several times a minute),
it is possible to determinate the general availability of a server by looking at
apache log using this script:
- http://git.erp5.org/gitweb/cloud-quote.git/blob/HEAD:/py/my.py
It produces a json like this:
- http://git.erp5.org/gitweb/cloud-quote.git/blob/HEAD:/data/stats.json
However, this is a bit draft and rudimentar to determinate problems on the
machine, as the machine completly "death" is rare, normally most of failures are
pure network problems or human/environmental problem (normally not depends of
the machine load).
[1] http://git.erp5.org/gitweb/slapos.core.git/blob/HEAD:/slapos/collect/entity.py?js=1#l58
[2] http://git.erp5.org/gitweb/slapos.core.git/blob/HEAD:/slapos/collect/snapshot.py?js=1#l37
[3] http://git.erp5.org/gitweb/slapos.core.git/blob/HEAD:/slapos/collect/db.py?js=1#l130
[4] http://git.erp5.org/gitweb/slapos.core.git/blob/HEAD:/slapos/collect/entity.py?js=1#l77
[5] http://git.erp5.org/gitweb/slapos.core.git/blob/HEAD:/slapos/collect/snapshot.py?js=1#l62
[6] http://git.erp5.org/gitweb/slapos.core.git/blob/HEAD:/slapos/collect/snapshot.py?js=1#l95
[7] http://git.erp5.org/gitweb/slapos.core.git/blob/HEAD:/slapos/collect/snapshot.py?js=1#l81
[8] http://git.erp5.org/gitweb/slapos.core.git/blob/HEAD:/slapos/collect/reporter.py?js=1#l75
[9] http://git.erp5.org/gitweb/slapos.core.git/blob/HEAD:/slapos/collect/reporter.py?js=1
[10] http://git.erp5.org/gitweb/slapos.core.git/blob/HEAD:/slapos/collect/db.py?js=1#l192
[11] http://git.erp5.org/gitweb/slapos.core.git/blob/HEAD:/slapos/collect/db.py?js=1#l39
\ No newline at end of file
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment