Traces - issues, capture, storage, manipulation

Olivier Aubert - www.olivieraubert.net

Cours INFO5 - 15/10/2025

Summary

Context: from physical to digital traces
Traces for Learning Analytics
Time-Series Databases

Context

Trace = sign of the past, inscription of a past event or process

Marginalia

Annotation - trace of a scholarly reading activity

Page of the Codex Oxoniensis Clarkianus 39 (Clarke Plato). Dialogue Gorgias. Public Domain

Diaries

CC BY 2.0

Weavings

Output from the performance "Le Déparleur" (Patrick Bernier et Olive Martin). Personal photograph - Olivier Aubert, CC BY-SA 4.0

Digital traces

Social network traces Source

Source { "data": [ { "id": "X999_Y999", "from": { "name": "Tom Brady", "id": "X12" }, "message": "Looking forward to 2010!", "actions": [ { "name": "Comment", "link": "http://www.facebook.com/X999/posts/Y999" }, { "name": "Like", "link": "http://www.facebook.com/X999/posts/Y999" } ], "type": "status", "created_time": "2010-08-02T21:27:44+0000", "updated_time": "2010-08-02T21:27:44+0000" }, { "id": "X998_Y998", "from": { "name": "Peyton Manning", "id": "X18" }, "message": "Where's my contract?", "actions": [ { "name": "Comment", "link": "https://www.facebook.com/X998/posts/Y998" }, { "name": "Like", "link": "https://www.facebook.com/X998/posts/Y998" } ], "type": "status", "created_time": "2010-08-02T21:27:44+0000", "updated_time": "2010-08-02T21:27:44+0000" } ] }

Digital traces - aggregation

Aggregated social network traces Source

Digital traces - consuming habits

Netflix part of Digital Advertising Alliance

In 2016:

video size (evaluation): 25Po
datahub (activity traces) size: 60Po (for 89e6 accounts) [2023: 260e6 accounts]

(in 2024: "Managing up to 10 million writes per second")

Digital traces - health

Coronavirus Country Comparator - Source

Digital traces - learning environments

Learning Analytics dashboard Source

Digital traces - other subjects

Access logs (offices, servers), sensor logs (health, linky, IOT…)

Variety of digital traces

Many types of traces. Here we will focus on:

activity traces for learning analytics (xAPI)
sensor data (time series)

but there are also

e-mails
chat logs
revision control information
website analytics
server logs
…

Common features of traces

time dimension
trace of a past activity
- there is a subject
some data is collected

Digital traces

Many things automatically trackable at a low cost
Possible to add higher level/synthetic events/interpretations
Beware of the semantic gap between what can be observed and what can be interpreted
Beware of the data bias, interpretation bias and the illusion of completeness (full representation)

General issues

Issues/challenges in

privacy/ethics
capture
storage
visualisation
manipulation/analysis
interpretation

A word about ethics

"Une grande responsabilité est la suite inséparable d'un grand pouvoir." (Plan de travail, de surveillance et de correspondance, Comité de Salut Public, 1793)
Programmer’s responsibility/ethics
What do you do if faced with the task of implementing illegal/unmoral software/processing?

User point of view

Wooclap survey time!

Ethics - Food for thought

Good to think about it before the issue arises…
Some reading ideas:
- (english) A Gift of Fire: Social, Legal, and Ethical Issues for Computing Technology
- (french) Manuel d’épistémologie pour l’ingénieur.e (disponible à la B.U.)
- (french) Quelle éthique pour l'ingénieur
- (french - video) L’éthique de l’ingénierie existe-t-elle ?
- A Framework for Ethical Decision Making

Some examples

Awful AI is a curated list tracking current scary usages of AI - hoping to raise awareness to its misuses in society
Mobile tracking
How it can go wrong
- US: These Companies Know When You're Pregnant—And They're Not Keeping It Secret
- US: DOGE / AI and the Future of American Politics (Bruce Schneier)

Some user-side solutions

Phone
- ExodusPrivacy
- TrackerControl
Computer
- cookie/ad/js blocking
- but browser fingerprinting and server-side tracking (IP…)
Real life
- video camera analysis jamming

Regulation - GDPR / RGPD

General Data Protection Regulation (GDPR) / Règlement Général sur la Protection des Données (RGPD)

Hypertextualized version
Applies from May 25th, 2018
Applies to any EU company, or any company processing data from EU citizens
Fine up to 20 M€ or 4% of the annual worldwide turnover
Applies to non-anonymized, personal data, with more constraints for sensitive data

GDPR principles 1/2

Explicit consent required (with some exceptions) and transparent information in any case.
Research exceptions
Right of Access, Right of Erasure, Right of Objection
Data portability
Data Protection Officer as a reference/contact person

GDPR principles 2/2

Responsibility and Accountability
Privacy by Design and by default, data minimization
Pseudonymisation encouraged
Data breach 72h notification maximum delay
Limited storage duration

Some GDPR application examples

Amazon €746 million (2021) - targeted ad system not based on free consent
WhatsApp €225 million (2021) - lack of transparent information
Facebook €17 million (2022) - data breach notification
Instagram €405 million (2022) - violation of children's privacy
Facebook €1200 million (2023) - transfer of personal data to the US

(see GDPR enforcement tracker)

Learning analytics

Definition

Learning analytics is the measurement, collection, analysis and reporting of data about learners and their contexts, for purposes of understanding and optimizing learning and the environments in which it occurs. Source

Interdisciplinary domain (data science, pedagogy, social sciences)

Learning Analytics Lifecycle

Khalil and Ebner, 2015

Application domains

analytics on interaction with the learning platforms
analytics around social interactions
analytics around learning content
analytics in different spaces (digital/f2f)

Uses

Real-time or asynchronous feedback
Reflexivity
Usage analysis
Reporting
Alerts
Recommandation
Document re/conception
…

Data analytics levels

Descriptive Analytics: What happened?
- Metric, summaries, visualisations…
Diagnostic Analytics: Why it happened?
- Outliers, patterns, PCA, correlations…
Predictive Analytics: What might happen if specific conditions occur?
- Probabilities, scenarios, predictive modelling…
Prescriptive Analytics: Which actions are best?
- Simulations, recommandation engines…

Source

Requirements for L.A.

Issue: traces are generated on a variety of platforms

we need a common model/protocol
- able to accommodate a variety of scenarios
we need to gather traces in a common place
- trace repositories (personal/shared) called LRS (Learning Record Stores)
depending on the nature of the trace data, there may be other constraints (frequency, volume)

And again, ethics

As mentioned before, ethics is a consideration to always keep first in mind.

European Union Ethical guidelines on the use of artificial intelligence and data in teaching and learning for educators (2022)

Models/Protocols

ExperienceAPI
- xAPI (ex-TinCanAPI)
- SCORM successor
- much inspired by ActivityStreams W3C recommendation
Caliper (IMS Global)
- another approach (more metrics-oriented)

Experience API evolution

SCORM (2000) shortcomings
- need to be connected
- content must be imported and registered into the LMS before tracking
- LRS embedded in LMS
Call for research for evolution in 2010
TinCanAPI project, inspired by ActivityStreams
v 1.0 released in 2013
v 2.0 released in 2023 (IEEE standard)

Experience API data model

ExperienceAPI website - Reference spec

Activity events are recorded as Statements

Statement = (timestamp, actor, verb, object) [+ context] [+ result] [+ stored timestamp] [+ authority]

xAPI Statement example

{ "timestamp: "2025-10-15T14:00:02.598441+01:00",
  "actor": { "name": "Olivier Aubert",
             "mbox": "mailto:contact@olivieraubert.net" },
  "verb": { "id": "https://activitystrea.ms/schema/1.0/present",
            "display": { "en-US": "presented" } },
  "object": { "id": "https://olivieraubert.net/cours/gcn_stockage_traces",
              "definition": {
                  "name": "Traces - issues, capture, storage, manipulation",
                  "type": "https://adlnet.gov/expapi/activities/lesson"
              }
  "context": { "language": "fr",
               "extensions": {
                   "https://www.polytech.univ-nantes.fr/xapi/polytechRoom": "D004"
               }
             },
  "stored": "2025-10-15T14:00:03.954814+01:00",
}

Actor representation

Identified by at most ONE of mbox, mbox_sha1sum, openid, account (homepage, name)

{
 "name": "Olivier Aubert",
 "mbox": "mailto:contact@olivieraubert.net"
}

{                                    {
 "name": "Olivier Aubert",            "name": "Olivier Aubert",
 "account": {                         "account": {
   "homePage": "https://orcid.org/",   "homePage": "https://madoc.univ-nantes.fr/",
   "name": "0000-0001-8204-1567"       "name": "aubert-o"
 }                                    }
}                                    }

Actor/Group representation

Agent + specify objectType and member list.

{
   mbox: "mailto:info@tincanapi.com",
   name: "Info at TinCanAPI.com",
   objectType: "Group",
   member: [
       {
           mbox_sha1sum: "48010dcee68e9f9f4af7ff57569550e8b506a88d",
           mbox_sha1sum: "9023723cde2d3efc5810dcee68e9f9f4af7ff575"
       },
…
}

Verb representation

URI + display string

{
 "id": "http://adlnet.gov/expapi/verbs/experienced",
 "display": {
   "en-US": "experienced"
 }
}

From xAPI vocabulary & profile index (common vocabularies), many common ones come from ActivityStreams W3C recommendation.

In particular, CMI5 defines (not only) a standard profile for learning content.

CMI5 - xAPI profile

xAPI is more generic than SCORM
CMI5 is a xAPI profile that normalizes vocabularies as well as other aspects (content launch, score reporting)
See Comparison of SCORM, xAPI and cmi5

Object representation

Normally an activity, but can also be a person, group or even another statement.

{ "id": "https://olivieraubert.net/cours/gcn_stockage_traces",
  "definition": {
    "name": "Traces - issues, capture, storage, manipulation",
    "type": "http://adlnet.gov/expapi/activities/lesson"
  }
}

{ "objectType": "Agent",
  "mbox":"mailto:test@example.com"
}

Context representation

Additional information about the activity context

"context": { "language": "fr",
             "extensions": {
                 "http://www.polytech.univ-nantes.fr/xapi/polytechRoom": "D004"
             }
           }

Result representation

Representation of a measured outcome.

score (Object): score of the Agent in relation to success/quality of the experience.
success (Boolean): Indicates whether or not the attempt on the Activity was successful.
completion (Boolean): Indicates whether or not the Activity was completed.
response (String): A response appropriately formatted for the given Activity.
duration (String): Period of time over which the Statement occurred.

Extensions

object, context and result can feature an “extensions” list
Custom vocabulary, key-value pairs where keys are URIs

"context": { "language": "fr",
             "extensions": {
                 "http://www.polytech.univ-nantes.fr/xapi/polytechRoom": "D004"
             }
           }

xAPI protocols

4 REST APIs (last 3: Document APIs)

Statement API: main API for statements
State API: scratch space in which arbitrary information can be stored in the context of an activity, agent, and registration (per user/per activity).
Agent API: additional data against a profile (group, settings…)
Activity Profile API: additional data against an activity not specific to a user (collaboration activities, social interaction)

Example REST call(1/2)

Call (source)

POST https://v2.learninglocker.net/v1/data/xAPI/activities/state

URL Parameters

activityId:http://www.example.com/activities/1
stateId:http://www.example.com/states/1
agent:{“objectType”: “Agent”, “name”: “John Smith”,
       “account”:{“name”: “123”,
                  “homePage”: “http://www.example.com/users/”}}

Example (2/2)

Headers

Authorization:Basic YOUR_BASIC_AUTH
X-Experience-API-Version:1.0.0
Content-Type:application/json

Body

{
   “favourite”: “It’s a Wonderful Life”,
   “cheesiest”: “Mars Attacks”
}

Trace repositories

Learning Record Store
- personal or shared or integrated with the LMS
Integrated to or external to LMS
Some open-source solutions: TraxLRS, SQLRS

Example dashboard

From Veracity

TraxLRS

Open-Source LRS
php - MySQL/PostgresQL based
Offers trace storage
Requires additional components (like Kibana) or paid version for visualisation/analysis

Example application

eFiL project

Source: eFiL

Time-series databases

Time Series Definition

A Time Series is

a collection of observations or data points obtained by repeated measure over time
measurements often happen in equal intervals
measurement is well defined (who measures what)
time is an illusion, any sequence can be seen as a time series

Use cases 1/2

Computer system monitoring of various measures (processor, network load, disk usage…) for
- Continuous monitoring
- Prediction for future possible events (storage limit, etc)
- Post-mortem analysis of event (failures mostly) causes

Use cases 2/2

Finance
- Observing trends of stock prices
- Handling bank accounts history
IOT / Industry / Health / Aeronautics / Telecom / Agriculture / Defence / Smart City / Insurance…
- Storing/analysing sensor measures
- Continuous measures and evaluation
- Warning when measurements deviate from the norm
- Comparisons/trends

Example: server monitoring

Example: electrical consumption

Source: Analyze your electrical consumption using Warp10

Example: geospatial time series

Source: Traffic data for Smart Cities

Definition of Time Series Data

Time series data can be defined as:

a sequence of numbers representing the measurements of a variable at time intervals.
identifiable by a source name or id and a metric name or id.
possibly associated with tags
consisting of {timestamp , value} tuples
- value: float most of the times, but can be any datatype
raw data is immutable, unique and sortable
possible extension: Geo-TimeSeries

Conventional database approach

Time Series can be stored in conventional databases (relational) - see "Point Based data model" - slide 41 of Temporal Databases lecture

series-id	timestamp	value
s01	00:50:37	2.56
s02	00:53:53	3.12
s01	00:56:52	4.42
s02	01:00:16	3.23
s01	01:03:32	5.20
s01	01:06:24	6.20

Database issues 1/2

scalability issues
- volume (# of sensors, # of measures)
- frequency (1Hz -> 86400 measures/day, 1 MHz sensor (acoustics) -> 1e6 measures/second)
- common workload in time series : millions of entries per second
  - OVH example : 1.5M datapoints/s, 24h/7, peaks at ~10M datapoints/s
- frequent write means frequent index updates/memory cache hitting - storage characteristics matter, see interesting discussion on Timescale blog.

Database issues 2/2

query/analysis/transformation
- expressivity issues
- query performance
- query characteristics (large batches, downsampling…)
- uncommon analyses

⇒ need for specialized time series databases

Other database issues

Warm vs cold data

Storage/archival of cold data/timeseries.

how to efficiently store it?
how to efficiently query it?

Example from Time Series et santé: datalogger médical 100hz, retour d'expérience après un an:

2 year capture of intracranial pressure (100Hz) -> 10e9 datapoints
use of HFile (compressed format for Warp10) for archiving data in a compressed way

Which database for which usage?

Global categories of databases:

relational
document stores (complex structures with different fields)
column stores (key-value)
object-oriented
graph (relationships)
triple stores (semantic database)
Time Series
…

Time-Series Database

A TSDB system is

a container for a collection of multiple time series
software system optimized for storing and querying arrays of numbers/values indexed by time, datetime or datetime range
specialized for handling/processing time series data, taking into account their characteristic workload

Characteristic write workload

write-mostly is the norm (95% to 99% of all workload)
writes are almost always sequential appends
writes to distant past or distant future are extremely rare
updates are rare
deletes happen in bulk

Characteristic read workload

happen rarely
are usually much larger than available memory (need for server-side processing)
multiple reads are usually sequential ascending or descending
reads of multiple series and concurrent reads are common (batch reading)

Characteristic usage

(live) monitoring
explore/understand data
data aggregation
outlier/anomaly detection
forecasting
…

TSDB Designs

Based on these characteristics

proper internal representation of time series
distributed database options allow for more scalability than monolithic solutions
server-side query processing is necessary
memory caching/optimization

Storage implementation

May be based on existing DBMS (Cassandra, HBase, CouchDB…), relational DB, Event stores (Kafka) or even Parquet files – for storing data or metadata.
May use its own data format (Time Structured Merge Tree for InfluxDB)
May focus on compression to store more data in memory (Gorilla by Facebook)

TSDB Design - Wide tables

One row/doc per time period, columns are samples

series-id	start	t+1	t+2	t+3	…
s01	00:00:10	2.56	3.12	4.42	…
s02	00:00:10	4.12	5.12	6.12	…
s01	00:00:20	4.23	4.44	4.76	…

…

TSDB Design - Hybrid tables

One row/doc per time period, completed lines are stored as BLOB.

series-id	start	t+1	t+2	t+3	…	compressed
s01	00:00:10				…	{…}
s02	00:00:10				…	{…}
s01	00:00:20	4.23	4.44		…

…

TSDB Design - Direct BLOB insertion

Usually with memory cache.

series-id	start	data
s01	00:00:10	{…}
s02	00:00:10	{…}
s01	00:00:20	{…}

…

Some optimizations

Pre-aggregation: pre-compute common aggregation with common granularities - days, months, etc
Use a custom data format as input: json, protobuf, custom format (warp10.io, Influx Line Protocol)…

Retention policies

Round-robin tables approaches (RRDTool, Graphite): keep only a round-robin buffer of data. Use a fixed-size storage.
InfluxDB allows to configure the retention policy (duration, replication, shard group duration)

Some existing solutions

InfluxDB

InfluxDB
Somehow Open Source (monolithic version, for real-time data - limited to 72h in v3 Serverless)
No external dependency (Go en 2.x, Rust en 3.x)
Custom TSMTree storage (Time Structured Merged Trees) in v3 Cloud and Parquet/Apache Arrow in v3 Serverless
Nanoseconds timestamps
REST API, CLI tool, language bindings
InfluxQL SQL-like query language (they abandoned the Flux functional script language)

InfluxDB - Writing data

Using Line Protocol:

<measurement>[,<tag-key>=<tag-value>...] \
<field-key>=<field-value>[,<field2-key>=<field2-value>...] \
[unix-nano-timestamp]

Example:

cpu,host=serverA,region=us_west value=0.64
payment,device=mobile,product=Notepad,method=credit billed=33,licenses=3 1434067467100293230
stock,symbol=AAPL bid=127.46,ask=127.48
temperature,machine=unit42,type=assembly external=25,internal=37 1434067467000000000

InfluxQL query language

SQL-inspired - Example:

SELECT MEAN("water_level")
FROM "h2o_feet"
WHERE "location"='santa_monica'
      AND time >= '2015-09-18T21:30:00Z'
      AND time <= now()
GROUP BY time(1h) fill(none)

InfluxQL = SQL + extensions

SQL from Apache Arrow DataFusion
Influx-specific extensions like
- MEASUREMENTS, FIELD KEYS, TAG KEYS in addition to SQL TABLE and COLUMNS
- tag/field/time-based operators (aggregation, etc)
- fill operator

InfluxDB platform

Platform for collecting, storing, graphing, and alerting on time series data

Telegraf: metrics collection agent
InfluxDB
- Storage agent
- Embedded web-based UI
- Metrics and events processing and alerting engine (notifications)

Warp10.io

Warp10
Open-source (Apache 2.0 OSL)
Features high-volume GeoTime Series (GTS) handling
Collection, storage and analysis of GTS
Server-side analysis scripts
Storage through LevelDB (standalone) or HBase (distributed)

Warp10 architecture

Standalone version (single jar, leveldb)
Standalone with datalog replication or sharding
Distributed version: Kafka for messaging, HBase for storage

Warp10.io - writing data

POST queries to an ingress endpoint
Encoding: TS/LAT:LON/ELEV CLASS{LABELS} VALUE
Time precision: from ns to ms
5 value types: long, double, boolean, string, binary

POST /api/v0/update HTTP/1.1
Host: host
X-Warp10-Token: TOKEN
Content-Type: text/plain

1380475081000000// foo{label0=val0,label1=val1} 123
/48.0:-4.5/ bar{label0=val0} 3.14
1380475081123456/45.0:-0.01/10000000 foobar{label1=val1} T

Warp10.io - WarpScript

Expressive query language (fully functional, Turing-complete)
6 high-level TS operations: BUCKETIZE, MAP, REDUCE, FILL, APPLY, FILTER
RPN-inspired syntax for WarpScript (stack-based)
Alternative syntax FLoWS
Output format: compact JSON objects

Warp10.io - WarpScript example

'TOKEN_READ' 'token' STORE                 // Storing token

[ $token ‘consumption’ {} NOW 1 h ] FETCH  // Fetch all values from now to 1 hour ago
[ SWAP bucketizer.max  0 1 m 0 ] BUCKETIZE // Get max value for each minute

[ SWAP [ 'room' ] reducer.sum ] REDUCE     // Aggregate all consumptions by room
[ SWAP mapper.rate 1 0 0 ] MAP             // Consumption being a counter, compute the rate

Warp10 test

Create a sandbox server on http://sandbox.senx.io/
Up to 10k series

Warp10 LLM

LLM trained to assist with WarpScript
https://spocky.senx.io/

Timeseries forecasting

Trend analysis: systematic change in the level of a series
Cyclical fluctuation analysis
Seasonal pattern analysis

Timeseries forecasting approaches

Approaches

statistical (ARIMA, SARIMA, TBATS…)
learning (XGBoost, Random Forest…)
deep learning
- NeuralForecast - collection of neural forecasting models
- TimesFM (Time Series Foundation Model) - pretrained time-series foundation model developed by Google Research for time-series forecasting

Visualisation interfaces

InfluxDB v2 Giraffe

Kibana

Part of the ELK stack (Elasticsearch)
Generic dashboard but with time-focused components

Grafana

Grafana (Graphite, InfluxDBv3, OpenTSDB, Prometheus…)

Initially fork of Kibana

Dash (PlotLy)

Dashboard creation platform https://plotly.com/dash/

https://dash.gallery/Portal/

For instance soccer game viz

Other visualizations

Example dynamic visualization: NBA Data visualisation (Source)

For reference/inspiration - Timeviz Browser: https://browser.timeviz.net/

ObservableHQ notebooks on Time Series viz

References

Time Series Database Query Languages (Philipp Bende)
Survey and Comparison of Open Source Time Series Databases (Andreas Bader, Oliver Kopp, Michael Falkenthal)
InfluxDB comparison with other TSDBs
Time Series France - commmunauté francophone sur les séries temporelles, de la collecte de la donnée jusqu'à son exploitation
Awesome Time Series - collection of resources for working with time series data
https://timeseriesclassification.com/ (datasets + code)