Backfill of historical data: Often our customers have internal data lakes and want to backfill their historical data to Amplitude.Īs all queries in Amplitude have a filter for the event-time, we want our raw data to be indexed by event time, so we can filter out data outside the query-range.All events are sent to Amplitude once the device comes back online.
Offline Activity: devices are off the grid while performing events, so events get buffered on the device.We often have significant differences in these two timestamps because of the following two reasons: There are two key timestamps in Amplitude, event time (time at which the event occurred) and server upload time (time at which the event was uploaded into Amplitude systems). This problem is a little unique to us because of the way our data is sent to Amplitude.
Further investigation showed most of this time was spent opening and closing lots of small files. Fragmentation of Data into lots of small files:įor our slowest queries, close to 65% of the total query time was being spent on reading raw data from the disk. This includes enabling functionality like auto-scaling on our querying infrastructure.Īs we were analyzing the performance of our systems, we had a few discoveries.