Advent of SQL - Day 13, XML Travel Manifests

Its day 13 of Advent of SQL, we have some xml to parse, which I don't think SQL can handle, but string manipulation to the rescue.

Let's get the SQL for the day:

DROP TABLE IF EXISTS travel_manifests;

CREATE TABLE travel_manifests (
    manifest_id INT PRIMARY KEY,
    vehicle_id TEXT,
    departure_time TIMESTAMP,
    manifest_xml XML
);

INSERT INTO travel_manifests (manifest_id, vehicle_id, departure_time, manifest_xml) VALUES
  (1, 'SLEIGH-01', '2025-12-22 06:00:00', 'Nia GrantovernightSofia KimovernightignoredJonah Wolfestandardignored'),
  (2, 'SLEIGH-07', '2025-12-23 13:00:00', 'Ravi PatelpriorityKeiko ItostandardAnya PavlovstandardNia GrantpriorityignoredCarter LewispriorityignoredLayla Brooksstandardignored'),
  (3, 'FLIGHT-NP-9', '2025-12-22 18:00:00', 'Keiko ItoovernightJonah WolfepriorityDiego RamosovernightignoredPriya DaspriorityLayla Brooksstandard'),
  (4, 'TRAIN-ICE-3', '2025-12-22 18:00:00', 'lowIsla TorresovernightRavi PatelovernightHiro TanakastandardPriya Daspriority'),
  (5, 'FLIGHT-NP-9', '2025-12-22 17:00:00', 'Nia GrantpriorityMateo Cruzstandardignored'),
  (6, 'CARGO-12', '2025-12-22 15:00:00', 'Carter LewisovernightSofia KimpriorityHiro TanakapriorityLucas FordpriorityignoredNia Grantovernight'),
  (7, 'SLEIGH-01', '2025-12-22 11:00:00', 'unknownPriya DasstandardDiego RamosstandardLucas FordovernightCarter LewisovernightignoredHiro TanakapriorityignoredZara Sheikhovernight'),
  (8, 'CARGO-12', '2025-12-23 13:00:00', 'Jonah WolfestandardLayla BrooksovernightLeo BeckerovernightignoredRavi PatelstandardSofia KimpriorityignoredElena Moralesovernight'),
  (9, 'SLEIGH-01', '2025-12-23 10:00:00', 'Sofia KimovernightBianca PereirastandardZara SheikhovernightElena MoralesovernightignoredHiro TanakapriorityKeiko Itopriorityignored'),
  (10, 'SLEIGH-01', '2025-12-22 21:00:00', 'lowNia GrantstandardAva JohnsonstandardPriya DaspriorityMateo CruzstandardBianca PereiraovernightLeo Beckerovernight'),
  (11, 'SLEIGH-07', '2025-12-23 10:00:00', 'Jonah WolfepriorityBianca PereirapriorityAnya Pavlovstandard'),
  (12, 'SLEIGH-01', '2025-12-22 08:00:00', 'okRavi PatelpriorityignoredBianca PereirapriorityKeiko Itostandard'),
  (13, 'FLIGHT-NP-9', '2025-12-22 11:00:00', 'okNia GrantpriorityElena Moralesstandard'),
  (14, 'SLEIGH-01', '2025-12-22 14:00:00', 'highLayla BrooksstandardignoredAnya PavlovovernightKeiko ItostandardZara SheikhstandardignoredSofia Kimovernight'),
  (15, 'FLIGHT-NP-9', '2025-12-22 14:00:00', 'okPriya DaspriorityignoredBianca PereirastandardNia Grantstandard');

Just one table, with some wild XML.

Let's see what do we want to do in the problem statement.

Problem

Using the travel_manifests table, extract the passenger information from the XML data and produce a report that shows all of the departure times for "CARGO" vehicles that have more than 20 passengers booked. Include in the results:

  • The vehicle_id
  • The departure_time
  • The total number of passengers on that departure
  • Order the results by departure_time.

Ok, so we need the number of passengers in the records which are of type CARGO and have more than 20 passengers booked.

Interesting!

Let's look at one record.

sqlite> .schema
CREATE TABLE travel_manifests (
    manifest_id INT PRIMARY KEY,
    vehicle_id TEXT,
    departure_time TIMESTAMP,
    manifest_xml XML
);
sqlite> SELECT * FROM travel_manifests WHERE id = 1;
Parse error: no such column: id
  SELECT * FROM travel_manifests WHERE id = 1;
                         error here ---^
sqlite> SELECT * FROM travel_manifests LIMIT 1;
+-------------+------------+---------------------+--------------------------------------------------------------+
| manifest_id | vehicle_id |   departure_time    |                         manifest_xml                         |
+-------------+------------+---------------------+--------------------------------------------------------------+
| 1           | SLEIGH-01  | 2025-12-22 06:00:00 | Nia GrantovernightSofia KimovernightignoredJonah Wolfestandardignored                                               |
+-------------+------------+---------------------+--------------------------------------------------------------+
sqlite> 

So, we have the following columns

  • vehicle_id which is I think related to filtering CARGO related vehicles only
  • departure_time which we just return as is
  • manifest_xml, oh! This is xml and it has passenger details.

If we look carefully, we can see the xml looks like this:

xml

    
        
            Nia Grant 
           overnight
        
        
            Sofia Kim 
           overnight
           ignored
       
       
           Jonah Wolfe
           standard
           ignored
       
    

We have manifest which has a property of passengers which is a list of passenger tags, inside of which, each element of passenger has its details like name, ticket_class, etc.

We only want the count of passengers, how can we get that? The dirtiest way to do is to count the occurances of <passenger> or </passenger> in the xml string.

We can do that with counting the full length of the xml string, and then dividing by the number of times the string can be replaced(which is the dirty part, there could be hidden <passenger> string somewhere that might break this logic, but if it is a valid xml, it works). We count the number of characters left after we replace the string <passenger> with empty string '' so that we can get the difference of the total number of character and the number of characters occupied by the string <passenger>. This difference if we divide by the length of <passenger> will give us the count of the number of times the <passenger> string is present in the xml string.

Let's take an example from the above. The length of the xml string is 374.

SELECT LENGTH(manifest_xml) FROM travel_manifests LIMIT 1;
sqlite> SELECT LENGTH(manifest_xml) FROM travel_manifests LIMIT 1;
+----------------------+
| LENGTH(manifest_xml) |
+----------------------+
| 374                  |
+----------------------+
sqlite> 

Let's replace the occurances of <passenger> with empty string in the manifest_xml` string, like so:

SELECT REPLACE(manifest_xml, '', '') FROM travel_manifests LIMIT 1;

Now, we can see the string <passenger> is gone from the returned result set. We can try getting its length now.

SELECT 
    LENGTH(
        REPLACE(manifest_xml, '', '')
    )
FROM travel_manifests LIMIT 1;

Now, it says 341 why? Because we removed (replaced with empty string) the occurences of <passenger>.

Let's get the length of '<passenger>' string, which should be 11 right?

Spell it p-a-s-s-e-n-g-e-r as pass + enger (4+5=9) and 2 for <> so 11. Sometimes I don't know how to do math, I use SQL.

SELECT LENGTH('');

There it is 11.

Now, if you compute the difference of the actual length of XML with the removed parts of the <passenger> what do we get?

SELECT LENGTH(manifest_xml) - LENGTH(REPLACE(manifest_xml, '', '')) FROM travel_manifests LIMIT 1;

WE got 33, why would you ask because 11 times 3 is 33. We found three instances of <passenger>, so we just need to divide by the length of <passenger> or hard code it as 11 doesn't matter.

We would get the number of occurances of <passenger> which will give the number of passenger in the xml string.

SELECT
    (
        LENGTH(manifest_xml) - LENGTH(REPLACE(manifest_xml, '', ''))
    ) / LENGTH('')
FROM travel_manifests LIMIT 1;

Phew, its 3!

That was a lot for a simple stuff. But hey its fun!

sqlite> SELECT (manifest_xml) FROM travel_manifests LIMIT 1;
+--------------------------------------------------------------+
|                         manifest_xml                         |
+--------------------------------------------------------------+
| Nia GrantovernightSofia KimovernightignoredJonah Wolfestandardignored                                               |
+--------------------------------------------------------------+

sqlite> SELECT REPLACE(manifest_xml, '', '') FROM travel_manifests LIMIT 1;
+--------------------------------------------------------------+
|           REPLACE(manifest_xml, '', '')           |
+--------------------------------------------------------------+
| Nia Grantov |
| ernightSofia KimovernightignoredJonah Wolfe |
| standardignored                    |
+--------------------------------------------------------------+
sqlite> 
sqlite> SELECT LENGTH(REPLACE(manifest_xml, '', '')) FROM travel_manifests LIMIT 1;
+--------------------------------------------------+
| LENGTH(REPLACE(manifest_xml, '', '')) |
+--------------------------------------------------+
| 341                                              |
+--------------------------------------------------+
sqlite> SELECT LENGTH(manifest_xml) - LENGTH(REPLACE(manifest_xml, '', '')) FROM travel_manifests LIMIT 1;
+--------------------------------------------------------------+
| LENGTH(manifest_xml) - LENGTH(REPLACE(manifest_xml, ' SELECT LENGTH('') FROM travel_manifests LIMIT 1;
+-----------------------+
| LENGTH('') |
+-----------------------+
| 11                    |
+-----------------------+

sqlite> SELECT (LENGTH(manifest_xml) - LENGTH(REPLACE(manifest_xml, '', '')))/LENGTH('') FROM travel_manifests LIMIT 1;
+--------------------------------------------------------------+
| (LENGTH(manifest_xml) - LENGTH(REPLACE(manifest_xml, ' 

Now, let construct the query to get the number of passengers.

SELECT
    vehicle_id,
    departure_time,
    (
        LENGTH(manifest_xml)
        - LENGTH(REPLACE(manifest_xml, '', ''))
    ) / LENGTH('') AS passengers_in_manifest
FROM travel_manifests
WHERE vehicle_id LIKE 'CARGO-%';

We need to wrap it in a CTE to grab and group by the vehicle_id I believe, as there are similar entries.

Also we need to group the records with the same departure time, so that we can combine the number of passengers for that vehicle.

WITH passenger_counts AS (
    SELECT
        vehicle_id,
        departure_time,
        (
            LENGTH(manifest_xml)
            - LENGTH(REPLACE(manifest_xml, '', ''))
        ) / LENGTH('') AS passengers_in_manifest
    FROM travel_manifests
    WHERE vehicle_id LIKE 'CARGO-%'
)
SELECT
    vehicle_id,
    departure_time,
    SUM(passengers_in_manifest) AS total_passengers
FROM passenger_counts
GROUP BY vehicle_id, departure_time
HAVING SUM(passengers_in_manifest) > 20
ORDER BY departure_time;

We count the number of passengers in the CTE and use it as a filter in the outer query as SUM(passengers_in_manifest) > 20 which will give the right condition for us to get the result. We have to use HAVING as we need to do that after grouping the same vehicle_id and records across same departure_time.

We also use the vehicle_id LIKE 'CARGO-%' in the CTE to filter it right at the inner query to avoid looping in all the queries for computing the number of passengers.

That solves this problem.

That's it from day 13 of Advent of SQL.

There are other ways, but its the same parsing, We can use JOINs and stuff, but hey that was not the point of this.

Anyways! See you tomorrow for day 14!