Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

There seems to be an issue in the format of the value of some sports #1

Open
reallyyy opened this issue Oct 25, 2023 · 5 comments
Open

Comments

@reallyyy
Copy link

For example:
Here is the data for "Women's 100m Breaststroke"
Some participants's times is a much as 30 seconds, for context the value according to google is around 1:04 - 1:06 depending on the participants in question.
eventTile value participantName
Heat 2 31.77 Dalma Sebestyen
Heat 2 31.86 Remedy Rule
Heat 3 32.42 Erin Gallagher
Heat 4 36.08 Benedetta Pilato
Heat 2 40.94 Claudia Verdino

The values for "Men's Marathon" is also questionable. Here are some examples:
eventTile value participantName
Men's Marathon Final 10:09 Cameron Levins
Men's Marathon Final 10:52 Ivan Zarco Alvarez
Men's Marathon Final 11:28 Yuma Hattori
Men's Marathon Final 12:07 Christian Pacheco
Men's Marathon Final 12:07 Hassan Chahdi
Men's Marathon Final 15:36 Stephen Scullion
Men's Marathon Final 15:44 Mykola Nyzhnyk
Men's Marathon Final 15:48 Lemawork Ketema
Men's Marathon Final 16:12 Oleksandr Sitkovskiy
The names of the participants are right but the values are wrong. You can't run a marathon in 10 minutes and 10 hours is too long. The average time is about 2 - 3 hours or so.

@llui85
Copy link
Owner

llui85 commented Oct 28, 2023

Hi @reallyyy

All the data is what was returned from the OBS server at time of scraping, there was no processing or changes to the data made at all. There may well be inaccuracies, but that's concerning if there are. I have not audited the data or carried out sanity checks like you seem to have done.

Where are you finding this information? The way events are laid out is a little confusing; from memory there are many different data types for different sections of the event - SubEventUnit, Stage, Result, Phase, Event, and EventUnit. Is it possible that the data you're finding is only partial, perhaps for one section of measurements taken? (i.e the first 50 metres in a swimming race could be timed separately in different legs?)

The USDF messages may also be useful for troubleshooting.

@reallyyy
Copy link
Author

reallyyy commented Oct 28, 2023

Where are you finding this information?

  • I am doing an EDA on the data as part of my personal project
  • This is the code I am using if you may find it helpful
def get_result(sportExternalId):
    Data = []
    for itemId, item in olympicsData["EventUnit"].items():
        if sportExternalId[0:9] in item["attributes"]["externalId"]:
            resultIDs = item["relationships"]["results"]["data"] 
            # One event unit will have multiple results because there are multiple participants, which mean the data will be return as a list
            for resultID in resultIDs: #Loop through the list
                resultID = resultID["id"] #Get the id of one single result
                competitorID = olympicsData["Result"][resultID]["relationships"]['competitor']["data"]["id"] 
                # Get the id of the person/competitor owns this particular result in this particular event
                participantID = olympicsData["Competitor"][competitorID]["relationships"]["participant"]["data"]["id"]
                # Using the competitorID of a person to find their participantID to find out more about them
                Data.append({
                "eventTile": item["attributes"]["title"],
                "value": olympicsData["Result"][resultID]["attributes"]["value"],
                "participantName": olympicsData["Participant"][participantID]["attributes"]["name"] # Get the participant's name using participantID
                })
    return Data
   
disciplines = olympicsData["Discipline"]
events = olympicsData["Event"]
SportsData = []
for itemId, item in events.items():
    disciplineId = item["relationships"]["discipline"]["data"]["id"]
    discipline = disciplines[disciplineId]
    if discipline["attributes"]["name"] in ["Athletics","Swimming","Weightlifting"]:
        SportsData.append({
            "name": item["attributes"]["name"],
            "id": item["attributes"]["externalId"],
            "disciplineName": discipline["attributes"]["name"]
        })

print(json.dumps(SportsData,indent = 4))
  • I use most of your code keeping the same format, for the most part
  • The problem is that using the same code only a faction of the sports is wrong, in my case it's all related to sported related to timming. Not all of the timming sports are wrong, there are running sports with the right data for example: Men's 1500m running or Men's 5000m running. And there are swimming sports where you have only a few rows are wrong for example:
"Men's 200m Backstroke
 eventTile    value       participantName
  57      Final  1:51.25         Kristof Milak
  813     Final  1:53.27          Evgeny Rylov
  60      Final  1:53.73          Tomoru Honda
  790     Final  1:54.15           Ryan Murphy
  539    Heat 4  1:54.44        Kuan-Hung Wang
  ..        ...      ...                   ...
  842    Heat 1  2:17.40         Izaak Bastian
  849    Heat 1  2:17.51         Julio Horrego
  833    Heat 1  2:20.09       Arnoldo Herrera
  822    Heat 1  2:23.22  Abdulaziz Al-Obaidly
  684    Heat 5    32.11           Haiyang Qin

As you can see only the last row is wrong.

@llui85
Copy link
Owner

llui85 commented Oct 28, 2023

Ah, I see what's happening.

For the men's marathon (event unit ID f0a359cc-d859-3865-a7e3-ab3b6f68eddf), there are 106 Competitor records, but 1014 Results. Side note: Don't match on externalId, use the relationships that already exist instead.

I've attached a CSV of this subset that should help you understand what's going wrong here immediately. mensmarathon.csv

The thing I think that's being missed here is that a Result is not final - there are unofficial, partial, and final official results. To quote from the ODF spec (the data from this repo is a parsed form of the ODF spec for the most part, although not always identical in structure)

The ‘Results’ message, DT_RESULT is the key message for all competition information and is available for every unit. This message is:

  • used to provide the start list before the start of the unit;
  • updated continuously throughout the unit with results; and
  • sent with the unofficial and official results when the unit is over.

So in this case:

  • The value field is partial for a FRAME_RESULT, only providing the time change since the previous frame was sent.
  • The final official times are stored in results that satisfy resultType == STRUCTURED_RESULT && status == OFFICIAL. These results include summaries of the frame results in extendedInfo.odfExtensions, as well as final ranks and DNF statuses.

For a marathon, there would be 10 intermediate frame results sent at different checkpoints. For 100m swimming, a frame would be sent for each lap, which matches up with the data that you were seeing.

@llui85
Copy link
Owner

llui85 commented Oct 28, 2023

Also, I have data from the 2020 Paralympics & 2022 Bejing Olympics/Paralympics in the same format that I never got around to uploading to Kaggle, if you'd like it.

@reallyyy
Copy link
Author

reallyyy commented Oct 28, 2023

Wow thank you so much for the fast reply and spending time exploring the issue.
I am so grateful for the support.
This is my first time working with such a big dataset so I am somewhat lost.
Now that you have said, I went back and check the code, and it's true as you said that I did make the mistake of assuming that for any sport the number of records and Results should be the same. . When I run my code
For example Women's 4 x 100m Medley Relay, The sweden team in the final has like 10 differents data points.

Also, I have data from the 2020 Paralympics & 2022 Bejing Olympics/Paralympics in the same format that I never got around to uploading to Kaggle, if you'd like it. - I would like to if uploading doesn't take too much of your time. Because with all the data I have, I already have roughly what I needed. More data points will make the point more concerete. Otherwise just the support you gave me already is kind enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants