From 886ef5c9f912403549a2cef0a29493b7352a59ec Mon Sep 17 00:00:00 2001 From: Atharva_Rasane Date: Fri, 21 Jun 2024 09:40:30 +0530 Subject: [PATCH] Review Feed Back 1 plus additional changes --- papers/atharva_rasane/main.md | 60 ++++++++++++++++++---------------- papers/atharva_rasane/myst.yml | 2 +- 2 files changed, 32 insertions(+), 30 deletions(-) diff --git a/papers/atharva_rasane/main.md b/papers/atharva_rasane/main.md index 37e985084d..7047b23f7a 100644 --- a/papers/atharva_rasane/main.md +++ b/papers/atharva_rasane/main.md @@ -1,44 +1,45 @@ --- # Ensure that this title is the same as the one in `myst.yml` -title: AI driven Watermarking Technique for Safeguarding Text Integrity in the Digital Age +title: AI-Driven Watermarking Technique for Safeguarding Text Integrity in the Digital Age abstract: | The internet's growth has led to a surge in text usage. Now, with public access to generative AI models like ChatGPT/Bard, identifying the source is vital. This is crucial due to concerns about copyright infringement and plagiarism. Moreover, it's essential to differentiate AI-generated text to curb misinformation from AI model hallucinations. In this paper, we explore text watermarking as a potential solution. We examine various methods, focusing on plain ASCII text in English. Our goal is to investigate different techniques, including physical watermarking (e.g., UniSpaCh by Por et al.), where text is modified to hide a binary message using Unicode Spaces, and logical watermarking (e.g., word context proposed by Jalil et al.), where a watermark key is generated via a defined process. While logical watermarking is difficult to break, it is not detectable without prior knowledge of the algorithm and parameters used. Conversely, physical watermarks are easily detected but also easy to break. - This paper presents a unique physical watermarking technique based on word substitution to address these challenges. The core idea is that AI models consistently produce the same output for the same input. Initially, we replaced every ith word with a "[MASK]," then used a BERT model to predict the most probable token in place of "[MASK]." The resulting text constitutes the watermarked text. To verify, we reran the algorithm on the watermarked text and compared the input and output for similarity. + This paper presents a unique physical watermarking technique based on word substitution to address these challenges. The core idea is that AI models consistently produce the same output for the same input. Initially, we replaced every i-th word (for example every 5th word) with a "[MASK]," a placeholder token used in natural language processing models to indicate where a word has been removed and needs to be predicted. Then we used a BERT model to predict the most probable token in place of "[MASK]." The resulting text constitutes the watermarked text. To verify, we reran the algorithm on the watermarked text and compared the input and output for similarity. - The Python implementation of the algorithm in this paper employes models from the HuggingFace Transformer Library, namely "bert-base-uncased" and "distilroberta-base". The "[MASK]" placeholder was generated by splitting the input string using the `split()` function and then replacing every ith element in the list with "[MASK]". This modified list served as the input text for the BERT model, where the output corresponding to each "[MASK]" was replaced accordingly. Finally, applying the join() function to the list produces the watermarked text. + The Python implementation of the algorithm in this paper employes models from the HuggingFace Transformer Library, namely "bert-base-uncased" and "distilroberta-base". The "[MASK]" placeholder was generated by splitting the input string using the `split()` function and then replacing every 5th element in the list with "[MASK]". This modified list served as the input text for the BERT model, where the output corresponding to each "[MASK]" was replaced accordingly. Finally, applying the join() function to the list produces the watermarked text. This technique tends to generate nearly invisible watermarked text, preserving its integrity or completely changing the meaning of the text based on how similar the text is to the training dataset of BERT, which was observed when the algorithm was run on the story of Red Riding Hood, where its meaning was altered. However, the nature of this watermark makes it extremely difficult to break due to the black-box nature of the AI model. --- ## Introduction -The growth of the internet is primarily a the spread of web pages which in turn are written in HTML (Hyper Text Markup Language) consisting of lots and lots of text. Almost every webpage in some form or another contains text making it a popular mode of communication whether it be blogs, posts, articles, comments etc. Text can be generalized as a collection of integers or ASCII/Unicode values wherein each value is mapped to a particular character. +The growth of the internet is driven by the spread of web pages, which are written in HTML (Hyper Text Markup Language). These web pages contain large amounts of text. Almost every webpage in some form or another contains text making it a popular mode of communication whether it be blogs, posts, articles, comments etc. Text can be represented as a collection of ASCII or Unicode values, where each value corresponds to a specific character. Given the text-focused nature of the internet and tools like ChatGPT and Bard, it is crucial to identify the source of text. This helps to manage copyright issues and distinguish between AI-generated and human-written text, thereby preventing the spread of misinformation. Currently, detecting AI-generated text relies on machine learning classifiers that need frequent retraining with the latest AI-generated data. However, this method has drawbacks, such as the rapid evolution of AI models producing increasingly human-like text. Therefore, a more stable approach is needed, one that does not depend on the specific AI model generating the text. -With the majority of the internet and tools like ChatGPT and Bard being text-focused, we need to realize the importance of identifying the source of text whether due to copyright or to differentiate between AI-generated text and Human written text to prevent the flow of misinformation. The standard of detecting AI-generated text is with the use of another ML classifier which needs to be constantly trained on the latest AI-generated text data. This approach has a few drawbacks, one of which is the ever-changing nature of AI-generated text where we have bigger and better models that are giving more human-like text being released faster then ever before and thus we need a more standard/concrete approach, one that can be used regardless of the AI model i.e. we need a method of identifying that doesn't depend on one generating the text. One such approach is via the use of a watermark. +Watermarks are an identifying pattern used to trace the origin of the data. In this case, we specifically want to focus on text watermarking (watermarking of plain text). Text watermarking can broadly be classified into 2 types Logical Embedding and Physical Embedding which in turn can be classified further [@Atr01]. Logical Embedding involves the user generating a watermark key by some logic from the input text. Note that this means that the input text is not altered and the user instead keeps the generated watermark key to identify the text. Physical Embedding involves the user altering the input text itself to insert a message into it and the user instead runs an algorithm to find this message to identify the text. In this paper, we will propose an algorithm to watermark text using BERT (Bidirectional Encoder Representations from Transformers), a model introduced by Google whose main purpose is to replace a special symbol "[MASK]" with the most probable word given the context. -Watermarks are an identifying pattern used to identify the origin of the data. In this case, we specifically want to focus on text watermarking (watermarking of plain text). Text watermarking can broadly be classified into 2 types Logical Embedding and Physical Embedding which in turn can be classified further. Logical Embedding involves the user generating a watermark key by some logic from the input text. Note that this means that the input text is not altered and the user instead keeps the generated watermark key to identify the text. Physical Embedding involves the user altering the input text itself to insert a message into it and the user instead runs an algorithm to find this message to identify the text. In this paper, we will propose an algorithm to watermark text using BERT (Bidirectional Encoder Representations from Transformers), a model introduced by Google whose main purpose is to replace a special symbol [MASK] with the most probable word given the context. +BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained model introduced by Google in 2018, which has revolutionized natural language processing (NLP) [@Atr03]. "Pre-trained" means the model has already been trained on a large dataset before being fine-tuned for specific tasks. This allows the model to learn general features and patterns from a broad range of text data. For BERT, this pre-training involves vast amounts of text from books, articles, and websites, enabling it to understand the intricacies of human language. This pre-training allows BERT to be adapted quickly to various NLP tasks with relatively small amounts of task-specific data. Traditional models read text sequentially, either left-to-right or right-to-left. In contrast, BERT reads text in both directions simultaneously, providing a deeper understanding of context and meaning. This bidirectional approach allows BERT to perform exceptionally well in various NLP tasks, including question answering, text classification, and named entity recognition. By grasping the nuances of language more effectively, BERT sets a new standard for accuracy and efficiency in NLP applications [@Atr03]. -BERT[@Atr03] (Bidirectional Encoder Representations from Transformers) is a pre-trained model introduced by Google in 2018 that has revolutionized natural language processing (NLP). At its core, BERT employs a bi-directional Transformer encoder which allows the model to understand context from both directions simultaneously, greatly enhancing its comprehension of text. BERT undergoes pre-training through two tasks: Masked Language Modeling (MLM), where certain words in a sentence are masked and the model predicts them based on surrounding words, and Next Sentence Prediction (NSP), which involves determining if one sentence logically follows another. This comprehensive training enables BERT to excel in numerous NLP applications like question answering, text classification, and named entity recognition. Given its deep understanding of context and semantics, BERT is highly relevant to text watermarking. Watermarking text involves embedding identifying patterns within the text to trace its origin, which can be critical for copyright protection and distinguishing between AI-generated and human-written content. BERT's sophisticated handling of language makes it ideal for embedding watermarks in a way that is subtle yet robust, ensuring that the text remains natural while the watermark is detectable. This capability provides a more stable and reliable method for watermarking text, irrespective of the model generating the text, therefore offering a concrete solution amidst the evolving landscape of AI-generated content. +At its core, BERT employs a bi-directional Transformer encoder, which helps understand the relationships between words in a sentence. This enhances its comprehension of text by understanding context from both directions simultaneously. BERT undergoes pre-training through two tasks: Masked Language Modeling (MLM), where certain words in a sentence are masked and the model predicts them based on surrounding words, and Next Sentence Prediction (NSP), which involves determining if one sentence logically follows another. This comprehensive training enables BERT to excel in numerous NLP applications like question answering, text classification, and named entity recognition.Given its deep understanding of context and semantics, BERT is highly relevant to text watermarking. Watermarking text involves embedding identifying patterns within the text to trace its origin, which can be critical for copyright protection and distinguishing between AI-generated and human-written content. BERT's sophisticated handling of language makes it ideal for embedding watermarks in a way that is subtle yet robust, ensuring that the text remains natural while the watermark is detectable. This capability provides a more stable and reliable method for watermarking text, irrespective of the model generating the text, therefore offering a concrete solution amidst the evolving landscape of AI-generated content. ## Related Work -Related Work (Modified) -In this paper, we will discuss two text watermarking algorithms in detail before delving into the suggested watermarking technique. First, let's examine the current standards for text watermarking. Word context, developed by Jalil et al.,In [@Proc01], a type of logical watermarking is developed in which a watermark key is generated without changing the source text. UniSpaCh [@Atr01], on the other hand, modifies the text's white spaces in order to implant a binary message directly into the text. +In this section, we will review two text watermarking algorithms before introducing our proposed technique. Let's first look at the current standards for text watermarking. Text watermarking algorithms embed unique identifiers in text to protect copyright and verify authenticity. They are important because they help prevent unauthorized use, copying, and distribution of text. -In word context, the author selects a keyword. For the purpose of this paper, lets consider an example, let's say the keyword is "is" and the text is *"Pakistan is a developing country, with Islamabad is the capital of Pakistan. It is located in Asia."*. to generate the watermark we record the length of the words preceding and following the chosen keyword. In this case, those will be, "Pakistan" and "a", "Islamabad" and "the," and finally "It" and "located." We then append the lengths of these words one after the other, creating our watermark, which is 8-1-9-3-2-7. +The first algorithm is Word Context, developed by Jalil & Mirza in 2009. It is a type of logical watermarking that generates a watermark key without altering the original text [@Proc01]. Logical watermarking involves embedding a watermark key without changing the original text. Word Context generates a watermark key by analyzing the structure of the text around selected keywords and creating a pattern based on word lengths [@Proc01]. In Word Context, a keyword is selected. For example, using the keyword 'is' in the text 'Pakistan is a developing country, with Islamabad is the capital of Pakistan. It is located in Asia.' The lengths of the words before and after 'is' are recorded: 'Pakistan' (8) and 'a' (1), 'Islamabad' (9) and 'the' (3), 'It' (2) and 'located' (7). The watermark is then 8-1-9-3-2-7 [@Proc01]. The keyword is chosen based on its significance in the text. Word lengths are used to create the watermark because they provide a unique pattern without altering the text, ensuring the watermark is imperceptible [@Proc01]. -Using 2-bit categorization, UniSpaCh [@Atr04] proposes a masking strategy that creates and isolates a binary string SM (e.g., "10, 01, 00, and 11"). Every two bits are replaced with a unique space (such as a punctuation space, thin space, hair space, or six-per-em space). After that, the created file spaces are incorporated into certain areas, including the spaces between words, sentences, lines, and paragraphs. Because of the cover text, this method ensures a high degree of invisibility, but it has a low capacity (two bits per space) and is unsuitable for applications that need to integrate long-secret messages into brief cover messages. +The second algorithm, UniSpaCh by Kamaruddin et al. in 2018, modifies the white spaces in text to embed a binary message directly into it [@Atr04]. Modifying white spaces changes the spacing patterns in the text, embedding binary information. A binary message is a sequence of bits (0s and 1s) that represents data. This method uses different types of spaces to encode these bits [@Atr04]. UniSpaCh uses 2-bit categorization to create a binary string (e.g., '10', '01', '00', '11'). Each pair of bits is replaced with a unique type of space (like punctuation space, thin space). These spaces are then placed in areas like between words, sentences, and paragraphs. This method is highly invisible but has low capacity, making it unsuitable for embedding long messages [@Atr04]. 2-bit categorization assigns pairs of bits to specific types of spaces. This method is considered invisible because the changes are subtle and not easily noticeable by readers. It has low capacity because only a few bits can be embedded per space, limiting the amount of information that can be hidden [@Atr04]. -The first approach [@Proc01] is not appropriate for today's world, especially with regard to AI-generated text, as we can generate new text faster and easier than ever before, making it impractical to store a logical watermark for each one. The second approach [@Atr04] is also not appropriate because it is relatively simple to reformat the text in order to remove the watermark; therefore, we require a watermarking technique that is both robust and imperceptible. +The first approach by Jalil & Mirza (2009) is not suitable for today's fast-paced generation of AI text, as it is impractical to store a logical watermark for each new text [@Proc01]. It is impractical to store a logical watermark for each text because the volume of generated text is too high, making it difficult to manage and store all watermarks. AI text generation has made it easier and faster to produce large amounts of text, increasing the need for scalable watermarking solutions [@Proc01].The second approach by Por et al. (2012) is also not suitable because the watermark can be easily removed by reformatting the text. We need a robust and imperceptible watermarking technique [@Atr04]. The watermark can be removed by reformatting because changes in text layout, such as altering spaces or reformatting paragraphs, can disrupt the embedded watermark. A robust watermarking technique can withstand such changes and remains detectable, while an imperceptible technique ensures the watermark is not noticeable to the reader [@Atr04]. -The technique presented in this paper is based on one that Lancaster [@Atr02] proposed for ChatGPT. In that method, he describes a way to generate watermarked text by replacing every fifth word from each non-overlapping 5-gram (a sequence of five consecutive words such that no sequence has overlapping words) with a word that is generated using a fixed random seed. Consider the following line, for instance: “The friendly robot greeted the visitors with a cheerful beep and a wave of its metal arms.” the non-overlapping 5 grams ignoring punctuation will be “The friendly robot greeted the”, “visitors with a cheerful beep” and “and a wave of its metal” here we will replace the words "the", "visitors" and "metal" using words generated by ChatGPT with a fixed random seed. The watermark will be checked using overlapping 5-grams, which are sequences of five consecutive words that overlap with each other except for one word. For the same example, the overlapping 5-grams will be "The friendly robot greeted the", "friendly robot greeted the visitors", "robot greeted the visitors with," and so on. The beauty of the approach is that we are using ChatGPT to watermark itself. However, this also means that we need to run two separate models of ChatGPT (for the sake of standardization as multiple models of ChatGPT are available to users and each model may give different output on the same random seed) or run a second model on the generated text. +Our proposed technique is based on a method by Lancaster (2023) for ChatGPT [@Atr02]. It replaces every fifth word in a sequence of five consecutive words (non-overlapping 5-gram) with a word generated using a fixed random seed. For example, in the sentence 'The friendly robot greeted the visitors with a cheerful beep and a wave of its metal arms,' the non-overlapping 5-grams are 'The friendly robot greeted the,' 'visitors with a cheerful beep,' and 'and a wave of its metal.' We replace the words 'the,' 'visitors,' and 'metal' with words generated by ChatGPT using a fixed random seed [@Atr02]. A non-overlapping 5-gram is a sequence of five consecutive words without any overlap. Replacing every fifth word embeds the watermark without altering the overall meaning of the text, making it a subtle and effective method for embedding the watermark [@Atr02]. -In this paper, we suggest using BERT to overcome this, a model created for discovering missing words, as a much superior substitute for ChatGPT because it is more precise and smaller. Not that BERT will necessarily produce better results than ChatGPT; rather, because of its bidirectional nature, which allows us to use more context for word prediction, essentially increases the amount of context that could potentially lead to better results than ChatGPT, which only uses the context of the previous words to predict the next word. While ChatGPT-based algorithms will work best for ChatGPT-generated text, using BERT will allow us to expand our horizons beyond AI-generated text to any text, regardless of its origin. +We check the watermark using overlapping 5-grams, which overlap by four words. For example, 'The friendly robot greeted the,' 'friendly robot greeted the visitors,' 'robot greeted the visitors with,' etc. This method uses ChatGPT to watermark its own text, but it requires running two ChatGPT models to ensure consistency across different outputs from the same seed. Overlapping 5-grams are sequences of five words that overlap by four words. Two models of ChatGPT are needed to ensure consistent watermarking across different outputs because different models might produce different results with the same random seed, and consistency is crucial for verifying the watermark. + +We propose using BERT, a model designed to find missing words, as a better alternative to ChatGPT. BERT is more precise and smaller. Its bidirectional nature uses more context for word prediction, potentially leading to better results. While ChatGPT-based algorithms are best for ChatGPT text, BERT can be used for any text, regardless of its origin. BERT is better than ChatGPT for this purpose because it is more precise and smaller, making it more efficient. BERT's bidirectional nature means it uses context from both the preceding and following words to predict a missing word, which can lead to more accurate results. ## Proposed Model -BERT-based watermarking is derived from the 5-gram approach by Lancaster[@Atr02], but here the focus is watermarking of any text in general regardless of its origin. This paper will mainly use **bert-base-uncased** autoencoding language which finds the most probable uncased English token in place of the [MASK] token. +"BERT-based watermarking is based on the 5-gram approach by Lancaster[@Atr02]. However, our focus is on watermarking any text, regardless of its origin. This paper will use **bert-base-uncased** model, which finds the most probable uncased English word to replace the [MASK] token. Note that a different variant of BERT can be trained on different language datasets and thus will generate a different result and as such the unique identity to consider here is the BERT model i.e. if the user wants a unique watermark they need to train/develop the BERT model on their own. This paper is not concerned with the type of BERT model and is focused on its conceptual application for watermarking. Thus for us, BERT is a black box model that returns the most probable word given the context with the only condition being that it has a constant temperature i.e. it doesn't hallucinate (produce different results for the same input). For our purposes, you can think of the proposed algorithm as a many to one function which is responsible for converting the input text into a subset of watermarked set. @@ -53,7 +54,7 @@ The above is a simple implementation of the algorithm where we are assuming 1. The only white spaces in the text are " ". 2. BERT model has infinite context. -This simplified code allows us to grasp the core of the algorithm here we first simply split the input text into a list of words, in this case, it's with a simple spit() function due to our assumption that only white spaces are spaces, then we replace every 5th word with the [MASK] token this is a special token that tells BERT at which position to replace the word and can vary from model to model, then for every single [MASK] token in the list we pass all the words preceding and the 4 words proceeding the [MASK] here we assume that the BERT model has infinite context size which isn't true for BERT so, in that case, we will pass upto maximum_context_size - 5 words along with the [MASK] token, missing_word_form_BERT() then returns the most probable missing word which replaces the respective [MASK] token in the list this will continue until all [MASK] tokens are replaced and finally we call the " ".join() to convert the list words into a string. +This simplified code allows us to grasp the core of the algorithm. First, we split the input text into a list of words using the split() function. Next, we replace every 5th word with the string "[MASK]" which represents a special token indicating where BERT should predict a word. For each [MASK] token, we pass the preceding words and the 4 following words to the BERT model, assuming BERT can handle an infinite context. In reality, BERT has a limited context, so we use up to maximum_context_size - 5 words along with the [MASK] token. The missing_word_form_BERT() function returns the most probable word, which replaces the [MASK] token in the list. We continue this process until all [MASK] tokens are replaced, then convert the list of words back into a string using " ".join() The beauty of the algorithm is that if we were to run it again on the watermarked text the output that we would get would be the same as the input thus to check if a given text is watermarked we simply need to compare the input and output to determine if a given text is watermarked we simply need to run the above algorithm again, but with a few changes we will have to take in offset as a consideration as the one plagiarizing the text might insert additional words that may lead to the text @@ -61,26 +62,25 @@ The beauty of the algorithm is that if we were to run it again on the watermarke The algorithm checks if a given text is watermarked by comparing the input and output texts, considering possible word insertions that may offset the watermark pattern. -1. Input Text Preparation : Obtain the suspected watermarked text as input. +1. **Input Text Preparation :** Obtain the suspected watermarked text as input. -2. Run Watermark Detection Algorithm: Run the watermark detection algorithm on the input text. +2. **Run Watermark Detection Algorithm:** Run the watermark detection algorithm on the input text. -3. Compare Input and Output: If the input matches the output, the text is watermarked.If not, proceed to check with offsets. +3. **Compare Input and Output:** If the input matches the output, the text is watermarked.If not, proceed to check with offsets. -4. Offset Consideration: Initialize an array to store match percentages for each offset: `offsets = [0, 1, 2, 3, 4]`.For each offset, adjust the input text by removing `n % 5` words where `n` is the number of words added. +4. **Offset Consideration:** Initialize an array to store match percentages for each offset: `offsets = [0, 1, 2, 3, 4]`.For each offset, adjust the input text by removing `n % 5` words where `n` is the number of words added. -5. Check for Matches: For each offset, count the matches where the watermark pattern (every 5th word replaced) aligns. +5. **Check for Matches:** For each offset, count the matches where the watermark pattern (every 5th word replaced) aligns. -6. Store Match Percentages: Calculate the percentage of matches for each offset and store them. +6. **Store Match Percentages:** Calculate the percentage of matches for each offset and store them. -7. Statistical Analysis: Compute the highest percentage of matches (`Highest Ratio`). Compute the average percentage of matches for the remaining offsets (`Average Others`). Calculate the T-Statistic and P-Value to determine the statistical difference between `Highest Ratio` and `Average Others`. +7. **Statistical Analysis:** Compute the highest percentage of matches (`Highest Ratio`). Compute the average percentage of matches for the remaining offsets (`Average Others`). Calculate the T-Statistic and P-Value to determine the statistical difference between `Highest Ratio` and `Average Others`. The T-Statistic measures the difference between groups, and the P-Value indicates the significance of this difference. -8. Classification: Use a pre-trained model to classify the text based on the metrics (`Highest Ratio`, `Average Others`, T-Statistic, P-Value) as watermarked or not. +8. **Classification:** Use a pre-trained model to classify the text based on the metrics (`Highest Ratio`, `Average Others`, T-Statistic, P-Value) as watermarked or not. ## Implementation - Encoding module -Lets take a look at a potential python code implementation for the proposed watermarking model. A "watermark_text" module identifies every 5th word in the given input string, splits them using python's in-built split() library and tracks them as the words to be modified using BERT. These word placeholders are replaced with a "[MASK]" token. -While we are using the BERT model here, the module can easily be scaled to adapt with other AI models. The choice of BERT is solely due to its efficiency in altering individual words. +Let's examine a Python implementation of the proposed watermarking model. The watermark_text module identifies every 5th word in the input string, splits them using Python's built-in split() function, and marks them for modification using BERT. These placeholders are replaced with the [MASK] token. Although we use BERT here, the module can be adapted to other AI models. We chose BERT due to its efficiency in altering individual words. The 5th word is selected to ensure a consistent and detectable pattern. ** The choice of index = 5 is because?** ** Also is this code picked from a paper?** @@ -138,11 +138,13 @@ In the result, the module has replaced each 5th word with the most probable repl Further, to speed up the AI computing, we can employ GPUs in this module as well as the Detection module. ## Implementation - Detection Module -Now that we know what our watermarked text will look like, hereforth we will assume that this is the published text that a plagiarizer will have access to and we are tasked with identifying if there is copyright infringement. +Now that we have our watermarked text, we need to identify potential copyright infringement. We assume this text is what a plagiarizer has access to. + +For this, we create a module to check the number of word matches if the AI model with the same offset parameter is run on the watermarked text again. The algorithm's elegance lies in its consistency: if we run it again on the watermarked text, the output will match the input because the most probable words are already present at every 5th offset. Consequently, we get a 100% match rate with a match ratio of 1. If all the 5th words were altered, our match rate would be 0. -For this, we create a module to check the number of word matches we will get if the AI model with same offset parameter is run on the watermarked again. The elegance of the algorithm lies in its consistency: if we run it again on a watermarked text, the output will match the input as the most probable words are already present at every ith offset of the text and their position has not been altered. Consequently we would get 100% match rate with a match ratio of 1. If all the ith words were altered our match rate would be 0%. would get 0 matches with a match ratio of 0. +The use of AI models to improve one's written text has become quite common. When a AI model is given a sub text, it is likely to alter and/or shuffle an entire sentence based on context as opposed to individual words. Which means, our model needs to ensure it is looking for watermark in not only at the i-th index, but also the words leading upto it. Which means our module will loop for offsets 1 to 5. and check for matches. This also covers the scenario where a plagiarizer might insert extra words, causing the input not to match the output exactly. -The use of AI models to improve one's written text has become quite common. When a AI model is given a sub text, it is likely to alter and/or shuffle an entire sentence based on context as opposed to individual words. Which means, our model needs to ensure it is looking for watermark in not only at the ith index, but also the words leading upto it. Which means our module will loop for offsets 1 to 5. and check for matches. This also covers the scenario where a plagiarizer might insert extra words, causing the input not to match the output exactly. +Altering written text is a posibility we cannot ignore. consider a scenario where a plagiarizer might insert extra words, causing the input not to match the output exactly. This means our model needs to check for watermarks not only at a specific index but also in the surrounding words. Therefore, our model needs to check for the watermark at different offsets (0 to 4) to account for potential word insertions. Here's how the offset works: - If 1 word is added at the start, the offset is 1. diff --git a/papers/atharva_rasane/myst.yml b/papers/atharva_rasane/myst.yml index 6b2e2bbc88..5527e8f92f 100644 --- a/papers/atharva_rasane/myst.yml +++ b/papers/atharva_rasane/myst.yml @@ -3,7 +3,7 @@ project: # Update this to match `scipy-2024-` the folder should be `` id: scipy-2024-atharva_rasane # Ensure your title is the same as in your `main.md` - title: AI driven Watermarking Technique for Safeguarding Text Integrity in the Digital Age + title: AI-Driven Watermarking Technique for Safeguarding Text Integrity in the Digital Age subtitle: # Authors should have affiliations, emails and ORCIDs if available authors: